One conception of race is that it is skin deep, and is no more than a matter of skin pigmentation. By implication, such a categorisation is superficial, trivial, and unlikely to be an explanation of any presumed racial differences in behaviour. There may be effects due to people making unwarranted assumptions based on skin colour, but that says more about them than anything else.
According to the skin-pigmentation theory, an Xray should see right through that, to the reality of the bones underneath. This would reveal, the theory says, that people are alike under the skin. Perhaps so, but is it true of their bones?
[Submitted on 21 Jul 2021]
Reading Race: AI Recognises Patient’s Racial Identity In Medical Images
Imon Banerjee, Ananth Reddy Bhimireddy, John L. Burns, Leo Anthony Celi, Li-Ching Chen, Ramon Correa, Natalie Dullerud, Marzyeh Ghassemi, Shih-Cheng Huang, Po-Chih Kuo, Matthew P Lungren, Lyle Palmer, Brandon J Price, Saptarshi Purkayastha, Ayis Pyrros, Luke Oakden-Rayner, Chima Okechukwu, Laleh Seyyed-Kalantari, Hari Trivedi, Ryan Wang, Zachary Zaiman, Haoran Zhang, Judy W Gichoya
Background: In medical imaging, prior studies have demonstrated disparate AI performance by race, yet there is no known correlation for race on medical imaging that would be obvious to the human expert interpreting the images.
Methods: Using private and public datasets we evaluate: A) performance quantification of deep learning models to detect race from medical images, including the ability of these models to generalize to external environments and across multiple imaging modalities, B) assessment of possible confounding anatomic and phenotype population features, such as disease distribution and body habitus as predictors of race, and C) investigation into the underlying mechanism by which AI models can recognize race.
Findings: Standard deep learning models can be trained to predict race from medical images with high performance across multiple imaging modalities. Our findings hold under external validation conditions, as well as when models are optimized to perform clinically motivated tasks. We demonstrate this detection is not due to trivial proxies or imaging-related surrogate covariates for race, such as underlying disease distribution. Finally, we show that performance persists over all anatomical regions and frequency spectrum of the images suggesting that mitigation efforts will be challenging and demand further study.
Interpretation: We emphasize that model ability to predict self-reported race is itself not the issue of importance. However, our findings that AI can trivially predict self-reported race — even from corrupted, cropped, and noised medical images — in a setting where clinical experts cannot, creates an enormous risk for all model deployments in medical imaging: if an AI model secretly used its knowledge of self-reported race to misclassify all Black patients, radiologists would not be able to tell using the same data the model has access to.
This is an astounding paper. It appears to reveal that, using artificial intelligence, deep learning methods lead to the detection of race in X ray images, even if all possible giveaway signals are stripped out of the image. That is extraordinary. The next point of interest is that the authors have made it very clear that they are alarmed that this is possible, and warn that it might lead to evil consequences.
They assert that race is not based on ancestry, but is a social construction and based on self-report.
So, to take the authors at their word, this paper is a warning, as well as the publication of a finding. The underlying results show that you cannot get rid of race in an Xray, even though radiographers are reportedly unable to detect race in such images, and do not know how artificial intelligence achieves such good results.
I read these results some months ago, and now come back to them, seeing this as an important finding, and a vivid example of what authors must do in academia when they report unwelcome results.
First, let us look at what the authors say to deflect critics. Here is their statement:
We emphasize that model ability to predict self-reported race is itself not the issue of importance. However, our findings that AI can trivially predict self-reported race — even from corrupted, cropped, and noised medical images — in a setting where clinical experts cannot, creates an enormous risk for all model deployments in medical imaging: if an AI model secretly used its knowledge of self-reported race to misclassify all Black patients, radiologists would not be able to tell using the same data the model has access to.
Bias and discrimination in Artificial Intelligence (AI) systems has been heavily studied in the domains of language modelling1, criminal justice 2, automated speech recognition 3 and various healthcare application domains including dermatology 4,5, mortality risk prediction 6andhealthcare utilization prediction algorithms7 among others.
While AI models have also been shown to produce racial disparities in the medical imaging domain 9,10 , there are no known, reliable medical imaging biomarker correlates for racial identity. In other words, while it is possible to observe indications of racial identity in photographs and videos, clinical experts cannot easily identify patient race from medical images
Race and racial identity can be difficult attributes to quantify and study in healthcare research 15, and are often incorrectly conflated with biological concepts such as genetic ancestry 16. In this work, we define racial identity as a social, political, and legal construct that relates to the interaction between external perceptions (i.e. “how do others see me?”) and self-identification, and specifically make use of the self-reported race of patients in all of our experiments
After all those virtuous statements, they hope they are in the clear. They are against the evil discrimination shown by artificial intelligence. Only then can they turn to the very bad news.
In this study, we investigate a large number of publicly and privately available large-scale medical imaging datasets and find that self-reported race is trivially predictable by AI models trained with medical image pixel data alone as model inputs. We use standard deep learning methods for each of the image analysis experiments, training a variety of common models appropriate to the tasks. First, we show that AI models are able to predict race across multiple imaging modalities, various datasets, and diverse clinical tasks. The high level of performance persists during the external validation of these models across a range of academic centers and patient populations in the United States, as well as when models are optimised to perform clinically motivated tasks. We also perform ablations that demonstrate this detection is not due to trivial proxies, such as body habitus, age, tissue density or other potential imaging confounders for race such as the underlying disease distribution in the population. Finally, we show that the features learned appear to involve all regions of the image and frequency spectrum, suggesting that mitigation efforts will be challenging.
Essentially, a good data-crunching program can work out the race of people in X rays, and nothing can be done about it. This is a horrible result which cannot be “mitigated”. The authors do not for a moment consider the possibility that race is biologically real.
They studied chest, hands, breasts (mammogram), and lateral spine (including skull) X rays. In seeking to control for things which might have explained the results, they corrected for body mass index and also for bone density. The latter is a little publicised race difference, such that American black women and men achieve 5%-15% greater peak bone mass than white persons.
Despite all these corrections, race was detected with high accuracy. For some reason, it seems to be there in the bone structure.
By high accuracy, I mean that they system produces true positives of racial identification rather than false positives, and very few false negatives. The “receiver operating characteristics” (ROC) are excellent, often 98% accurate. ROC scores were designed in the early days of radar, when it was often difficult to distinguish a true signal from background noise. Receivers varied in their accuracy. In this case the capacity to pick up a true signal is very high. Furthermore, the system can be trained to do this without being directly trained for that particular task.
We hypothesized that if the model was able to identify a patient’s race, this would suggest the models had implicitly learned to recognize racial information despite not being directly trained for that task.
Again, even though races differ in fatness, body mass index does not do a good job of discriminating between x-rays of different races. Surprisingly, when they degraded the x ray images so much that radiographers could not even see that they were x rays, the system could still discriminate the different races.
In despair, they say:
One commonly proposed method to mitigate bias is through the selective removal of features that encode protected attributes such as racial identity, while retaining as much information useful for the clinical task as possible, in effect making the machine learning models “colorblind” 43. While this approach has already been criticised as being ineffective in some circumstances 44, our work further suggests that such an approach may not succeed in medical imaging simply for the fact that racial identity information appears to be incredibly difficult to isolate. The ability to detect race was not mitigated by any reasonable reduction in resolution or by the injection of noise, nor by frequency spectrum filtering or patch based masking.
It is a case of: “Out, damned spot! out, I say!”.
Lest they be misunderstood, they explain themselves further.
There has been extensive research into the correlation between self-reported race and genetic ancestry in the field of genomics, which have shown more genetic variation within races than between races, and that race is more a social than biological construct 16. We note that in the context of racial discrimination and bias, the vector of harm is not genetic ancestry, but instead is the social and cultural construct that is racial identity, which we have defined as the combination of external perceptions and self-identification. Indeed, biased decisions are not informed by genetic ancestry information, which is not directly available to medical decision makers in almost any plausible scenario. As such, self-reported race should be considered a strong proxy for racial identity.
We strongly recommend that all developers, regulators, and users who are involved with medical image analysis consider the use of deep learning models with extreme caution. In the setting of x-ray and CT imaging data, patient racial identity is readily learnable from the image data alone, generalises to new settings, and may provide a direct mechanism to perpetuate or even worsen the racial disparities that exist in current medical practice. Our findings indicates that future medical imaging AI work should emphasize explicit model performance audits based on racial identity, sex and age, and that medical imaging datasets should include the self-reported race of patients where possible to allow for further investigation and research into the human-hidden but model-decipherable information that these images appear to contain related to racial identity.
“Human-hidden but model-decipherable” is a danger, the authors aver.
What is a reasonable conclusion to draw from this research? Here is my version:
Interpretation: That a model can predict race — even from corrupted, cropped, and noised medical images — in a setting where clinical experts cannot, is a novel finding which reveals racial differences at the skeletal level. The biomechanics of these differences for health and behaviour should be the subject of further research.
By this I mean that if X rays can detect race, then there is certainly more to it that skin pigmentation. Does having a particular skeleton have any behavioural consequences?
It might partially explain differences in different sports and occupations. If you have denser, stronger bones, might that give you an advantage in a fist fight? It could also partly explain vulnerability to falls and other injuries, with clear implications for health. This artificial intelligence system can correctly identify skeletons by race, even if you control for bone density. Those density differences are a real-world difference with big implications for falls in the elderly.
Shorn of its exculpatory protestations, this is an important paper. Race is real by objective measures. Perhaps radiographers can see it and think it politic not to discuss it in public. Deep learning networks don’t have to be so circumspect. If there is a difference to be found, they can report the signal they detect within the noise. They have found such a signal.