false
Catalog
Deep Dive: An Introduction to AI in Critical Care ...
Shortcuts Causing Bias in Medical Imaging
Shortcuts Causing Bias in Medical Imaging
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
My name is Judy. I'm not an ICU doctor. I'm an interventional radiologist. I'm at Emory University. And so I want to bring just, you know, just a different view to thinking about data and showing how difficult it is. Plus, you know, I'm biased. I really care about this topic, which is, you know, so thanks for allowing me to speak about sort of shortcuts and why, how they can sort of affect you. Here are my disclosures. Since we have a very engaged audience, how many people have, in the last 24 hours, pressed these keys? Oh, thank you. Good. Okay, great. How many people undid the work that was generated by these keys? Nobody, right? Oh, one person. Okay. How, do you remember how you learned, maybe with just one answer, how you learned about these keys? But is there anyone who doesn't know these keys? Or at least this one, one of them, too? Right, so this is control or command C, command V, right? If you want to quickly copy or paste, this is how you do it. But how did you learn about it? Somebody taught you? Okay. Okay, so struggling 20-year-old came, said, this is how to do it. Any other difference? People who don't have good friends? Right, so I'll tell you, mine is my mouse died, and I only had a keyboard, so I had to figure out how to use it. But it's so weird, because we gain this knowledge. We don't even change it. Actually, I had to look how to change it on my computer. You can change all these keyboard shortcuts, and, you know, and we rarely change it, and rarely actually make mistakes about them. And it turns out that deep learning has the same characteristic, but it's not always very good. So that's what I'll be talking about. I'll explain what shortcuts are for deep learning, and we had a really good lecture for Ankit, so this is going to be fast. I'll give you some examples. I'm a radiologist, so I'll bring in some images, and then I'll sort of talk about some of the strategies for mitigating these shortcuts. In our world today, our keyboard shortcuts are very helpful, but in real world, we don't know if they're helpful or they're not helpful. So you saw the example of underfitting and overfitting, and shortcuts are not either of those two things, but they're not really the end product you are looking for. And then I'll end with sort of an implication of bias. So if you end up being very interested in learning about this, I recommend this paper. It's not technical. It's very simple to read. And the first example here that they give is, okay, so I see that it doesn't really show my mouse, but to the left there, if you say caption that image, it says they're grazing sheep, but they're not sheep in that image, and I'll show you why this becomes a problem. One of the most famous examples, and I'll show it again, is tell us what pneumonia appears on the X-ray. The AI system learns this patient is from the ICU. I think ICU patients tend to have pneumonia. I think this patient has pneumonia. And so this occurs because our images and our data sets have context. And I'll come back to sort of the non-imaging data. And in this example, you can see that there's some few samples and large samples. If you have a camel, it's behind a background of the desert. And if you have a cow, it has a background of grass. And the flip side is very rare. I'm not saying it's impossible, but it's rare. And so if consistently your background is the most common feature, then the model, you remember how Ankit showed how you learn, you show examples, it doesn't really need to learn the smell of the fruits to tell you the difference in your prediction. It just learns, hey, this is the most easy thing for me to do. I'm working very hard. I'm going to stop here and learn that the background is the most important distinguishing feature. So the characteristics of our data set and the way we train our models, true, with the transformer architecture, that may be changing. But previously, you're just looking at the lines and shapes. That's all those, that grid square you're trying to look at and magnify. You're just looking at lines, lines, lines. So in this rightward example, we have some, I hope you can see that there's a picture of a cat, but the model would say this is an elephant because it's just looking at the texture images more than losing context of the shape. And so this results in this dilemma where we have some categories that look same for humans on the left side, like the number five, the school bus, but those look very different for AI models. And on the other hand, the neural networks seem to think those are similar classes, but for human eyes, those are not the same. And this is a characteristic of learning. And so this can persist anyway. It can persist on text. We are in an era where we're using a lot of chatbots or our patients are using chatbots. And so if your data set is biased, and we've seen quite a lot of issues that can come up, then if you're relying on the model recommendation to tell you where to take the patient, it will say white patients should go to the hospital if they are violent and black patients should go to prison if they are violent because it just encodes the same characteristics of our world today. This is a game that the model learns. If I stay in this one position, I can hack the game and win it so it doesn't really need to play. And some of the stereotypes that exist in our society around who's a doctor or who's a nurse. And so if you take that example of shortcuts, so you do learn some characteristics that are true. For example, background is present, but it's not everything that represents the data set. Then I can show you some examples in medical imaging. I talked about the example for pneumonia, this work. And the way it was done, which is sort of again going back to reproducibility, is that you train a model on a composition of four data sets, the NIH, Mount Sinai, and a combination of both. And you train one model and then you test it on another. So this external validation is very, very important. And when you see a drop in the performance, you go again, like Omar said, you figure out why is this model not working. And the red areas, you'll see it's the L markers. Those mark the handedness of the image for the radiologist. But they can have and code quite a lot of characteristics. So here are all the possible examples in the MRE, not ICU, MRE, chest x-ray data set. You can see the Ls have different orientations. Some are bright, some are not bright. So if you think about your world, you usually, if you're very early in the ICU, you may see the same technologies, the same x-ray machine that comes to your transplant ICU unit every morning. And so those images can get encoded with these things that we're not really thinking about, but are so important because the models are going to learn, wow, this is a cardiac ICU. These patients always tend to have some heart failure. And that's what it uses for its prediction, not because it learned the disease, but it's because it learned a proxy that's necessary for the prediction. Here's another example for chest tubes, right? We know that we treat pleurofusions and pneumothorax with chest tubes. So this example, they show that the model is performing much better. Unfortunately, my cursor doesn't work. So these two graphs, you saw these ROC curves. If it's closer to the end point there, that's the perfect model. But you find that the patients without a chest tube, the model performance drops. And the reason why this is important is that the model learns that if I have a chest tube, this patient probably has pneumothorax. But if you think clinically, that's not the most important case, right? Because that's a patient who's received treatment. The patient who needs treatment is the one with the pneumothorax without a chest tube. And so if the model performance is low for those patients, it actually can be harmful because of the intervention or cause delays for that subgroup. At Emory, we deploy five algorithms. Two of them are a triage algorithm for intracranial hemorrhage and PE. And we've noticed for both of them, outpatients, they have a 10% drop in the model performance. So those are the patients, even in the ER or in the hospital, if you collapse, hopefully someone is going to see that you're not doing well. But if you're at home and you got your imaging and now you're getting sick and the model is not performing well for a triage algorithm, that becomes very challenging. So I'll show you some examples, not just of radiology. Here's an example of dermatology. The top row has malignant lesions. The bottom row has benign lesions. The benign lesions in the way this data were acquired had these blue patches, just putting a patch. And obviously we're reusing some of the old data set. And it turns out the model learned, oh, the images with a patch are benign. And that's all it required for it to make its predictions. If you change that, they try to correct and remove the patches. And they find that now the patch predictions are based to be nonsensical. In other words, you can think about the blue patches as the example of the radiographic markers on chest x-ray. Similarly, to ophthalmology images, these are tabletop cameras. Depending on your ophthalmologist, you may have paid out of pocket for this fancy image instead of getting your eyes dilated. The tabletop camera has this artifact at the below image, which is just that bright area. And that's the equivalent of the radiographic marker. It's present on all images. If they are being acquired from a specific subset of your patient population, then the model can calibrate its predictions to that subset of the patients. Again, he is again going back to pneumonia prediction, very poor performance for COVID-19 prediction. And these authors actually did amazingly. They picked public data sets. They created two data sets, data set one and data set two. And all those, what's doing is that they're trying, it's a behavior that we frequently do without thinking of the downstream impact. So for example, if you have some positive cases and you need some negative cases for your AI systems, you say, hey, Leo, give me those negative x-rays you had. You combine them together. Those are called Frankenstein data set. And unfortunately, the models are so sensitive to where the origin of the data sets are that they actually are unable to learn the disease. So in this case, you can see that the original, sort of on the middle image there, the red line is near perfect. You'd love this model. We all know in this room that none of these AI systems were really used for patient care during COVID. But when you take it to another side, the performance drops and that's, you know, converse. Then they do something interesting, which is the generative networks, just as you can generate text, they can generate images. And so you can say, hey, show me what you're seeing in this patient who is COVID negative and then what makes them COVID positive. And they do the same and they find that the R marker that the model was looking at is now the one that is being moved between positive and negative patients. So not learning the disease, the lungs are not really lighting up to show us that that's the focus of the model. And it's really a shortcut. Now, in medical imaging, you remember the input from Anke. We just give images. We don't give anything else. You're not saying here is a male, female, or whatever it is. But they show that if you change just the last layer in the prediction, if you remember the neural architecture and say, tell me what the patient sex is or the projection of the image, they have perfect performance on the internal and external dataset. And I show this to sort of remind us that just because you don't provide a variable in your image, as long as the acquisition of your data is sensitive to the characteristics of the patients you care for, the models will figure out what that shortcut is and use it in their prediction. And so for data sets, we don't have a lot of very, very granular data sets. This is a data set that has disease variables, patient variables, hospital processes, for example, the timing of the image, what was the scanner. And what they do is then, instead of saying, does this patient have a hip fracture or not, they start to change and predict all the characteristics in the data set. And in this sort of the left graph there, if you look at the top predictions, they're mainly for the scanner model and manufacturer. Those are not really important. For radiologists, we don't even notice what the scanner is unless you have a really old scanner with an artifact. And if you think the portable machine that's coming to the ICU every day, we're not changing those. They're the same. So if the person who usually scans them places the radiographic markers differently, or you can see why hip fractures in the middle of the night would probably be positive more than during the day during the orthopedic clinic. So those variations in clinical care are encoded even on medical images, even though you don't provide the data set for them. So I kept saying, look, I can show you where the shortcuts are. And here's a simple example that is saying these are knee radiographs from five hospitals. And the task here was just to predict where the knee radiographs come from. Now, if you look in between the knees, there's a metal bar that is separating the knees. And it's slightly different, if you can believe me, that's just between the three knee sets that I'm showing there. And as a radiologist, I always say, okay, I'll look at the L, where the L is, how it's looking, its intensity and tell you, I think this is a different machine. And they show that the model is very good at predicting the source of the hospital. But when they visualize what the model is looking at, it's not always the radiographic marker. It's the metal bar that was the most sensitive feature to tell you, oh, this is data from hospital A or hospital B. And I show this just to tell you, just because you have one shortcut, when you remove it, the model is going to get another shortcut. It's almost like peeling an onion, you have many layers that have to be unpacked. And secondly, where you think the humans can be the check of where shortcuts cause problems, then those are not always the reason, you know, because humans may not even be able to see some of them. Now, the gold standard in medical imaging is drawing segmentations. We understand them as radiologists, we love them, you see it, and you're like, okay, I know where the model is looking at, I see where the pneumothorax is. And in the top image there, this is a film of a neonate's sort of prenatal screening, and it has a ruler, a caliper. And, you know, this is pretty standard in acquisition. And if you use this for model training, what happens is that the model learns, oh, you know, I can center my prediction around this ruler. And that's how, what happens still, it finds an example without a caliper, and the model prediction fails. And even more concerning is the location of the lesion. So the top lesion here are all melanomas, and you know, they are sort of in different segments. And it turns out, let me just show you the next slide, it turns out that most of our public datasets, the malignant lesion is usually just centered in the image itself. So these are four public datasets that are already in use, and the malignant is in the middle portion of the data. But if you start to move the location of the lesion across, and you start to measure the model performance, you'll find that the lesions that are far away from the center, which is what the examples the model sees a lot, so it may not always learn this is melanoma, it just learns if this is in the middle, it's probably cancer, right? And it's, you know, when you find a lesion that's far away, and you can think about the real world use, maybe you're taking the lesion with your camera, so you're not really precise, then the model performance is going to drop, which is a problem. And so we are entering in this era of foundation models. They were not discussed, but this is sort of like the biggest hype, the same technology that powers large language models and GPT is being applied to anything. Every week, there's a new foundation model, it's discovering new cells, discovering new drugs, everything, they're all trained the same way. And we see, traditionally, this is a convolutional neural network, we see that the model can tell you the differences between disease, you can see different peaks, but sex and race remain the same. But in the foundation models, they do encode a lot of sex and race characteristics. And so when you extract these numbers from data sets, and you can start to analyze them in new ways that we hadn't been able to do so before. So I mentioned about the MRE test x-ray data set. And now we can start to say, hey, I'm just picking a number that represents the chest x-ray. Remember, we said all images can become zeros and ones. And then, you know, you overlay them and start to visualize what they are. And you can see that there's clearly a difference between the lateral and the front facing images, that is pretty well encoded. We don't really see, there's some groups of patients, and it could be depending on the hospital that they attend, where this becomes a problem. And we also start to see differences across our MRE hospitals just based on visualizing this large data set. So when we go to look at the test x-rays themselves, we now can start to see new clusters, where Ankit showed that this unsupervised way, where you can see some images are all flat, you know, they could have been acquired just differently, but they are embedded in the data set. Some are frontal. The example I like actually is the top row here, where to me it was so surprising because it was sensitive to the EKG lead. Some patients in our data set have a clip type, some have a button type, and those were so sensitive and causing different clusters. So you can imagine if your charge nurse always orders a different set of EKG leads for your unit, then that becomes the proxy that the model drops on. Finally, I saw this recent paper and I was presenting at Grand Rounds in MD Anderson, and I asked them, what do you think, why do you think that patients here, patients who get radiotherapy in the morning, tend to survive better, actually significantly better? This is a Kaplan-Meyer curve, that top row, versus the ones who, you know, get radiotherapy in the afternoon or the evening. What do we think the reason for that could be? What? The staff? Oh, I would like to be treated in the morning then, okay. Okay, kids, right? So maybe our scheduling is so sensitive and we like children not to be hungry and come in the morning and then other people come at night, okay? What other thing? I actually don't know the answer to this, but connections. I don't know, you have to explain that a little more. Okay, so the group said maybe the night patients are inpatients, right? So they are already much sicker. Maybe the evening and the night group are patients who have a day job or do something else and they live in a deprived area, so they have other social determinants of health, and they are coming in at these other hours, but they also have more burden of disease, or, you know, we also know that non-medical things can affect health outcomes. But anyway, this is a good example of a shortcut, although it's not discussed that way. So finally, you know, medical images can encode a lot of information. Actually, if you give me this chest x-ray, and I'll show you an example here. This is work that we published around three years ago, and people have come back, you know, and said, you know, from chest x-rays, I mean, it's a little crazy, honestly. If you had told me this is what we'll be able to do with neural networks, you, you know, you can tell the arrhythmia, the specific arrhythmia, not because there's a replaced valve on a chest x-ray. You can tell the age of the patient, you can tell the costs of the patient, you can tell their race, you know, and that was our work that we published a few years ago. And so I always like to summarize, at least today, if you could get my chest x-ray, you could say I'm black in a social and legal construct way. I'm old, I'm aging faster than my, you know, chronological age. I live in a deprived area. I have COPD, CHF, and I'm going to spend this amount of money in the next three years. And just from chest x-rays, which is like the most bizarre thing. And it turns out that this data are encoded in the predictions. And this work, we looked at the predict encoding of sex, age, race, and the intersection of sex and race, and then looked at the in-distribution and out-of-distribution testing. Those words should not be too familiar, unfamiliar after Ankit's talk. And so we find that based on different prediction, whether you're predicting nerve finding, which is a normal x-ray, or cardiomegaly, that these characteristics are encoded in variable ways. Sex is the most strongly correlated. That makes sense. For example, pleural effusions and breast tissue would overlap on medical imaging, so you could see why that would work that way. But race is not something that you could really see. And so, but we find that it's strongly encoded on medical images. And we find that the more the strength of encoding, the more disparities of a model. And when you come to try and mitigate, so we apply five, six different algorithms to try and make the model fair across all groups, it turns out that the most important thing is to remove the shortcuts, remove the demographic shortcuts to make a fair model. Otherwise, all the other strategies that you do, you really don't get as much gain as much as removing shortcuts. And so for us to end, is that just think about shortcuts. Think about the ICU. There's so many characteristics. You know, I participate in these data funds around the world and, you know, even just oral health is enough to tell you the outcome of patients. If you look at a lab that's ordered, just not the usual lab time that you always order labs for your patients. If you just use that timing, it can tell you where it's wrong. And so the shortcuts, I think, bring at the center, the domain expert, who is you, to give feedback. But on the other hand, it brings a challenge because the AI will never really work for everyone unless we can really rethink how to manage the shortcuts. So thank you so much for your attention.
Video Summary
Judy, an interventional radiologist from Emory University, provides insights into the complexities of data interpretation using AI, focusing on the implications of shortcuts in deep learning. She highlights how shortcuts can lead models to incorrect conclusions by relying on superficial cues rather than core data, such as AI predicting pneumonia on ICU X-rays based on contextual clues rather than actual evidence. Judy illustrates this with examples from medical imaging, dermatology, and more, noting that non-contributory elements like radiographic markers or background artifacts can mislead AI. She stresses that these issues extend beyond imaging, citing biases in AI models affecting real-world applications like healthcare recommendations. New AI architectures, such as foundation models, continue to encode demographic biases, making it crucial to address shortcuts. Judy emphasizes the need for domain experts to provide ongoing feedback to mitigate biased outcomes, ensuring AI models serve all demographics fairly and effectively.
Keywords
AI shortcuts
deep learning
medical imaging
demographic biases
domain experts
healthcare AI
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English