false
Catalog
Deep Dive: An Introduction to AI in Critical Care ...
Deployment Challenges Downstream of Machine Learni ...
Deployment Challenges Downstream of Machine Learning
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
So, the next session is going to be given by myself and a friend and colleague, Theresa Rincon. I'm going to have her introduce herself in a second. I just wanted to announce that we have been organizing data thons for Society of Critical Care Medicine. We've had a total of three so far, or two so far? Three? And we're going to have our fourth one in July. I believe it's the second weekend of July. It's going to take place at the SCCM headquarters outside of Chicago. It'll be great if you could continue really diving deep, diving into the deep end of the pool. And don't be intimidated that you don't know machine learning. That's perfectly fine, because you're bringing understanding of the data to those who are familiar with machine learning methodology. So I always tell the residents that you are bringing so much value to the table. I don't care if you don't understand data leakage or cross-validation, because you're going to understand that in the process while working with people who breathe machine learning every day. And they need you. They cannot do this on their own, because they don't understand the data. So with that, I'm going to just very quickly pass it on to Teresa to introduce herself now. How can we plan for this? There should be announcements in the SCCM website, but the very first one was right before the pandemic, actually. And then we had two more last July, and then the July before that. Yeah. And it might be getting booked up. So if you're interested, try to register soon. It should be on the website, yeah. It's in the middle of July in the Chicago area. Hello, everyone. My name is Teresa, as Leo just let you know. I teach biomedical informatics at the Graduate School of Nursing at the UMass Chan Medical School. I also teach healthcare informatics at Regis College in Weston. I teach DNP students primarily. And I do some consulting on the side. So I'll have a disclosure slide that I'll talk about those in just a second. So it should be on that now. There's my disclosures. And I'm not going to represent any of these groups as I talk to you today. I'm just going to speak from my own experiences. And my name is Leo. I still practice in the intensive care unit. I work in the medical ICU of Beth Israel Deaconess Medical Center. My research, however, is hosted at MIT. And the main objective is to build capacity in health data science. So going from country to country, different parts of the United States. Most importantly, bringing together people with different backgrounds, from healthcare to computer science to engineering to social sciences, because we need all of those expertise to be able to leverage the data. And these are my disclosures. I get research funding from several sources, but nothing that is conflicting or nothing that is worth disclosing for this particular talk. And I'm just going to start off with this statement. Most deployment challenges downstream of machine learning, so after a model has been developed, stem from issues upstream of machine learning, which is how the data was generated, how it was collected, how it was harmonized. How it was curated. And this is why we think that we need to invest more time, more capital, more funding on understanding the data set. And then, most of the challenges upstream and downstream of machine learning stem from societal issues. It is unlikely that we could address the challenges upstream and downstream of machine learning without fixing some of the age old problems that we see in society and in our hospitals and clinics. And I'm just going to post this paper that has already been introduced by Judy. And my take home actually from this paper is that whatever model performance that you read on a journal article may not apply to your own hospital. So what they found is they truly optimize a model, making sure that it's fair across patients. And then when they applied it to a different setting, they did not see the same performance. And what that means is that you should not stop with the estimates of model performance that you read in journal articles. In fact, I've been trying to convince the journals to stop sensationalizing this AUC, AUPRC precision and recall. Because in the end, what matters is what happens to the patients when you deploy an AI. So in the end, the only metrics that matters would be those that pertain to relevant clinical outcomes. When you adopt an algorithm to reduce maternal mortality, are you seeing a drop in maternal mortality across different demographics? I don't care about the initial, well, I do care, but it's just the first step in the right direction. When you report the accuracy of a certain model, and this is statistical accuracy, it does not predict how it will change the behavior of the clinicians. So a very important question to ask for every health system is who will perform local validation and post-deployment monitoring. And a recent study showed that about 60 to 80% of health systems now are using some AI, whether that's for clinical decision support or for operational tasks or for administrative tasks. I would say that it's probably higher now, because that's a survey of 2023 hospital executives. I would think that at this time, close to 100% of health systems. And what was really troubling is the fact that the small community hospitals, the rural hospitals, the critical access hospitals are using AI off the shelf. They buy AI from some vendor or they use AI embedded in the electronic health record system without validating it locally on their own population. And without any monitoring of what happened to patient outcomes as soon as the AI is deployed. The problem is that 99.99% of hospitals and clinics around the world do not have the resources to be able to evaluate AI, to be able to monitor AI once it is deployed. And the only solution that we can offer is for health systems to partner with universities, universities and colleges in their locality. So this is the ethos of the data thoughts that we organize. We try to partner hospitals with computer science departments, with engineering departments locally, because you cannot afford to pay those data scientists. They're being attracted by open AI and IBM getting offered more money than we are making in the hospitals. So they will not come to your hospitals. But you have free labor in the form of students who are studying machine learning at their universities. And to us, that's the only feasible way of making sure that AI is being deployed in a safe manner if there are tons and tons of eyes looking at them, evaluating them, monitoring them after deployment. But the biggest deployment challenge, I think, is the risk of harm from uninformed use. So what we're gonna do now is go over guidelines that have been published to evaluate algorithms. And then we're gonna divide you into five, and each group will be moderated by one of us. Teresa will go over or will give a list of papers that you yourself will evaluate and come up with what questions do you have for this particular algorithm. And we will try to answer them as a group as well. And then we will segue that into a workshop. So you will not leave your groups from going over the studies, trying to apply what you have learned today so far, coming up with questions. We may not be able to answer them, but at least we exercise your critical thinking. Because we think that critical thinking is the most important skill that everyone needs to have to be able to leverage the power of this technology. So there are two guidelines that I would like you to be aware of, because you can use them as your cheat sheet. You could use them to really see whether an algorithm is robust, an algorithm is effective, and whether an algorithm is fair. The first is the Tripod AI Guidelines. This was published last year, and it's coming up with a checklist of the different items that you need to go through and check to see whether that paper, that algorithm, is good enough for operation. And I'm just gonna highlight some of the most important items in the checklist. So some of them are as follows, the data preparation. So the Tripod AI is specifically looking at algorithms that were trained on numeric data and also images. We have a second tripod guideline, which is called Tripod LLM. And that is more focused on language models. So you're gonna use Tripod AI when you're looking at an algorithm that has numeric data as features, vital signs, laboratory tests, maybe images, and maybe clinical notes too. But they're not specifically using a language model or developing a language model. So there has to be some explanation of how the data was prepared. And this is a red flag for us when we're reviewing papers, is when there's no description of how the data came about. And then the outcome, the outcome has to be a clinically relevant outcome. And not only is it clinically relevant, but it should also be actionable. So I have some misgivings about predicting ICU readmission, because I don't know what to do with that prediction. Do you mean that if there's a high risk of readmission, I should keep the patient in the ICU? Will that change the ICU readmission risk, if I keep the patient one more day or two more days? Because I am exposing that patient to the risk of nosocomial harm by keeping the patient in the ICU. If the risk of readmission is not gonna change over time. So again, when you're looking at what outcome is being predicted, is this an outcome where we have interventions to improve that outcome? So predicting Alzheimer's disease at this point, to me, is a waste of time, until and unless we could find some effective way of preventing progression of Alzheimer's disease or preventing the onset of Alzheimer's disease. Predictors, the most important thing about predictors when you look at what were the variables included to predict a disease trajectory or a complication is how complete it is. And this is where you do not know what you do not know. Sometimes it just doesn't capture all the different factors that were considered when you are making a decision. An example that I often give is predicting outcomes for heart attacks. And what we know, what the data is reflecting, is that women have poor outcomes from heart attacks when their cardiologist is a man versus when their cardiologist is a woman. And unfortunately, we don't have that data, the sex of the provider. We also know that concordance in the race ethnicity between the patient and the provider also has an impact, but we don't have that information. And that is recipe for spurious associations to be learned. For example, going back to the example of women having poor outcomes when their cardiologist is a man versus a woman, there are so many more male cardiologists. And for that reason, if you give that data to a computer, it's gonna learn that being a woman is a risk factor for having poor outcomes. When in fact, it's being a woman cared for by a male cardiologist versus a woman who is cared for by a female cardiologist. But because of the fact that we have so many more male cardiologists and this information is not present in the electronic health record, then it learns the wrong associations. Sample size. I'm hoping that this is no longer gonna be an issue because we are now using electronic health records from tens and thousands of patients. It's still always an issue if you have an imbalance in the number of events or complications that you are trying to predict. And there's really no good way of addressing the imbalance problem, the data imbalance problem. Because if you create synthetic data based on what's available in your electronic health record, you're not really adding more information. It's using the same information that's already embedded in your data set. So I'm not a believer of synthetic data. I think that if you do not have a good data set, then maybe you should hold off in building prediction models. Missing data. It's very important that they explain how they handled missing data. And this is not a problem in machine learning. This is a problem in clinical research. As you go down through your inclusion and exclusion criteria, as you say that, oh, we're gonna remove patients who did not stay at least 24 hours in the ICU, just take note of which patients died or which patients got discharged before 24 hours. Because if you change the composition of your cohort as you are going through inclusion and exclusion criteria, you are probably introducing bias. You're probably providing an opportunity for the AI to learn spurious associations. So don't just drop patients. Make sure that you understand who got dropped as you are establishing the final cohort with which to train and validate your model. Analytical methods. We're not gonna go through all of these. The most important things to know is that you need some cross-validation. So within the training cohort, you're gonna be running the models by changing which patients you develop the model on and then which patients do you validate the performance of the model. So that's called cross-validation. Most packages that you use to develop models would already have this feature. An important thing to remember or to be aware of is the likelihood of data leakage. So data leakage, the definition truly applies to when there are patients in the training cohort that somehow are also found in a testing cohort. So a very good scenario where this happens is when you break down your data according to the years. So you're gonna say, I'm gonna train my model on patients who were admitted from 2017 to 2019 and then I'm gonna test the performance on patients admitted from 2019 to 2020 except that there are patients who probably appeared in both cohorts and that's an example of data leakage. Another example of data leakage which is a little bit different pertains to the sepsis model that was deployed within the Epic system during the pandemic. And if people are not familiar with that story, that model was subsequently found to be garbage and that it was very poor in predicting who had sepsis and it turned out that what they did as part of the features to predict sepsis was prescription of antibiotics because they said that, oh, the prescription of antibiotics happened before the onset of sepsis according to their definition and since it was perhaps developed fully by data scientists without any clinician input, no one told them that when an antibiotic is already being prescribed that the clinicians are already thinking that this patient might have sepsis. That to me is an example of data leakage. They shouldn't have used that as a feature in the prediction of sepsis because the clinicians already identify that patient as potentially having sepsis. I've spoken about class imbalance, fairness evaluation. This is an area that we are very actively working on and the reason for that is the labels that we use for fairness evaluation, race, ethnicity, as you know, is not very well captured in electronic health records. In the MIMIC dataset during the pandemic, 25% of patients had unknown ethnicity. What do you do with them? Do you drop them from the model? How do you test your fairness when 20 to 25%, you cannot assign a race, ethnicity and it's the same thing with sexual identity. We're only using the self-reported sexual identity and that may not fully capture the richness of that particular variable. So what we're doing now is rather than trying to do fairness evaluation on this traditional demographic labels, we identify the patients who receive poorer care to begin with. So it doesn't matter whether they're black or a woman or old, we lump them together. If they're receiving poorer care, if they have poorer outcomes to begin with and then we evaluate the model performance on those patients because we don't want them to be given wrong labels, wrong diagnosis, wrong treatment recommendations when we deploy AI. Very important conflict of interest if you're reading a paper and I've been burned. I'm part of the editorial board of PLOS Digital Health and there was a paper that got published about an algorithm and the authors conveniently forgot to disclose that they have a company that will sell this algorithm. So again, make sure that when you're evaluating a paper or when you're evaluating an algorithm that you have a good understanding of whether the paper was written by a company that is gonna use that paper as a marketing ploy to start selling that algorithm. And then the last is what I consider as the most important, patient engagement. Was there patient engagement when the group developed an algorithm? And not just patient engagement for the sake of patient engagement. There's a lot of patient engagement in washing. As soon as you throw that around, you think that you have done your due diligence and it's really hard to say what type of patient engagement is relevant for developing algorithms. But at least they should describe how they involve patients throughout the life cycle of AI. The other tripod that I mentioned earlier is the tripod LLM. And this is particularly specific to those that are using a language model or that is developing a language model for a specific task. And these are some of the tasks that we have seen language models are used for. So they are used for text processing, for information retrieval, so search strategies. I could tell you stop trying to do systematic reviews. Systematic reviews are gonna be done better by language models, I would say, a year or a couple of years from now. And even though they're still hallucinating now, expect those hallucinations to go away. So I think that we need to rethink what type of research we should be engaging in now that we have very powerful tools to help us. There's also chatbots, which are very popular. Our concerns include the direct-to-consumer marketing of a lot of these chatbots. They're now bypassing clinicians and starting to marketing this to patients. You could use this for mental health counseling. And for that reason, they might be flying under the radar of regulatory bodies. Especially there are so many loopholes in the current state of regulations. There is this differentiation between wellness product and health product. If it's a wellness product, like a wearable, it is not under the purview of the FDA, it's under the purview of the FTC. And for that reason, the data that is collected by wearables, they're not considered health data, they're considered wellness data, and they could be sold and bought by any third party. But this, again, are just some of the uses of language models. There's summarization and simplification. There's also helping us draft letters to the next provider or draft a letter to the patient. We've seen examples of that. You might be also using that already. It's embedded in some of the electronic health record systems. And then there's also the use of language models for outcome forecasting. The most important information from this slide is what are the specifics of the language model that you need to have some understanding of. And this will include, if you're using GPT, you would like to see what version is being custom tailored or fine-tuned for a specific task. You also would like to know some of the methodology to fine-tune that particular language model. Did they use reinforcement learning, which was introduced by Ankit earlier? If you're using experts to assess whether a language model is accurate, whether a language model is interpretable, whether a language model is biased, and you're using human experts, there has to be some explanation of who those human experts are, the number of human experts, the demographic composition of the human experts. I'm always astounded when I see a model, for example, for a conversational agent for breast reconstructive surgery, and then you find that all the developers are male. And I kid you not, that is not unusual. And I ask them, what makes you think that you could come up with a good conversational agent for breast reconstructive surgery if you don't have a single female in your group? So we just submitted a paper that has just been published where we introduced the concept of a team card. So right now, there is some expectation that every model should be accompanied by a data card. What is the data that was used to train the model? And then a model card. Give us all the specifics about how the data was curated and how the model was developed. And to that two cards, we add the team card. Who are the people behind this particular algorithm? Are they all from rich countries? And if that's the case, are you gonna sell this to low and middle income countries? Maybe you shouldn't. And this is the sort of transparency that we really want from algorithms. This is a bit more technical and maybe less relevant at this time, but I think it's gonna get more relevant. We want to also have a tally of how much compute was used, how much carbon emission was generated as a result of building this algorithm. And that would arise from running this algorithm on your phone, on your computers. Because as we use AI, as we said, we are melting one iceberg at a time. And I think that that should also be an important consideration in looking at the value of AI as we see them now. And again, if it's language models that are used on, that are trained on clinical notes, we want to know the composition of the patients who contributed their notes to the development of that language model. I'm gonna skip the glossary of terms because you may not remember them if I tell you the different definitions, but we will share the slide deck to everyone. We're just finding out a way of how to do that. But there will be a recording of the entire workshop. You could watch them over and over again. You could use a language model to summarize the entire workshop and write a paper. But I will pass it on to Teresa who will describe a few studies that we would like you to jump into and ask questions about. So you should all have the tripod documents that was out at the table. If you don't, we have extra copies for you up here. We can send those around that can help you evaluate. So, not for that, we already talked about me, let's get through that. Oh, really quick, the images, you'll see some images in here. I actually created those using AI tools, so I don't have to, unless I tell you otherwise. That's where they came from. So you have the tripod, we have it up here in hard copy, if you would like to have a copy of that, that's available to you. You can also scan this QR code and it will take you to the document if you prefer to have it on your phone and look at it. Oh, sorry, there it is. I know that's turned that way, I've gotta look over there. And so, that checklist will allow you to look at some of the items that Leo already went over. And that's just more, he's already gone over that, so we're gonna keep going. And then, so this is paper one. So, this side of the room over here, I'm gonna have you guys evaluate paper one. All right, so you'll get your phones out and scan that. And you can scan all of them if you wanna look at all the papers at some point. You'll get a copy of the slide deck, so you'll be able to get these papers later as well, but you're welcome to do that. I think everybody on that side of the room has scanned, it looks like, okay. And then paper two, so the first four rows on this side, why don't you guys look at paper two. Paper three. Oh, back to paper two. It's not coming up? Okay, then we'll skip that paper. You guys go ahead. Everybody on this side of the room, try paper three. Let's make sure you can get into paper three. Should be able to. I tested all these on. What happened? Oh, these were swapped. Got it. Okay, sorry. I apologize for that. I got locked out of the system and I couldn't check any of my slides until today. All right, so you should have two different papers. The first paper that you scanned on this side will be the first four rows and then the second paper that you scanned will be the next. It's not opening. Well, you choose one of the others because I don't want to spend too much time. If you can't get it, just choose a paper. Make sure you can get into one. The paper, either one of them. Since that was a problem, you guys just pick one. Pick your favorite paper. Okay, so for the first half of the room here, like up to about four rows, five rows, go ahead and scan paper four. And then paper five would be the back of the room over here. And so basically, anything that you've learned today, you can use the tripod tool, tools if you'd like to evaluate the paper, or anything you've learned from the speakers today. I want you to just take a few minutes as you read that paper to think about how you would evaluate it. I think this goes good with the question we have from one of our colleagues over here that was reviewing a paper and trying to figure out like, how do I review an AI paper? Some of us do reviews. So these kinds of tools can help you ask the review, the author's questions back, right? If they haven't produced their code or their data set, maybe that's a question you put back in your review. So any questions, you're going to just look at the tripod evaluation tool, pick out some things, review the paper, also anything you've heard so far. We're going to be here, you can just raise your hand if you have a question, and we'll walk around and help you out.
Video Summary
Leo and his colleague Theresa Rincon announced the upcoming DataThon organized by the Society of Critical Care Medicine (SCCM), set for July in Chicago. These events aim to collaborate healthcare professionals with machine learning experts, emphasizing the value of domain specialists in understanding data over technical expertise alone. Leo, active in health data science research at MIT, stresses the importance of comprehensive data analysis, revealing that challenges in AI deployments often originate from the data collection stage. He criticizes sensationalism in reporting AI model performance, advocating for metrics based on real clinical outcomes. Theresa, a biomedical and healthcare informatics educator, and Leo highlight concerns about AI applications in healthcare, especially the risk of harm from uninformed use and lack of local validation in smaller hospitals. They advocate partnerships between hospitals and academic institutions to address these gaps. The session focuses on critical thinking for robust AI evaluation, introducing guidelines like Tripod AI and Tripod LLM. Attendees will evaluate AI research papers, applying learned concepts, with emphasis on the importance of understanding data preparation, predictor variables, and the ethical aspects of AI in healthcare.
Keywords
DataThon
SCCM
healthcare AI
data analysis
AI ethics
clinical outcomes
Tripod guidelines
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English