false
Catalog
Current Concepts in Pediatric Critical Care
19: AI and Machine Learning in the PICU
19: AI and Machine Learning in the PICU
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
All right, thank you so much Kyle and Sharon and Ty and others for inviting me to talk. I'm going to try and in the next 30 minutes take a topic that some people might be intimidated by and try and keep it at a level I think we can all understand and show a couple of applications as well. So I have no particularly relevant disclosures for this content. A couple of learning objectives, again, I really want to focus on the basic approaches to understanding artificial intelligence and machine learning models and methods. We're going to talk about supervised, unsupervised, very lightly on deep learning. We're going to describe some of the performance metrics and limitations. So these are things that I think we as clinicians and providers need to understand when we're reading these papers and trying to understand what is useful at the bedside, what is not, why are things not getting to the bedside, what do we think is coming down the pipeline. And then discuss a couple examples of current applications. This chapter that my co-authors Mark Mai and Serene Shah and I had put together have a couple of other examples in it and some references in there. So if you are interested in additional examples, please check out that content. So again, this is the overview. I'm going to briefly touch on the topic of the content area of clinical informatics just to describe that. And then we're going to talk about AIML evaluation critique and then examples. So clinical informatics is a medical subspecialty. I am a board certified in clinical informatics as well as Pete's Critical Care. And it really combines this idea of clinical care, the health system, and information and technology. So when you think about a subspecialty about your electronic health record, it really is much more than that. It's really about thinking about all the different topics of technology and healthcare, legal and regulatory issues, bringing in some content around data science, clinical decision support, understanding the workflows that we work in as providers so we know where to put these tools, how to get them to the bedside at the right time, and then thinking about some of these other topics that are surrounding this figure. The goal of clinical informatics, and this is an older figure, but I think it still holds true, is really not to take the clinician out of the picture. It's really not to put the computer interfacing directly with the patient without the clinician. It's really to make the clinician function better, work better with the aid of computers. American Medical Informatics Association, which is our organization, our academic organization, defines informatics as the application of information and technology to deliver healthcare services. I think there's three big buckets of informatics. There's operational work, which some people might be familiar with. Those are the folks who are choosing which electronic health record you might want in your system, what upgrades to do. Those are the folks you talk to when it doesn't have the right decision support, the right order set, things like that. Folks who are interested in using the tools of informatics for research purposes, such as myself and others, and then folks who are using informatics tools to further education goals or are doing teaching about informatics. Those are the three main topics. Now, we do have a terminology problem in informatics. There's often a lot of interchangeable words, and people use similar words for different content. So just going to lay the groundwork. Again, really, the broad topic of informatics applies to people, information, and technology, and doesn't necessarily have to be in the clinical space. There's certainly a lot of informatics work in industry. Biomedical and health informatics, sort of the overarching umbrella, and many departments are actually departments or centers of biomedical and health informatics. And it covers the bioinformatics, the proteomics, cellular and molecular work, the medical or clinical informatics work, which is what I'm really going to focus on, and then other things such as public health informatics. Now, within that, there's, with that umbrella of clinical or medical informatics, there's other subfields as well. And then we have a couple of adjacent fields, things like information technology, which is adjacent to, but not the same as informatics. Digital health, health information management, really managing the health record, and then data science. And we're going to touch a little bit on data science in this talk. So the idea of data science certainly has grown quite prevalent recently as more and more models are being produced. It's become common to discuss in the lay press as well about how clinicians are being superseded or replaced by computers. But let's take a step back and think about just the broad topics of data science. The real idea of data science in general is to turn, to extract information and knowledge from the data that we have. And this can be done through a number of different ways. We use data mining tools or techniques to glean the information that we need. And certainly as ICU clinicians, we know we have a ton of data available. We use machine learning or artificial intelligence tools, computational tools, to help us understand or make sense of that trove of data. And then we really, really rely on domain expertise. And I say this is critical because data scientists often will come in without healthcare or critical care specific domain expertise. And trying to identify a problem, trying to understand the implications of the tools developed without that domain expertise, often inevitably leads to failure. And so it's really important for all of us as clinicians and experts in this area to provide that domain expertise for data scientists to assist. This is a figure from our text where we really break down the hierarchy of artificial intelligence and machine learning just to understand and center what we're talking about when we're using these terms. So artificial intelligence is really the broad overarching term for everything underneath it. But it can include things like natural language processing and robotics expert systems. Within artificial intelligence, there's a subdomain of machine learning. We're going to talk a little bit more about that subdomain. And that includes things like supervised learning, unsupervised learning or clustering, and some reinforcement learning tools. A subset of that machine learning is this deep learning, which usually utilizes neural networks, which are computational networks that are based on the way that our brain processes information. And then finally, a subset of that is the most talked about now, I believe, topic in deep learning, which are generative AI tools. And that's where we get into things like transformers and large language models or your chat GPTs. So just taking a step back up to that machine learning, I want to talk about some of the groupings of machine learning. This is a slightly older but still very relevant figure from Tal Bennett's pediatrics article 2019, which looks at the three main groupings of machine learning. And then I'll pull in some side figures here from Matthew Sherpak and Nelson Sanchez-Pinto's work. So supervised learning is really about we have a labeled set of classes. We understand a set of data and we've labeled them with a set of outcomes. So for example, you can see here we have a number of patients that are listed. They have a number of different clinical features, and they have outcomes to those clinical features either as survived or not survived, and those are known and labeled. And then we take those features, we feed them into a machine learning model, and we predict given a new patient without that label, what is the likelihood of that label either as survived or not survived? Obviously those features and those labels can be many different things, but that's the general idea of supervised learning is you start with a labeled subset, you know the end result of a subset of patients, and then you move on to the next prediction. Unsupervised learning is quite different. Unsupervised learning we don't pre-specify a label. In unsupervised learning it's about taking a heterogeneous population, in this example a number of different circles of different clusters, and then separating them out into different clusters by a variety of different features. And those clusters can then be grouped together, and you can examine outcomes from those clusters. So you're not pre-specifying a set of outcomes or a pre-specified set of labels. And then lastly, deep learning is again based on the idea of using the brain structure, the neural network structure, to take in inputs and feed them through a number of different layers, typically an input layer, one or many hidden layers, and finally an output layer to generate results. This is often used in signal processing, one, two, and three dimensional or higher dimensional signal processing. So image processing makes use of a lot of deep learning tools. I am not going to focus more than this slide on large language models in CHAT-GPT. Certainly there's enough out there on that content. I think at this point it's relatively early in its use in our practice. I think it's being used a lot in the ambulatory phase, stage for understanding text and notes and generating content, things like after visit summaries. But I do think it's relatively in its infancy here in the critical cares and the inpatient space. I do think there's a ton of other content to talk about with regards to machine learning and artificial intelligence. And so we're going to avoid most of that. And CHAT-GPT agreed with me when I asked if it was reasonable to discuss health care without discussing generic pre-trained models. It thought it was. So that was great. Sort of feeding itself into lack of existence. So basically, I'm going to step through a typical time course or outline of a machine learning problem and then outline where some of the problems occur, where some of the gaps occur in that. So let's say you're coming to a data scientist and a clinician, a team, and you have a problem that you want to try and develop a prediction model for. We're going to focus mostly on supervised learning because that tends to be the one that we're seeing most often. There's some great studies and some great talks here around clustering and understanding heterogeneity of treatment effects and things like that in the unsupervised space. We're going to focus mostly on supervised learning here. So you identify a problem that you have. You need to gather some clinical data. Typically you'll then develop a test and validate internally in your own institution or your own health system a model. You'll ideally take that model and you'll bring it somewhere else to have it externally validated. And this is really important to show that you haven't simply overfit your data or your model onto your data. You really want to show generalizability outside of your own space. And that can be within your own health system in another unit or potentially in other health systems or some of the data sets we'll talk about in a minute. Then the next step would be to bring it to the bedside and that's a whole that arrow should be a lot longer because that's obviously a big chunk of time to get that to the bedside. You'd like to evaluate the usability and effectiveness to make sure you're not just putting a tool there that nobody's going to use or that isn't usable. And lastly you need to monitor it because these models can drift over time. So changes can occur in the data coming in and that can change some of the prediction layers. The things that are highlighted in red here are things that we do relatively well now. We as a community and you'll see a number of papers published and abstracts and RST sessions around that here at the meeting. The things that are not in red are things that we're not good at. We really don't do a great job of externally validating models. The implementation is quite poor and I would say the usability and effectiveness and monitoring is even worse. Just to highlight the fact that the implementation has not gone particularly well for these models. There's two studies I want to show. The first one was published in PCCM in 2024. It was some work done out of the Petal subgroup of Polisi, the data science group. We were looking at supervised machine learning models in Pete's critical care medicine through this scoping review and we found there were 141 studies that had been published in this space and only 17 of them were currently implemented. And only one had been implemented at the time of the publication. We actually contacted all of the authors to ask if it had been implemented since published and 17 of them were. This is a different study that was done by a group of pediatric clinical decision support folks. We looked at machine learning based models in the pediatric space so this was not specific to pediatric critical care. We screened 8,000 abstracts and found 17 studies that were implemented. I will say although the number is the same, these are not the exact same 17 studies. But they are only, again, the idea being that there's a number of studies that are being published and very few of them are actually making it to the bedside unfortunately. So now we're going to talk a little bit about how we evaluate and critique these articles and these papers that are coming out and these models. And this is based on a set of evaluation and critique model or framework that again was set out in this 2024 PCCM article. So this is the article if you're interested in it and it does outline some of these principles that we're focusing on. So the first idea we're going to talk about is choosing your data set. We're going to understand a little bit about data preparation, understanding model performance and then thinking about how does that translate to real world performance because those are two very different things. So it's really important when thinking about data to be used in model development that we have to understand the breadth and depth of the data that we're using. Again is it local? Is it local to our unit, local to our hospital? Does it represent a broad swath? The data that we choose is going to inherently bias our model. So there are a lot of options for clinical and biological data. Certainly the most common is people use their own local site data. That typically requires very little in terms of standardization and harmonization or I should say relatively little certainly compared to when you move to multi-center, multi-site data. There are a number of common data models. That's a term you might hear that it basically implies that the data are sharing the same structure as across sites and across institutions. And then there are several pediatric research networks that I want to highlight. I've put a few up here on the slide as opportunities or networks that have data available and are always looking for collaboration. I've made use of a few of these myself and I want to highlight the PICU Data Collaborative somewhat selfishly because this is work that I'm intimately involved with as are a few other folks in this room. This is a multi-center 19-site now electronic health record extracted granular data set. We currently in our eight-site data have 150 or 60,000 ICU encounters, pediatric ICU encounters. And so that's going to only grow as we add the remainder of the 19 sites. and we hope that that will be a resource to serve as an external validation data set for some of the work that we're talking about here. So what is involved in data preparation? I would say most people who've done this work can agree that 80 to 90% of the work is in data preparation, and the last 10 to 15% is in building the model, and then 5% is in evaluating it. So it's really about understanding all of the different features and elements in the data, how we clean those data down, and then how we prepare them for model development. I'm not going to get into a lot of the details of that piece. I'd be happy to talk about it more individually, but I do want to say it's really important when we think about how studies are publishing the machine learning models that they've done, how did they split the data coming in? So when you're developing a machine learning model, you really want to have multiple data splits. You want a set of data that you're training on, you want a set of data that you're validating internally, and then you want a set of data that you've kept all the way in the side, in the dark room, in the closet, and you're not letting anyone touch it until you have your final, final model, and then you test it on that data set. Because that's the true way to validate whether your model has worked on a clean and new set of data. It is a lot easier to get good results if you both train, test, validate on the same data set, because you are overfit to that particular data set. There's a number of different ways to split, to do these splits. If you have the fortune of using multi-site data, and this is a figure from Serene Shah proposal, to look at an eight-site data set, you can see that you can hold out specific sites. More common recently has been to do temporal holdouts, which is holding out newer data. So you train on older data, and then you hold out newer data. And then another method, as is shown in this figure, would be to do a combination of site and temporal holdouts. So you sort of get the advantages of both. And papers that you read that are not transparent in how they do this, be very cautious of that, those work. So just going to briefly step through an example of model performance, because at the end of the day, a typical logistic regression model or other types of machine learning models results in a number from zero to one. And it's up to the authors, it's up to the model developers to decide what is the cutoff that they're going to choose to understand what the model is predicting as a yes or no. So this is a very basic example just to highlight some of those principles and ideas. So let's presume for a moment that the blue dots are the true gold standard yeses, and the red dots are the true gold standard nos. And the model is a logistic regression function, and you choose a cutoff right in the center. You can see that there's a couple of points that fall on the opposite side of that cutoff. So there's two blue dots at the top that fall below, which means the model would have predicted no when in actuality it was a yes, and then one red dot that's on the opposite end that the model would have predicted yes, and the red dots a no. Pretty basic, straightforward example from statistics, you can see if we create a two by two table or a confusion matrix, and then you can calculate from there your two actual yeses that are reported as no, your one actual no that's reported as yes, and that allows us to compute a sensitivity, a specificity, and a positive predictive value. Now if we slide our cut point a little bit to the side, because we really want to make sure we include all of those yeses, you can see the effect of that is that you're going to increase, you eliminate the model choosing no when it actually was a yes, but you increase by one the model saying yes. So what does that do? It gives you an extra false negative. This obviously has implications if this model were put into practice, because the treating team is going to receive an alert or some way be notified about that model being true when in actuality those were not. And so that lowers your positive predictive value or your precision, but your sensitivity is great. So you can go the other way. Now you really don't want to be alerting clinicians about false positives, so you're going to eliminate a few of those blue dots at the top. In this case, you're going to lower your sensitivity, but you've made your positive predictive value great. And the point of this exercise is to emphasize that there is an inherent tradeoff in these models between sensitivity and positive predictive value. So we can either be very sensitive or we can have a high positive predictive value with a low false negative rate, but we likely cannot do both, and we have to find a balance between that. And it's very context dependent, meaning we really have to understand what are we trying to predict. There are some things where we can't miss any of them, so we're going to accept more false positives. There are some things where we're willing to miss things, and then we can slide that over. We take these three cutoffs that I just mentioned and we plot them onto a true positive versus false positive curve, and that's how we end up with our receiver operating characteristic curve, which you'll see in a lot of papers. Typically they'll report the area under the receiver operating characteristic curve. An area of 0.5 means it's a coin flip. An area of less than 0.5 means they should have flipped their model around. An area of more than 0.5 is great. You really want to aim towards an area of 1, an area under the receiver operating characteristic curve of 1. Now this is inherent in the model. It's not based on the population. What I mean by that is there's class imbalance issues with these models that can drive precision in one way or another. So let me explain what I mean. We just looked at an example where we picked a cut point and we found a sensitivity, a specificity, and a positive predictive value, as you can see on the screen. This is a relatively balanced class, meaning that the gold standards are relatively evenly split between yeses and nos. But for most of the models that we're trying to predict in pediatric critical care, that is not the case. We don't have an even split of patients with sepsis and patients without sepsis at any given time in our unit. We don't have even splits of patient mortality. We're pediatric clinicians, not adults. So what we find is that there's relative class imbalance, meaning that you'll have 14 yeses, for example, and 1,200 nos. That's a pretty imbalanced class. This is the same model sensitivity and specificity run through that 2x2 table, and you can see that positive predictive value drop from 92% to 11%. So that means that as a clinician, if this model were implemented in your unit and these relative incidences were holding true, you would have a huge false positive rate in your model. You would be alerted a number of times when there wasn't actually a problem. The number needed to alert is the inverse of the precision of the positive predictive value, and that's the number of times the model will alert to find one true positive. You can see it went from about 1.1 to 9. So of those nine alerts that you're going to get, only one of them is actually true. So this is a huge problem in pediatric critical care. It's a huge problem in model development, and it's really why any model paper that you read needs to report the positive predictive value or precision, because if it doesn't, it's going to tell you it's not particularly useful in implementation. So in the last few minutes, we're going to step through just two examples, and I'm just going to highlight a couple of these ideas that we've talked about. So the first example is from Aaron Massino and a group at Children's Hospital of Philadelphia. Bob Grundmeier was the senior author. They were looking at developing a machine learning model for sepsis recognition in the NICU. So they identified a number of features from the electronic health record. They used a retrospective case control study of hospitalized infants. They knew they had developed sepsis. They looked back for the data that they wanted to pull. They had 36 individual features. Features are the covariates, effectively, that are going into the model. And they used a number of time windows. In this case, the one that they reported was a 44-hour window that ended four hours prior to the evaluation. So just a couple of the things that they did that I already mentioned. They used some data extraction features. They did some data processing, which we didn't really touch on. But they have to figure out how they deal with missing data, right? Not all of the patients that we have have all of the features that we're looking for. We don't always get blood gases or all of our labs on all of our patients, for example. And so you have to figure out how do you deal with missingness, and how do you normalize those data? Normalization is important because some data values are higher, some are lower, just numerically. And if you put them into the same model, a mean blood pressure of somewhere in the 60s or 70s is going to look different to a computer than a creatinine of 1. And so you have to normalize those data and scale them. And then they did a number of fancy techniques to train the model and evaluate it. They ended up with a receiver operating characteristic curve, as you can see here. And it looks similar to the one that we showed earlier. They chose to connect their lines square instead of smooth. So that's why it looks a little different. But the same idea is true. They have an area under the receiver operating characteristic curve of 0.83, which would be considered moderately discriminant. So they fixed their sensitivity. They chose that they wanted to fix a sensitivity at 0.8. And with that, they developed a positive predictive value of 0.23, which gives you a number needed to alert of about four. So you look at that, and you say, that's actually not too bad for identifying sepsis in neonates. I could accept three false alarms for one actual alarm. That's reasonable. One of the challenges with that is asking yourself, is this reflective of real-world performance? Is this what the model will actually do if I put it into the system? Well, remember, they looked at a study population where they had two out of every eight infants or one in four infants that had sepsis. But when you take those two infants and you put them into a population of neonates that is much, much larger, because it was a case-controlled study, they limited the number of controls that gave them a falsely elevated precision or positive predictive value. What you would find is that your positive predictive value would drop dramatically, or substantially, maybe dramatically as well. And your number needed to alert would increase similarly. So that means that you would have a number of false positives. It would increase tremendously because of the fact that their population in the study was limited. Their control population in the study was limited. So these are some of the nuances that are really important when reading these types of studies to understand, what is the population that they were looking at for development? And have they looked at what this would actually look like if it were implemented in the real world setting, in an ICU? And the last example that I'll show is a little bit older, but I still think relevant. This is from Robby Kamani and Nelson Sanchez-Pinto when he was in LA developing an early acute kidney injury model using electronic health record data. So they were attempting to develop a parsimonious, or a relatively limited feature set model. So they were able to take 33 initial variables and reduce them down to seven. Similar to kind of developing a risk score almost. So they wanted it to be almost computationally able to be done with a calculator rather than needing a computer. They looked at any stage by KDGO criteria, AKI, at 72 hours of admission, so relatively early AKI, at three days. Using data that were available at the less than 12 hours of admission. So the data had to be available at 12 hours of admission, and they were predicting AKI three days later, or three days of admission. And I'll just highlight one of the ways that they went about developing this model, because again, it's important when you're thinking about model development to understand how did the authors choose the features in the model? Did they do it by expert consensus? Did they say, I think that the things that are going to predict X are Y and Z, or A, B, and C? Or did they do it using a data-driven approach? In this case, they used a data-driven approach. They looked at the univariate odds ratios for each of the different predictors, and they used the ones that were statistically significant. So these are the univariate odds ratios for the predictors for AKI that were significant. Now, you can set your statistical significance a little bit higher than the typical 0.05, because you're going to be putting these into a multivariable prediction model, and then it'll pull out the ones that are significant. But that's a more data-driven approach than simply saying, I think it's going to be X, Y, and Z. I'm going to use those. This is their area under the receiver operating characteristic curve, and you can see for their model, they chose to smooth the data. That shows an AUC of 0.84, again, similar, and I would argue, a relatively moderate prediction. And then they plotted the performance measures, and the importance of this data is that they first showed both the derivation and the validation cohorts, so they were transparent in reporting that things were better in the derivation cohort, the one they derived the model in, than the one they validated it in. But they also reported it at two different cut points. Remember that prior paper chose a fixed cut point with a sensitivity of 0.8. In this case, they used two different cut points. They used two percentiles, a 50th and a 90th, and you can see how the sensitivity and the specificity change with those cut points. You can also see that inherent change between positive predictive value and sensitivity at those two cut points. So for example, on the validation data set, and I didn't highlight this here, but you also can't see my pointer there, in the validation data set, the sensitivity at the 50th percentile went from 85 down to 58 at the 90th percentile, whereas the positive predictive value went from 7 up to 21. So you can have a greater precision if you are willing to accept a lower sensitivity. And similarly, their number needed to alert went from 14 down to 4 in the validation data, in the derivation data set as that positive predictive value went up. So again, it reflects the idea that as you change your cut point, you're going to change how sensitive your model is and how many false positives you get, and that's going to really impact how you can bring this to the bedside and what your content experts feel is important. So the take-home points that I hope I was able to get across is that machine learning tasks typically are divided into supervised, unsupervised, and deep learning. There's a pipeline to development that includes this acquisition. We have to take data and impute it, normalize it, derive the model, ideally internal and then external validation. That there's a trade-off that's inherent in model development between sensitivity, specifically between sensitivity and positive predictive value, and these can be modified by the class imbalance that we often see in the problems we're working in. Post-implementation, you think phase four drug monitoring, evaluation and monitoring are necessary to ensure that that real-world performance continues. And unfortunately, the vast majority of machine learning models are not brought into clinical decision support in the real world, and that's definitely a gap in literature, a gap in research, and hopefully one that folks here would be interested in. So with that, I will finish up. Thank you very much. Thank you.
Video Summary
The speaker provides an insight into understanding and developing artificial intelligence (AI) and machine learning (ML) models, particularly in clinical informatics. The focus is on simplifying complex concepts regarding AI and ML, covering essential learning objectives like supervised, unsupervised, and deep learning. The speaker underscores the clinical relevance of these technologies, dwelling on understanding performance metrics and limitations to discern what is beneficial at the bedside. They discuss the workflow of developing prediction models, focusing primarily on supervised learning and emphasizing the importance of external validation to establish model generalizability. The speaker outlines a standard pipeline for ML projects and identifies common pitfalls, particularly in model implementation and real-world application. Through examples involving sepsis recognition and early acute kidney injury models, challenges like data set selection, model overfitting, and sensitivity versus positive predictive value trade-offs are illustrated. The summary emphasizes the gap in translating these models from research to practical clinical use.
Keywords
artificial intelligence
machine learning
clinical informatics
supervised learning
model validation
prediction models
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English