false
Catalog
Deep Dive: An Introduction to AI in Critical Care ...
Data Issues Upstream of Machine Learning
Data Issues Upstream of Machine Learning
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
All right. Thank you. Yeah, so I'm Omar Bedouin. I'm currently the chief of the Division of Data Sciences at the Telemedicine Advanced Technology Research Center within the Defense Health Agency. My original background was as a critical care pharmacist. After academia, I worked for VisiQ as a startup and later acquired by Philips and helped develop and manage the EIC Research Institute database for many years, but that was several years ago. I've had some past, not current, some consulting with Clue Medical and Siva Health, and I am not here to represent the Department of Defense or Defense Health Agency or government in any way, so please, this is all my opinion, personal. All right. So what I'm going to talk about today and hopefully by the end of this, and feel free to jump in with some of the good questions we heard, that hopefully everybody can describe the importance of data provenance as well as its limitations, assess different scenarios where commonly found data can lead to misleading model inference, and I'd like us all to kind of spend a little bit of time proposing approaches to mitigate some of these insidious data issues. So first, I'm going to start with, this is where I think is a perfect segue to the question that was just raised, which is how well do you really understand your data? So there are four categories of basically competence, right? The first is unconscious incompetence, and that is, you know, was made famous many years ago with, you know, we don't know what we don't know comment, and that's when you're unaware of your knowledge deficit. So really, you're just, you're moving along blissfully, you know what you know, and there are things that are just, they're out of your realm of consciousness that you don't even understand that they exist. Then someday you become aware of this, and you say, wow, okay, I didn't know that existed, but I have no idea what that is. And so now you're in the stage of conscious incompetence, right? You have some humility about it because now you know it's there, but you still don't understand it. So then you move, hopefully, to being able to learn and understand it, right? And that becomes conscious competence. So you get it, but requires concentration, right? You have to think about it. And then there are some things that will hopefully eventually move into unconscious competence, where you know it so well, it just becomes second nature and you don't even have to think about it. I would argue when you're using data for research, really in general, although we're always talking about healthcare research, that you should assume you're living in quadrant number one here, right? You don't know what you don't know. And I think what I've learned and what I learned over 20 years of this and probably the first 10 was every time I was confident about making an assumption, I would get burned eventually. It might not be for years later, but at some point you would learn that oh man, I thought I knew what that was and something comes up to just humble you that you really didn't get it. So I always say people sometimes I think might get a little frustrated with me when I'm working with them because it doesn't seem like I have a lot of confidence in some things that I'm talking about. And it's just because I've been beaten down so many times by data that you start to say I really need to verify these things, right? And until you're at that stage, you're probably going to make some incorrect assumptions and go wrong. So here's a scenario I like to, you know, this is one of the first ones that we sort of hit on. Let's assume you're a data analyst, you're at a hospital, you're doing a study on ICU readmission, you're using a data repository from your EHR, your ICU clinical information system, and the team wants a report that says, you know, what's the readmission rate last year and if it differs by the ICUs. And so you look in the database, you find a field called ICU readmission. So can you use this, just tabulate this and say okay, here's the readmission rate. I'm going to group by ICU and give this. So what do you think that flag means? I won't make you come up with all of them now. I'll give you a few options. So option A, patient was discharged from the ICU to a lower acuity location followed by transfer back to an ICU. Seems reasonable. Patient was discharged to a lower acuity location followed by transfer back to an ICU, but only if at least 24 hours passed between stays. All right. That's a little odd, but okay. That could be something. Patient was discharged from an ICU to a lower location, transferred back to an ICU, but only if less than 48 hours passed. Okay. We've seen a lot of people use this as a metric, right? Readmission within 48 hours. Patient was in ICU that had flex universal beds. Their level of care was downgraded to step down status. I don't know. Is that readmission? That seems a little odd or not enough information. So how many would vote for A? How many for B? C? All right. D? All right. Or E? All right. So the answer is E because literally these other options have all been true in different databases depending on which one you're using and how it's organized. So you have to really throw logic out the door on what you think you know. It's not about what's clinically right. It's about how the data is defined and organized. And you just don't know. You have to go investigate and you can't assume these things. So what happens is a lot of times in research, the FDA and others, they talk a lot about data provenance. And data provenance is really a documented trail that tracks the origin of data, logs its movements and changes over time, and ensures that its credibility and transparency for the research data. So this sounds great, right? And this is really important because you want to know how did data flow from one spot to the next? When was it changed? When wasn't? So makes sense. So why does that matter? Well, it gives you confidence in the validity of your research, guarantees there's transparency from data creators, and it gives a chain of trust for data reuse and adaptation. So things like who entered the data? Was it edited? If so, when and by whom? Was it replicated or received via HL7 interface? All this kind of chain of custody will be tracked with provenance. So I think a lot of people would say that that solves most of your problems, right? We have a whole chain of events that happen, so we should be good. Do you think by tracing that lineage, you'll feel confident? You already know me, and you know that there's zero confidence here. So here's some of the issues, right? So even with good data provenance, full transparency is rare. And I think, like you were just mentioning, measurement error often reveals more uncertainty than anticipated when you really look at the data closely, right? So those vital signs, the labs, the things that were charted, the diagnoses, the problem lists, even the birth dates, admission, date, time, discharge, there's errors everywhere. That doesn't make the data useless, right? But it means there's challenges you need to be aware of and approach with some humility, right? So here's an example evaluating ICU discharge data. Those of you that know me, I've done a lot with ICU discharge and readmission. So you're analyzing a patient's chart to assess ICU discharge times. You've made a model, you're comparing it, and you see that the patient was discharged later than expected. So what can you conclude about their length of stay? Is that kind of a no-brainer that the patient probably stayed longer than they needed to? You know, your model's good, says they should have been discharged earlier, so they stayed too long, right? That would be maybe, especially if you have a really good model, people might trust that, right? But what happens with some other information that's just not available to you? Like, what if the hospital's experiencing bed strain and there were no other beds to move them to? You know, does that change how you interpret those findings? What if the ICU had universal beds, and so they allow patients to move to general ward care without leaving that bed? These are things that aren't often inherently documented in that data set. They may be things you can find if you work hard enough for, but they're not always obvious. And when you start looking at something as simple as length of stay, all of a sudden you start realizing, wait, there are some other things that might influence this, right? So some things that can happen that none of us think about, I know I never thought about this before I was working in telemedicine, network drops, right? Sometimes hospitals have network drops, and there can be data loss. And so you're looking at things years later, and you have missing data, and do you even know you have missing data? Do you know why? You know, are you just blaming it as bad data, as the people didn't do their job? There's all kinds of things that we might not be thinking about. Replication errors. You know, most systems are replicating data from archive, you know, archiving them for reporting and research. Sometimes defects can occur, and data might be replicated wrong, it might be missing. There are all kinds of things that can happen. And then there's reference data errors, right? So sometimes you've coded things, and there's mismatches. So you're just following the codes. You're like, well, I know that they use this coding system, so I'm just going to search for those codes, and that's how I'm going to build all my research on. And then one day somebody looks and says, well, wait a minute, why is this drug name in your dataset? And I don't know, we didn't look for that drug, and it turns out somebody mismatched a code, and all of a sudden you've got drugs in there that you weren't supposed to have. So, you know, some of these things happen, and it's hard to envision sometimes. Maybe this is too abstract, but I can tell you, like, I don't have a slide on this, but one of the more wild things I ever saw, and it's really not this type of research, but there was an app that we'd been working on that was being installed on a tablet. And most of the time everything was working fine, but a whole bunch of these tablets were corrupted, and data wasn't working right, and the app, you know, was failing, and they had to be sent back in. And the engineers were trying to figure out, well, what's wrong with this app, the installation, and everyone was searching for months and could not find a problem with it. And then one day two people from different offices sat down with the tablet and sat down next to each other, said, okay, we're going to do this installation side by side and see what's going on. And one of them stuck the tablet in portrait mode, and the other put it in landscape mode, and they ran through the installation process, and one of those orientations failed and caused the corruption. So none of us would ever think that the orientation of a tablet while installing something could cause a failure, right? This is what I mean by this unconscious incompetence. It would never enter your realm of existence. So these are the things that can happen, and we really need to be diligent about them. So I'm going to take the last few minutes before handing over to Judy and ask, there are a lot of smart people in the room, what are some of the approaches that you can think of that you should use to mitigate some of these insidious data issues? So don't be shy. If you don't want to get up to the mic, you can yell out and I'll try to repeat it, but does anybody have any ideas? All right, so the response was use more validated data sets. So like better data, right? Having cleaner data? Vetted. Vetted data. Okay, so if you have obviously higher quality data that should reduce the amount of some of these things, okay. What if you're using, well, even, you know, MIMIC data set, EICU data set, some of these others. I mean, that's real world data, right? They weren't designed for research, and so those will have issues in them. Yeah. Okay, so the suggestion was at the beginning of the research or at the at the you know Sort of onset with the data. There would be a list of assumptions made by Is it the data scientists or the clinicians or by the clinicians? About what's understood about the data and what the assumptions are That's that's a nice idea. So you can have some sort of open dialogue discussion about what's understood Anything else? Yeah Create visualizations for for data. Okay. So what would you do with some of the visualizations? How would that guide you? Right, so you're looking for kind of anomalies and probable things, odd patterns in the data. So that's definitely that exploratory data analysis step should always happen. I think that's a huge piece. I'll take that kind of a step further, which is I think some things can escape that visualization, like you have to do that. But then even beyond that, when you start looking at relationships of your data, anything, I think at a bare minimum, anything that looks odd, you should investigate, right? If it doesn't match your clinical understanding, then you should be saying, why is that? And to me, there's like no harm in that because you're going to, it's one of two things. You're either going to learn something new clinically that you didn't understand before, or probably 90 plus percent of the time, you're going to find an issue somewhere in the data that wasn't possible. And another thing that happened to me, I was doing a readmission model and I found 20 times increased risk of readmission with patients who had missing airway status or ventilation status at the time of discharge. If I had just done a black box model, my RFC was great, but by investigating the odds ratios of all these things independently, found this bizarre pattern, like why is this, this can't be right? And looked at it and said, it can't be right, you got to go over your code again. Went through all the code, could not find any defects. Go back, talk to everybody, can't be right, but keep digging, digging, digging, digging. Finally found after weeks and weeks, the source system, basically when somebody was readmitted to the ICU, the system was, the engineering was to take the last data from the care plan and move it into the new stay. And so instead of just copying that, because it was going to make it easier on the clinicians, that the recent data, if it was less than 24 hours, would still be there and they just had to revalidate it, it actually moved it from the database into the new stay. So anybody readmitted within 24 hours, by default, had a missing care plan, had missing care plan data. How would you even find that without being able to dig into the software? So sometimes it's not you, sometimes it's not the data, sometimes it's something the nurse did three years ago with the bedside or something that happened with the EHR five years ago. So I don't want to do this to completely take everybody and say, you just can't do any work, but I think I'm trying to make the point that you really have to think carefully and look very critically at all your data. You can still make some good models. Doesn't have to be perfect to be useful. And I think it's a lot of that validation that we'll get into that really can help make a difference. So I'll stop there. And if there's any questions or if we should do it at the end of Judy's, but we can do questions while we're transitioning, I guess.
Video Summary
Omar Bedouin, Chief of the Division of Data Sciences at the Telemedicine Advanced Technology Research Center, discusses the importance of understanding data provenance in research. Highlighting his extensive experience, he emphasizes awareness of data's origin and potential limitations. Bedouin explains the concept of competence, emphasizing that in research, it’s crucial to recognize we often don't know what we don't know, which can lead to faulty assumptions. He illustrates scenarios in research where data definitions can be misleading, underscoring the necessity of thorough data investigation and not taking definitions at face value. He discusses the concept of data provenance as a documentation trail and its role in validating research data. Despite this, Bedouin warns of common issues such as measurement error, network drops, replication errors, and reference data errors, advocating for practices like using validated data sets, stating assumptions, and employing exploratory data analysis to enhance data reliability and mitigate issues.
Keywords
data provenance
research validation
data reliability
exploratory data analysis
measurement error
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English