false
Catalog
SCCM Resource Library
Methods For Minimizing Bias From Missing Data In C ...
Methods For Minimizing Bias From Missing Data In Critical Care Research
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Welcome to today's webcast, Methods for Minimizing Bias from Missing Data in Critical Care Research. Today's webcast is brought to you by Discovery, the Critical Care Research Network at SCCM in collaboration with Clinical Pharmacology and Pharmacy Section. My name is Brooke Clark and I will be today's moderator. I am currently a Critical Care Clinical Pharmacy Specialist in Neuromedicine ICU at UF Health Shands Hospital in Gainesville, Florida. I received my Doctorate of Pharmacy from Jefferson College of Pharmacy in Philadelphia, Pennsylvania. I completed a Critical Care Residency in Lexington, Kentucky at University of Kentucky Healthcare. I have been involved with SCCM CPP Section, the Item Writing Section Committee, as well as a Visual Aspect Editor for Critical Care Medicine. I have no disclosures with regard to today's webcast. This webcast is being recorded. There is no CE associated with this educational program. Just a few housekeeping items before we get started. There will be a Q&A session at the conclusion of this presentation. To submit any questions throughout the presentation, please feel free to type into the question box located on your control panel. And now I would gladly introduce today's speaker. Dr. Douglas Lansettle. Dr. Lansettle is a Professor and Department Chair of Epidemiology and Biostatistics at the School of Public Health, Indiana, University of Bloomington. He received his PhD in Biostatistics from the University of Pittsburgh in 1997 and has subsequently published over 150 papers and served as the primary investigator across numerous projects in occupational and clinical research. With an emphasis on biomarker studies and prognostic models, as well as evaluating comparative efficacy and casual effects of interventions and medical treatments. His previous positions have included roles as an Associate Professor or Interim Director for three different organizations, as well as a Director of Biostatistics for Research for the Starzell Transplant Institute. He currently leads a department of 20 faculty members and is the Director of two different data coordinating centers for the SECM Discovery Network and Consortium for Rheologic Imaging Studies for Polycystic Kidney Disease. He also serves as a statistician on two different immunotherapy trials and a health systems intervention study. Thank you so much for joining us today, Dr. Lansettle. And now I'll turn over the presentation to you. Okay. Thank you very much. So the talk, as mentioned, is on methods for minimizing bias for missing data in critical care research. And first, I'll just give a little outcome for the webcast today. We're going to start with a very sort of simple, very generic description of if we're doing a surveillance study of COVID-19 clinical outcomes, just to kind of give some framework for how we might think of missing data. Then we'll talk about different missing data mechanisms. So basically looking at why are the data missing? And then we'll talk about the approaches that we could use, either in terms of study design, deletion of observations that are missing, or some type of modeling or imputation approach. And then I'll look at the implications of those different approaches under different assumptions about the missing mechanisms. I'd also like to use this as an opportunity to really stress the importance of rigor, reproducibility, and transparency in thinking about missing data. And then I'll make some final general recommendations and stop for questions. So I do want to preface this by saying this is going to be a talk about a fairly basic kind of essential aspects of missing data, not going into a lot of detailed sophisticated methods, but hopefully this will be appropriate for our audience today. So let me just frame this in terms of when would missing data come up. So I'll just give this example. And I work on a surveillance study of COVID-19 clinical outcomes. And so we might have patient characteristics, lab values, different interventions used, hospital stress, and clinical outcomes. So we may have, in terms of timing a data collection, we have patient characteristics usually at the time of admission. We have lab values and interventions that might happen daily. So you might measure different lab values either daily or weekly or some longitudinal setting like that. And you could have interventions like mechanical ventilation that change from day to day. And then oftentimes now more and more we're interested in hospital stress. So let's say that we measure that weekly. And we have multiple sites with different variations on how they measure the data and how that might influence missingness. And we could have missingness at any step along the way. So that's sort of a setting that we might look at in critical care. And so let's just say for the sake of argument that we had the following levels of missingness just to give us a broad idea of what the data might look like. So maybe in patient characteristics, let's say that most of our patient characteristics like age and sex were complete. But oftentimes we have a higher proportion of missing in one variable like race, for instance. And let's say we had lab values interventions that were sporadically missing maybe at about 10% or 20%. And that might be differentially over time. So we might see more missing values when there's a patient nearer to a worse outcome or in a more critical setting. And let's say that hospital stress was basically complete. But with these types of hierarchical measurements and measured at a different level of observation, what we might get is a high level of missing that's just at one hospital. And let's say that there are variations across multiple sites. So different sites might have more or less missing. And they might be in different variables. So that's basically the setting. And so we can think about different types of missing data that we might run into. And let me just start with some of the very textbook definitions here. So one way we could have missingness is missing completely at random. So this means the probability of a value being missing for a given variable is totally unrelated to other factors. So it's not changing over time. It's not more for some sites than others. It's not more for certain people that have certain characteristics. It's just some random occurrence that happens over time. This is really the easiest setting to deal with. And a lot of different things that we could do might work fine in this particular case. Another thing that we have, instead of missing completely at random, we could have missing at random. So that means that the missingness shows up. There's some random mechanism behind the missingness. But it could vary by observed factors. So maybe in the case that I just gave with hospital stress data, we might have a high percentage of missing data at a single institution. So the probability of missingness is driven by the institution that you're at. Or maybe over time, there are more lab values as the person is in the hospital longer or if they're in certain units or get moved to certain units. So we have the probability of missingness varying over time. But that's an observed factor. And that setting also we can handle with fairly standard statistics. It's a little bit more complicated, but we can model and try to predict what the value would have been based on the factors that affect the missingness. So if missing, let's say, people that come in with a certain underlying condition have a higher chance of worse lab values, we can use that to predict the missing values. Now, the worst case is missing not at random. So there is some non-constant probability of missingness, even after accounting for other observed factors. So trying to come up with what the value would have been if it was not missing is more difficult because we actually can't really specify what that underlying mechanism is. So I will say that most of the applications that we normally have in practice, we usually assume that it's missing completely at random or more often missing at random is the usual assumption that we can deal with. And we just keep our fingers crossed that it's not missing at random. I will mention some more complex methods that are often pointed toward in terms of missing not at random, but they may not actually specifically handle missing not at random. It's just that maybe if they use a more complex function of the data, we can capture that missingness in a way that uses more of a complex function of the data we do have. So even though it's not using the unobserved data, it might get closer to the correct prediction. But there really aren't a lot of good ways to handle missing not at random. So these are our textbook definitions. And by the way, I think I have a small enough set of slides that we'll have plenty of time for questions at the end. So I just encourage you to jot down ideas that you have so we can try to have some discussion at the end. So let me go back to our example. So let's say, what about race? We said that it's missing about 30% of the time. Many times, this tends to be systematic. So missing not at random or maybe missing at random on observed factors. But maybe, for instance, a person who has a certain ethnicity that is not easily guessed by the observer, they may just leave that missing. Whereas maybe for someone that's Caucasian, they can look at them and just, without asking, write that down. Or it might differ then. So we might be able to predict the people that are missing, say something about them, even though it seems to be systematic. In terms of lab values and interventions, just as a for instance, maybe they're captured less as the person has worse disease. So if there's a more tense setting, it might be harder to get those lab values. Or maybe if they're in worse condition, they're actually taking more labs, so they might get them more often. So maybe the lab values are measured less early on when the person is in better shape or not yet in a critical situation. I'm not a clinician, so I don't state that these are necessarily kind of accurate scenarios. I'm just trying to throw out some possible values or possible scenarios you might have. Site differences are often systematic. So as I mentioned in our example, maybe stress is missing half the time at one given site and it's almost always complete at the other. So we could try to use that in our information to try to predict the missing value. So we could look at stress at that particular site where it's missing and try to use that information. So even though it's systematic, we might have some idea of what's happening there, how to predict that. And then differences over time also might be systematic. So maybe with lab values, we could look at some trend as to what's happening with the lab values to predict those that are missing. So that's often a useful tool in dealing with missing data. So what can we do about missing data? And I have a set of six slides where I'll just kind of talk through different sets of approaches. Again, kind of sticking to some very main ideas to give you the feel for general approaches. So one thing we can do, which I think is always the best to do, is avoid missing data. So we might have something where you can use some automated data extraction. It doesn't depend on people inputting the value, typing it in. Maybe there are other ways to operationalize the collection of data where it's part of regular clinical practice where you're basically avoiding the opportunity for scenarios that lead to missing data. Another thing with avoiding missing data is objective and consistent coding. So if, for instance, we have something where in REDCap, for instance, you can either check the first box or check the second box, you're a lot less likely to get missing data than if it's a hand-entered form and there's just a blank space. Because what happens, people might write it in or they might write it in differently or use different phrasings. Things get incorrectly translated. There's just more opportunity for missing data. So web-based data collection through something like REDCap is much more ideal also than typing it into Excel. So if you're using Excel and you want your coordinators to just punch in data as you're entering it for a study, it's very easy to type it into the wrong column, have errors, one of which of those errors could be missing data. Another is to shorten instruments. Many times people get tired of long instruments, longer surveys, so we tend to get more missing data. I know that's definitely the case for myself as I'm doing any kind of survey and I get about five minutes into it and I kind of get tired of it and either close it out or go much quicker. Also, you may want to pilot test any kind of data collection instrument and see where you're getting missing data or you're running into other kinds of inaccuracies. And then they go back later and also think about ways you could get the data that appears to be missing. So it could be time intensive but maybe going back to the actual data extraction, et cetera. And so here I think is where we could go back to the phrase prevention is the best medicine. So avoiding missing data is always the best way to deal with missing data. So try to prevent it through those kinds of strategies. So the second slide I have is the obvious thing you could do is just delete the values. So we might do this in a couple of different ways. One might be to keep only complete cases. So here you might say, well, I have a hundred observations. 85 of those do not have any missing data. This person's missing value for race. This one's missing a couple of lab values. Let's just keep the complete cases and then that makes everything straightforward. So what are the implications there? Well, if it's missing completely at random, i.e. there's no systematic error in why a given observation is missing versus another, it's unbiased, okay? They're using complete cases. But you do lose power or precision. So if you're just deleting every one that doesn't have a complete case, you have a lower sample size than if you use other methods, but you're not introducing a systematic error if it's missing completely at random. Now, of course, rarely in practice do we get missing completely at random. Missing at random would at least be a little bit more unrealistic or more realistic, sorry. So in this case, if it's missing at random, your estimates, all the statistics you run, so when I say it's unbiased, I mean any subsequent statistics that you run, calculating the mean, calculating the variance, running a regression model, doing a t-test on some measurement between groups. Those results are unbiased. If you model the variables explaining the missingness, so if you try to set up some model that predicts the value when it was missing, and you know which value should actually affect this missingness mechanism, and you use that in your prediction. So you use the data, the complete, oh, I'm sorry, I think I had this under the wrong setting here. So for complete cases, this would be unbiased. I'm actually, this should be in the next setting where I say that, so ignore. This is unbiased, but less power and precision. Now, if it's missing not at random, your result here with complete cases is that you get biased results. Because if there's something affecting that missingness mechanism, that's likely to mean that the people with complete cases are different than those without complete data. Another thing you could do is not just reducing it to the complete cases, but you could reduce it to the complete cases but you could delete variables from your statistical analysis only if they are missing in those variables. So maybe you have one row of data where it's missing race. You keep that set of data in your model. It's just when you run some analysis with race, it's dropped out. And this is exactly what your software is doing. You don't have to tell it that, but it's going to drop the observation if you try to do some statistics that uses that particular variable. And so one thing I'll mention here is be careful with comparing models. Because if you run a couple of regression models and you try to use some statistic like what's called a likelihood ratio test or partial F test or compare in some way where you have a statistic for one model that has a certain variable and another model that doesn't, you'll actually get different sample sizes in fitting the model. So you have to be careful about that. This does lead to more precision than complete cases. And it's comparable bias to the complete case analysis. You're basically doing similar thing there, both more precision. So what else can we do with missing data? So you might say instead of just deleting the observation, we can impute to value. And so there are different ways that we can impute data for missingness. There are many approaches for single imputation. And single imputation just means that what you're going to do is you're going to come up with a value and you have some mechanism by which to do that, but you're going to come up with a value that you think you have for a missing value and you're just going to use that value instead, you're going to impute that single value. So how might you do that? You could fit some kind of regression model where you use the data you have to predict for the variables you don't have. So if race is missing in a bunch of the observations, what you could do is you could set up a regression model where race is the outcome and then use the other variables as the predictors. And so then you get some prediction and maybe you predict the probability of it being, let's say, white versus other race. And even if there is, it's missing, I think I go on to say here, if it's missing completely at random, then you're unbiased and you have more precision than just deleting. If it's missing at random, you can still get an unbiased estimate and more precision if you're using the right variables to explain the missingness. So again, remember, missing completely at random just means it's a totally random scattering of missing. Missing at random means that there's some structure to the missingness, but you can account for that with the variables you observe. Now, if it's missing not at random, you still have a biased result. We could impute the value just use the mean from the missing data. So instead of using a regression model, we could just say, you know what, I'm going to take the mean of all the data for race. Now, for race, it's a little tricky since it's a yes-no variable. Let's say it's some lab value. So what I might do is I might just use all the observations that we're not missing for that lab value, take the average and impute that mean for the non-missing values. So that's less precise than the predicted value imputation. Another thing you could do, and of course, that gives the setting where you have a bunch of the same values in for different observations. You could impute the baseline value or the last value. So this is sometimes called carrying the last value forward. And this is probably a more biased approach unless you're leveraging some knowledge about the problem. So for instance, when I do research with kidney disease and we're looking at estimated GFR over time, maybe I know something that once a person's GFR is very low, even if I didn't observe it, it's not going to jump back up. So maybe if it's low enough, it's okay to take the last value carried forward. Or maybe it's okay that if I'm early in the study, I could use the baseline value for now if GFR has not started dropping. So it might be okay if you have some knowledge about the situation, but in general, this is considered a worse approach. So in general, I would say that using a prediction from the non-missing value gives you the most precise estimate. Now, another thing we could do is we could impute many different values for the missing data. So this is called multiple imputation, and there are many different approaches for multiple imputation. And it's the same idea as single imputation where we're using the non-missing data to in some way predict or come up with a number for the missing data. The difference here is that you repeat the imputation thousands of times. So what you're doing is you're creating thousands of different data sets where you do the imputation process on each of them. Now, the reason to do this is this correctly estimates the variability. So with all single imputation methods, you are assuming when you then go to do subsequent statistics on those. So you impute the values, and then you go do a t-test. Or you impute the values, and then you run a regression to get the effect of the intervention or the outcome. All of the statistics you do do not take into account the fact that you just imputed a value that has variability. It wasn't necessarily that value. Might have been a little more, might have been a little less. You might be off by a lot. So the idea of multiple imputation is you get a distribution of values for the missing data, and then you run the subsequent statistics across those many different data sets to get the variability in your subsequent statistics. And this is generally strongly preferred over single imputation. In theory, there's no reason to do single imputation other than multiple imputation can be more time consuming, and you have to have the software that's running that. It's not just something that you usually put in a command. Maybe you put in a command that does lots of things in the background. But there's not always multiple imputation commands available for every type of analysis you could do. So this is kind of held back a little bit by practical constraints, but there's a lot of tools out there for multiple imputation. It's still biased under missing not at random, though. So my second to last slide on what can we do about missing data is we can flag the missing data as a separate variable. You might say, well, why would you do that? The idea is that you use the observed and imputed data in your subsequent statistics. So the idea is that you impute the data, but then you also retain a variable that says whether the value, you retain a flag for each of the variables saying whether they were actually observed or not. And that way, if there's some systematic differences, so let me just go down here to reveal the text I was saying here. So we have a separate indicator as to whether it was imputed or observed for each variable. And so the idea is that we're capturing the unobserved effects associated with missingness. So maybe the people that were missing always had worse outcomes, or maybe they were somehow systematically different that the person collecting the data was like, oh, OK, this is fine. Nothing's changed, so maybe they were better. So you're trying to capture those unobserved effects. And there are some papers out there, and I have the references at the end, that suggest that you might get more accurate or unbiased and precise estimates using your subsequent statistics. So when you actually go to estimate, does treatment have a significant effect on the outcome, by using this kind of model, you can get a more unbiased and precise estimate of your treatment effect. So under missing, completely at random is comparable to imputation alone. When you're missing at random, it's also comparable to imputation. But it could potentially reduce the bias under missing not at random. And it's comparable bias to complete case analysis. All right, what can we do about missing data? The last slide I have is for missing it not at random. Now, I say here we may consider fancier statistics. So the multivariate imputation by chained equations, or MICE imputation, was something, honestly, I was not familiar with until recently. I had a PhD student who did her PhD on a machine learning pipeline to predict treatment response to antidepressants. And so she used MICE imputation in hers and found that it seemed to have some better properties. And that's what the literature seems to say. So I think it may be kind of taking over. Again, we're using different kinds of more complicated regression methods to predict the outcome. So this may be something that's gaining popularity. Another is something called hot deck imputation, which uses a nearest neighbor matching type procedure to do imputation. And there are a variety of Bayesian methods also. So Bayesian methods, in case you're not familiar with that, I'll just give you the very simple kind of two-minute explanation, is that Bayesian methods put some prior distribution on your quantity of interest. So you could use the information you know about or that you suspect about missing, and you put some distribution on that. You use the data to update that distribution, and you get a posterior distribution. And then if you need an estimate, you can sample from that posterior distribution to get the estimate for something. So there are a number of Bayesian methods that are proposed in the literature for missing data. Now, these aren't, especially the first two, are not really directly getting at missing, not at random. If you look in the papers, first thing they say is that they assume that it's observable data that guides missing at random. But since you're using a more complicated function of the data you do have, maybe the thought is that this could better get at missing, not at random, by using the data you have in some more complex relationship. So also looking at a flag for the missing data could help to address missing, not at random. So let me go back to our example and just say a few summarizing comments. And then I'll say a little about rigor and reproducibility, and we'll end up leaving a lot of time for questions, I think. So back to our example, let's say you had missing race. That's going to be really difficult to impute. You might know something about different races having different characteristics in terms of other demographics. So maybe you could impute that with some kind of regression function and use either single or preferably multiple imputation. So again, if it's missing at random, that means that the value depends on data you have observed. For lab values and interventions, you may have non-missing data and trends over time that can be used to predict imputed values. So what you might look at is you might look at the trajectories of the non-missing labs over time in different groups of patients and then try to predict using the non-missing value which group it was in, where it fell over time, just try to get a prediction from that trajectory and impute the value. And again, preferably with multiple imputation versus single imputation. And then I mentioned there are some cases where some variation of the last value carried forward may work in terms of a missing value. And I had mentioned that, for instance, if you had estimated glomerular filtration rate and you have that it's dropped down to 12 or something, we may take that last value carried forward as our estimate. However, especially if we know that they haven't gone on dialysis or something like that. However, if you have an EGFR of 90 and then a long period of time before the next one or a lot of chance for variability and the next non-missing value you have is 60, it's kind of hard to estimate where you are other than saying that maybe, again, we can look at some trajectory over time. So sight and time effects might be useful in imputation. So it's really important with missing data. It's not just a matter of like, well, let me try to pull out my fanciest statistical approach and impute values. It's important to think about your knowledge of the clinical problem to think about the model for imputation. A former branch chief of mine when I was in the government, when he was doing, and it wasn't on missing data, but a statistical talk and sort of tutorial for the investigators there. He used to always say, don't leave your brains back at the door. And people kind of took offense to that because they were like, what are you telling me? I'm leaving my brains at the door. But what he meant is that you don't want to set aside your knowledge of the science and the clinical problem. You want to try to use that in the statistics. And that's really true with missing data. Many times when I, and I don't consider myself any major expert on missing data other than kind of knowing some of the general things we could do. Many times when we get into this discussion of what we could do, the investigators might be surprised that I'm actually asking a whole lot more questions than I am giving the answers to hopefully get us to a reasonable answer at the end. And certainly favor multiple predicted imputation over other methods of single imputation or mean value imputation or even less value carried forward. So we might use a prediction with bounds on it set up by your knowledge of the clinical problem. So you may know that certain values not going to be negative or it's not going to change this much over time. So you could incorporate that into the prediction. Again, just using your subject matter knowledge of that is really important. So let me go back to, I mentioned that I want to think about rigor, reproducibility, and transparency whenever you're dealing with missing data. So rigor, reproducibility, and transparency are always important to consider. And I think they don't get mentioned a lot with missing data, but I think it's really important here. So what do I mean by rigor? By rigor, I just mean using the most comprehensive valid approach. It's not necessarily the fanciest, but a comprehensive and valid approach. So describing the data thoroughly. So what's the distribution of data for those that are missing? I'm sorry, that are non-missing. You don't know what they are for non-missing. But what is it for those that are non-missing by group? What does the data look like over time? Does it look like that? What does the other data that's non-missing look like for those that have either missing or non-missing for a given variable? So if you're missing race, what does all the other data look like versus the cases where race is measured? And so if that's different, that tells you something about the missingness and the mechanism. Try to use that information to characterize mechanisms and then conduct the optimal analysis based on that. Reproducibility is really important also with missing data. So especially when you're looking at multiple imputation, where you do something 1,000 times and then your software does some procedure to look across at a distribution of estimates to get an overall estimate to characterize variability, there could be lots of opportunities for doing slightly different things that lead to slightly different answers across a number of steps. And even doing something like checking the random seeds. So if you're doing something that uses a random mechanism in what you're fitting, whether it's using some kind of machine learning algorithm or it's a multiple imputation that randomly selects the day 1,000 times, I always like to set the random seed at the beginning so I can reproduce the results exactly. Otherwise, you never know. You might get very unlucky and happen to have run it once and got a very fortuitous answer, an answer that looks kind of bad. And then you're in the situation like, well, what if I try to reproduce it and I can't and it looks very different? What do I do? Do I keep running it a bunch of times, which looks like I'm trying to get the best answer? So I always like to set a random seed. I usually look at the hour, minute, and second and maybe add the date and put that as a random seed in the software. Also, transparency. So documenting the rationale for why you suspect that it was a given mechanism to the missingness and documenting all steps that you're using. Many times, as you can probably pick up here, when we're trying to deal with missing data, there's a lot of different decision points. There's a lot of descriptive analysis you might use. There's some decisions in terms of, oh, I think that is missing at random. I don't think it is. Let me try this. Maybe we make a different decision that that wasn't really right and then we try something else. So you want to have transparency in what you do. Certainly, this requires some subjective judgment. So documenting that rationale in all steps is really important. The issue of rigor, reproducibility, and transparency is really not specific to missing data. But I like to mention it in any analysis when I have the opportunity because it's extremely important. So some final questions and then I'll leave. Oh, this is perfect. I was saying I'd hopefully leave about 15 minutes for questions. So the biggest thing you could do, again, prevention is the best medicine. Since I'm probably talking to a number of physicians, hopefully that rings true. So we want to avoid missing data through the design. That will always save you time later and always leave to the most precise and most unbiased results. When you do have missing data that you can't avoid, think about the mechanism of missingness and what's generating it. I really like to favor prediction of the missing rather than just mean imputation or single value, any kinds of last value carried forward. And I also prefer that over deletion. Although I will say, if your percentage of missing is low, and let's say you have a data set with 1,000 people and you're only dropping 80 out of 1,000 for the analysis, it's probably not going to matter a whole lot what you do. So that type of it's often referred to as pairwise analysis, just dropping the variable that's missing in the particular analysis is probably a reasonable thing to do. And I do that more often than I maybe should, should try missing prediction and multiple imputation more. Favoring multiple imputation is recommended to incorporate the variability of your estimates. Again, if it's a small percentage, it's not going to matter that much. But this is something that could be important if you have a lot of missing data across a number of variables. Consider a missing indicator. This creates a little bit of what I'll, using my vast technical terms, say is a bit of a clunky analysis where you're carrying all these other values into the analysis as indicators, as to flag whether each of the variables is missing. But if your real focus is on some type of overall treatment effect, that might not be such a bad thing to do in the analysis. It may not make it that much harder or difficult to interpret. Always evaluate your assumptions. So it's good, it's not a bad thing as long as you're transparent to do a number of different things and see if they agree. Again, you want to be transparent. You don't want to just do a bunch of different things and then pick the answer that suits you. You want to be transparent about what you tried and what you got and how it varied and what you ended up doing in the end. Always strive for rigor, reproducibility, and transparency in missing data analysis and anything else. So these are some of the references I had. And so with that, I'll just say thank you for the opportunity to present here today. I hope this was useful just in terms of getting you to think about missing data, what are some of the issues, and where are some of the implications of different approaches you can take. And again, really, if you want one take-home point, it's really that try to avoid missingness through design. And with that, I'll pause for questions. Thank you, Dr. Linsedal. I'll leave the floor open to questions if you want to type them into your question box at the bottom of your control panel. Common question that may come up, especially when there's missing data in observational research, is how much is too much when it comes to missing data? Is there any information you can give the audience about how they should consider with regard to thresholds of missing data? Yeah. So that's a great question. That's maybe the most common question maybe I should incorporate into the slides other than the fact that I'm not sure I have a great answer for it. Certainly, a super low percentage, 1%, is not a problem. 50% is too much. How am I saying that? Well, I just think if you have half of the data missing, it's very hard to say what the mechanisms are for generating it unless you know something that tells you, oh, well, the reason it's missing is because I don't know. Maybe every five days that the way we're collecting the data, our emergency room or ICU is interrupted and we always miss it on that day or we miss the labs at these times. And it's for unrelated reasons, and we know they're unrelated. If you have some knowledge to know that it's unrelated, in theory, you're only losing precision if it's unrelated and it's missing completely at random. Or if it's missing at random and you can try to use the observed data to predict the unobserved, in theory, there's not really too large of an amount because you're only losing precision. So you're losing sample size. You'll get less significant results. You'll get wider conference intervals. But in theory, that's not really affecting the bias of the result. The problem is if you have a lot of missing data, I just have a hard time unless there's some really clear reason why that's the case in accepting that, oh, we know the other 50% of the data was just like the data we have. And the more you have missing, the more, again, the more I think it's harder to argue that it's at random. And the more, if it's not at random, it will bias your results. So even if you have 1% that's missing and you say, yeah, that was the worst people we were missing, well, maybe you could use that information and do a sensitivity analysis where you impute or you predict really bad lab values for those people because you know that they were missing because of some really poor condition. So you could do some kind of sensitivity analysis. So oftentimes, I've heard 20% thrown out. Like over 20% missing is just too much. And to be honest, it depends on a discipline. With survey research, we get 50%, not we. I don't do too much survey research. But survey research, they get 50% missing all the time. And they just give that as a caveat. If we're doing tight surveillance of labs and things, we might be really concerned if we get over 20%. So there's not really a cutoff. I tend to think of, and you can do sensitivity analysis to assess this. But in my experience, when you do some sensitivity analysis, if you're around 10% or 20%, you have to have fairly drastic kind of bias in the data to affect your answer a lot. Now, the result you get, if you're really interested, is in the treatment effectiveness. And the result you get is borderline significant. Then even a small amount of data could go from borderline significant to borderline non-significant. So bottom line, if someone sort of said to me, I'm going to fire you if you don't, if you're going to get fired from your position there at IU, if you don't give me a number, I'd probably give 20%, just somewhere it's somewhere reasonable in terms of not a huge amount of missing data where you probably could have a number of realistic scenarios where it doesn't change the results with sensitivity analysis. I would say probably 40%, 50% is definitely too much in critical care settings. What's the right answer in between? I have no idea, and I don't think anyone else does. So sorry, that was a very long-winded answer to a very concise question. So my apologies for that. Oh, that was great. I appreciate the extensive response. That's very helpful. So you kind of also brought up a good point with regards to missing data, especially in the critical care population. So I'll just take one more question and ask, when it comes to either deciding to do multiple amputations versus just doing cases that have complete data, I guess when it comes to just doing cases with complete data, as you mentioned, those with missing data may be the most critically ill patients that were interested in their outcomes. So would your preference be towards using complete cases or utilizing some of these strategies to fill in some of the missing data? So oftentimes, I'll be very honest. And oftentimes, I really focus in the studies that I'm on in data coordinating centers that I'm on, including work with SCCM and Marty Gonzalez's great work with databases and others at SCCM. Really, the emphasis is on trying to reduce missing data in the data collection study design phase. And then when we do have the data, trying to look at, oh, do we have large percentages of missing? And so if we do, can we go back and try to get the data? So I'll be honest. It's really pretty rare that I'm left in a situation where I'm having to use a lot of these more complicated missing data amputation. As I mentioned, my student did with using strictly EHR results. And I think that might come into play more as we go towards using EHR extensively. So I think using deletion, I would say it's better, though, still to do pairwise deletion or variable-specific deletion. So you're only deleting the data on age, if age is in the model, as opposed to doing the complete cases that are non-missing for everyone. So you just have to be careful about doing nested model comparisons. Yeah, excellent point. Like you said, the key to cure here is prevention. So I totally agree with that. Well, that concludes our Q&A session here. And thank you again to Dr. Lansdell for his great presentation today. And thank you for the audience for attending. There's no CE associated with this educational program. And that concludes our presentation for today. Thank you. All right. Thank you, everyone.
Video Summary
In this webcast, Dr. Douglas Lansdell discusses methods for minimizing bias from missing data in critical care research. He emphasizes the importance of rigor, reproducibility, and transparency when dealing with missing data. Dr. Lansdell suggests strategies to avoid missing data, such as using automated data extraction, consistent coding, and shortening instruments. When missing data cannot be avoided, he recommends using prediction or multiple imputation methods. These approaches involve using observed data to predict missing values or creating multiple imputed datasets to estimate the variability of results. Dr. Lansdell also discusses different types of missing data mechanisms, including missing completely at random, missing at random, and missing not at random. He highlights the implications and limitations of each approach under different missing data mechanisms. Finally, he stresses the importance of considering the amount of missing data and how it may affect the results, as well as being transparent about the assumptions and methods used in the analysis. Overall, Dr. Lansdell provides valuable insights and practical recommendations for addressing missing data in critical care research.
Asset Subtitle
Research, 2021
Asset Caption
Missing data is a common, yet often overlooked, source of bias that plagues both observational and randomized studies. Failure to appropriately account for missing data can create selection bias that may lead to spurious, unreplicable findings. This webcast from Discovery, the Critical Care Research Network, will describe the biases that can be created by missing data and provide an overview of effective methodology for reducing missing data bias.
Meta Tag
Content Type
Webcast
Knowledge Area
Research
Knowledge Level
Intermediate
Knowledge Level
Advanced
Membership Level
Select
Membership Level
Professional
Membership Level
Associate
Tag
Research
Year
2021
Keywords
missing data
bias
critical care research
rigor
reproducibility
transparency
prediction methods
multiple imputation
missing data mechanisms
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English