false
Catalog
SCCM Resource Library
Using Propensity Scores for Causal Inference in Cr ...
Using Propensity Scores for Causal Inference in Critical Care
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Hello, and welcome to today's webcast, Using Propensity Scores for the Casual Inference in Critical Care. My name is Kaitlin Tinn-Lohaus. I am a critical care nurse practitioner at Emory University in Atlanta, Georgia. I will be moderating today's webcast. A recording of this webcast will be available within five to seven business days. Log into mysccm.org and navigate to the My Learning tab to access the recording. A few housekeeping items before we get started. There will be a Q&A at the end of the presentation. To submit questions throughout the presentation, type into the question box located on your control panel. Please note the disclaimer stating that the content to follow is for educational purposes only. And now I'd like to introduce you to our speaker for today. Dr. Todd Miano is an Assistant Professor of Epidemiology, Perelman School of Medicine at the University of Pennsylvania in Philadelphia. And now I'll turn over things to our presenter. Thanks so much, Kaitlin, and welcome everyone to today's talk. Using propensity scores for data analysis is one of my favorite topics. And there are often questions that arise during the phase of study design. I'm worried about confounding what methods that I use and what is the role of propensity scores. And hopefully by the end of today's talk, we'll have a little bit better insight on how to make those decisions. And so to get started, we are going to think about causation. So how do we determine cause and effect? This is a challenge that as clinicians and researchers, we face every day. I have a patient admitted to the ICU with septic shock. I resuscitate them with a particular combination of fluids and vasopressors, and the patient has a good outcome. How do I then go about determining whether that specific intervention had an effect, whether it was the reason that the patient had a good outcome or a bad outcome? And so it turns out this is a really challenging problem. It's often the case that patients come into the ICU and get better, and we do lots of different interventions, and we often have no idea why the patient gets better. And oftentimes patients get better perhaps in spite of all the interventions that we make. So how do we go about establishing cause and effect? For the individual person, it's very challenging, and we base our thinking around this challenge about making counterfactual contrast. So for a given individual to understand whether an intervention had an effect on outcome, we would need to be able to apply the intervention, observe the patient's outcome, and then go back in time and withhold the intervention, and then observe the patient's outcome, and then contrast those two counterfactual realities. And so obviously we can't do this. Time travel has not yet been invented. So for the individual level, establishing causality is extremely, extremely challenging. What we can do, however, is to make comparisons between groups, and this is, you know, this is what we do. This is how we attempt to make causal inferences, and the science of causal inference is all about understanding the assumptions required and the methods required to link contrasts of groups back to causal effects for the group and for the underlying individual patients. And so all, you know, our goals when we do studies to examine the effect of exposures, our goal is to mimic this counterfactual ideal. When we make contrasts between groups of interest, there are two specific types of contrasts that we can make, and I'll reference these throughout the talk. The first is an average treatment effect, and this, or the ATE, which is defined as a contrast of the outcome in the entire study population, had those patients received treatment versus the entire study population had they not received treatment. So the ATE is a focus on the entire study population, whereas we have the average treatment effect in the treated, or the ATT, which is a focus on the treated population. So it's a contrast of outcomes in those who received treatment, and it's important to highlight these two targets of causal inference because for a given study population and a given question, these parameters can be different. So the average treatment effect may not be the same as the average treatment effect in the treated, and that's particularly the case if there's treatment effect heterogeneity, meaning that the effect of treatment varies based on patient characteristics. And as we'll see, depending on the methods that we employ, certain methods will target an ATE while others will target an ATT. So nearly, you know, in 2023, nearly all health systems and hospitals had an electronic health record providing unprecedented amounts of data. It's easier now than it has ever been to query your EHR database and obtain large study populations to make contrasts to address causal questions. And so this highlights the potential value of observational studies for causal inference. As most folks are, I'm sure, aware, there are lots of challenges when attempting to establish causality using observational data. Here I think are some of the most important ways in which observational studies can go wrong. I don't think we're going to focus on confounding. But I actually think that some of these other statues on the Mount Rushmore, in particular immortal time bias and collider stratification, are often the most important issues that I see popping up in the literature. And so I think immortal time and collider bias are more often the fatal flaws that I see in studies rather than issues of confounding. And so Miguel Hernan and colleagues established the target trial emulation framework, which involves, you know, any question that you're thinking about designing an observational study around, start by imagining the randomized control trial that you would design and use that hypothetical target trial as a guide for decisions about design of your observational study. In particular, how you deal with time zero and how you assign treatment and eligibility for the study in relation to time zero. And immortal time bias and collider bias occurs when you look into the future to assign treatment and to assign eligibility, envisioning your target trial as one way to avoid making that mistake. So I digress. As I mentioned, we're talking today about confounding. And so what exactly is confounding? There are two ways to think of confounding. The first is a lack of comparability at baseline, differences in baseline risk for the outcome of interest. Another way to think about confounding is that it's a mixing of effects. The exposure effect, which is what we're trying to measure, is mixed with the effect of an extraneous variable. And below is a classic figure that shows a confounding relationship between the exposure of interest and the outcome that we're trying to measure. And then there's this extraneous variable, the confounder, that is a common cause of both exposure and outcome. And so to isolate the effect of exposure, we need to adjust away the effects of the confounder. And so to illustrate these concepts of confounding and how we address them, I'm going to introduce an example study of ours where we set out to examine whether pancreatomycin combined with piperacillin-tazobactam increases the risk of acute kidney injury when measured using various different biomarkers. So this was our study population here. We had 739 patients that were enrolled. And so this figure here kind of highlights the issues with confounding for this paper. So our hypothesis was that vagamycin combined with peptazo is, in fact, associated with increases in creatinine-defined acute kidney injury. But that association is due to effects on creatinine handling in the kidney, the inhibition of creatinine tubular secretion. So that the increases in creatinine in the diagnosis of AKIs is not true nephrotoxicity, but a pseudotoxicity. And if that hypothesis is correct, then we would expect that AKI associated with vanc-impeptazo would not retain the associations with downstream important heart outcomes that we tend to think of being related to AKI, including the need for replacement therapy and the risk for mortality. And so addressing these questions are all, as we'll see in a moment, subject to important confounding by differences in baseline severity of illness in our study population. So this is the table one from the paper, where we are looking at important baseline characteristics that are risk factors or surrogate measures of baseline severity of illness, which contribute to your risk of acute kidney injury and risk for mortality. And if we look across these characteristics, we'll note that the peptazo group appears to have a much higher severity of illness at baseline compared to the control group, which were patients who received vancomycin and cefepine. They had a higher age. They were more often receiving mechanical ventilation. They had a higher Apache 3 score. They had worse baseline kidney function, more often had cirrhosis, and had a higher lactate at baseline. And so across all of these factors, they appear to be a good bit sicker at baseline. And so these are the outcomes of correctly defined acute kidney injury, renal replacement therapy, and mortality before adjusting for any confounding. And we'll note that the incidence of acute kidney injury was substantially higher with a rate ratio of about 1.6. And so too was mortality, about 11% higher absolute increase in mortality. And so our question here is, after seeing the baseline characteristics of these patients where the peptazo patients are obviously sicker at baseline, are these differences that we observe in acute kidney injury and mortality the effect of treatment with peptazo in combination with vancomycin? Or are these differences being driven by the differences in baseline risk and the fact that the peptazo patients are sicker? And this is the essence of confounding and the challenge of confounding when trying to interpret the results of observational studies. And so our goal is to then use design choices and analytical approaches to try to minimize those baseline differences and isolate the effect of the exposure. There are a number of ways to do this. As I'm sure most folks are familiar, randomization is the most effective tool to minimize confounding by flipping a coin to determine who receives treatment versus those who do not receive treatment. This removes any association between baseline risk and the exposure so that on average, if we randomize enough individuals, the groups being compared will be similar on average. And so it helps us to isolate the effect of exposure. For many questions, we don't have the time or the resources to do a large multicenter randomized control trial. And so we try to fill in the gaps with observational methods. And so today we're going to focus on how we can use propensity scores to address confounding and also the role of multivariable regression. And we'll start with regression. So for decades, multivariable regression was the clear standard, the go-to method to adjust for confounding in observational studies. And for those that are not very familiar with multivariable regression, at first glance, these approaches seem very, very complicated. And there are certainly a lot of complexities. We all learned somewhere in middle school, we learned simple single variable linear regression, y equals mx plus b, which quantifies the change in y, the outcome for a one unit change in x. And these multivariable regression models essentially do the same thing, except instead of regressing on one predictor variable, we include multiple different variables. And so this allows us to then understand the effect, the change in y for a one unit change in x, holding all other variables constant that are in the multivariable regression model. And so if we include our treatment variable, our exposure of interest in a regression model, with additional confounding variables in that model, it allows us to examine the change in outcome for a one unit change in our exposure variable, holding all the other confounding variables constant. So it helps us to isolate the effects of our exposure of interest. There are multiple different types of regression models. You know, the way that these models work under the hood, the underlying mathematics is fairly similar. The primary difference is the type of outcome that is of interest. We use linear regression if the outcome is a continuous variable, logistic regression if it's a dichotomous yes-no outcome, and cost-proportional hazards if we have a time-to-event outcome. And so there are certainly additional types of regression models that can be used, but these are by far the most common. And in general, multivariable regression is very effective, can be very effective for controlling confounding from measured variables in your study. The rate-limiting step of multivariable regression, the Achilles heel, is overfitting. And so what I mean by overfitting is trying to include more variables in the model than you have information to support. And information in the model is driven by the number of outcome events. And there's a rule of thumb that we use, that in general, we want at least 8 to 10 outcome events for each variable that we want to include in our multivariable model. So that's not a hard-and-fast rule that there are scenarios where you can sometimes go beyond that. But in general, that's a landmark that is helpful. And if we do overfit our regression models substantially, the models will become biased. The coefficients from the model will be biased and not accurate. And the confidence intervals in p-values will be biased as well so that it can lead us to make, potentially to make false positive conclusions or false negative conclusions. So that is really the key limitation for multivariable regression models. And so in our example study, I'm showing the number of events for the outcomes of interest in our study. So we had 254 acute kidney injury events, which is about five events per variable. Renal replacement therapy, we only had 40 events. And so for the number of confounders that we have to deal with, 54, we could not adjust for all of those. For really any of these outcomes, there would be a substantial risk of overfitting. And so that, in general, that is really the challenge with using multivariable regression. And this limitation was one of the motivations for the development of propensity score methods. As we'll see, propensity score methods are generally more flexible in terms of the number of variables that you can deal with. So what is a propensity score? It is simply the probability of receiving treatment. So it's a number that varies from zero to one. And it is a function of the patient characteristics. So it's a function of the variables, the confounding variables that we measure and include. We take those variables and then estimate what is a given patient's probability of receiving treatment given their particular pattern of covariance. So to provide some intuition for exactly what we mean here. So coming back to our table one and the differences observed between those that received PIPTEZO with vanco versus those that received cefepime. PIPTEZO patients were older, more often on mechanical ventilation, had a higher APACHE-3 score, a higher lactate. So given these patient characteristics and these differences at baseline, I could ask the question, which patient has the highest probability of being treated with piperacillin-pazobactam? Patient A, a 75-year-old woman with cirrhosis and a baseline GFR of 35 who is mechanically ventilated with a high APACHE-3 score, or a 50-year-old man with a GFR of 80 not receiving mechanical ventilation, and with a normal lactate concentration and a lower APACHE. So based off of the differences we're seeing here, you might infer that patient A has a higher probability. We can make this kind of subjective, intuitive assessment of the probability of being treated that is based on the differences in patient characteristics. And that's essentially what a propensity score is. And instead of doing this intuitively, we use a regression model to estimate probabilities quantitatively. And so the genius of propensity scores, why they're so valuable or one of the advantages, is that it's been shown that on average, conditioning or adjusting for the propensity score is the equivalent to adjusting for all variables that are included in the propensity score model. So you can think of the propensity score as a one number summary of the covariate pattern for each patient across however many covariates there are in that model, be it there are five variables or 500. We can summarize the pattern of those covariates into a single number. And if we condition or adjust for that number, it allows us, it gives us the ability to adjust for all of those covariates. And so by doing so, it gives us flexibility to include, it often gives us greater flexibility to include a higher number of covariates in our analysis. So that's what propensity scores are in contrast to multivariable regression. So in terms of using either of these methods, either regression or propensity score methods for causal inference, there are two key assumptions that need to be met. The first is positivity, which is defined as all patients in a study population having a non-zero probability of receiving each treatment. Another way to think of this is that there needs to be overlap of covariate distributions. And it's easiest to understand this when we think about a single covariate. So let's say that we are doing a multicenter study that includes two centers. And we're comparing treatment A versus treatment B. And it turns out that treatment A is only used in center one, and treatment B is only used in center two. In that scenario, there is no overlap of center across the treatments. In center one, there's a 0% probability of receiving treatment B, and vice versa for treatment A. So in that scenario, a comparison of A and B is also intrinsically a comparison of center one and center two. So in that scenario, if the center is related to a patient's risk for the outcome, we would be unable to adjust for confounding by center. So we need overlap for each covariate, or more broadly, the whole distribution of all the covariates that we are meaning to address. Another way to think about positivity is the concept of clinical equipoise. We often think of clinical equipoise in the context of randomized controlled trials, but it's also relevant here. If there are patients that, if we're thinking about our vancomycin and piperacillin example, if there are patients that we would never treat with PIPTEZO, let's say they have a prior resistant isolate that's resistant to PIPTEZO, trying to make a contrast of the effect of PIPTEZO on that patient is not clinically relevant and is at risk for important residual confounding. So positivity, a key assumption. And then something that's perhaps more familiar is the assumption of no unmeasured confounding. So an assumption that we must make in the observational setting is that we have collected data on all the important variables that affect the outcome that could confound the contrast. We've measured all of them and include them in our statistical analyses. And so sometimes that is a big challenge. So here I'm outlining the steps for propensity score analysis versus multivariable regression. So in each method, we start with defining our study population and then selecting the covariates, selecting the set of variables that we think are confounders, and then collect data on those variables. And if we're using multivariable regression, then we go to the last step, which is estimate association with outcome. We start building multivariable models and including those covariates in the outcome model. With propensity scores, there are some additional steps. We first need to estimate the propensity score. And then we do two additional things that are actually very valuable. And I think one of the important ways in which propensity scores can help researchers do better causal inference is there are steps that are focused on directly examining the assumptions needed to make causal inference. So we can directly examine covariate overlap. And we'll see how to do that. And then we directly examine covariate balance. So we use the propensity scores to balance covariates. And then we can check and say, are we balancing covariates? And if we can achieve adequate balance of our covariates, we can remove confounding. So these are key steps that are often, those steps are missing from the regression approach. And so as we'll see, I think these are potentially helpful additional steps. And they allow the researcher to revise study design. And as we'll see, when you examine overlap of the covariate distributions, if we see important areas or lack of overlap, violation of the positivity assumption, we still have the opportunity to perhaps go back and reconsider our eligibility criteria for our study. If we've estimated a propensity score and have used it to balance covariates, and we find that it is not working, there's residual imbalance, we can go back and revise the propensity score model. And we can do this without worrying about potential issues of post hoc analysis and p-hacking, those things, if we make sure that we do not look at the association with outcome. So one of the advantages of the propensity score analysis is it allows you to separate study design issues. It allows you to separate dealing with confounding from estimating associating with outcome. So once we've estimated a propensity score, there are multiple ways that we can use it. We can match on the propensity score. We can use it for weighting, stratification, and covariate adjustment. We'll briefly go through each of these methods. So matching is probably the most intuitive way to use propensity scores. And it may be one of the reasons that, for a long time, matching has been the most common way that propensity scores are implemented, although I think now there's been a shift towards using weighting approaches. So in this example, this is just a simple example of matching. So we have six treated individuals, 10 control patients, with the blue dot representing an important confounder. And so to match, we basically want to take each treated individual and find a control patient who looks like the treated individual. And so we could start with the first patient, and we match them to a control. And we continue to repeat this matching process until, for every treated individual, we have a control patient that looks like the treated individual. And so in this example, I'm just focusing on one variable. But if we had summarized all the variables for a patient into a propensity score, we could just match on that propensity score. So for each treated patient, we take their propensity score, and then we find a control patient who has a propensity score that is as close as possible to that treated individual. So invariably, with matching, there are some patients who will then drop out of the analysis. And so this can be one of the limitations of matching, is that you generally were making inferences on only a subset of the population. And this can be particularly problematic if we're unable to match all the treated individuals. So that's one of the key limitations of matching. And so in general, matching is most effective if matching estimates an ATT, the average treatment effect in the treated. So if our focus of interest is on estimating an ATT, and we have many more untreated patients compared to treated patients, our goal here with matching is to ensure that we can match all the treated individuals. And we have the highest likelihood of doing that successfully if we have a rule of thumb that I use is at least twice as many controls as there are treated patients. If you fall below that ratio, you often find you're unable to match all the treated individuals. And so if you're unable to match all the treated individuals, then you are left with an effect estimate in the matched population that can be difficult to extrapolate to the bedside. Because instead of having this is the effective treatment in individuals who get treated, it's the effective treatment in individuals who get treated who were able to be matched. And that is a vague and hard to identify population. So one of the key limitations of matching. So once we have our propensity score, there are many different ways that we can use that to match, many different types of matching algorithms. Greedy nearest neighbor matching is the most common. And there are methodological studies that suggest that it's the most common. It's the simplest. And it tends to be as effective as other more complicated approaches. And I suggest that if you're going to do matching that you should match within calipers. And what I mean by caliper is setting a priori, determining a maximum difference between propensity scores that is allowable. And again, it's easy if we think about a particular covariate. So let's say we were going to match patients just on age. And we have to decide, well, how close do we want the matches to be? Do we want to match age to within plus or minus one year, plus or minus five years, plus or minus 10 years? We would want to apply some type of caliper. And so that same idea applies to the propensity score. So we have to decide how close the propensity scores need to be. So in general, propensity score matching has been shown to achieve excellent covariate balance and so effective at removing confounding from measured covariates. As I mentioned, it estimates an average treatment effect in the treated. Reduction in sample size can be an issue. It not only can create issues of generalizability, but if you're doing a one-to-one match on the propensity score, your sample size can drop substantially. So as I mentioned, for many years, propensity score matching was the go-to most common way to use propensity scores by far. But in recent years, there's been an upswing in weighting, and there are a number of reasons why that has been the case. There are a number of important advantages of weighting. So what do I mean by weighting? So instead of matching on the propensity score, we can use the propensity score to re-weight the population, which basically means we count. It has to do with how many times we count individuals in the outcome analysis population. And so we'll go through an example there. The most common type of weighting is what's called inverse probability of treatment weighting, where for each patient, we define a weight as 1 divided by the propensity score for the treated and then 1 over 1 minus the propensity score for the untreated. So patients are weighted by the inverse of the probability of receiving the treatment that they, in fact, were treated with. So inverse probability weighting is the most common approach, although there are numerous types of weighting approaches. There are at least five, upwards of a dozen different types of ways that we can use these to weight patients. Inverse probability weighting is the most common. And a newer method that is becoming more common is what's called overlap weights, where patients are weighted by 1 minus the propensity score for the treated or the propensity score for the untreated. And we'll talk about differences between these in just a moment. But some intuition for how this works. So here, I have another example hypothetical study population where I have six treated and eight untreated and a confounding variable that is out of balance. It's 2 thirds of the treated patients have this covariate. Let's say that it is a past medical history of cirrhosis. And it's only found in 25% of the untreated group. So if we take this population and estimate a propensity score based on this single covariate and then create inverse probability weights, it would look like this. For the treated individuals that have the covariate, in the weighted population, we would multiply them by 1.5. So these four individuals would turn into six individuals in terms of the number of times we count them in what we call this pseudopopulation. And the treated individuals who do not have the covariate are up-weighted even more so. And a similar weighting is applied to the untreated population. And what is happening with weighting is patients in each group that have an uncommon covariate distribution are up-weighted more so than those that have common covariate distributions. And by up-weighting the uncommon covariate distributions, we end up with an equal prevalence of the confounding variable in both groups after weighting. So we went from 2 thirds versus 25% before weighting to now 42.9% in each group, 6 out of 14 in each group. Now, this may look like we are basically duplicating patients. And in fact, that's what weighting is doing. We take the propensity score. We translate it into a number that represents the number of times we're going to count this patient in the outcome analysis. And so what we have to do to account for this duplication, we can account for that in our outcome models by using clustered variance estimation so that we bring our sample size back down to the original sample size. So even though this looks like we are making up new patients and creating a larger sample size, if we properly account for that in our outcome analysis, we retain the original sample size and the original power and so forth. But by weighting patients in this way, we can remove associations between treatment and exposure and balance covariates. So methodological studies have shown that weighting does a great job at balancing covariates. And whereas matching targets the average treatment effect in the treated specifically, weighting targets the average treatment effect in the entire study population. And so if your research question is more focused on the effect in all patients, that would be one reason to consider using weighting. So weighting works great but can be more sensitive to violations of underlying assumptions, in particular violations of positivity. Overlap weights, as I mentioned, are a newer type of weighting that is less sensitive to violations of positivity but doesn't estimate an average treatment effect. So can affect estimates that can be hard to generalize. And so we'll talk more about these considerations in just a moment. So I think matching and weighting are becoming the go-to methods for propensity score analysis. There are other ways to use these, including stratification and covariate adjustment. But in general, there are some important limitations for these approaches. And so we're not going to talk, we're not going to focus much on those. So those are the different ways that we can use the propensity score. And then we'll quickly run through the steps of the analysis. So how do we choose covariates to be included in our propensity score model? So in general, this is based on subject matter knowledge. So we need to understand our outcome. What are the risk factors for the outcome? And try to measure as many of those as we can and include them. So we want to try to avoid variables that are predictive of receiving treatment but are not a risk factor for the outcome. Those are called instrumental variables. And if we include those in our model, we can have a substantial loss of power. And potentially, it can have a bias magnification effect. So we want to focus on risk factors for the outcome. And as I mentioned, use our subject matter knowledge. And we should try to avoid using univariate screening approaches because there are a number of reasons why using screening to select our covariates can be biased. How do we estimate the propensity scores? By far and away, the most common way is to build a logistic regression model where the outcome is treatment and all the predictive variables are the confounders. And then we get from that the predictive probability, which is the propensity score. So in general, this is the only model that we ever really need to consider. But it's not the only way that we can do this. There are many other models that we can use, machine learning, boosted regression. In many applications, more complicated approaches won't provide any additional benefit compared to logistic regression. And when I look to use a more complicated model is if I'm having difficulty achieving balance. So I generally always start with logistic regression. If I'm having trouble balancing covariates, I might consider a more complicated model. So once we have our propensity scores, a key step is to examine overlap of the propensity score. So from our example study, the Vanck-Piptezo study, this is the overlap of the propensity scores in our study population, with the blue line being the distribution of propensity scores for those that received Piptezo, the red line being that for the comparator, those treated with CepheP. And so you can think of this as a picture of confounding. If the propensity score is just a summary, it's a single number summary of all the covariates, then we can think of this as an overlap of the entire covariate distribution. And as I mentioned, there needs to be overlap. There needs to be positivity. And so here, we can see that there are some violations of the positivity assumption here. So on the right hand, on the left hand side there, there are patients who were given their covariate pattern, were never treated with Pitazo. And on the other hand, we had patients, given their covariate pattern, they were never treated with Cefepime. So these are patients that there may not be clinical equipoise for those patients. So areas of non-overlap violation of positivity are potential concerns for our ability to adjust for confounding. So one thing that we can do is we can do sensitivity analyses, where we repeat our analysis after excluding patients who fall within these areas of non-overlap. And we can additionally trim patients in the tails of the overlapping distribution to focus on an area where there may be the greatest clinical equipoise for receiving one treatment or the other. And this area of overlap, where, at least in theory, there's the most clinical equipoise is the target of inference of overlap weighting. And so this is perhaps, in my opinion, the most important step of propensity score methods and in causal inference in general is directly looking at a picture of the confounding and getting a sense for, are we comparing populations that are similar enough to make a contrast? Sometimes you look at this graph, and there's very little overlap at all. And you say to yourself, gosh, I'm really comparing apples and oranges. And maybe I need to rethink my study question to make a contrast that is more clinically relevant. It's an extremely valuable step that we don't have if we're using just multivariable regression for our confounding approach. And then the last step is to check covariate balance. There are two ways to do this, standardized differences and p-values. And the take-home here is do not use p-values to examine covariate balance. P-values are not measures of balance. They are measures of the probability that a difference is due to chance. And so when we're trying to assess balance, we're not trying to make statistical inferences. And more importantly, they're influenced by sample size. The standard measure that we use is a standardized difference. It's a direct measure of balance. It's basically the mean difference over the pooled standard deviation for each covariate. And the threshold that we use is a standardized difference of 0.1 or larger as a measure of meaningful imbalance. And just to illustrate the limitations of p-values, here are some data from my master's thesis project from several years ago where I was comparing patients treated with colistin versus those treated with comparator antibiotics. Here are two risk factors, vasopressor use and blood products that were substantially unbalanced before matching. And then I matched and noticed that my sample size went from roughly 2,300 down to less than 400. Obviously, my p-values get better because my sample size has dropped by a factor of 5. And we would be erroneous to conclude, well, we have a key balance because we have non-significant p-values. Because if you look at the numbers, they're obviously still substantially out of balance. And we still see that in our standardized differences. So standardized differences tell us about balance in the mean of the covariate. However, for continuous variables, we may be considered about the entire distribution. And so here I'm showing density plots of baseline GFR across different versions of the propensity score showing that even if our standardized difference is below 0.1, we can still perhaps improve balance across the distribution with refinement of our propensity score model. And then finally, estimating associating with the outcome. So once we've matched, weighted, what have you with the propensity score, we can apply any type of outcome model in the matched or weighted sample. We just need to be careful that we account for the propensity scores in the outcome model. Matching, we need to use a clustered variance on the matched pair. And we need to use robust variance estimation with weighting to account for the duplication. So in general, propensity scores versus regression, overfitting is less of a problem. And key advantages is the transparent evaluation of covariate balance and overlap and the separation of the confounding analysis from the outcome analysis. An important advantage, potential advantage of multivariable regression is that given a sample, a study population, and a set of covariates, if you're not overfitting, regression will have more power compared to propensity score methods. And so that can be an important advantage. And so for a given data set, it's not accurate to say propensity scores are better or worse than multivariable regression. It's about understanding the methods and applying them to your specific question. Both methods are effective for controlling confounding from measured covariates. Importantly, neither method controls confounding from unmeasured variables. So in that respect, propensity scores are certainly no better than regression approaches. Only randomized trials balance both measured and unmeasured variables. So coming back to our study, so as I mentioned, I had a concern for overfitting. We had a moderate overall sample size. And I was interested in estimating an average treatment effect. And so those considerations led me to choose inverse probability weighting. There was an additional complication that we had missing data. And so what we applied a multiple imputation analysis where basically we imputed 50 data sets, applied our weighting analysis in each data set, and then took an average of those for our final answer. Here's our table one after weighting. And so as we can see, we had imbalance in these covariates before weighting, balance, and then good balance after weighting. And as a reminder, these were our outcomes that we were estimating, AKI, renal replacement therapy, and mortality. So this is unadjusted before weighting. And this is adjusted after weighting. And so the association with AKI was reduced substantially. And the association with mortality was basically adjusted nearly completely away, suggesting that in large part, these unadjusted signals were substantially confounded. And so that's all that I have. And we'll open it up to questions now. Thank you, Dr. Miano. Really great talk. So we'll get started with the questions. Just remember to put it in your box on the right side. So we'll start with the first question. When choosing covariates, do you recommend? Actually, that question was already answered. Let's move on to the next question. Sorry about that. Your PIP-TASO example resonates with me. I've struggled with how to compare outcomes using different meds, which by their presence suggest worse baseline status. For example, comparing different vasopressor combinations. I recently did a IPTW approach, Inverse Propensity Treatment Weights approach, to compare angiotensin to norepi equivalents and to norepi alone. Had I done matching, would I expect the same result? So a great, great question. And what I would say is that matching targets an average treatment effect in the treated. So it focuses on estimating the effect just in the treated individuals, whereas weighting targets the entire study population. And so number one, those two effects may not be the same if there is treatment effect heterogeneity, meaning that the effect of, in this example, angiotensin, if it varies based on patient characteristics, meaning it works more effectively in those who have a higher APACHE-3 score, just as an example, that it's more effective in sicker patients versus less sick patients. So the way that weighting or matching approaches, basically, each method balances covariance. Matching balances covariance by shifting the covariance to look like the treated patients, whereas weighting shifts the covariance to look like the average for the entire study population. So using matching or weighting would, in this case, severity of illness as measured by APACHE-3 score, it would shift the average value of the APACHE-3 to either look like the treated or look like the average for the entire population. And so if that severity of illness measure is associated with treatment effect heterogeneity, then you could get different answers. And it's not necessarily that one would be correct and the other would be incorrect. It would just be two different estimates, addressing somewhat different study questions. So number one, they won't always give you the same answer. And then further, in terms of deciding between the two, you would need to think about how many treated versus control. And I think, as I mentioned, matching works best when there are at least twice as many untreated as treated individuals so that you can ensure that you can match all patients. And then I would ask, what does the overlap in the covariate distributions look like? If there are important positivity violations, that could be a reason that weighting doesn't do as good of a job and might be a reason to consider propensity score matching. So those would be the considerations for that question. Okay, great. Our next question is, where did the weighting relative values come from? I think this is a couple of slides above, the 1.5 and 3. Did you start with a percentage in the final population and then work backwards? So that's a great question, and I should have. So those weights are, the first step, I didn't show that step, but I should have. The first step is I estimated the propensity score just based off of that single covariate. So each of those patients had a propensity score, you know, from zero to one. And then I then take that propensity score. And so for the treated, it's one divided by the propensity score. So if your propensity score is 0.25, then your weight would be four. If your propensity score is 0.5, your weight would be two, and so on. So the first step is estimation of the propensity scores, and then the second step is calculation of the weights. Okay, great question, and more great questions coming in. I am going to keep going, so if you, I know it's almost 3 o'clock, or it is 3 o'clock, so if you want to stick around for the questions, please feel free. So the next question we have, great discussion. Fortunately, I have a biostats background, so I had the pre-knowledge to understand the details and concepts. This talk will be a great reference to discuss propensity analysis in general terms. So that's not a question, but the next one is, to confirm, the covariates chosen should have been in association with both the intervention and the outcome. Is that right? Really great question, so the most important aspect is that they are associated with outcome. So that is the key characteristic, and so a variable only causes confounding if it's associated with both outcome and exposure. That is the definition of a confounder. So, however, if, so those are the variables that must go in the model. You must include all confounders, those associated with both outcome and exposure. So that's a requirement. However, you can also include, it does not hurt to include variables that are only associated with the outcome, and in fact, doing that, so those variables, if they're associated with just the outcome but not exposure, they don't cause confounding, but if you adjust for them in your model, they can actually reduce, they reduce variance in the model, which increases power. So in general, so I would say focus on risk factors when thinking about trying to identify covariance for the model. Okay, that's great. Moving on to the next question, when comparing standardized differences of covariance, how would one compare standardized difference if the variable was categorical variable instead of a continuous variable? Great question. So there are standardized difference equations for both continuous and categorical variables, and in general, software, data analysis software that can do these analyses will allow you to tell it, this is a continuous variable, use that equation, or this is a categorical variable, use that equation. Yeah. Great, and can you share your thoughts on using propensity score matching for studies not comparing intervention but differing patient populations, such as COVID ARDS to non-COVID ARDS, looking at the outcome? Yeah, so that's a great question. I think in, you know, theoretically this, the thinking about propensity scores is kind of centered around a treatment and intervention that we can control, and that's where the concepts make the most sense, but at the end of the day, a propensity score is just a mathematical way to summarize lots of covariance down into a single number, and so in that sense, there's no reason it couldn't be used for if your exposure is something that is not an intervention, like COVID versus non-COVID, for example. It still could be useful if you are running into issues of overfitting with a multivariable regression model. I think a bigger issue, though, is just thinking about the causal question that you are making, especially as you're thinking about the positivity, the positivity association. So, yeah, so I think, you know, the methods can be used, but we just have to be perhaps a little bit more careful in terms of how we think about causality in those instances. Great, so really focusing on your causal question before moving forward. The next one we have is, I use propensity score matching, and it's been extremely useful, saved a lot of resources and time, sorry, but I still source it in time and advance medical practice more rapidly, but I still get the awaiting, the randomized controlled trial from some binary thinkers, even though many dozens of RCTs have been highly confounded, found no differences in the respective interventional studies in my field of critical care. How do you respond to them? Yeah, so I think another great question, you know, the challenge of learning about treatment effectiveness in critical care and the role of randomized controlled trials obviously is far, you know, has evolved a lot and needs to continue to evolve. I, there certainly is, you know, one of the key challenges with randomized controlled trials in critical care is heterogeneity of our populations and the fact that we need to be able to do a better job of identifying target populations who are likely to respond to a therapy. So, you know, treatment effect heterogeneity is a, is really one of the Achilles heels of randomized controlled trials. And so we, so that problem, addressing that problem is about doing a better job of enriching study populations using subphenotyping and other related methods to, you know, even at the molecular level, identify patients who are likely to respond to the therapy based on mechanism of action of, you know, of a given therapy. And then, you know, more effective ways to explore treatment effect heterogeneity statistically, we generally do that with subgroup, you know, single variable subgroup analyses, but there's an emerging literature on more sophisticated ways to model treatment effect heterogeneity using multivariate, you know, multivariable methods to identify treatment effect heterogeneity. And so I think those are important things that we need to do to continue to develop moving forward. Observationally, you know, it comes down to an individual study and the study design. And so, you know, confounding is always a problem with observational studies, and propensity score methods can be an approach to help with that. But oftentimes, as I alluded to earlier, confounding is not the fatal flaw of a study. It's so often it's collider stratification or missing data or immortal time bias. And so effectively using observational data for causal inference requires not only the appropriate statistical methods for confounding, but sound, you know, first starting with good sound research questions and then sound study design. And, you know, as I mentioned, the target trial framework is really useful for that approach. So I think there's a role certainly for both methods. But, you know, it might be that observational methods are easier to get wrong is oftentimes the limitation there. Exactly. Yeah. And this is just such a great method, all your methods for looking at such a heterogeneous population we have in critical care that we all deal with every day. So thank you so much. That concludes our Q&A session. Thank you so much, Dr. Miano. This was a really great talk. A lot of comments in the box just saying what a great message and how helpful this can be to comparing and looking for causality and going over all of our questions from our critical care population. So thank you so much for your talk. Thanks. Thanks for having me. I really enjoyed it and hope everyone has a good rest of the day. And thank you to the audience for attending. Again, this webcast is being recorded. The recording will be available to registered attendees within five to seven business days. Log into mysccm.org and navigate to the My Learning tab to access this recording. And that concludes our presentation for today.
Video Summary
In this webcast, Dr. Todd Miano, an epidemiology professor at the University of Pennsylvania, discussed the use of propensity scores for causal inference in critical care research. He began by highlighting the challenge of determining cause and effect in clinical research and the limitations of observational studies in establishing causality. Dr. Miano then explained the concept of confounding and the role of propensity scores in addressing confounding. Propensity scores are the probability of receiving treatment given a patient's characteristics and are used to balance the distribution of covariates between treatment groups. Dr. Miano discussed various methods for applying propensity scores, including matching, weighting, stratification, and covariate adjustment. He emphasized the importance of examining covariate overlap and balance to ensure the validity of propensity score analysis. Dr. Miano also compared the advantages and limitations of propensity score analysis with multivariable regression, noting that propensity scores are more flexible in handling a large number of covariates and can provide a transparent evaluation of confounding. However, he cautioned that neither method can control for unmeasured confounders. In conclusion, propensity score analysis can be a valuable tool for addressing confounding in observational studies and offers researchers a way to make causal inferences.
Asset Subtitle
Research, 2023
Asset Caption
The bias created by confounders significantly affects interpretation of outcome in observational studies. The webcast from Discovery, the Critical Care Research Network, will:
Analyze the potential impact of confounders on the interpretation of outcomes in observational studies
Evaluate the effectiveness of covariate adjustment using multivariable regression models in controlling for confounders in observational studies
Create a plan to apply propensity score methods to match individuals in the control group to the exposure group based on equal propensity scores
Meta Tag
Content Type
Webcast
Knowledge Area
Research
Membership Level
Professional
Membership Level
Select
Tag
Research
Tag
Clinical Research Design
Year
2023
Keywords
webcast
Dr. Todd Miano
epidemiology professor
University of Pennsylvania
propensity scores
causal inference
critical care research
observational studies
confounding
treatment groups
covariates
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English