false
Catalog
SCCM Resource Library
Unlock the Power of Subgroup Analyses in Clinical ...
Unlock the Power of Subgroup Analyses in Clinical Trials
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Welcome to today's webcast, Unlock the Power of Subgroup Analysis in Clinical Trials. My name is Otsania Ochobu, and I'm a clinical specialist at Advent Health Central Florida in Orlando, and I'll be moderating today's webcast. A recording of this webcast will be available within five to seven business days. To access the recording, log into your MySCCM account and go to the My Learning tab. A few housekeeping items before we get started. There will be a Q&A at the end of the presentation. To submit questions throughout the presentation, type into the question box located on your control panel. Discovery has launched PERSICE, an inclusive survey of all critical care stakeholders to identify critical care research priorities. The goal is to involve all critical illness and injury stakeholders. Please take five minutes of your time to complete the survey and provide your feedback. Please note, the disclaimer stating that the content to follow is for educational purposes only. And now I'd like to introduce you to our speaker for today. Dr. Todd Amiano is an assistant professor of epidemiology at University of Pennsylvania Perlman School of Medicine in Philadelphia, and I'll now be turning things over to him for the education session. Thank you, Atsanya, for the kind introduction. And thanks to everyone in the audience for joining us today. One of my favorite topics, the utility of subgroup analysis for clinical decision-making has been controversial for decades. On the one hand, we have epidemiologists like myself and statisticians arguing that subgroup analysis results are unreliable and should generally be viewed as hypothesis-generating only. On the other side, we have clinicians who argue that clinical trial populations typically enroll diverse patient groups with widely varying underlying characteristics in pathophysiology that may alter the effects of therapy so that in a given trial, some patients may benefit while others are harmed, and that the overall trial results are difficult, are suboptimal for identifying those individual treatment effects. And is often the case with most controversies, both sides are correct, at least to some extent. And so over the next 50 minutes, hopefully we'll provide some clarity on how to think about subgroup analysis. We'll first kind of give some background on the importance and the impact of heterogeneity of treatment effects in diverse patient populations, like those enrolling critically ill patients. We'll talk about some key principles, things to think about as we conduct subgroup analysis and as readers of the literature, interpreting and applying subgroup analyses. And then we'll finish with some more novel techniques that are being increasingly applied to clinical trial data to examine heterogeneous treatment effects. And so the crux of the issue is the inherent challenges related to how we determine cause and effect. For a given individual, how do we determine if a therapy will improve their outcome or have any effect? And so this is the fundamental kind of Achilles heel of causal inference, is that for individuals, for a given patient and a given therapy, it is impossible to determine whether or not the therapy had an effect. To do so, to determine causation at an individual level, we would need to be able to time travel. So for a given patient and a given therapy, to determine the effect of that therapy, we would first need to treat the patient and observe their outcome over time, and then travel back in time back to the same starting point, withhold the therapy, and then look at the difference in the two outcomes. And as you know, in reality, we only ever see one of these counterfactual outcomes. We only observe whether a patient is treated or not, and we don't observe the counterfactual. And so for any given patient and therapy, there are four potential outcome types that could be observed. The individual could be immune, so their individual treatment effect could be immune, meaning that their outcome is not gonna occur regardless of treatment. So here, if the outcome is death, they're gonna survive if we treat them, they're also gonna survive if we don't treat them. The patient may be, their individual treatment effect might be that they are saved by the therapy. They will survive if treated, but they will die if not. They may be harmed, they will die if treated, they will survive if not. And then they may be doomed, they may be fated to have the outcome regardless of therapy. And so for a given individual, the key challenge is that if a patient survives, we cannot know if they were immune or if they were saved. And if they die, we cannot know if they were harmed or if they were fated to have the outcome regardless. So at the individual treatment, at the individual patient level, we cannot determine causation. What we can do though, is to try to mimic this counterfactual ideal at the group level. And this is what we do with randomized controlled trials. So we start with a trial, a target population, we flip a coin to determine who gets treated and who is untreated. And this flipping of the coin renders on average, the potential outcome distribution in the treated and the untreated equivalent on average. And so that equivalency allows us to make causal inferences at the group level. We can make inferences on average treatment effects. And so this key benefit provided by randomization is why randomized trials have become the gold standard for causal inference and how we learn about which treatment works and which treatments do not. Unfortunately, for the field of critical care, we haven't had a lot of success with clinical trials. As everyone is, I'm sure, well aware and familiar, we've had dozens and dozens and dozens of trials examining therapies and they often end up like this. So here I'm showing the results, the summary results of early goal-directed therapy, the initial trial, single center trial showed benefit. Subsequently, three large multicenter trials were conducted and an individual patient meta-analysis conducted summarizing results across all three populations. And here is the Kaplan-Meier curve showing that on average, there seems to be no effect of early goal-directed therapy versus the standard usual care resuscitation approaches. And so now we could talk a lot about the controversy of early goal-directed therapy, but it really just serves to highlight the fundamental issue is that once we have an observed average treatment effect, so in this example, the hazard ratio is 0.98, suggesting basically that there is no effect of this therapy on average, and that's the key. And the key limitation is that average treatment effects are not the same as individual treatment effects. Individual effects are what we want, but as we've mentioned, we can't estimate those. And so we're using average treatment effects as a surrogate. But as I'll show here in this simple demonstration, average treatment effects can be ambiguous. So in this hypothetical example, this is a hypothetical randomized controlled trial where 50 individuals were randomized to treatment, 50 individuals were untreated. And here in this hypothetical scenario, we know everyone's potential outcome. And in this example, the treatment in fact has no effect. Overall, or for any individual patient. So the populations are comprised of those who are either immune or doomed. Because the treatment has no effect, there are no saved or harmed patients. And in this scenario, there were 16 individuals who died in each arm. And so the incidence or risk is 32%, and there's no treatment effect. Now consider this example. In this example, again, we have 50 patients who were randomized to treatment, 50 control. However, the treatment has an effect. So there are some individuals who are saved by the therapy, others that are harmed. And because this is a randomized controlled trial, the distribution of these potential outcomes is the same across arms. But it turns out that the number of patients who benefit are offset exactly by the number of people who are harmed. So that overall, it appears that there is no treatment effect, when in fact, there are about 12 patients who are saved by therapy and 12 that are harmed. And so the numbers cancel out. And so, this is kind of the crux of the challenge is that for a given trial, and a given average treatment effect, it's challenging to know whether we are dealing with this scenario, where there's no treatment effect overall, or for an individual, or this scenario, where there are indeed individual treatment effects. And those individual treatment effects, as we've mentioned, those are the ones that we're really interested. Those are the effects that are most helpful for making clinical decisions. And so this challenge of teasing out individual treatment effects, or trying to infer individual treatment effects from the average treatment effect estimates provided by clinical trials, is made most difficult when the trials enroll heterogeneous populations. And that is, happens to be basically the norm in critical, in the field of critical care. So we are rarely managing patients with a specific diagnosis or a specific disease, but instead, we are managing patients who have a syndrome. Sepsis is a syndrome that has multiple underlying causes. Acute kidney injury is a syndrome that has multiple underlying causes. So we define acute kidney injury by a change in either creatinine or urine output. And we say, all right, now we have, this patient has the AKI syndrome, which is associated with adverse outcomes, increased risk of mortality, length of stay, and so on. And let's say that we want to design a trial to examine various options that might be beneficial for patients with acute kidney injury. But the challenge is that for a given patient with a change in creatinine, it's generally impossible to know what's causing that change in creatinine. Is it because they have sepsis and that's the key driver, or they have ischemia reperfusion injury, or they're receiving multiple different antibiotics and there's nephrotoxicity. So, and if we are wanting to enroll patients, let's say in a trial that is gonna examine a novel molecular therapy that's targeting a specific pathway that we think is involved in septic acute kidney injury, to the extent that we enroll patients with AKI who have AKI because of ischemia reperfusion, or because of vancomycin nephrotoxicity, those patients have little chance of benefit, of benefiting from this targeted therapy and are only then at risk for potential adverse effects of the therapy. And if we take it a step further, each one of these contributing factors to acute kidney injury is often itself a syndrome that could be further broken down. So the substantial diversity, the heterogeneity of underlying pathophysiology in patients enrolled in critical trials is the Achilles heel for our field and is thought to be one of the major factors that keeps us from identifying effective therapies for sepsis and other syndromes. In our field, perhaps more than most, average treatment effects are poor surrogates for individual treatment effects. This point has been highlighted previously. So in a seminal paper from 2014, John Marshall reviewed more than 100 clinical trials for examining multiple different molecular targets to attempt to identify treatments for sepsis. And the sum results of all of this research and costs and time, zero novel therapeutics to treat sepsis, which remains the case today. But again, how much of this failure of clinical trials is related to the fallibility of average treatment effects in this specific population? So one approach, so while it's not the focus of today's lecture, one approach to this problem is to take our syndromes and to attempt to identify the underlying sub-phenotypes. So taking a population with sepsis or AKI, applying unsupervised or supervised clustering algorithms, machine learning methodologies to group patients with a given phenotype such as sepsis into sub-phenotypes, which may share a common underlying pathophysiology and then enroll those patients in a clinical trial. So to make the populations enrolled in trials more homogenous. And so there's a growing literature there and it seems to be a promising approach. But alternatively, in the trials that we have, we can alternatively or in addition, do a better job of exploring heterogeneity in the clinical trial populations. And so three approaches for exploring heterogeneity in treatment effects include traditional subgroup analysis, which is kind of the overarching focus of today's talk. And then some more novel approaches including prognostic modeling and effect modeling. And so we'll start by focusing on traditional subgroup analysis. So as I kind of suggested in the intro to the talk, subgroup analysis for a long time has had a bad reputation as being unreliable and that in general, we should take the results of subgroup analyses with a grain of salt. So here are some representative quotes from thought leaders in the controversy. Of all the various problematic methodologies in clinical trials, subgroup analyses remain the most overused and over interpreted. We suspect some investigators selectively report only interesting subgroup analyses, leaving the reader unaware of how many less exciting subgroup analyses were conducted and not reported. Subgroup analyses are particularly prone to over-interpretation and one is tempted to suggest don't do it or at least don't believe it. And then finally, in general, we discourage subgroup analyses. And so this perspective kind of represents what I grew up, you know, in my training, this was how I was taught to view subgroup analyses, that they're hypothesis generating only, but rarely, if ever, should we use them to guide clinical decision making. Well, you know, why is that the case? Well, probably the most important reason why subgroup analyses can be unreliable is the issue of multiple testing, right? So when we set up a hypothesis test, the hypothesis test is most reliable when it is based on a well-formulated, pre-planned hypothesis that has an extensive amount of prior evidence to support it, and that, you know, we have designed our study around that single hypothesis, and when those circumstances occur, we know what the false positive rate is. It's defined by our alpha level, which in general is, you know, p is equal to 0.05, right? So what that suggests is that if all those assumptions are met, and we have a well-formulated hypothesis, that we have a five percent chance of observing a false positive, meaning there's a, you know, our test tells that there's a difference when, in fact, there is not one. Unfortunately, what happens is as the number of tests conducted goes up, the risk of observing a false positive goes up rapidly, substantially, and so subgroup analysis in a trial, you know, implicitly involves conducting multiple hypothesis tests. So this risk of false positives is inescapable. It is always there when we're interpreting the results of subgroup analyses, and so in general, you know, for your average clinical trial where you may see 10 subgroup, the results of 10 different subgroup analyses, the probability of observing at least one false positive is not five percent, but more on the order of 40 percent. So the risk of over-interpreting the results of subgroup analyses due to false positives is substantial, and it's most, perhaps most problematic or challenging when the overall results of a trial are negative, and particularly if the trial wasn't pre-registered, right? So there isn't a protocol that was posted before the trial was conducted saying these are the subgroups that we are going to look at, and these are the hypotheses that are underlying those subgroups, and then we conduct those tests, and then we report all of the tests that we said that we were going to conduct, right? So a priori specification and reporting all of the tests that were conducted. In the absence of those circumstances, then, you know, concerns about selective reporting of only the subgroups that are interesting becomes a meaningful concern. And so here is an example just to kind of highlight the problem. So if you look hard enough, you can generally find a difference that would be interesting. So these are results from the beta blocker heart attack trial published in the early 1980s where they conducted 146 different subgroup analyses, and this is a histogram summarizing the distribution of the results. And so as you can see, it's basically a normal distribution, and the individual effect estimates in a given subgroup range from a beneficial effect with a risk difference of 10% to a harmful effect with a risk difference of 5%, right? So vastly different results. And so, you know, just highlighting, you know, this issue that if you look long enough and hard enough, you can find something that's interesting. But in reality, what you're observing is often just random variation. Now, you know, while the risk of false positive is increasing, we're also plagued by a high risk for false negatives because, in general, subgroup analyses are underpowered relative to the power for the overall study population. So in this figure, I'm showing the power provided for a subgroup interaction test versus the power for the overall trial. So for example, the dark blue curve here represents trials that are powered with 80% power to detect the overall effect in the entire study population. And each of these estimates represents the power for the interaction test for various strengths of interaction. So to detect a subgroup effect that has an effect size equivalent to the overall effect size, if your study has 80% power for the main effects, you have less than 30% power for the subgroup effect. And a subgroup effect where the effect size is as large as the overall effect would be a notable subgroup effect. Most subgroup effects are, in fact, smaller than the main effect and so are generally woefully underpowered. A useful rule of thumb when you're thinking about study size is if you're wanting to power your study to have power to detect a subgroup effect, you need four times the sample size as would be required for the overall effect. So on the one hand, subgroup analyses are plagued by high risk of false positives, also plagued by a high risk of false negatives. And in addition, beyond those kind of two inherent limitations, the methodology that has been applied in clinical trials has generally been inadequate. So, you know, a survey in the year 2000 of clinical trials suggested that less than half of trials that reported subgroup analysis used appropriate statistical tests. 12 years later, 85% of trials that reported strong subgroup effects met less than half of established of established criteria for credibility, which we'll review in just a second. 12 years later, in 2024, we're not doing any better. A review of 374 oncology trials, only 17% applied correct statistical analysis. So we're not really helping ourselves out. We have an inherently challenging problem, interpreting the results of subgroup analyses, and we're not doing a good enough job in how we are conducting those analyses to make them as rigorous as possible. Now, I've been painting a pretty stark picture of the limitations of subgroup analyses, but it doesn't have to be this way. Here are some quotes from a nice paper in the BMJ from 2015 that talks about, you know, that talks about when you can believe a subgroup analysis, and they make the point that the well-documented unreliability of subgroup analyses are not inherent or unique to subgroup analyses. These are just the challenges of causal inference in general. The same problems would arise for clinical trials if we conducted them in the way that we conduct subgroup analyses, routinely performed in underpowered populations with haphazardly selected interventions. They suggest that if properly selected, meaning based on previous empirical evidence and current scientific theory, an adequately powered subgroup analysis can be a valid hypothesis test. So what are the things that we think about that help us identify a scenario where a subgroup effect is believable? So these are the key criteria that we need to think about, and these are not new. This set of criteria was first suggested in the early 1990s, updated in the mid-2010s, and as we go through these, you'll notice that these are really, many of these things are key principles of making causal inference from randomized trials in general. So the first factor is that the subgrouping variable should be a characteristic that is measured at baseline, and why is that so important? Well, we should be focusing on baseline variables because anything that is measured after baseline could potentially be affected by the treatment itself. So in this example, I'm showing a figure. This is a directed acyclic graph that is describing a hypothetical trial of angiotensin, angiotensin-2 vasopressor therapy, where the outcome is death, and let's say that we're interested in whether the effect of angiotensin-2 varies based on whether the patients were treated with corticosteroids. Now, that subgroup analysis makes a lot of sense. There's mechanistically reasons why we might think that that would be a source of differential effects of angiotensin-2, but if we are defining this subgroup based on whether or not patients receive corticosteroids during follow-up after randomization to treatment, whether or not a patient is treated with corticosteroids is likely a function of the effect of angiotensin-2. If angiotensin-2 is having a beneficial effect and blood pressure is improving, you're less likely to get corticosteroids. If, on the other hand, a patient is not responding, then they may be more likely to get corticosteroids, so if we conduct a subgroup analysis on this variable, we are basing that subgroup, we're conditioning on the effect of angiotensin, which would introduce a selection bias, and that is the general concern, that conditioning on post-baseline variables introduces a considerable risk of selection bias. So the first criterion is that we should only be focusing on baseline variables, information available at the time of randomization. The next important factor is that the p-values that we're looking at to determine whether or not a subgroup effect is a significant difference need to be p-values from interaction tests. So what exactly do I mean there? So a subgroup analysis is a special case of testing for interaction, where we have one exposure here, B, which is a subgrouping variable, and then the treatment of interest, which is A, and we're interested in whether the effect of A is different for those with versus without the baseline subgrouping variable. So the test of interest is an interaction contrast, and that can be either a difference in difference, if we're doing our analysis on the risk difference scale, or a ratio of risk ratios or rate ratios, if we're focusing on the ratio scale. So let's go through examples. So this is an example subgroup analysis from one of our previous papers, where we were examining the risk of acute kidney injury in patients treated with NSAIDs for analgesia versus oxycodone, and we stratified our analysis by those treated with RAS inhibitors versus a calcium channel blocker, because there may be a drug-drug interaction between RAS inhibitors and NSAIDs. And so we conducted an analysis of NSAID versus oxycodone only in those who were on amlodipine at baseline, and we get a rate ratio of 1.21 with a non-significant p-value. And then we repeat this analysis in those treated with RAS inhibitors, and we see a rate ratio of 1.26 with a significant p-value in this subgroup. And now, if we looked at this finding and said, well, we have a significant effect in one subgroup and not the other, and we conclude that this is a significant subgroup interaction, that would be the incorrect thing to do. This is not the appropriate statistical test for whether or not there is interaction. The p-value of interest, so the subgroup-specific p-values are not what determine the significance of an interaction. It's the p-value associated with the interaction test. So in this instance, it's the ratio of rate ratios. So if we take 1.26 divided by 1.21, we get a ratio of rate ratios of 1.04, and we get a confidence interval around that, and a p-value of 0.807. And this is telling us that the effect of NSAIDs does not vary significantly across these strata. Typically, these interaction analyses are shown in a force plot like this, and what we're looking for is whether or not the confidence intervals of these estimates overlap. So does the confidence interval for the RAS inhibitor effect include the mean, include the point estimate and the other subgroup? So if the confidence intervals overlap with the effect estimates, in general, we're going to have non-significant estimates of interaction. So in this example, if we look just at this subgroup p-value, we would conclude that there is an interaction, but if we do this appropriately, we'll see that there is, in fact, no interaction. So this is a key part of interpreting the results of subgroup analysis. So here, going back to the early goal-directed therapy pooled analysis, they did a number that they conducted 22 separate subgroup analyses and appropriately tested for interaction. And so they observed two significant subgroups, one of those being baseline liver disease, and what they found was that receipt of early goal-directed therapy appeared to be harmful in those with baseline liver disease, with no effect on those without, with a significant p-value for interaction. As we can see here in the force plot, there's no overlap of these confidence intervals. So this was a baseline variable with a significant interaction, and so then it begs the question, well, you know, to what extent does this represent a true effect? And so to think about that, we need to think about some additional credibility criteria. So first, was the subgroup hypothesis pre-specified? And ideally, including the directionality, right? So did the investigators pre-specify that we think liver disease alters the effect and we think that patients with liver disease will have a worse outcome compared to those without? So not only pre-specifying the subgroup analysis, but also the direction. And then, is there a strong pre-existing biologic rationale, right? So can we explain the mechanism whereby having liver disease would affect, you know, would lead to altered effects of early goal-directed therapy? And then importantly, was the subgroup analysis one of a small number of subgroups tested? So if we're wanting to base clinical decisions on the results of subgroup analyses, ideally, they are pre-specified with a strong biologic rationale, and the investigators tested only a handful, ideally just one or two. And so in this table, you know, so the way that we think about subgroup hypothesis tests, or hypothesis tests in general, is it can be helpful to think of them as a diagnostic test, right? So if I have a patient who has a positive urine culture, you know, whether or not I think that urine culture represents infection depends on the pretest probability. How likely do I think they have infection before seeing the results of the test? And that thinking can be directly extended to how we think about hypothesis tests. So for a given significant p-value, the likelihood that that represents a true effect depends on the baseline prior probability and the number of tests that were conducted. So in the PRISM analysis, they conducted 22 separate subgroup analyses. They had 80% power to detect an interaction effect, and they observed two significant interactions. Baseline liver disease, where early goal-directed therapy was worse. Baseline respiratory disease, where early goal-directed therapy was better. Um, and so the key question then is what is the prior probability? You know, prior to seeing the results, is there a, you know, did the investigators have a plausible mechanism by which these factors might affect, um, how the results of early goal-directed therapy for a given patient. And if the prior probability is low, then, you know, regardless of the power for the test, the likelihood that we're observing a true effect, um, is low. In contrast, if the subgroup effect is well-powered and we're testing just a single subgroup, um, then, then we might have high confidence that what we are observing is a true effect. Um, so a subgroup analysis that has had a major impact on care is the subgroup results from the recovery trial of dexamethasone for COVID-19, where they found that the benefit of dexamethasone was strongest in those who were receiving mechanical ventilation at the time of randomization. Still a benefit for those who were on oxygen therapy, but no benefit in the, you know, in the direction of harm, uh, for the patients who were not on supplemental oxygen. Um, and, you know, these results have, have made it into care. I think it is not uncommon for, for clinicians and, and health systems, you know, their guidelines to suggest, uh, to reserve critical steroid therapy for those on oxygen support. But, I mean, the question here is the same. This is a subgroup effect. Is it, how reliable are these results? Well, if we look across these criterion, was it a baseline variable? Yes. Was there a significant interaction test? Yes. Was it pre-specified? Yes. Um, although the direction of the effect was not specified in the protocol. Is there a biological, biologic rationale? I, I would argue that there is. Um, it was one of six subgroups tested. So, um, you know, these results might be more believable if it were just one or two, uh, but six is still a relatively small number. And then are the subgroup effects consistent across multiple studies? Here, we don't really know because there's limited additional evidence. This is the main trial that we have. And so, so I think there, there's still some question. If we repeated another large study, um, would these findings hold up? Um, so in summary, traditional subgroup analyses, there's a high risk of false positives and a high risk of false negatives, but we, uh, can use these, uh, you know, subgroup results can be useful when carefully planned and executed. And, and, uh, you know, we're making decisions based on a limited number of pre-specified subgroups that have a high, you know, pretest probability. Now let's say that we're dealing with a trial where all of that is true. So let's say in the early goal-directed therapy example, that all of these criteria are met both for liver disease and respiratory failure. So we kind of have like the perfect scenario where we have believable subgroup effects. There's still a challenge though. So we'll, you know, the results of the subgroup analysis suggests that liver disease patients may be harmed, whereas those with respiratory failure might, um, receive benefit. But then it begs the question, what, how do we treat a patient who has both liver disease and respiratory failure, right? And so an additional inherent limitation of subgroup analysis is that it's, it's, you know, we're examining one variable at a time, but patients typically have multiple different characteristics that affect outcome and that may alter the effects of therapy. And so it's difficult, can become difficult to apply one variable at a time, um, subgroup analysis. And so this limitation was highlighted by the predictive approaches to treatment effect heterogeneity, or the PATH statement published in Annals of Maternal Medicine in 2020. Um, and so this document does a really nice job of highlighting, you know, the challenges of, of interpreting one variable at a time subgroup analysis. And they propose that clinical trials should routinely include multivariable approaches to examining treatment effect heterogeneity. And the two broad approaches here are, um, examining, uh, interaction by baseline risk for the outcome or a prognostic model, um, uh, or developing an effect model. So prognostic models, you, you would be familiar with. So that's things like the APACHE-2 score. So it's, it is a, a model that estimates the, uh, risk of the outcome at baseline before therapy is, um, is administered. Uh, and as we'll see, there are many scenarios where the treatment effects may vary by a patient's baseline risk for the outcome. Alternatively, we can develop effect models, which this is not a model for the risk of the outcome, but it's a model where we're directly trying to predict the treatment effect. And, and this is really what we're most interested in. And so development of effect models involve developing a model that predicts your outcome if you were treated and predicts your outcome if you were not treated, and then estimates the difference between those two. And, and importantly, it examines, you know, a broad set of interactions between whether or not you receive treatment and a set of baseline covariates. Um, so we'll, we'll talk first. Uh, and so for, for both approaches, there's, this is the general overview of, of the steps involved to analysis. So the first step is to, um, identify a, a model, either the prognostic model or the effect score model. Now in a perfect scenario, there is a preexisting model that was developed, um, in an external population that has been shown to have good predictive characteristics. Um, short of that, the next best option is to derive a model in an external population and then apply it. Or, um, you know, the, the worst case scenario, but, but can still be valid is to derive a model in the study population. So in the clinical trial population. Uh, once we have a model, then we apply the model to the population and get an, an estimate of the prediction for each individual patient. Then we stratify by that score into high from, from low to high. And then we test interaction across these subgroups. So, so here we're basically testing for subgroup effects, not across a single variable, but, but across a score that represents multiple different variables. Um, so first prognostic models. So it is often the case that baseline risk varies substantially within a study population. So these are results from one of our recent analyses where, um, we were developing a prediction model for the risk of acute kidney injury and overall in the population, the risk of AKI was 5%. Uh, but if we arrange people in order of their predicted baseline risk, um, nearly 60% of the population had an, uh, you know, had a risk lower than 5%. And, you know, the, the, the overall average risk in the population is being driven by a relatively small number of very high risk patients. So the first important, uh, factor that, that provides the rationale for prognostic modeling is that risk varies substantially, uh, in populations and that baseline risk determines treatment effect. So, you know, if we assume a constant relative effect, so if I assume a relative risk of 0.5, um, that relative risk translates to a risk difference of 10% if the baseline risk and the untreated is 20%, but only 5% if my baseline risk is 10%. So in general, the risk difference is equal to the relative risk times the baseline risk. And so this simple mathematical truism, uh, tells us that treatment effect is, will often be determined by the baseline risk. Uh, and so if a therapy has an effect, but also has important adverse effects, um, then the, you know, the balance of risk versus benefit will vary substantially across baseline risk for the outcome. So here I'm showing a hypothetical example where, um, anticoagulation for COVID-19 and, um, anticoagulation, and let's assume that anticoagulation reduces your risk for thrombosis by 25%, but also increases the risk for bleeding by a relative, um, risk of 25%. Um, whether or not therapy is going to be beneficial, and here I'm showing this as a number needed to treat or harm very substantially, um, over baseline risk, um, for the outcome. Uh, and so an example of a baseline, um, uh, a prognostic model for heterogeneity was published, uh, by McAllen colleagues where they re-examined the SMART trial that randomized patients to balance crystalloids versus saline. They had developed a well-performing prognostic model in an external population, um, and then stratified their population from low to high risk, and they showed that there was no interaction on the relative risk scale, but a significant interaction on the risk difference scale, and showing that the benefit of therapy was driven, you know, the average effect was really driven by a significant effect in the highest risk patients. So, you know, if we're worried about adverse effects from, um, administering one fluid or the other, uh, if we had a way to, you know, restrict to the highest risk patients, we might, you know, optimize the balance of risk versus benefit. Now, a limitation of prognostic modeling is that, um, you know, sometimes it will enrich for treatment effect, but not always. So, differences in risk may not predict differences in treatment effect. So, the ideal scenario is what I'm showing here, that when we stratify by our prognostic model, it's sorting patients into increasing levels of the number of patients who would respond to therapy, who have a potential outcome type of being saved, right? Um, but what may be happening when we stratify by baseline risk is that we're not arranging people by whether or not they respond, but, you know, really by whether or not they were doomed to have the therapy or they were immune, um, from having the outcome. Um, so, prognostic modeling may not always be optimal in terms of predicting treatment effects, and so that's the, that is the rationale for developing a predictive model that, that predicts effect. Um, and so there's a nice paper just recently published in Critical Care Medicine where they conducted a predictive risk model or effect model in the, in two of the three studies included in the PRISM, uh, individual patient, uh, level meta-analysis, um, so that, uh, there were a total of 81 or 82 studies, uh, or, or centers, uh, in the two trials. They sorted those in random order into a derivation and evaluation subset, and the derivation subset, they applied three different machine learning algorithms, um, to develop a score that predicts whether or not you're going to respond to, um, early goal, goal-directed therapy, and then they applied, uh, the best performing model to the evaluation data set and then conducted, uh, subgroup analyses, and, and these are the results which are, are, are very intriguing. Um, so going from a score that suggest, uh, that the lowest value suggests that early goal-directed therapy, um, would be beneficial, the highest score predicts that it would be harmful, and what they show is that there is substantial variability, um, in the estimated treatment effects across these strata, um, and so, uh, so suggesting that using a score that draws on multiple different variables may be more effective at explaining treatment effect heterogeneity, um, than a single, you know, one-at-a-time subgroup analyses. Now, a limitation of, of these machine learning type approaches is, is that they tend to be black box, and so understanding, you know, applying these at the bedside involves kind of understanding why the model is saying a patient's going to benefit versus, versus not benefit, and so there's, there's a substantial amount of work that would still need to be done in terms of validating these findings in external populations and then developing tools to facilitate using, uh, you know, using an effect score to make decisions at the bedside, um, and so, uh, that takes us to the end. Thanks, thanks everyone for your attention, and we'll open it up to questions. Thank you, Dr. Miano, for that very informative session. Um, as a reminder, you can type your questions in the questions box on your control panel. Um, so one question that I have for you, Dr. Miano, is you had mentioned that if a study does not define or disclose a priori their subgroup analyses, then it makes it difficult for us to interpret, um, whether or not there are multiple subgroup analyses done, and if they're only reporting the interesting ones. What are some other things that we can look at to evaluate within studies, um, that present their subgroup analyses to assess them for validity if they do not report a priori what they plan to test? Yeah, so, so I think, um, uh, you know, the criteria that we discussed are, are, are helpful. Um, so, you know, if, if there, there isn't evidence that the hypotheses were pre-specified, we, we should just, you know, in general have less confidence in, in whether or not they are valid or not. So, um, and, uh, you know, so short of that, uh, how many subgroup analyses were conducted, you know, and I would say in general, you know, more than five. I, I'm starting to have less, um, less confidence that, that this is high enough evidence that I can make clinical decisions, uh, based on it, right? So, we want a limited number of, uh, uh, analyses being conducted, um, and then prior to the results of the subgroup analysis in a given trial, what was the pretest probability, right? So, how much evidence is there leading into the trial, um, to suggest that, uh, that it's a true effect? So, in the, in the same way that we often, uh, don't look at single clinical trials to determine whether or not an effect is true, we shouldn't look at single subgroup analyses to determine if an effect is true. But if we have a lot of, you know, substantial amount of prior evidence, um, and the investigators pre-specify the subgroup analysis and post that to clinicaltrials.gov, and it's one of just a handful of, of, of hypotheses being tested, then those are the types of results that it may be reasonable to use for clinical decisions. Okay, and that concludes our Q&A session. Thank you, Dr. Miyano, for your time, and thank you to the audience for attending. Again, this webcast is being recorded. The recording will be available to all registered attendees within five to seven business days. Log in to mysccm.org and navigate to the My Learning tab to access the recording. That concludes our presentation for today. Thank you for your time.
Video Summary
In the webcast "Unlock the Power of Subgroup Analysis in Clinical Trials," Dr. Todd Amiano discusses the intricate role and challenges of subgroup analysis in clinical research. Subgroup analysis can provide insights into how treatments affect different patient populations, yet it carries a reputation for unreliability due to issues like multiple testing, high false-positive and false-negative risks, and underpowered analyses. Dr. Amiano emphasizes the importance of pre-specification, limited testing, and strong biological rationales for credible results. Proper subgroup analyses should be based on baseline characteristics to avoid selection bias and should employ appropriate statistical tests, such as interaction tests, to determine significance.<br /><br />The session also highlights more advanced approaches, such as prognostic and effect modeling, which provide fuller insights by examining the influence of multiple variables on treatment effects. These methods promise a more enriched understanding compared to traditional one-variable-at-a-time subgroup analyses.<br /><br />Finally, the webcast underscores the importance of robust methodologies and pre-trial planning in ensuring the reliability of subgroup analyses, advocating for these practices to guide clinical decision-making effectively.
Asset Subtitle
Research, 2024
Asset Caption
Gain insights into optimal timing and methods for conducting subgroup analyses based on solid statistical principles to ensure valid and applicable trial results during this webcast. Equip yourself with the knowledge to enhance your skills in clinical trial design, analysis, and interpretation to make better-informed treatment decisions for critically ill and injured patients. This essential learning is tailored for healthcare professionals and researchers.
Learning objectives
Discuss the importance of subgroup analyses in assessing treatment effects within diverse patient populations in large-scale trials involving critically ill patients
Identify when subgroup analyses should be prespecified and discuss the statistical principles that guide the appropriate number of such analyses
Explore various statistical methods used for evaluating heterogeneous treatment effects across different patient subgroups in clinical trials
Meta Tag
Content Type
Webcast
Knowledge Area
Research
Membership Level
Professional
Membership Level
Select
Tag
Clinical Research Design
Year
2024
Keywords
Webcast
Research
Clinical Research Design
Professional
Select
2024
subgroup analysis
clinical trials
Dr. Todd Amiano
multiple testing
pre-specification
biological rationale
interaction tests
effect modeling
clinical decision-making
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English