false
Catalog
SCCM Resource Library
How Should I Make Sense of All These Conflicting S ...
How Should I Make Sense of All These Conflicting Studies?
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
A simple task. Thank you very much, Emily and Mike. I will do my best not to repeat a lot of what was said before, but I will say I think a lot of the same techniques that they've discussed I'll try to go through in some more detail here. I don't have disclosures related to this, and I will make the point. I think both Laura and Todd said an important thing is intellectual disclosures. For those of you who know me, I'm not a trialist. So you're about to get opinions from somebody who does not have a foot in this game. But I do think if you're anybody doing any research, you are already a bit biased when it comes to thinking about how we assess other people's research, because we know what that means to us. So I will say that I am certainly a researcher. And in the next 15 minutes, this is sort of what I would like to go through with us. The first is to talk about why I think it is plausible that ICU studies are particularly primed to have ongoing conflicting results, and things I think we should know as we think about studies going forward, and many of our trialists have already been addressing some of these. Why ICU trials may be at higher risk particularly than others, or trials in other areas. And then finally, what I think as a clinician or a reader, what we can think of even if we are not trialists, and how to evaluate these trials ourselves. OK. So to start with, I'm going to go through a couple of forest plots from meta-analyses. You've seen these both from Laura and Todd. But just to orient you, on the left-hand side are rows of individual trials. So these are the individual trials. The numbers I know are small, but the numbers next to that are the size of the trial. So how many people in each trial have events either in the intervention or the control arm. And then the bar down the end, which is this really forest plot, circles around the line of identity where we say that either the risk or the odds, in this case it's a risk ratio associated with the intervention, whatever that might be, whether or not that increases the likelihood or decreases the likelihood of having the outcome. So just as an example, this was a recent systematic review in meta-analyses of the use of corticosteroids in the treatment of patients with sepsis. And as you can see, I think, there are a number of studies of which there is a single one that has a very clear signal in favor of the use of steroids and that it reduces mortality, and with a confidence interval, that bar that doesn't cross the line of unity. That would tell us the p-value is less than 0.05, or this really is a confident study where we feel like we know the result. However, there are a series of other studies that have very significantly reduced, indicate very significantly reduced risk of mortality. And the issue really is the confidence. And I think this is part of what Todd was showing us in his work as well. And then I think additionally, and we see this again with many of our studies, we have a lot of these studies where the suggestion is that this intervention, in this case steroids, is associated with a reduced risk of the outcome, mortality. But we just don't have enough confidence. This confidence interval crosses one. And of course, there are some other studies that are sort of outliers in this group, some of which suggest potential harm. And again, at the bottom here, similar to what Todd described, you can see sort of the meta-analytic result down here where the confidence intervals are, of course, much smaller because we've accumulated a lot more evidence, including more and more patients. And this is not unique to the steroids in sepsis trial. This is another trial recently published, excuse me, pertaining to our critically ill patients looking at type like glucose control. The range of the x-axis is different. So I'm not suggesting the confidence intervals are easily comparable. But you can see that in general, many of the signals hover near the line of identity. And in fact, our confidence intervals remain wide. And again, this is true with selective gut decontamination. And again, these are all meta-analyses published in, let's see, the last 15 or so years, all in JAMA. So these would be considered probably high-quality meta-analyses, but we're consistently seeing this signal. It is not to say that this is just in critical care, but I would point out that there are some instances where other groups will do meta-analyses where we don't have this sort of non-confidence persistently seen. This is just an example. It's from a new alcohol use disorder drug called Camryl. And as you can see, the confidence intervals, again, the x-axis here is not exactly the same as the others, but it's not very dissimilar. The confidence intervals here are much tighter. That said, we are not alone. And so this is a trial I took from palliative care. And you can see here, they similarly have very similar odds or risk-odds ratios here, I apologize, for their trials with very large confidence intervals. And so we're sort of seeing here the sense, both in critical care and in this palliative care sense, may be different from some of the other fields that we are kind of circling in on an answer, but it's our confidence that may be less than we would like and where the meta-analysis may come into play. So why is this that ICU trials may be particularly at risk for this sort of low confidence piece? And I think there are three reasons that I've thought of, two of which I think were addressed by our prior speakers. So the first is, why might we have this lack of confidence? And I think Todd made a really nice point thinking about the three fluid trials, that even at these very large numbers, 15,000, 10,000, and 5,000 patients, that our signal or our intervention is going to have, if it has an effect, a very minimal effect, those numbers might not be large enough. So just to look at these two examples where we found that in the meta-analysis, the intervention worked. There was a clear signal toward reduction in the risk or odds of mortality with a confidence interval not crossing 1. As I highlighted in that first one, only 3% of studies actually on their own supported this result in the first meta-analysis, and only 17% in the second. However, a significant majority actually pointed us toward that. We just didn't have the confidence on their own to say that. So I think that's part of it, is that we're really missing this confidence. And why might that be? This is the only slide with an equation. I am not suggesting that anybody needs to know this, but I would just call your attention to the fact that to create the confidence interval, we use something called the standard error, that the width of this confidence interval is directly related to an exponentially related to an increase in the standard error. And the standard error is directly related to the inverse of a sample size. So in particular, every time I add more patients to a study, I can exponentially increase the likelihood that my confidence is getting smaller. And in particular, just to show you an example, I made up a study where the odds ratio for mortality, let's say in this case, comes out at 0.62. And you can see, I think, what we all intrinsically know. I think sometimes we forget the magnitude of the effect, that the confidence interval really notably shrinks. Clearly, it gets much better from 10,000 to 100,000 patients, but even when we go from 100 to 1,000, and that's really a pretty big difference, we're getting a market impact. So I think that's part of it. And if you look back at those meta-analyses, many of our studies on their own are just way too small to see the effect that we expect. OK, whoops. The second one is I think oftentimes we're comparing apples and oranges when we think about the impact of an intervention on a group of patients. So what's an example of that? So again, this is, I think, probably an example you folks have heard in other contexts. But I think often we can ask a very broad research question and design a study that is designed to answer that question. But that question may be hard on its own to answer. So for example, do steroids work in ARDS? And we know that ARDS is a syndrome, and it has a lot of different characteristics in a heterogeneous patient population. But might a better research question be, do steroids work in a more homogeneous population of ARDS patients, some that we're sort of narrowing down from this big syndrome to something smaller? And two examples are this sort of hyper versus hypoinflammatory sub-phenotyping. And we'll talk about this in a moment. Or maybe even what we've all seen in the last few years, that we are getting kind of more positive trials, more trials that actually have a result when we're looking at a very relatively homogeneous group of patients, particularly those with COVID. OK, so just to give you this as an example, and many of you may be familiar with this. This is work from Carolyn Kelphy, now about 10 years ago, where she looked at patients from several landmark trials. This one happens to be from Alveoli, the trial where they looked at high versus low PEEP in patients with ARDS. And she sub-phenotyped them and identified two clusters of patients, those termed the hypoinflammatory and the hyperinflammatory. That's on the left-hand side here in this graph. And you can see that in the red, those are the hyperinflammatory. They tend to have more of these sort of inflammatory biomarkers versus the hypoinflammatory tends to have less. And this was particularly important because in this study and in several others, they identified that the outcomes of these patients were very different. So in specific, as you can see here, their overall mortality was very different. Those in the hyperinflammatory group tended to have a significantly increased mortality than those in the hypoinflammatory. But more importantly, when they looked at interventions and the relative impact of interventions on each of these subgroups, they found something even more interesting. So here are the results from that sub-phenotyping analysis. And as you can see here, this is the hypoinflammatory and hyperinflammatory group. And they looked at how PEEP was impactful in each of these groups separately. And as you can see, the use of high PEEP in patients who were in the hypoinflammatory group actually seemed to be harmful. They had worse mortality. Conversely, the use of high PEEP in patients in the hyperinflammatory group seemed to help. They had less mortality. And this signal was completely buried, right? If we look over, and this is the primary outcome from the original alveoli trial that did not sub-phenotype people. And so if we just sit in very similar to what Todd was saying, if we look at all comers and include those TBI patients in our balanced fluid assessment, we may be muting a signal that may be there because in that subgroup, maybe we see a different effect, where maybe balanced fluids are actually harmful versus in the overall cohort. So I think we need to be careful about these sort of phenotypes. And then I think the third problem is it's hard for us to figure out what's normal. And what I mean by that is how do we think about how to organize our studies around what we expect in the control group? And this is very similar to what Laura was addressing early on in her talk about the fact that when you're planning a trial, especially a trial that's going to take a while, you're planning it based on data you have in front of you. And if things are changing over time, your planned statistical calculation may be based on faulty assumptions. So what's an example? So I will not go through in great detail. Laura did a much better job of this than I would in one slide. But this is just to call your attention to what she had mentioned in Manny Rivers' original study, looking at the control group. They got 3 and 1 half liters of fluid versus the control group in these three coordinated larger multicenter trials 13 plus years later, where the control group looked nothing in terms of fluid management like the original control group. And so it's really hard, and for all the other reasons Laura mentioned, to compare and contrast those. But I think this is not unique to that. And this is just the original statistical plan for the Petal Rose study, which was the study that came out fairly recently looking at early neuromuscular blockade for patients with ARDS. And as you can see in their statistical plan, they made their sample size calculation based on the assumption that 35% of patients in their control group would die. And they wanted a power of their study to recognize a 7% absolute reduction in that mortality from 35% to 20% sorry, 8% from 35% to 27%. You'll see that I have written 7% in the next one to ignore that. It should be 8%. However, when they look at their actual enrollment in the control group, they had a nearly 43% mortality. And you would think that doesn't matter, although as some of you may remember, the closer we get to 50-50 in terms of the impact or the outcome in an intervention versus a control group, the larger the number of patients we need to recruit to see the same absolute change. And so in fact, they estimated they would need 1,400 patients. To get 8%, they would need 2,000 patients for statistical significance. And as you may remember, this study was stopped early for futility, so it even enrolled something less than that, closer to 1,000. So again, I think, and this is purely because what their assumptions, which were very appropriate based on data that was available to them at the time, proved not to be true. OK, so what can we do with that in context? So I think the first thing, and this is often hard to do, is to try to understand what we think are both the reliability and the validity of the study that we're looking at. And so these are just three examples. Something that's low validity and low reliability we're probably not talking about, so that's not here. Our goal is to have something that's highly valid and highly reliable. Valid means I repeat the study again and again and again, and I get the same answer. Reliable means that when I repeat it, each time I get the answer that's close to the truth, in this case, the target in the middle. In this second graph, you have something that's fairly valid. I'm getting the result I expect to get, but it's not very reliable. It's all over the place. It's sort of our meta-analysis, where we're circling around the answer we want, but we're not getting it right on the nose. And when we look at each individual trial, it becomes hard to tell. And then, of course, unfortunately, you can have something that's quite reliable, but just wrong, that's not valid. I think this one on the right is the hardest to discern. And what I would say here is that if there is a reason when you look at the trials as a reader or a clinician, and you say there's something systematically wrong with how these trials are being run, this doesn't just mean you don't like the results, right? But this is, there's something systematically wrong. We're recruiting a lot more kids into the intervention arm, and we're putting a lot more adults into the control arm. And there's something intrinsically about why those two populations behave differently that should raise your concern that maybe you have something that's reliable, but not very valid. But I think the more common issue in critical care is here, this idea that we may need to do our own sort of intrinsic meta-analysis if someone hasn't done it for us to say, all right, I recognize the differences in the trials, but how can I combine them? And if I combine them, do they circle around the same result? There's, OK. And so I think, therefore, if we think about this in terms of the specific questions we have, we have certain things to think about in terms of how we can apply trials broadly and then in the specific populations they're meant for. And as was noted, I think, by our prior speakers, if you're going to be talking about people who did or didn't get fluids before they were enrolled into trial, you have to know how that applies to the patient you're looking at. And similarly, if you're going to get balanced or not balanced fluids, are you talking about someone who has traumatic brain injury or not? So as an example, for steroids and viral pneumonia, we know that, at least from the data we have, there may be a differential effect. We found in general that in COVID, it's helpful, and potentially in influenza, it's harmful. If we just talk about viral pneumonia overall, we may miss that. Similarly, excuse me, if we talk about paralytics for early ARDS, we have two large trials that show conflicting results, one of which that early paralytics are potentially helpful in terms of reducing mortality, and the other of which shows no impact. But if we dig more deeply, and this was the reason for the Petal Rose trial, was to repeat this and change the control group construction, we know that the control groups here are not the same. And so the issue here is not necessarily that steroids don't work for viral pneumonia or that paralytics don't work for ARDS, but that we've sort of created a situation where we have studies that are not internally generalizable. I can't go from one to the next and assume the same result. And we really need to think about what is the patient population we care about when we're trying to interpret the data that's available to us. And so for no trials, from my perspective, whether you're a statistician or not, there are a couple of things that you can look at to try to be keys to you about whether or not the data you're receiving is going to help you interpret the result, is going to give you the answer you want. So the first is, does the trial meet its enrollment criteria? And as Todd had alluded to, when you have to stop trial recruitment early, you're going to often get a result that's non-significant, even if the true result were to be statistically significantly different. And the second, which I think is the one we often forget is, does the actual control group look like the planned control group? Because if it doesn't, all of the statistical calculations are wrong. And they're not wrong if someone did a bad job. They're just wrong because they happen to not mimic the population they got. So in summary, I think ICU trials are going to have contradictory findings. Some of this is because we are currently in an era where we're going to have a lot of heterogeneity in our enrolled populations, partly because we deal in syndromes often still and not in diseases per se. And our sample sizes are quite small. There are going to be secular changes in practice that impact things, as Laura had addressed. And I do think, as readers, even not as trialists, there are things we can do to assess for this. And I've listed those here. Is it just a reliability issue? The results are true, but I just don't have the confidence. Is it that it's fair or not fair to generalize between studies? And if one or more studies are null, do they actually meet their statistical plans? And then finally, and I think Todd has not only alluded to this, but Todd and his group at Vanderbilt are really the prime example of this. Trialists are really starting to do things to help us address these issues. They're enrolling more homogeneous groups of patients, whether that's by personalizing to the intervention at the time. And they're also constructing larger, more flexible, pragmatic adapter trials that allow them to say, you know what, with the sample size we had planned, with the interventions we were looking at, we can modify that as we gain data in a statistically reliable way to allow us to answer these questions well. So I think we're going in the right direction. But I think for the long time coming, we're going to be stuck with regular randomized control trials. And if we know there are limitations, we can use them for all their strengths and their weaknesses. That is it. Thank you.
Video Summary
The speaker discusses challenges and considerations in interpreting results from Intensive Care Unit (ICU) trials, noting that these trials often yield conflicting results due to factors such as small sample sizes, heterogeneous patient populations, and changing control group conditions. ICU trials, particularly concerning syndromes like Acute Respiratory Distress Syndrome (ARDS), often lack confidence due to the variability inherent in critical care environments. Highlighting examples such as the use of corticosteroids in sepsis and early paralytics in ARDS, the speaker emphasizes the importance of evaluating trials based on their reliability, validity, and adherence to statistical plans. The complexity of ICU trials requires nuanced approaches, like sub-phenotyping and adaptive trial designs, to better capture and interpret results. The speaker calls for careful examination of trial methodologies and results to draw effective conclusions, acknowledging both the ongoing challenges and improvements in trial designs.
Asset Caption
One-Hour Concurrent Session | When New Data Conflict With Old Data: Making Sense Through the Lens of Landmark Trials
Meta Tag
Content Type
Presentation
Membership Level
Professional
Membership Level
Select
Year
2024
Keywords
ICU trials
ARDS
corticosteroids
adaptive trial designs
sub-phenotyping
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English