false
Catalog
Multiprofessional Critical Care Review: Pediatric ...
Biostatistics and Interpretation
Biostatistics and Interpretation
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Hi, welcome to the Biostatistics and Interpretation section of the Pediatric Critical Care Board Review course. I'm Leslie Durbin from the University of Washington and Seattle Children's Hospital. So with biostatistics, we are interested in the results of studies which depend on what is really true and the mistakes made in search for truth. Within those topics, we'll discuss problem and study design, outcome measurements, effect size, properties of diagnostic tests, random error, including type 1 and type 2 errors, and systematic errors, including ways to minimize and address these. This should match pretty closely with the content outlines for biostatistics, epidemiology, and clinical research outlined in the ABP guidelines. We'll start with problem and study designs. Warm-up question, what is the difference between a case-control study and a cohort study? One, a case-control study has comparison groups and a cohort study does not. Two, cohort studies are prospective and case-control studies are retrospective. Three, cohort study subjects are selected based on what happens to them and case-control study patients are selected based on risk factors. Or four, cohort study subjects are selected based on exposure and case-control study subjects are selected based on outcome. If you picked four, that's correct, and now we will dive into why. If we're going to consider all types of studies out there, people generally break these into two categories, descriptive and analytic. Within descriptive studies, this covers things like case reports, case series, surveys, focus groups, and interviews. Even the most limited type of descriptive study, the case report, can give us important information, especially about new emerging diseases, such as the 2019 coronavirus pandemic. We can also add qualitative studies to our list of study designs. These don't have a formal location in this hierarchy, but I generally include them with descriptive studies because they are not analytic studies. Instead, they use their own sets of methodologies to systematically answer questions that analytic studies don't answer very well related to the patient experience of illness and other qualitative topics. I would refer you to the Equator Network Standards for Reporting Qualitative Research. It's a really good read for interpreting this kind of study. We will spend a good portion of time discussing analytic study designs in greater detail. All analytic studies evaluate exposures and outcomes. Exposures are clinical events of interest, sometimes also called the dependent variable. Exposures are things that happen before the outcome that you think might influence it. Sometimes these are also called independent variables or predictors. Within analytic studies, people generally think of two categories, experimental studies and observational studies. Within the experimental study category, the classic one that we refer to as a randomized controlled trial or an RCT, although there are also interventional or experimental study designs that don't involve randomization or necessarily controls. Within observational studies, we generally talk about cohort, case control, and cross-sectional designs. We will start by looking at randomized controlled trials. In a randomized trial, eligible subjects are randomized to treatment A or treatment B or a control in place of one of the treatment groups. Then each group is monitored for outcomes. Randomization allows you to balance confounders between the groups. Confounders are any factor that might influence the outcome separate from the exposure. Control of confounding allows you to make a stronger assessment of the independent association between the exposure and the outcome. These are always prospective because neither the exposure or outcome has occurred when the study starts, but they're only applicable to exposures that can be modified because the researcher wants to control the exposure in only patients randomized to that group. The randomization is a powerful way to control both measured and unmeasured confounders. Even factors that you may not know that might influence the outcome should be balanced between the two groups if randomization is done properly. The fact that the exposure clearly precedes the outcome helps support causal inference for the association if one is identified. However, these are expensive. They can be challenging to conduct. The exposure may not be one that can be modified, either ethically or logistically. In addition, the study can only be conducted over a defined time period, so outcomes that occur after this time period, or that occur very late, or that occur rarely, are going to be really challenging to identify in this type of study design. Next we'll look at observational studies including cohort, case control, and cross-sectional study designs. These are separated based on how the comparison groups are defined within each of these study designs. In a cohort study, potential study subjects are classified by exposure status into patients that are exposed to the factor of interest and not exposed to that factor of interest. Then patients in each group are evaluated for the outcome of interest. So in the cohort study, patients are grouped by exposure status. This can be either retrospective or prospective, defined based on whether the outcome occurs before the study start or after the study start. The strengths of this kind of study are that you can measure the incidence of disease and period prevalence of disease, which you can't actually measure in other study designs. You can measure multiple outcomes at once, which can be efficient. You can actually ensure that the exposure precedes the outcome in a prospective cohort study design. This is good for rare exposures because you can identify exposed patients in a large cohort of patients. However, this can be inefficient for rare or late outcomes. It can be expensive to follow cohorts for the length of time that are necessary to determine the outcome of interest. And this study design is at risk of bias from unmeasured confounders. You can do some work to reduce the effect of bias for confounders that you know might be associated with the exposure and the outcome, and therefore hiding or increasing the observed association between an exposure and outcome. But if you don't know that something is confounding that association, you won't be able to address it in the analysis part of your study. We'll do a quick review of incidence and prevalence. Incident cases are defined as new cases that occur in a specified period of time. If I enroll 20 people in a study monitoring for writer's cramp, and these 10 people develop the illness, I have four people who are newly diagnosed with this illness over the period of my study. These are incident cases. These incident cases represent four out of the 14 at-risk people in my population at the start of my study, which is an incidence of 29%. These six arrows in the beginning are all people who have the disease when I started my study, so they're not at risk of then developing the disease, they already have it, which is why they don't count in the denominator for the incidence, but they do count towards prevalence. So I can consider both a point prevalence at time point A, which I have six people at the beginning of my study out of 20 who have the disease, so there's a point prevalence of 30, but there's a period prevalence of 50%, which is the 10 people who have the disease, the six at time point A, and the four that developed it during the time course of my study out of the 20 people involved in the study. Case control studies are the last kind we will consider. In the case control study, you collect eligible subjects based on your enrollment criteria, but you sample them according to the outcome. So you're going to compare people who are cases, people with the outcome of interest, to people who are the controls, people who did not develop the outcome of interest, and then you look back and see who was and was not exposed to the risk factor. These are generally retrospective. You have to know the outcome status of the patients before you can group them into the case or control comparison group, and most importantly, the researcher controls the ratio of cases to controls. So if I want to, I can look at four controls for every case or six controls for every case. Because I'm controlling that ratio, you cannot measure risk in a case control study, which influences both how you interpret the findings as well as how you statistically handle them. This is really an ideal study designed for rare or delayed outcomes. You can evaluate multiple exposures at the same time. The classic study for this design are studies looking at mesothelioma and the exposure to asbestos from industrial sources as a risk factor for them. The limitations are that, as we mentioned, you can't estimate incidence or prevalence from this kind of study. Choosing appropriate controls can be tricky. If you choose controls that are not actually at equal risk of the outcome as the cases outside of your exposure of interest, you can introduce bias. And there's often also recall bias. You may be asking people to recall events or exposures that happened decades ago, which can be challenging. Finally, cross-sectional studies. In a cross-sectional study design, patients are identified and evaluated for both the outcome and the exposure at one moment in time. Using this kind of study, you can calculate point prevalence, as shown in our previous example. But this doesn't account for how long the illness lasts. So you may actually have a period prevalence of your disease that is much higher than the point prevalence that you've calculated, depending on how long the disease lasts. It's important to remember that point prevalence is not the same thing as incidence, and it's not the same thing as period prevalence. So there are limitations to how you interpret data obtained from these studies. You also cannot really evaluate causality. You can't really be sure that the exposure has preceded the outcome if you're measuring them at the same point in time. However, they're easy to conduct and generally cost effective. So this leads a lot of people to question, why bother with observational studies at all? I have some examples later, but as alluded to, you can't design a randomized controlled trial without them. You need to know how many people you expect to be affected by the outcome in both a treatment group and a control group to be able to do a power calculation. And if you don't do that, you may end up with a very expensive randomized trial that doesn't actually answer the question you have. Many questions can't be answered with RCTs, especially things that evaluate how therapies work in the real world. This is where pragmatic trials come into play. Regionalization of care, access to care, risk factors for disease, as we mentioned. And it gives doctors a good opportunity to be dismissive that it's just a retrospective study and I'm not sure if I should change my practice based on it. But we should remember that not all observational studies are retrospective. That doesn't mean they're necessarily bad. We just need to be careful of how we interpret them. As an example, the structure of the solar system is a model that was based off of observational data for which we have no experimental data to back it up. So the best type of study really depends on the question at hand. So if you are interested in knowing how to prevent or treat a disease, then a trial design is going to be the best way to answer that study with the least amount of bias. But if you're interested in more basic epidemiologic questions, how common is a condition, what makes it likely, and how often do patients recover, or the natural history of the disease, you'll need a cohort study. If you're interested in risk factors for very rare events, like E. coli 0157H7 infection, you'll need case control studies. So if you take all of the high quality observational studies and experimental studies that relate to a particular question, you can move into a type of study design that is best termed data synthesis. This includes things like decision analysis, cost effectiveness analysis, and systematic reviews and meta-analysis. The Cochrane Handbook is the place to go with any questions you might have about systematic reviews and meta-analyses. These aim to collate all of the existing evidence on a particular question into one study. And meta-analysis aims to go one step further by allowing you to statistically combine the results of different studies. In both of these, the aim is to minimize bias by using very systematic methods, which are laid out in the Cochrane Handbook. So as a quick example, just to walk you through how these are put together and published, Roberts in 2017 asked the question, what is the true effect of antenatal steroids on reducing pre-torm infant mortality? This had been addressed in 23 eligible studies that they then combined for this systematic review and meta-analysis. These studies occurred over a very long period of time and enrolled a wide range of patients. By statistically pooling the results of these studies, they were able to determine that across all of the studies, the antenatal steroids were indeed associated with a reduction in infant mortality, prematurity, with a relative risk or a risk ratio of 0.7. And then they went through and evaluated each study for its quality, evaluating it for the risk of bias, consistency of effect, imprecision, indirectness, and publication bias. They come up with these complicated-looking tables where each study is judged on these criteria. And then they use that along with the estimate of effect by pooling the results of the studies to come up with a grade recommendation for how strong do we think the evidence is for this practice. Based on the pooled meta-analysis, the risk ratio of 0.7, they would have had a high quality rating where because of the sheer number of patients that were involved in those studies, we're pretty confident that the estimated effect of 0.7 is very close to the true effect. However, we then upgrade that or downgrade it based in part on the quality of the studies included. And in this case, because there are so many question marks and some red spots in that table, they actually downgraded the results to just a moderate with that pooled relative risk of 0.7. They said right there, the result was downgraded once for the risk of bias in the included trials. So that's all of the study designs we have to cover. We will now move into outcomes measurements. Some basic facts about how outcomes are measured that then influence the statistical tests that can be applied. So a warm-up question, which of the following statements is correct? A, age is a binary variable. B, weight is a categorical variable. C, P to F ratio is a continuous variable. Or D, cancer staging is a continuous variable. If you chose C, you would be correct. Let's see why. So there are different types of variables and these can include binary variables such as do you have COVID or do you not have COVID? Which I realize in the real world is a little more complicated than that. Nominal or unordered categorical variables. These are categorical, so there are categories that one can be assigned to, but they are not ordered. They don't have a natural hierarchy such as your Zodiac sign. We have ordered categorical variables. These are categorical variables in which there is an inherent hierarchy. So you can have normal blood pressure, elevated blood pressure, or various degrees of hypertension. And then continuous, age, continuous variable. And then among continuous variables, there are time to event variables and these need particular handling, so I will discuss them a little bit separately. Each kind of variable matches a different set of descriptors that can be used for it. So with binary variables, you can report a proportion, which is actually the same as the mean of the ones versus zeros, depending on how you coded your variable. You can also use odds. For nominal data, you can use proportions and odds. For ordered categorical, you can use proportions and odds, but you can also use mean. If I have two populations of people with various stages of hypertension and I compare them, the mean of their stage will actually tell me which group on average has a higher stage of disease. However, taking the mean of this kind of data is only a good idea if the categories are separated by about the same amount of quote unquote badness, which they may not be. Going from normal blood pressure to pre-hypertension may not be as bad for you as going from, for example, stage two to stage three hypertension. In that case, you actually wouldn't want to use a mean to describe those differences. And then continuous data. With continuous data, you can do a lot more. You can use means, standard deviations, medians, interquartile range. And with time-to-event data, we'll use medians as well as hazards to describe change over time. Data comes in different distributions. So we talk about the normal distribution where if I plot your SAT score versus frequency in a large population of people, you'll see that most people cluster around the middle with a few people on the high end of positive scores and a few people on the low end of the scores. The reason we talk about this is that normally distributed data allows you to do some statistical methods that aren't appropriate for skewed data. So knowing how your data is distributed is important to being able to choose the right methods. In a normal distribution of data, the mean or the arithmetic average, the median, the 50th percentile or middle, and the mode, the most frequent, all line up. They're all essentially the same number. And then the distribution of data, both on the positive or right-hand side and left-hand side are evenly distributed with one standard deviation covering 68.2% of the data, two standard deviations covering 95.4, and three standard deviations covering 99.6% of the data. We can compare this to skewed data. This is some salary data for university professors from my epidemiology course, comparing salary and frequency. And in this data, you can see that there's a little bit of a left predominance or a rightward tail describing this data. Because of that rightward tail, those high values are pulling the mean or the arithmetic average a little bit north of the median. So this is by definition asymmetric. The mean is not the same as the median. Generally, people would choose to describe this with a median and an interquartile range instead. And you can calculate things like skew and kurtosis to describe how skewed it is. The important thing here is that this data may fail to meet some of the assumptions of parametric statistical tests and those therefore generally shouldn't be used. We also talk about how our measurements have precision and accuracy. And this is important because we shouldn't confuse the two. All measurements have some amount of variability. So if I'm using a thermometer to measure temperature and this thermometer is calibrated to read to 0.1 degrees plus 0.2 degrees, I have to keep in mind that any of my measurements might be off by as much as 0.2 degrees. If I'm comparing two groups' temperature measurements, I have to keep that amount of precision in mind. You can measure precision using standard deviation or standard error to describe the spread of data. So this is the right versus left column in our table. Estimates that are precise are grouped together. Estimates that are not precise are spread over a wider area. However, this is not a replacement for accuracy. Accuracy is shown in the rows of our two pieces of data. You can have data that's not precise but is still accurate where its average is giving you a true estimate of effect. What if your thermometer is off by five degrees? You won't know that by just being able to measure the precision. For this, you need to know the external truth or you need to be able to do many measurements with different devices to be able to collate what you think the true answer is. So moving that into discussions of effect size and how effect size is described in different studies. So a warm-up question. Is a risk ratio the same thing as a relative risk? Turns out it is. Is a risk ratio the same thing as an absolute risk? No. And is a risk ratio the same thing as an odds ratio? No. Let's see why. So risk or absolute risk is the same thing as probability, the same thing as chance. It is the number of times something happens out of the number of times it could have happened. So if I have a deck of cards and I'm interested in drawing the ace of diamonds, the risk or probability or chance that I do that is one ace of diamonds out of 52 cards in my deck, which is about 2%. This can be applied to the chance that a baby would be born on a Tuesday. So that's one day of the week that the baby's born on out of seven days in the week, about 14%. You can calculate the risk in two groups or the probability in two groups and compare them by dividing one by the other. So a risk ratio is the probability of outcome or risk in one group, here the treatment group, divided by the probability of outcome in the other group or the control group. So in the ARDSNet tidal volume trial from the New England Journal in 2000, the relative risk of mortality in that study was 31% in the low tidal volume ventilation group compared to the 39.8% mortality in the control group, which gave a relative risk of 0.78 or a reduced risk in the low tidal volume ventilation group. This is different from absolute risk reduction. So absolute risk reduction is, as the name implies, the difference in raw absolute risk between the groups. So in the ARDSNet study, that's going to be the difference between 39.8% and 31% or 8.8%. The application of the intervention had the effect of reducing mortality by 8.8%. This is important because that's the number you need to calculate the number needed to treat. The number needed to treat is 100% divided by the absolute risk reduction or 100 divided by 8.8, which is 11.4. This means that we would have to apply our therapy to 11.4 patients in order to save one life. Another warmup question. Which of the following statements regarding relative risk is correct? One, the higher the relative risk, the lower the p-value. Two, the range of possible values is from zero to infinity. Three, it is nearly identical to the odds ratio if events are frequent, for example, over 25%, or never tell me the odds. Well, if you chose two, you would be correct. The range of possible values is from zero to infinity. But if you chose four, of course, you would also be correct. Effect size, odds versus risk. So we will discuss odds now, and it's important to understand how odds are different from risk. So risk to review is probability or chance, the number of times something happens divided by the number of times it could have happened. Whereas odds are the number of times something happens divided by the number of times it did not happen. So with our card example, I can draw the Ace of Diamonds one out of 52 cards in my deck for a risk of 0.0192, or I have an odds of drawing the Ace of Diamonds one time compared to the 51 other cards in the deck. And you'll see that because 52 is a pretty high number of events for the one event I'm interested in, these two numbers are pretty similar. That's actually not always the case. If I then look at drawing any diamond out of the deck of cards, the risk or probability of drawing any diamond is 13 diamonds out of the 52 cards in my deck, which is 0.25. But the odds is now 13 diamonds out of 39 other cards, non-diamonds in the deck, which is 0.33. As the frequency of the event increases, these two numbers are going to get further and further apart. So risk is similar to odds if the events are rare, one out of 52, but not if the events are common. Also, risk is always smaller than odds. So when actual risk or probability is not known, you have to use odds. This gets us back to the case control studies where if I've chosen the number of cases and controls I'm looking at in the study, I can't talk about risk because I've essentially chosen the actual risk that occurs in the study. In that case, I have to use odds. So for an example, we can look at the question, is volume control ventilation associated with pneumothorax in RSV? So let's say I look back in my medical record and I identify 10 cases with a pneumothorax and 190 controls without. And in looking at those groups, I see that eight patients with a pneumothorax were in volume control compared to two in pressure control. But out of my 190 controls, 92 were in volume control and 98 were in pressure control. So what's the odds of a pneumothorax with volume control ventilation? Well, odds is the number of times something happened compared to the number of times it didn't. So that's eight over 92 compared to the odds in the pressure control group, which is two out of 98. And you can do the math there to come up with 4.3 in this example, volume control ventilation is associated with a 4.3 times the odds of pneumothorax. So if this is a rare event, the odds ratio will approximate the relative risk. And then I could think about this as being the same as my relative risk. So a question, can I assume that this is rare? It only occurred in 10 out of 200 patients in my study. Wrong, this was a case control study. We chose the 10 patients to look at and the 200 controls. So we can't assume this is rare. We would have to know from some other study design, like a cohort study, whether this was a rare enough for us to say that the odds ratio approximated the relative risk. Warm up question, to identify factors independently associated with the need for intubation in children with asthma, the following type of analysis would be appropriate. One, linear regression, two, paired t-test, three, Cox regression, four, non-proportional hazards modeling, or five, logistic regression. If you chose logistic regression, you would be correct. Let's see why. So what do we do if the outcome or predictor isn't binary? I can only calculate things like relative risk or odds if I have a binary outcome. How can I deal with continuous outcomes? So you could make it binary. You could say out of this example data where I'm comparing height on the x-axis to frequency on the y-axis out of a population of women and men, you could say, well, maybe my outcome of interest is whether or not an individual is over 75 inches tall. In that case, the proportion of women would be very small and the proportion of men would be still small but slightly higher. And I could then compare that binary data. What is the relative risk of being over 75 inches if you're a man versus a woman? But that actually loses quite a bit of power. I sort of ignore then the fact that I have all this other data about the heights of men and women. And so that isn't the best use of my continuous data. I could generate a descriptive value. So I could take the average height of women and the average height of men and compare them or the median. I could measure the correlation between gender and height or I could use a regression and I could describe an average change in one value with respect to change in another. Or here, I could describe the average increase in height associated with male versus female gender. There are different kinds of regression depending on the data we're looking at, including linear, logistic, and Cox regression, which we will touch on briefly. So if I wanted to look at correlation to deal with this continuous data, I could measure a Pearson's correlation coefficient. This essentially measures the strength of a linear association between your two variables going from perfect negative minus 1 to positive 1, which is a perfect positive correlation. Zero refers to no correlation. So if I'm looking at infant gestational age on the x-axis in weeks and birth weight on the y-axis in grams, you can see that there's a correlation between birth weight and gestational age. And if you calculate the Pearson's correlation coefficient on this data, it is .52, which is a moderately positive correlation. You can also do a Spearman's rank correlation. This is a nonparametric test in which the data is actually replaced with its positional rank. The, you know, first data point being 1, 2, 3, and so forth. It gives you the same range of data, but is a better fit for non-normally distributed data. And then regression. Regression is more powerful than correlation because you will get not just whether or not they're associated, but an actual measure of how much each variable changes with respect to the other. The type of regression that you will use depends on the variable. Logistic regression is good for dichotomous outcomes. So mortality. If there are paired values, not for mortality, but some other dichotomous outcome, you would use conditional logistic regression. You can use linear regression to look at continuous outcomes like oxygenation index. And you can use Cox regression to look at time to event data such as survival after ICU discharge. Look at a couple of examples after going through general regression terminology. So when you're doing a regression, you are evaluating the change in the dependent variable or the one thing that depends on the other things, your outcome of interest, for example, mortality, and how much that changes with change in your independent variables. These are things that you think that might affect the outcome, like covariates, predictors, or explanatory variables are other names for this thing. Examples would be illness severity, age, comorbidities. The dependent variable is one side of the equation and all of your other independent variables go on the right side of the equation. There will be a constant to make the math work out. The constant itself is not very interesting. And then there will be a coefficient multiplied by the independent variable of interest. The slope given by that coefficient is the effect size. How much does Y or the outcome change with each unit change in X or your predictor? And you can extend this to include multiple predictors if that's appropriate for your data. So in linear regression, the interpretation, the question is what is the average change in the outcome for each unit change in the predictor? Back to my gestational age data, you can run a linear regression on birth weight for gestational age, and you come up with this output from your statistical program, but you can rewrite that as a formula. So birth weight in this example is equal to 124 times gestational age plus a constant. And you would read this as saying for each additional week of gestational age, birth weight increases on average by 124 grams. Now, that's only appropriate to do if you think that the change over each week is roughly stable over the entire period of data, which it may or may not be. The last example for regression that we'll look at is survival analysis and Cox regression. You need a different set of methods when your continuous variable of interest is time to event. And the reason for this is that the number of patients that are under observation changes over time as patients develop the outcome of interest. A Kaplan-Meier curve is typically used to describe this kind of data. So this paper from 2003 looked at post-discharge survival among post-cardiac surgery patients compared to medical ICU patients. They looked over a three-year window and showed that mortality in the medical ICU patients in the dashed line was higher over time than the post-cardiac surgery patients. If we were interested in the amount of difference between those two lines, we would calculate a hazard ratio. Hazard refers to the moment-to-moment risk of event, in this case, risk of death. And we would calculate that using a Cox proportional hazards regression. Okay, we will move from measures of effect to properties of diagnostic tests. This will lead us into the dreaded slash famous two-by-two table. In the two-by-two table, these are generally constructed in this orientation where your two rows refer to patients who test positive and patients who test negative. And then your columns refer to patients who are actually diseased or actually healthy. This leads us to four categories. You can have a true positive, a false positive, a false negative, and a true negative. We usually use letters to correspond to each of these boxes. So we can use this to calculate the sensitivity, specificity, positive, and negative predictive values of the test in question compared to some gold standard by which we have diagnosed people with the disease versus healthy people. So two of these are intrinsic to the test. That is the sensitivity or the chance of a positive test if the patient actually has the disease. The notation I learned in my epi courses was the probability of a positive test given positive disease status. I find that notation easier to remember, so I'm gonna provide it here for you in case that's helpful. This is A, so patients that test positive, out of all the patients who have the disease, which is that column, A plus C. You can also calculate the specificity or the chance of a negative test if the patient does not have the disease. This is the same as the probability of a negative test given that there is no disease. So the no disease group is the healthy group, B plus D in that column. And out of those patients, the ones that test negative are D. And then there are two of these, positive and negative predictive value, that depend on the test characteristics as well as disease prevalence. So these will change depending on how common the disease is in the population you're testing. The positive predictive value is the chance of having the disease if your test is positive, or probability of positive disease given positive test. So out of my positive test group, which is A plus B, the ones that are actually having the disease are the A group, and negative predictive value, which is the chance of not having the disease if your test is negative, or probability of negative disease given a negative test. So the negative test group is C plus D, and the no disease group out of that is D, the healthy people. These are both a little more useful in the real world setting because I'm testing people with the test and I get the test result, which is either positive or negative. And what I really need to know is how likely does that make it that they either have or don't have the disease. So in general, positive and negative predictive values are more useful than sensitivity and specificity alone. All right, let's look at an example. Super test has a sensitivity of 95% and a specificity of 81% for the detection of COVID-19. What is the positive and negative predictive value under the following conditions? I cared a lot about this and I actually did some of this math as my children were in a school that had mandatory testing, because if you had a positive test, you had to go home and get a PCR test. But if you had a negative test, you could stay in school. So let's see how that would work out for me. In one setting, we can consider a prevalence of 2%. So out of a thousand patients, 20 of them have COVID. And we can consider a situation where we have the same thousand patients, but now 500 of them or 50% have COVID. If I use the sensitivity, I can calculate that the probability of being test positive if you are disease positive is X out of 20, which should be 95%. So that's 19 people. And in my other group where the prevalence is 50%, that's 475 people. So I put those numbers into my table. And then I use the specificity of, apologize, the specificity has now changed to 75%. If the specificity is 75%, then that means that the people who test negative out of the people who do not have disease, X out of 980 is 0.75 or 75%, which is 735 people. And in my higher prevalence cohort, I have the probability of test being negative if disease is negative. Specificity of 75% equals X out of 500, X would have to be 375. So I put those numbers into my table. And now I calculate positive and negative predictive values to see how this test is going to perform in my two different populations. So the positive predictive value is the same as the probability that I have the disease if my test is positive. So that is 19 people out of the row that test positive 264 times 100, which is 7.2%. The positive predictive value in my other group where the prevalence of COVID is much, much higher is now 475 out of 600 people or 79%. So the positive predictive value changed a lot when I had a much more common disease. Now the negative predictive value. So the negative predictive value is the likelihood that I don't have the disease if my test is negative. You can see that when COVID is rare, the negative predictive value is excellent 99.9%. But when the disease and when the disease is more common, it's a little bit lower, but not terrible 94%. So you can see that the positive predictive value changed a lot and the negative predictive value changed a little. So what happens when we have a COVID test with good sensitivity, moderate specificity, and we apply it to a very low population or a low frequency of infection population, the positive predictive value is extremely low, meaning that most people who test positive actually don't have the disease. But then that changes as the prevalence goes up. So this brings us to the concept of screening tests. So the ideal screening test is inexpensive and it's safe, and you're able to test for something that you can treat or with COVID quarantine. You wanna test that has very high negative predictive value. You want people who screen negative to definitely not have the disease because otherwise they're gonna walk around and expose everybody at the grocery store or at school. So in our case, super test had a really excellent negative predictive value, particularly when COVID was rare. But then it also had this really terrible positive predictive value. And so with screening tests, that's an acceptable trade-off and we have to be prepared for that. So we have to be able to communicate those positive test results accurately. In this case, it was more likely than not that you still didn't have COVID even if you had a positive test, which is why you have to have then an acceptable follow-up test in our case of PCR so that you could sort out the true positives out of the false positives in that positive test group. Okay, one more thing to consider. What if we're looking at a screening test that doesn't have a binary result? So instead of a test that is positive or negative, we have a test now that has maybe multiple possible outcomes. In that case, what you do is you generate a receiver operator characteristic curve. You do this by calculating the sensitivity and specificity of the test at various cutoff points. And then you plot them, sensitivity versus one minus specificity. This is from an epidemiology textbook looking at the performance of the CAGE questions as screening for alcoholism. You can see that for a CAGE score of one, two, three, or four the sensitivity of diagnosing alcoholism with that screening question goes up pretty significantly with a little bit of cost in the specificity. A perfect test would be represented by this right angle line where a test would have 100% sensitivity and specificity. And an uninformative test would be at the X equals Y line there across the center. So this is a pretty good test. But then you have to choose clinically what the appropriate cutoff is. Should that be three? Should that be four? A higher cutoff will be more sensitive but less specific for the conditioning question. So now we will move into error. We will talk about both random and systematic sources of error starting with type one and type two errors. So example question, I am going to conduct an observational study and I find that wearing tennis shoes versus dance go clogs is associated with an odds ratio of 0.78 of having to run to a code that day. The 95% confidence interval around this estimate is 0.46 to 1.1 with a P value of 0.2. Since P is greater than 0.05, A, wearing tennis shoes clearly does not prevent my having to run to a code. B, the observed association is possibly due to chance alone. C, the observed association is definitely due to chance alone. D, the study might have been inadequately powered. Or E, let's keep a little optimism here. If you chose B, you would be most correct. If you chose D, you would also be correct. And of course, if you chose E, you would also be correct. So P values, all of the energy we spend looking at them and thinking about them are because we are trying to avoid type one error in our studies. Type one error is denoted by alpha and it is the risk of a false positive. In a false positive study, you have a situation where there is no actual treatment effect, but your study incorrectly found one. The other way to phrase this is that you incorrectly reject the null hypothesis. So just to review, what is the null hypothesis? The null hypothesis is what you are trying to disprove. For example, hydrocortisone has no effect on mortality in children with sepsis. You might design a study where you try to disprove that by seeing whether there is a mortality difference in children with sepsis treated with hydrocortisone versus without. You will measure this by alpha, which is the same thing as a p-value. This is a probability that the results are due to chance alone in your study. The higher that probability is, the more likely it is that chance influenced your results and that you may not want to trust them. So this brings up a whole host of issues related to how do we get a p-value? This area is called statistical inference or hypothesis testing. So the specific statistical inference or hypothesis testing that you will apply depend on the type of measurement that you conducted, going back to the measurement types we discussed earlier in the talk. If you have binary or categorical data, those are going to be different methods than if you have continuous data. You'll have to consider the distribution if your data is continuous and whether the data are paired. Paired data occurs when you have repeated measurements in subjects, so if you measure my blood pressure over periods of time, or if you have a matched case control study design where you identified cases and then matched them, for example, on age, ventilator strategy, and comorbidity to controls. Just to give a real world example of paired data, this data looked at central venous oxygen saturations on admission and six hours later for 10 patients admitted to an ICU. You can see that we measured it on admission and then six hours after admission and then looked at the difference. Because we measured the same population of patients two times, we would need to use statistical methods appropriate for paired data. So if I have binary or categorical data, I can use the following methods. Let's say I'm looking at a study where I have a intervention of a Herculean monoclonal antibody that saves lives, and I find that this many patients are alive and dead in both my control and my treatment group. I would use a chi-squared test to analyze that data. If I found that one of the cells in my data was very small, zero, one, two, maybe up to 10, I would use a Fisher exact test. This is appropriate for any test that you would use a chi-squared on, but when there are small numbers present. And this is technically not wrong, it's just computationally intensive if big numbers are present. If I have paired data that is binary or categorical, I can use a McNamara's test to compare. If I don't have binary or categorical data and I'm looking at continuous data instead, I need to check the distribution of that data. For normally distributed data like hemoglobin, I would use parametric tests. For skewed data like pH, I would have to use non-parametric tests. For normally distributed continuous data, I can compare a single mean to a reference like is white blood cell counts elevated in asthmatics using a one-sample t-test. If I have two unpaired means that I'm comparing, for example, are lipases higher in children with sepsis than children with seizures, I would use a two-sample t-test. For two paired means or two measurements in the same set of subjects, like does hydrocortisone increase blood pressure and I measure it before and after, I would use a paired t-test. If I have three or more means, for example, does the number of admissions per night differ by the phase of the moon, I would use an ANOVA. There are multiple ways to do an ANOVA depending on the variables you have at hand. You can do a one-way ANOVA, which is just looking at one independent or outcome variable, so admissions versus moon phase. For two-way ANOVA, you can use two independent variables, which allows you to adjust for then another variable. For example, I want to look at admissions versus phase of the moon, controlling for whether or not it's a weekend. For skewed continuous data, I can use a, I can compare a single mean or a median to a reference. This is, I would use a sign test, which is basically the nonparametric equivalent of a one-sample t-test. If I'm looking at two means or medians that are unpaired, I can use a Wilcoxon Rank Sum or Mann-Whitney U-test, which is like the nonparametric equivalent of a two-sample t-test. I can do two paired means or medians using a Wilcoxon Signed Rank test, and I can compare three or more means or medians using a Kruskal-Wallis test. So now we'll move to type two error, or avoiding false negatives. First, a question. Which of the following statements about power is true? A, all things being equal, as the number of subjects increases, a study's power increases. B, it takes less power to find a 5% mortality difference than it does to find a 10% difference. C, if a study has 80% power, then there's an 80% chance that the study's findings will be true. Or D, Count Dooku was more powerful than Yoda. If you chose A, you would be correct, and if you chose D, you would definitely be wrong. So type two error is denoted as beta. Beta error is the error of finding a false negative. This means that there is an effect to your intervention, but your study didn't find it. Your study incorrectly concluded that there was no effect. Power is one minus beta, which is the inverse, power of probability of finding an effect if one exists. Conventionally, most studies accept an 80% power, which is actually a 20% risk of a false negative study. I think this is worth bearing in mind when we see how many negative studies there are, because if our statistics are correct, up to 20% of those are potentially falsely negative and negative by design. Many important studies where we really wanna know the answer are designed for 90 or potentially even 95% power, but we'll see why we sometimes don't do that because it's really challenging because the number of patients you need will increase accordingly. So power increases with more patients, a higher p-value for statistical significance. So there's a little bit of a trade-off here. If you are more willing to overhaul the study, then you can have greater power, but at a higher risk of a type one error. If you have a larger treatment effect, power will increase because it's easier to find an elephant than a mouse. If the outcome is more common, that will help. And if there's a narrow distribution of outcome in the treatment group, that means that it's less likely to overlap with the control group. So let's walk through some of the calculations just to illustrate those factors. If I'm designing a study in which the baseline rate of cure for my terrible disease is 10%, how many patients do I need to enroll to achieve 80% power? Or what if I wanna achieve 90% power? So this is a sort of complicated graph, but what we're looking at in each of these graphs is sample size on the x-axis, power, or one minus beta, on the y-axis, with 80% in the dashed line and 90% in the solid line. And we're looking at two different experimental group proportions. So if my treatment is only kind of effective and it increases the rate of survival to 20%, that's the blue line. And if it works more, then it increases survival to 30%, that's the red line. And then my two graphs are looking at an alpha of 0.01, where I'm willing to accept less risk of a type one error. And then on the right, I have alpha of 0.05. So the first thing I need to do is estimate how effective my intervention is. Is it going to give me a 20% rate of cure or a 30% rate of cure? This is gonna depend on conducting cohort studies and other types of studies to understand what these numbers are. And I'm gonna set my alpha. And then I'm gonna look at what number I need. In this case, I'm going to need 200 patients, assuming a 30% rate of cure in my treatment group, alpha of 0.05 and beta of 10%. So that covers p-values and power. And then that leaves us with systematic bias as the last thing to discuss. So systematic error has both direction and magnitude. This is important because there are some things that bias you toward the null, but if I found an effect, that actually just made it harder for me to find the effect and they don't necessarily invalidate the findings of my study. However, if it's a bias that would be towards the effect and I found an effect, that would be much more worrisome. You can reduce bias with rigorous design and analysis. Things like randomization, blinding, those are very important. Having objective outcome definitions, also important. And in the analysis phase, if you have to, you can adjust for the presence of confounding if you know a confounder is present. So the formal definition of confounding is that a confounder is a variable associated with the exposure and the outcome, but not in the causal pathway. This is important because if I do adjust for something that's on the causal pathway between my exposure and outcome, I will actually reduce my ability to find that association. Kind of shooting yourself in the foot, you don't wanna do that. This is really important in observational studies because the exposure isn't random. You are at risk of lots of things being associated by the exposure, which has naturally occurred. Let's consider an example. I might find in a study that coffee drinking is associated with lung cancer. You might say, well, how could that possibly be? Well, if all coffee drinking occurs in cafes and people who go to cafes like to smoke, then it might be that drinking coffee is associated with smoking and smoking is associated with lung cancer. And I would need to adjust for the effect of smoking to then see the true effect of coffee drinking on lung cancer. So there are multiple ways to control for confounding. The best is to randomize your patients, but that then requires that you conduct a randomized trial. You can match. So in a one-to-one case control study, you can match on a confounder and then that confounder is balanced between your two comparison groups. You can stratify where you look only within, for example, prism quartile for association between your exposure and your outcome to see whether it holds true when you're looking at patients that all have the same prism, or you can use regression. Regression is especially powerful because it will allow you to adjust for multiple variables at the same time. What else is crucial for interpreting studies? We'll just quickly touch on internal validity, external validity, and causal inference. Internal validity refers to the concept of, did we reach an accurate conclusion for the patients we enrolled? So this is the validity of the study you conducted. Did you retain all of your participants, i.e. minimal lost follow-up? Is your data high quality? Was there a lot of data missing? That's always worrisome. How compliant were the patients that you enrolled in your study for the treatment? Then there's external validity. This has to do with, how well do your study results apply outside of your study context? So is it generalizable? One other way to word that question is, would my patient that I'm considering applying your therapy to have met enrollment criteria for the study? If they didn't, I have to think really carefully about how my patients are different from your patients. This gets into pragmatic versus explanatory studies. Explanatory studies are really aimed at identifying, can a treatment work? Is it efficacious? We're really interested in causal inference here, minimizing variation. You often have a very rigid protocol so that you know, was it really my protocol that worked and not something else? You have very selective inclusion criteria where you want all of your patients to be essentially as similar as possible. And you collect your data very carefully with respect to answering the question you set out to answer. Pragmatic trials are very different and they're really interested in looking at, does a treatment work in the real world setting? For these, they often have broader inclusion criteria and the outcomes are generally more clinically oriented. The last thing we'll discuss is causal inference. Not all observed associations are causal. In the 17th and 18th century, it was well known that people that were sailing on ships for many months were at risk of developing scurvy. This was thought to be due to a lack of exposure to something in the land that they were away from that they needed. So then, when men who were mining in California and Alaska with very poor diets developed something called land scurvy, people thought, well maybe there's something deficient in this land that makes them like the sailors that are away from land completely. Some people tried treating them by burying them in earth to then re-expose them to whatever it was in the land that they were missing. Unfortunately, that didn't work. Features supporting causality include, most strongly, that data from randomized controlled trials support the exposure to disease association. This is prospective and you've controlled the exposure, so that's the strongest evidence. However, you can also use data from non-randomized studies to establish causality in a sense if the following criteria are met. You want the data to support the association and you need evidence that the exposure clearly precedes the disease, that the association is strong, that there is no plausible non-causal association out there, a plausible biological mechanism for causation exists, and there's a dose-response relationship. This has been successfully used many times over in medicine, for example, establishing that certain forms of hepatitis increase your risk of hepatocellular carcinoma without conducting a randomized controlled trial. So it can be done, but we need to think carefully about all of these features when we're considering whether non-randomized data supports causality. Bad news, there's a lot to know, and it's important for a lot more in your clinical life than just a challenging and very expensive exam. The good news is that it's not unknowable. There are many excellent web-based and paper-based resources. These are a couple of my recommendations, as well as this series from Critical Care in 2002. And may the force be with you. Thank you so much.
Video Summary
In this video, Leslie Durbin discusses various topics related to biostatistics and interpretation in pediatric critical care. She starts by explaining the different types of studies, including descriptive and analytic studies, as well as qualitative studies. She then goes into detail about analytic study designs, such as experimental and observational studies, including randomized controlled trials, cohort studies, case-control studies, and cross-sectional studies. Durbin also talks about outcome measurements and the different types of variables, such as binary, categorical, and continuous, and how to describe them using measures like proportions, odds, means, and medians. She then discusses properties of diagnostic tests, including sensitivity, specificity, positive predictive value, and negative predictive value, and how to calculate them using a two-by-two table. Durbin also explains the concepts of type 1 and type 2 errors, as well as power and p-values. She discusses different statistical tests and when to use them based on the type of data and study design. Durbin also touches on bias and confounding, and how to control for them in studies. She ends the video by discussing internal and external validity, and the importance of causal inference in establishing causality.
Keywords
Leslie Durbin
biostatistics
pediatric critical care
study types
analytic study designs
outcome measurements
variables
diagnostic tests
statistical tests
causal inference
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English