false
Catalog
SCCM Resource Library
Reinforcement Learning
Reinforcement Learning
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
I'm an intensivist at Beth Israel Deaconess Medical Center in Boston. My research based at the Massachusetts Institute of Technology focuses in the creation of a learning ecosystem across disciplines to leverage data routinely collected in the process of care. The presentation is an introduction to reinforcement learning. Before I begin, I would like to acknowledge research funding from the NIH and various industry partners. The most popular application of reinforcement learning is AlphaGo, an algorithm that beat Lee Seedal, a grandmaster of the board game Go in 2016. The premise behind reinforcement learning is to train an algorithm to predict the best action given a certain state from data captured in a dynamic environment. Applications of reinforcement learning have been in gaming, robotics, and self-driving cars. Upfront, I want to highlight how the healthcare domain differs from others where reinforcement learning has been successfully applied. With the board game Go, robotics, and self-driving cars, confounding is not a huge issue. Getting to the next level of an Atari game or successfully avoiding a car crash is solely dependent on the moves and the maneuvers. Whether a patient improves is confounded by factors that may not be captured in the electronic health record. We don't know what would have happened if a different action or no action is taken, the counterfactual. Reinforcement learning is but one of several types of machine learning. Supervised learning is what most of us are familiar with where the data has outcomes, diagnosis, or other labels that we would like to predict using some formula. Unsupervised learning does not require such labels. Clustering of patients based on features such as vital signs and laboratory tests without knowing whether they survived or not is an example of unsupervised learning. In reinforcement learning, we map every action or intervention to a state and to some downstream state. The goal is to maximize a feature reward for example, probability of being extubated within three days with a series of actions over time. The assumption is that the series of sequential actions contribute to downstream states. Some key concepts before we proceed. The agent is a mathematical construct of an object that takes action. For us lay people, the agent refers to the clinician or sometimes the patient. But in the machine learning world, the agent is typically a neural network that we are training. The environment is the world in which the agent exists or the ICU in our case. The third component is the action which for all intent and purposes is an intervention that produces an effect. In the ICU, this can be a medication, fluid administration, or exploratory laparotomy. The action space is a set of all possible actions which in the ICU is confined by our current understanding of critical illness. Unlike in an Atari game or a self-driving car, one cannot predict the outcome of all possible actions in the ICU, especially those that are never taken. To a certain extent, it is similar to the rules of a chess game where you can move the horse only in a certain way. The ICU is more complicated though in that the rules are not explicitly spelled out. The observations pertain to the changes in the environment once an action is taken. What happens to a patient after a vasopressor is initiated? The state is a representation or a vector of in our case, the patient. States S sub T consist of all the measurements that are observed representing the patient at time T. In regular reinforcement learning, the state S sub T is a snapshot of the patient at time T, the vital signs, the laboratory tests, the physical exam findings. However, in dynamic treatment regimes, a more advanced type of reinforcement learning, the state S at time T represents the history of states that a patient goes through, including all the treatments the patient has received. An action is evaluated through a reward function. In the ICU, the reward function can be immediate, such as what happened to the urine output in the next hour or distant, such as whether the patient survived or not. The more downstream the reward function, the more confounding we have to worry about. On the other hand, more immediate outcomes may not translate to longer term reward. Note that the reward is predicted at the time the action is being taken. More explicitly, the reward is actually an expectation of an outcome if a certain action is taken. Here's a diagram that illustrates application of reinforcement learning in the ICU. The clinician is typically the agent, our interventions represent the actions, patients transition from one state to another, state is represented by the vital signs, the laboratory test, the images, the waveforms, the clinical notes, et cetera. The reward can be survival at 90 days, liberation from mechanical ventilation, lowering of the serum lactate, et cetera. The reward can be quantified at every single time. The total reward is the sum of all the subsequent reward that follows a certain action. In chest, for example, a move generates a set of subsequent moves, which lead to different rewards. All these rewards are summed up as a total reward for that action at time T. In the ICU, the total reward is typically a clinical outcome that we consider relevant such as survival. This tandem of environment, action and reward is called a Markov decision process. The basic assumption with Markov decision process is that the reward is a function of the state and the action and nothing else. We will keep highlighting this assumption as it is the basis of reinforcement learning. The discount factor is the weight assigned to a reward. For example, one might want to weight immediate rewards more than those that are further downstream, given that prediction of distant rewards has more uncertainty. At present, there is no standard way of setting the discount factor. It is an active area of research. At the heart of value-driven reinforcement learning is the Q function. Q is the expected total future reward for an action executed for a specific state. For example, given a set of vital science laboratory tests, length of stay fluid input, Q pertains to the sum of all the expected probability of some predefined outcome associated with a specific intervention. The policy is a set of rules that the agent learns that infers the best action to take at a certain state. The policy pi of S pertains to the action that maximizes the future reward. Each possible action for a specific state corresponds to a total future reward. The one that has the highest total future reward is the optimal policy for that state. Finding Q, which corresponds to the action that has the highest total reward function for a specific state, is the central tenet of value-driven reinforcement learning. It requires discrete actions such as administering fluids or commencing base suppressors. In policy learning, rather than going through all possible discrete actions and calculating their total reward function, the agent samples a subset of all possible actions, which may not be discrete, to determine the optimal policy. For example, it may sample a range of volume of fluid bolus or a range of dosage of base suppressors. Let's look at this Atari breakout video game. The goal is to hit and dissolve all the multicolored bricks by bouncing the ball on the red barrier that one can move left and right. If the ball is not bounced back up, the agent dies. The speed with which the ball moves depends on the angle it hits the red barrier, similar to a table hockey. So given a set of pairs of state and action, such as A and B, which one has the highest Q value? This is what the agent is trying to learn in the simulation. Not moving the red barrier in state A has a high Q value. On the contrary, not moving the red barrier in state B will cause the agent to lose the game. This nicely explains short-term and long-term reward. Moving the red barrier to the right in B has a short-term reward as long as it catches the ball, but how far one moves to the right leads to different long-term rewards. For deep reinforcement learning, the input is all possible pairs of state and action, and the agent learns the best action for each possible state-action pair as shown in the left frame. A more efficient learning approach is to predict the Q for all possible actions given a specific state as shown in the right frame. We define the Q loss function based on the frame on the right. It is the expected mean square root error between the predicted Q value of action A at time T and the actual Q value of that action A at the time of the simulation. This means the square root error is minimized by backpropagating the gradient and adjusting the weights to optimize the prediction of the Q value. At every step, say at time T plus one, one updates the model with a better estimate of the Q value as one observes the effect of action A at time T at state S plus one. The repeated estimation and recalibration of the Q value for every possible state-action pair given a state S continues until a predefined duration or final state is reached. In the ICU, the simulation ends when the patient dies or is discharged from the unit. The downsides of Q value-based learning include its inability to model scenarios where the action space is rather large. Fortunately, this is not the case for most healthcare applications. In addition, Q value-based learning cannot handle continuous action spaces as I alluded to earlier. This is where policy gradient methods come in. Policy gradient methods output probabilities of the reward rather than some deterministic value. Rather than the discrete Q value, the policy is expressed as a distribution. If you add the probabilities of the reward of all possible actions, the sum should be one. In the Atari example, a value-based Q learning outputs a discrete Q value for each action, moving left or moving right or not moving. In policy gradient reinforcement learning, the model learns a distribution of reward across a continuous action space. In the ICU, the agent can learn a distribution of the probability of survival across a range of volume of a fluid bolus or across a range of dosage of norepinephrine. Instead of having an output of discrete actions and their respective reward, one gets an output of the mean reward and its variance for a specific action. Let's turn to training an agent of a self-driving car. The agent is the car. The state is represented by all the measurements that are sensed by the car. The action is the maneuver of the steering wheel and the reward is as distance traveled without crashing. The agent is a neural network in an environment that can take actions. The agent at every time point is defined by three parameters, the state, the action, and the expected reward. We start with initializing the agent with a set of actions for each time t and an expected reward for each action. The reward is updated at time t plus one based on the state s plus one. Running a policy until termination means we are going to observe what happens as the agent takes an action at each time point. We record the state, the action, and the expected reward at every step. Any action that ultimately crashes the car is weighted less during back propagation while those that keep the car going are weighted more. To recap, the agent is a neural network whose input is a state, in this case, a representation of what is sensed by the car, and the output is the mean and the variance of the reward for the distribution of actions that one can take at that state. In this slide, we identify the time when the reward took a nose dive. In the healthcare setting, this can correspond to an action or inaction that ultimately led to a path where all the subsequent rewards are significantly lower. In the ICU, the goal is not to keep the patient in the ICU forever. We adjust the reward function so that the ICU discharge alive is weighted more than staying alive for an extra day in the ICU. We can also establish that keeping a patient alive for as long as possible, only to die at the end, is not a good scenario. This, I believe, is the first high-profile reinforcement learning paper using clinical data that was published more than three years ago now. In this paper, we employed reinforcement learning to identify optimal policies to treat hypotension using 90-day survival as the reward function. In general, we found that clinicians tend to use more fluids and less vasopressors compared to what the algorithm suggests. The system is undergoing prospective evaluation at the number of hospitals in the UK. The following year, we published a paper also in Nature Medicine to guide clinicians how to evaluate research manuscripts that leverage reinforcement learning to identify best interventions given a patient's state and a specific reward function. To summarize, the objective of reinforcement learning is to train an agent to identify the action for each state that is associated with the highest reward. In Q-learning, the actions are discrete and the reward is deterministic. In policy gradient, the action space is continuous and the reward is represented by the mean and variance of its distribution. There are some basic assumptions in reinforcement learning that are worth reiterating. The first posits that the action taken by the agent is solely influenced by the observable features of the state. The second presupposes that the reward is solely a product of the action taken given a set of features. Both these assumptions are not necessarily true in the healthcare setting. The use of reinforcement learning in healthcare remains very attractive for a number of reasons. Ideally, we conduct randomized controlled trials to determine the best action for every possible clinical state that arises after a series of states and actions. But this is obviously not feasible. In addition, rather than measuring an average treatment effect across a cohort that may not be a representative of the population as we do in a randomized controlled trial, we estimate a treatment effect for every action given a specific patient state. But we need to be aware of the limitations of reinforcement learning in the healthcare domain. First, we don't observe every feature that defines a state. Two patients may have identical snapshots based on the physiologic signals that are captured, but may still be distinct with respect to biomarkers that are not measured. Unlike a board game, we cannot see everything we need to see in order to determine the best move. Second, the traditional reinforcement learning does not model the history of states and actions. Although new approaches have used recurrent neural network to create vector representations of the entire trajectory of a patient. Third, reinforcement learning does not have a causal framework. Did the action cause state S to transition to state S plus one? What would have happened to a patient at state S as a result of an action that is not observed? Unlike a self-driving car, we cannot simulate all possible steering wheel maneuvers and observe an effect. Fourth, the algorithms do not currently provide the level of uncertainty in its output. How confident is the computer in recommending no intervention for an episode of hypotension? But the biggest challenge that we face now to the application of reinforcement learning and all artificial intelligence for that matter in healthcare is the bias that exists in real world data. Artificial intelligence is a codification of clinical practice and the medical knowledge system as it is developed from their digital exhaust. It is but an encryption of a clinical practice that leads to outcome disparities and a medical knowledge system that is derived from research that disproportionately represents a majoritized population. When we train a model to optimize an outcome from a set of features pertaining to the patient and the disease, we assume that treatment decisions are the same across similar patients, but they are not. These are papers from our group demonstrating outcomes for sepsis that vary across hospitals after adjusting for illness severity and other confounders. Using audio recorded outpatient encounters from urban primary care physicians, a study found that doctors spend less time and build less emotional rapport with obese patients compared to normal weight patients. A 2015 survey of 28,000 trans individuals in the US revealed a third of respondents had a negative encounter with a healthcare provider, including being refused treatment. Black patients are less likely than white patients to receive pain medication for the same symptoms, a pattern of disparate treatment that holds even for children. And there's more to complicate matters. These papers shown here demonstrate how gender and race concordance between the provider and the patient affects outcomes. This paper from December of 2021 found that sex discordance between surgeon and patient was associated with a seven to 9% increased likelihood of post-operative complications and death. Patient sex significantly modified this association with worse outcomes for female patients. For those in this exciting field of medical AI, we need to remind ourselves that data routinely collected in the process of care are heavily influenced by longstanding social, cultural, and institutional biases, as well as provider subjectivity in decision-making. How do we create AI when the ground truth is not fair? At present, a key evaluation metric for machine learning applications is accuracy. But just because an algorithm is accurate does not mean it should be implemented. If all that matters is accuracy, then algorithms developed using real-world data will encrypt the biases and prejudice that taint clinical decision-making. To prevent AI from encoding social and cultural biases, we would like to predict an outcome if the world were fair, and the quality of care is the same across populations. As an analogy, consider bank loan algorithms. The ideal data set to build the algorithm has everyone receiving a loan and everyone having equal opportunities to repay that loan, but those data sets don't exist. The data sets used to develop models exclude clients who were declined a loan and were not given an opportunity to repay if that loan had been granted, the counterfactual data set. In healthcare, one needs to understand how an algorithm makes a decision. This is more than just making sure that there is adequate representation of different population in the data sets used to train and validate the models. That is a requirement, but it is not enough. We need algorithms that are better than humans, less prejudiced, and more fair. Before you build reinforcement learning models to optimize treatment, you need to ask the following questions. Which patient populations have poorer outcomes not explained by biologic factors? What are the drivers of these poor outcomes? Are there inequities with access to care, provider bias? How did the patients end up in the database? For example, are there patient groups who are systematically discriminated in a transplant database? For an ICU database, are there patients who are more likely to die before they reach the ICU? Are some patients less likely to be admitted to the ICU and more likely to be watched on the floor? Sampling selection bias is recipe for spurious associations that find their way into algorithms. A word of advice, the problem in healthcare is a machine learning problem. Artificial intelligence is not just about predicting or optimizing for the sake of prediction or optimization. The most important task is to augment our capacity to make decisions, and that requires understanding how those decisions are made. I don't have the solutions to the problems I present today. However, I am certain that problems in healthcare systems designed with perspectives that underrepresent most of the world are bound to maintain the status quo. We should not only invest in storage and compute technologies, federated learning platform, GPTs, GRUs, and NFTs. Our goal should be to build capacity across populations and diversity of perspectives in research. This is the biggest investment we can make to prioritize equity. On that note, we introduce a new school of thought, Village Mentoring and Hive Learning. Our group has created an interconnected meshwork of experts, not just across countries, but across various cultural contexts. The network has developed into a web of students and teachers whose goal is to learn together and from each other. The problems we are trying to address, health disparities, a medical knowledge system that disproportionately represents a majoritized few, are the same problems we faced in the previous century. We cannot solve them with the same strategies that created them in the first place. The idea that a single group with a narrow range of skills can have an impact is both arrogant and ignorant. We need to break down the silos across disciplines, across populations, and leverage each other's expertise, experiences, and perspectives. I would like to invite submissions to our new journal, called Health Plus Digital Health. I urge folks to engage with our colleagues and institutions that disproportionately serve minority populations so we view problems with different lenses and with a wider range of perspectives and lived experiences. Let's ask ourselves, are we truly diverse or are we diverse in appearance only? We have to build cognitive diversity, define as differences in perspective and information processing styles, and not predicted by factors such as gender, ethnicity, or age. I'm going to end this presentation with one of my favorite quotes. If you don't have a seat at the table, you are probably on the menu. It is time to build a much larger table and forget about the menu, forget about eating. Instead, let us learn together. Thank you and enjoy the rest of the conference. Thank you.
Video Summary
The video transcript is a presentation on reinforcement learning in the healthcare domain. The speaker, an intensivist, explains the concept of reinforcement learning and its applications in various fields like gaming, robotics, and self-driving cars. Reinforcement learning trains an algorithm to predict the best action given a certain state in a dynamic environment. However, the healthcare domain presents unique challenges, as outcomes are confounded by factors that may not be captured in electronic health records. The presentation also discusses different types of machine learning, such as supervised learning, unsupervised learning, and reinforcement learning. The speaker emphasizes the importance of understanding the limitations and biases in real-world data when developing AI models for healthcare. They also highlight the need for diversity and inclusion in research to address health disparities and create more equitable solutions. The presentation concludes with an invitation to submit to a new journal and a call to build cognitive diversity and learn together.
Asset Subtitle
Professional Development and Education, 2022
Asset Caption
Hear from past SCCM presidents as they share their experience and wisdom about critical care and SCCM.
Meta Tag
Content Type
Presentation
Knowledge Area
Professional Development and Education
Knowledge Level
Intermediate
Knowledge Level
Advanced
Membership Level
Select
Tag
Medical Education
Year
2022
Keywords
reinforcement learning
healthcare domain
applications
machine learning
limitations and biases
diversity and inclusion
Society of Critical Care Medicine
500 Midway Drive
Mount Prospect,
IL 60056 USA
Phone: +1 847 827-6888
Fax: +1 847 439-7226
Email:
support@sccm.org
Contact Us
About SCCM
Newsroom
Advertising & Sponsorship
DONATE
MySCCM
LearnICU
Patients & Families
Surviving Sepsis Campaign
Critical Care Societies Collaborative
GET OUR NEWSLETTER
© Society of Critical Care Medicine. All rights reserved. |
Privacy Statement
|
Terms & Conditions
The Society of Critical Care Medicine, SCCM, and Critical Care Congress are registered trademarks of the Society of Critical Care Medicine.
×
Please select your language
1
English