Matinum

Taking Charge of Your Health


>>We have a Finale Doshi-Velez, she’s a professor at
Harvard University, and has done lots of really cool
work on probabilistic modeling, sequential decision-making,
[inaudible] machine-learning, and with applications to health care. I think she’s going to tell us
about some health care stuff. Cool.>>Awesome. I’m super excited to be here to share some of the
work that we’ve been doing in our lab related to health care applications
in reinforcement learning. So today’s talk it’s
going to be like all of the problems we work
on that come from very practical problems that we see, and then we go down
and we try to improve our algorithms to deal with some
of these practical problems. So I’ll be giving a range of things that we’re
working on in the lab. I also invite interruptions, you’ve been a very polite
and quiet crowd all day. If you have questions along the way that’s fantastic, I’d
love to hear them. We don’t have to get
through all the slides, we probably won’t get through
all the slides anyway, so it’s not your fault. All right. So what do we work on in my lab? We work on a couple of
different application areas. We work on depression management, we work on critical care
scenarios in the ICU, and we also work on HIV. A couple of other problems, but this is our main
applications that we focus on. Today I’m going to be focused on because it’s
reinforcement learning, I know bandits are also part
of reinforcement learning, but I’m going to be
focusing specifically about problems that have
sequences of decisions. So that’s mostly happening in the critical care space as well
as in the HIV management space. So first of all, just what makes the
healthcare scenario interesting or
challenging to work with? Many of these things people are
probably already familiar with, but I’m just going to share a few. First of all, the way we typically work is our data comes in batches. So we go to a hospital we say, “Can we have data to
help you analyze x?” They give us a big dump
of data related to x. We’re not really able to get online
updates and all of this thing, and it takes a lot of work
to get even something like a very simple prototype
out into the real world. The other thing that’s really
important to keep in mind, is that database was designed
probably for billing purposes, it wasn’t defined to make care better for patients or
make lives easier for doctors. So lots of documents maybe missing, things are going to be confounded, and also the intensive actions
importantly are missing. So it may seem like sometimes
a doctor is behaving randomly, but maybe there’s a
variable that they’re very intentionally trying to manage that’s just not recorded in any of the data that
we have available. So that’s the setting
that we’re dealing with. I mean finally success is
not always easy to quantify. We’re doing some
reward learning work, there were some really
great discussion about rewards this morning, so I’m not going to
talk about that here. But learning rewards is hard
and doctors have a hard time specifying what does success mean in different scenarios. All right. But this tie all these
troubles and we were just having a chat
during the coffee break. I would argue that it’s
important to still try to do the best we can
with the data that we have. To ignore the fact that we have these large databases
would also be problematic. We want to try to squeeze
as much out as possible, and be aware of how
much we can actually do with the limitations that
I point out. All right. So here’s a problem set-up, and I’m not really going to
spend any time on it. There’s an agent, there’s a world,
actions, observations, rewards. All right. You’ve been hearing
about this all day. Let’s move into how you
might tackle this problem. So there’s one class of
approaches that I would consider like model-based
or value-based approaches. In these approaches, we
have some hidden state, for example if the patient
that’s evolving over time, and we try to fit a model of how
that state might be evolving, what are the observations
that we’re going to get? Or what’s the state evolution
as we take different drugs. So here’s an example. An example of reward function and we’ve trained the
model, we solve it. So that’s one class of
approach that we could take. Another class of approach
that people had thought about is forget all of that
complicated RL stuff, just find another patient
that looks like you, and just copy what worked for them. This is also a very
reasonable strategy you can just think of this as a
non-parametric estimator. Now it seems to also be
a reasonable choice, if you ask the doctors
what they’re doing, they’re often thinking that, “Oh, yeah, I saw someone similar. I want to try a similar action this time because it worked
for a patient before.” All right. In getting the long-term effects of this
there’s a little bit tricky. So one of the things that we did a couple years ago now at this point, was just to notice that these two ideas that I
just told you about, this parametric
approach, fit the model, fit the value function, versus this non-parametric
just find a clone in your data set approach really help
complimentary your strengths. That is if you happen to be
lucky enough to have a clone, that’s probably the best
world that you can live in. But if you’re not lucky enough
to have a clone in the data set, it’s probably best to use a simplified model perhaps
of disease progression, than to have to go to a clone that’s not really
that similar to you. It’s not good to match to
somebody that’s far away. So that was the intuition, then you can actually use
that in the following way. You can say that now suppose
we just have a mixture. Our clone type policy
suggests do action A, our model fitted policy
suggest do action B. We have some statistics about
the patients and the data set like how that patient relates to different patients
in the data set. We can identify how close they are, for example two similar patients, and then output some
actual action to do. So we applied this strategy. The first place we
applied it was into this scenario of HIV management. So this was with a registry
of about 33,000 patients. The observations include
CD4s, viral loads, etc, and the goal is to keep those viral loads low over
a long period of time, five years in this particular case. One thing I want to
point out is that there were a lot of different
drug combinations. So the action space was fairly large. We did this sort of thing, and the idea that I just
talked about is this policy, mixture policy, that’s the
second line from the top, that says choose depending
on where you are, either go with neighbor
or go with the model, DR is for Doubly Robust, I’ll get to that point
a few more moments. Obviously we weren’t
checking on actual patients, we had to use an important
sampling base estimator to estimate the quality
of our proposed policy. In this mixture,
Model Mixture Policy, we took this idea one step
further and said, “Well, really the key idea is parametric
versus non-parametric.” So instead of mixing at
the level of a policy, why don’t we mix at
the level of a model. If a patient is at a
particular state right now, for the next time step when we’re trying to predict what’s
going to happen next, let’s just see at that point in time, do you have a clone in the data set? If you do, expect our set of observations to be the observations that
would match your clone. But if you’re not
lucky enough to have a clone in the data set,
just use the model. Using that model-based
approach which again is mixing between this parametric and
non-parametric regressor, results in slightly
better performance. Not really statistically significant, but looks like it could
be if we had more data. All right. So this is what we did. In fact that intuition that I told you about
turned out to be true. So we didn’t build in a fact
that said you have to be x close to use the parametric
versus the nonparametric, but if we look at when these
different models are being used, we see that when the history
is short, we often match. The Kernel based is the
find a clone approach, which makes sense, it’s easier
to match to a short history. Whereas, when the history is long, we tend to use the POMDP or
the Model-Based Approach. Analogously, we notice that the Kernel is used when the distance
to your neighbors is shorter, and the POMDP is used
when it’s longer. We didn’t tell the
algorithm to you that we did an end-to-end optimization, but it makes sense, which
makes you feel good. So at this point where
this is awesome. Let’s see what other problems we
can solve with this approach. So we started looking at
sepsis management in the ICU. In this specific case, we are looking at hypo-tension management or
circulation management. So we wanted to make sure that the patient’s blood pressure
stays in a particular range, while the rest of
their body is fighting the actual infection
associated with sepsis. When we’re doing that there’s
two major act to take. You can either increase
their blood pressure by giving them vasopressors which
constricts the blood vessels, or you can increase
their blood pressure by just increasing the
amount of fluid in the body, and both of them have pros and cons. So that was the action space
that we were looking at, and the goal was to reduce 30-day mortality because
that’s a long way off. We used a reward
function that had to do with the log odds of
30-day mortality. Then when we actually
instantiated this approach, we did it slightly differently than before but the idea
remained the same. The idea here is that now we still had to find a clone
policy, that’s your Kernel. Then instead of using a model, we had an LSTM to
compress the history, and then we train a DDQN on that, and we place a little
barrier to make sure it didn’t recommend crazy
actions because as soon as you start using things DDQN’s you have no idea
what’s going to come out. But anyway, we did all of that
and you do the same again, same rough process of optimizing
that function to choose between these two very intuitive choices
to see what will come out of us. So this is what we get.
So what we find is that our approach which is the
columns off to the side. So the MOE columns, especially if you use the LSTM encoding which
is a recurrent encoding, are doing better in terms of this reward function
that we’ve created. We can compare to
the physician and we can compare to the
kernel and the DQN. Just to make this intuitive, we provide the following plot. The first plot and the
second plot are the same as for the two
different actions. What I’m showing here
is that if you take the action that the x-axis is the difference between what we recommend and what the doctor did. So zero means that the doctor
did what we recommended. Negative means that the doctor
did something different, positive means that the doctor
did something different. On the y-axis, you
see mortality rate. What you’re noticing is
that when the doctor matches our action the
mortality rate is lower. If it’s off to the side, the mortality rate is overall higher. I’m seeing some nodding heads
right kind of makes sense. So this is a point where you’re
feeling good about yourself. I’ll show you some awesome numbers. For the HIV case, I’m showing you some
more awesome numbers and more than awesome numbers, I have a plot which
seems to make a lot of sense and makes you feel
good until we plotted this. This is what happens when you do no action or if you
do random actions, and the plots look about the same as it does when we applied
our optimal policy. Some of you might have seen this before because I’ve given this talk a couple of times or versions
of this talk a couple of times. Some of you may just be much more sophisticated about
reinforcement learning, so you didn’t fall about for the
initial part the first place. But I pointed this out just to
emphasize that this is tricky. Knowing when to trust like an output of these Machine
Learning algorithms is super hard. I still think that there’s
value to the policies we found, and I’m going to explain to you why for the remainder
basically of the talk. But this sort of stuff is very
easy to get yourself misled. In the particular case
of this if you’re wondering why this stuff happened, the way we discretize the actions was based on quantiles
in the dataset. So for example, for the
vasopressor dosage, the initial actions are quite
close together like Action 1, Action 2, Action 3 in
terms of absolute dose. Where as the further out action, really Action 5 is the
biggest action that’s a way but there’s not much
action going on there. So what was really happening is that this plot is really focused
only on the middle. The middle patients are
all the healthy patients. The healthy patients are
going to survive anyway. I mean they’re in the ICU, but the more healthy ICU
patients are the ones who are more likely to survive
are the ones who are pretty much ending up here because there in
the middle of the plot. More details we have
in our archive post on why these effects happen. All right, and that observation
really was just the start. You’ve heard about some
issues earlier today, but I’m just going to reiterate. When you apply these important
sampling based estimators, you often get quite high variances. As you see here, our policy is up here. It doesn’t look particularly, it looks slightly better
but it’s not clearly better than the alternatives. You also end up with skewed datasets. So what’s going on here
that I’m trying to show is that if you
look at the weighted, that once you apply your
important sampling, what are the trajectories
that are retained? If you think of important sampling
is really taking a set of samples that you keep and then throwing away the samples
that don’t match, roughly it’s weighted but
that’s what’s going on. Your effective sample size is quite small compared to
the size of the dataset. This is in log scale, so that we’ve lost like
two orders of magnitude, and worse than losing the
two orders of magnitude, I would say is that
the sequence length in the dataset varies from
0-20 is in four hour bins. So 0-80 hours. The sequence is retained when we do our estimation are quite short
right because at some point, otherwise the important
sampling weight goes to zero and we lose the weight. So we’re really looking
at trajectories that are in the 0-20-hour range, and those are very
different population of patients then are the overall
population of patients. So this is just the start. We could spend a long
time talking about all the issues with off-policy
evaluation and batch stuff. Then again just to
emphasize that these are the sorts of themes
that can go wrong, when you just blindly turning the crank on these sort of a
purchase and they feel like it’s really important for
us as the experts in this area to remember to educate people that we work with
about these sorts of problems because these are not things that the doctors
are going to notice. They’re going to be “Oh, this
is a shiny looking number.” They’re not going to
necessarily be or even if they’ve been educated
about for example sample sizes, they might say, “Oh
your tests that had 3,000 patients,” that
sounds wonderful. But the effective sample
size was quite a bit lower. So just emphasizing those
things that we probably want to share with our colleagues when
we’re working on these problems. All right, so how can we increase
confidence in these results. So the last several years my lab is really been
thinking about different ways to holistically think about how to again squeeze as much out
of these data as possible. That was the original goal. So there is RL. That’s the part where
we really just know and love and it’s optimizing
agents and sample efficient ways with all sorts of cool estimation techniques and function approximators
and all of that. So that’s sitting inside
this dashed line. Then there’s the world outside all of that of implicit
stuff that goes on. So there’s defining the
reward function which I’m not going to really focus
on in the representation, which I’m also not going to focus on, coming in as the input. Then when it comes to validation, there’s statistical validation which I’m calling a policy evaluation. But there’s also other things
that you might want to do. I’m just calling that
checks plus plus to see are we doing reasonable
things with our policy. I emphasize that just because
at some point you run out, the statistics can
only take you so far. If there’s a dataset of a certain size and there’s a policy
that you’re trying to check, the confidence intervals
are just going to overlap and there’s nothing
you can do about that. What do you do when you hit that
particular scenario in your data? All right, so let me talk a little bit now about off-policy evaluation. We’ve heard about some of these
things a little bit earlier today. I’m just going to share
highlights from a couple of things that we’re working
on in the lab toward this that are again
inspired by things that we believe would be helpful in the particular types of
applications that we’re working on. So off-policy evaluation,
this is the quick definition. Again, I think most people in the room are pretty
familiar with this. We’re collecting data
from the clinicians. We want to estimate the value
of a policy that is under some proposed thing like the
things that I talked to you about before like what if we
use this mixture based model, that seems quite sensible, but how well will it perform. So that would be our Pi_e. There’s three major approaches
just kind of like in how in RL we split between models,
values, and policies. We have IS type approaches
typically high-variance. We have model-based methods, which are typically high bias, value based methods which
often suffer from high bias. We’ve been doing work
on the first two at the moment and some ongoing
work on the third category. I’m just going to share a couple of stories from what
we’ve been working on. So when it comes to important
sampling, as I mentioned, the big issue is that at some point, let’s say our evaluation
policy doesn’t want to do the thing that the
clinician actually did. So we have a sequence of decisions that are going along in the sequence. At some point, especially
if Pi_e is deterministic, probability of the
clinician action given our state under our proposed policy is going to be zero and at that
point this weight goes to zero. Even if you’re using
a stepwise estimator, you still lose the entire
end of the trajectory. If you care about something that
happens further in the future, this is obviously a problem. So one way we can ameliorate this
issue that we’ve been looking at is what if we can stitch trajectories that we
were going to throw out? So here’s your desired sequence. So you have four states. In each of those states, you were hoping to take this red action that
takes you to the right. So that’s your desired sequence, but in real life, here’s your batch of data. You never actually did that. You have an example of data that went to the right
twice and then went down. You have an example that
goes through the right ones, goes down, and goes to
the right twice again. Importantly, the
states still match up. Like this button in this
little toy example is not changing your transition but it’s something else that’s
providing rewards. So in this particular case, you might think okay, I have to throw these out. Or here again if you’re doing a per decision,
you would say, “Well, I can only keep the
first two time steps of the first one and the first
time step of the second one.” But what if you could take the two that you are going to
effectively throw out, and stitch them together? Again in this toy example, I can do that because the
states align exactly. So we work this out in
just all the maths of it, and the tabular case. You end up with some
subtleties in how you now have to adjust and calculate
importance weights. So in the tabular case and we’re working on it for
the continuous case, they’ll show you just a flavor
of the example that we get. So actually before I do this, so what does this intuitively doing, and why do we like
this as an approach? So what this is
intuitively doing is it’s increasing the size of the data, the number of
trajectories that you can use for any downstream thing. So whether you’re going
to apply your WDR, your PDIS, whatever it is. Whatever IS based technique
that you’re going to apply, you can apply it with
this Rejigger dataset. So the nice thing is that, as new and fancy techniques
come out in that space, we’re not preventing you
from using any of those. This isn’t an estimation
method per se. It’s a method for increasing
the amount of data that you can use for your estimation method. So what we show on some simple
examples is that if you have estimation methods on
the bottom axis is Episodes. As you gather more Episodes, the error goes down. As the error is going down, the dash line show the benefit
of using our approach on top of standard IS, WIS, PDWDR. So in all of these cases, it’s a standard method
but we’re just increasing the amount of the size of
the data that you have available so that you get a
better estimate. All right. So that was a way of augmenting any kind of importance sampling
based approach with more data. Any trajectory based important sampling based
approach with more data. So that was one example
of work that we’ve done. Now I’ll tell you a little bit about some model-based
work that we’re also doing. All right. So here we are, and this one I think is super cool because we’re
using mixtures again. So that intuition
before about, “Oh hey, what if we had this parametric
thing where we jump from place to place or we build a model
and we switch between the two?” Now what if we apply that idea, but now for evaluation? Because if we think about it
in the evaluation setting, we have exactly the same question that we had in the
optimization setting. When we’re over here, our evaluation policy
says take action A. If you’re lucky enough
to have a copy of action A taken in that
state in your dataset, then why not use it? Take that tuple and continue to
build your trajectory off of that. If you’re not lucky enough, maybe you fall back on
a parametric model. So that’s intuition number one. Intuition number two
is the following. I might want to make sure
that as I’m doing this, I am trying to stay in areas
that are well modeled. So let’s suppose that I have a region that’s not
very well modeled. That’s the one in yellow, and I have a region
that’s well modeled, that’s the one in blue. If I have this scenario, and now I have two forms
of simulators where the two simulators
again could be that parametric and non-parametric model. That’s how we usually end up
instantiating it in our case, but really this is general. You can have K approximate models. It doesn’t have to be
parametric and non-parametric. Let’s say you have your models. Those are the lines. Here is what really
happens in real life. The agent if it executes the policy, it’s going to follow this green line across this particular region. You have two choices
of which model to use at the first time step to guess where that agent
is going to end up. So if you use the black model, you’re going to get something that’s closer to the green
dot which is great. It’s more accurate, and it’s more
accurate than the dash line. But by going into this yellow region, you know that you might do
poorly later on in terms of evaluating the effect
of different actions. Because we’re in a
model-based scenario, we’re going to simulate out, we’re going to do rollouts. What could happen is that
once you reach here, you just go off the rails because this is the
poorly modeled region. You’re trying to
estimate the quality of this policy that goes through
a poorly modeled region. So maybe the better thing to
do if you just care about the error is to take
a hit the first time. Then stick to the well modeled area that now you’re following the trajectory assuming that the first step actually took it
here, rather than over here. So this is its own little
RL problem. It’s just cute. Now we’re using Reinforcement
Learning to choose which model, so the action space is
which model to choose, and the reward function is the mean square error of
the estimate of the policy. So we’re looking at different forms choosing between
models to do our rollouts. The different models have different [inaudible]
in different places. So more formally, how do we do that? I won’t spend too much time on this. We have a bound on the quality, so this is our error that
we’re trying to manage. It depends on the error from state summation and
reward estimation. The key term that I’ll
point out is that you have this Lipschitz constant
of the transition, which is a key thing that
will cause things to blow up as you accrue the biggest errors. We can do that Lipschitz
constant estimation both in the parametric
and non-parametric case by just simulating or
looking for nearby examples. So again, there’s some estimation,
approximations involved, but this thing is imputable based on once you made
a couple of approximations. So we can compute
our reward function, the action space is
relatively simple because they were just choosing
between two different actions. So here is a toy example, but an actual computed toy example of what happens when we do this. So in this case, there’s two actions, there is one action that takes
you straight across and there is one action that takes
you off diagonally. So there’s two different actions
and we ended up collecting data only two trajectories that end up doing just one
action or the other. So there’s one trajectory
that does this and another trajectory that goes
up, so that’s our data. Now, just for the
sake of illustration, let’s suppose that I fit
a parametric model that doesn’t know the difference
between the two actions. So I fit a model with all the data and I ask it to predict next state
given current state, and because of that, I’m going to conflate these two actions together. So my parametric models says
regardless of what action you do, you are always going to go up an angle that is
in-between these two. Did that make sense? Any
questions? All right. So that’s the setup for
this little toy example. Now, here is what the
evaluation policy wants to do. It wants to go up one step, go across, and then go up again. I want to get something
close to that policy based on my parametric model and
by being able to use the data, write samples from the data, which is my non-parametric model. So what happens if I’m greedy? Well, if I’m greedy, I’m like, okay, the first thing that
our evaluation policy did was go diagonally up, so I’m going to just take a sample from my data because
my data does that. I’m going to copy that, and then you reach a point
where you’re like, okay, now my model
says go to this side, and I have the choice
between sticking with my data and I can also say my
parametric model does this. I choose my parametric model
because it has less error. At this point, if I’m over here, I’m sorry to the people
on the other side, I wish there was a way to clone
myself, I’ll walk back and forth. So they are over
here, this is closer. Every step of the way, it’s going to be better to
choose the parametric model. So here’s a trajectory here, this is going to be your guest, this green line is going to be
your guest for the purple line. I’ll go to the other side
to give the better example. So that’s if something going wrong, is there a better solution? Well, a better solution would
be to say, you know what, I am going to use my parametric model first even though it accrues
a little bit of error. Because then after that,
I can use all of this data that roughly
follows the purple line, not exactly, but it stays close. Then once the purple
line starts going up, I go back to my parametric model
and this blue line over here, which was planning ahead for the
errors that it’s going to accrue, ends up looking closer to the purple
line than the green one does. So that’s the key idea
in this approach. So we’re taking that same intuition from the beginning of the talk, about how can we mix
clones versus models, but now really formalizing it in a way that we can use it
for off-policy evaluation. So that’s the intuition of why this might work and
how it might work, why planning is actually important
and it works on a real task. So what I’m going to do now
because I’m just at close time, 10 minutes, that’s a little
longer than I thought, I thought I had less than that. I’m going to quickly go through, we’ve done other work
on model-based stuff, if you’re curious, I’m happy to
chat about it at some other point. I want to talk a little bit
about validation and skip to a couple of things that we’re doing there that I’m
really excited about. So the off-policy evaluation, even with all the stuff that
we’re developing, again, it’s going to run into some
hard statistical limits. In our particular case, the doctors are being
able to observe stuff about the patients that
we cannot observe, so we know that there’s major
missing confounders in the dataset. But at the same time,
there’s a lot of data that is in the dataset and the doctors are not capable of looking at all the
data about a patient, certainly not capable of
comparing all that data about a patient to all the other
patients in the database. So there is a value
that we can provide, but we also have to
be very aware that there’s going to be
some major limitations, and in many of the assumptions
that are required also for off-policy evaluation are not
necessarily going to be met. So even if we have a method
that has nice guarantees, if your state space
is well-defined, etc, it’s not going to hold in
our real applications. So what can we do really? How can we feel good
about policies that have reasonable numbers but we’re just not sure about all of our assumptions? So I’m going to go quickly
through some of this. So we do stuff like we make sure that our our results reproduce
across sites, that’s important, we check sensitivities to all
modeling assumptions if we’re using something that requires
a covariate like WDR. We also talk a lot with the experts, we compare against standard of care, we asked clinicians, “Do
these policies make sense?” We gather annotations, but these are somewhat basic things that
one might want to do. The thing that I’m going
to spend the last couple of minutes talking about it in terms of like deep dive into is, well, when we’re asking the doctors, what’s the best way to actually
ask them if something makes sense? So the numbers sitting
up here on this slide, we gave them examples. We gave them a patient, we gave them a recommendation, we said does this
recommendation makes sense? We got an agree or disagree
or partially disagree, but that’s the very limited way of showing doctors what
the agent will do. In particular, a piece that
was missing is that how do we select those patients?
How do we know? How can we feel good about
this patient selection, this group was good enough? Because if the clinician said
that this group it looks great, would we feel good
about going forward with this policy? Probably not. Depending on how we
chose them, maybe not. If they said it made no sense, then we would know we have to
go back to the drawing board, but a positive result doesn’t necessarily indicate that
we have a good policy. So there’s a question
of how to validate an entire policy given
only some samples. So that’s a very little detour and another area that
we’ve been working in, is how can we best communicate a treatment policy to
a clinical expert? So we formalize it in
the following setting. We’re going to present the expert
with some state action pairs. In this particular state, the agent does this. The expert now is going to be
given a new state, S prime. We asked them, what
action is going to happen in this state, S prime? If they can guess it, then we can feel good. We can be like, okay, the expert has a good sense of how this
agent is going to behave, and then we’re going to trust the expert to tell us whether
the agent is reasonable. We can only do that
if the agent can do a great job of simulating
what the agent will do. So our goal, the thing that we have control over is that training set, the teaching set that we
give to the expert for them to be able to recover
back the entire policy. So here’s an example.
Here’s a Gridworld and the arrows that are filled in are the examples of
state action pairs, and then the task, for example, would be for those queries
that are circled in black, what are the actions that you think that the
agent is going to take? Not the optimal action, but what is the action
that the agent is going to take? Any guesses? People have any hypotheses
about what’s going on? Down in green. Okay. Anyone else?
People are too tired. You think that it’s
going to go right->>In the yellow spot.>>-in the yellow. I see. In general, can you summarize though? You have no idea. So there’s two forms of things that people often do
with these tasks, which have words that you know. So the policy wasn’t done in green. That’s like an imitation-type policy. I think that this often
happens in this state. The other thing could be, it looks like they’re
trying to go to blue. That’s an inverse
reinforcement learning. So we don’t know what the
people are actually doing. Are they doing IRL or
are they doing IL? The implications would
be different for this state action pairs
that we want to present. So in this case, many times, people end up doing more
inverse reinforcement learning. They think that there’s a goal
that people are trying to get to. Whereas if we give them examples
that look like the following. So in this particular case where this is different treatments
and different biomarkers, people are like, “I’m having the
foggiest of what’s was going on.” What happens in this
state, I just copy. I do this too. I look at that one, I try to find a copy
of this over here. I’m an imitation learner for
this particular type of dataset. What we showed is that
it’s really important to know what people are going to do. Wait. First of all,
so quantitatively, we noticed the difference. So this is a Turk study. We asked people after they did the experiment to tell us what
their thought process was. Then we went through all of those
things and tried to code them into something that looks
like IRL or I can’t tell. So the gray is we couldn’t tell from their description exactly
what they were doing. So in HIV case, this one over here, people were just like, “We copied.” So that was clearly imitation
learning in the GridWorld. More often, people were
describing something that looked more like with the agent
is trying to get to a goal. I understand now. Those differences mattered in the following sense. So now, we can imagine that I create a summary assuming that you
are an imitation learner. Then I give you that example. Or I create a summary
assuming that you’re an inverse reinforcement
learning learner, and that’s a different form of
summary that I’m going to give you. How would you perform if you
assumed that the extraction method, I should be using IRL or IL? So the dash line show how you
would expect performance to be. The important thing to notice is
that in the different domains, there’s a swap, there’s
not a clear better method. So it’s important to match. We also noticed that in
the different domains. So if we match what people
did remember if people were mostly imitation learners
in the HIV domain, if we create the sample knowing that they’re going
to do imitation learning, they do better than if we create the sample assuming
that they’re going to do IRL. So this is an important
piece of information. If we’re going to be
presenting examples, having the right model
of the human cognition, of how they’re going to extrapolate from the models is very important. This is obvious [inaudible] We did the actual study to
show that this is important. When we think about how to
present examples to the doctors, we want to understand what
these not symbols look like. So what I’m going to do is
I’m just going to finish up. One other thing that we
have that we’re working on is if things are
statistically indistinguishable, we want to offer options. So if we have a couple
of different policies, we want to be able
to say that some of the policies are clearly
bad like the ones in gray. The other ones might be reasonable, and we’ve applied this again to the hypotension management
task in the ICU. I won’t go into detail on that because I want to save
time for a couple of questions. So going forward. So RL in the health space I think
is a really interesting area. I was working on RL applied
to nothing doing my PhD, and now that I’m working on
these health applications, I feel like I’m working on
very different questions but they’re all very satisfying. There’s all sorts of really interesting technical
questions that come up. When you’re working
with a real domain, you care a lot about how
things are going to work. In terms of things to keep in mind, let’s think holistically about how
RL performs in the real world, and keep in mind that usually it’s a human agent system
that’s making the choices. So all of our work the way we think about it is that
our agent is going to be providing information
and recommendations to a human who has
additional information. They’re going to combine those
together to make the final choice. So it’s important to think about how the system is going to
behave in the real context. Also, we have to be
careful with the analyses. Like that first
example where I showed you where you can just
run a bunch of numbers, things can look totally peachy,
and then you’re like,” Wait. The numbers mean nothing.” That happens and we have to
make sure we are aware of that, that we take care of that. But let’s not also make that
such that we don’t think about messy problems that have
a lot of potential impact. Just two quotes that I really like going back to
comic books actually. So we have: With great power comes great responsibility.
We have to be careful. But at the same time, perfect
is the enemy of the good. If we stick with the status quo, then that’s also a problem when there’s better solutions
that are potentially out there. So I’ll leave it at that and
take any questions. Thank you.>>So we have time for
a couple questions.>>So I have one question.>>Over there.>>It happened again.>>Hey, Hailey. So I guess I’m curious about
the implications of the work you have on people wanting
to do imitation learning. I’m curious about given that that’s
what people want to do, what->>Well, it’s scenario-dependent
and that’s what we found. This is an entirely a Turk study. It wasn’t with doctors. So my hypothesis is that if
you can understand the domain, you will probably use IRL. Then you’re like, ”
You’re doing that,” because you’re trying
to get to this goal. Whereas if you’re not super
familiar with the domain, you will probably just imitate. Or somebody says, “Step to the left.” I’m just going step to
the left like I have no idea why somebody said that
or why somebody did that. So if we give some of these
examples to the doctors, like the HIV example, they may actually be doing
IRL and that’s something. Basically, the important
part of this study was that if we make the wrong assumption, we actually do get worse quality. Because it could have been that maybe the assumptions don’t matter. People are just somehow robust to
whatever examples you give them, which could have come out.>>So a related question. Going forward, how would
you think about doing this for a new task with doctors?>>So one thing
actually we discovered after this study is that we’re still really interested in extensions of this and how to do this with doctors. But one thing that came out of
more discussion as we’ve been talking to the clinicians is that even though they often
think in terms of clones, they think in terms of data points. I met this other patient five
years ago and this thing happened. When they communicate
with each other, they still have to communicate
in a language of features. Does that make sense?
So if I tell you about this patient five years
ago, it’s not in your head. I have to do like, “The patient
I saw five years ago who was over 70 years old and had broken hip,” and then I’m describing this
patient to you but I’m describing it to you
in terms of features. So seeing how that
communication works, we think that that might be actually the key to
describing a policy. It’s like looking at the features of the language of the communication that they’re already using and then putting the policy into those terms. So that’s where our
current work is headed.>>So just following
up on the parametric, non-parametric split
that you talked about. There might be some
ways to bridge it. So local learning is the one that
comes to mind where you train a parametric model on a [inaudible] neighbors
weighted appropriately. Some of the interpretability
methods like lime in particular, look at that. So I wonder, do you really
want to train separate models?>>Or could you just build some
model that works the best? People use all kinds of ensembling. So this was our initial intuition. My guess is that there’s always going to be some cases
where you want to do some ensembling and we come up
with intelligent ways to do that. From an interpretability perspective, I see that as different. I’m not sure. So lime does
these local perturbations. I think it’s really important to understand the delta
of the perturbation. That’s maybe a longer
discussion of how to do interpretability for policy. Is there for classifiers?>>Let’s take that off. One person.>>Let’s give [inaudible] another
round of applause. So that’s it. I just want to wrap up with a couple of thank you’s and shout outs. So first of all, thanks to all the speakers. This event would not have been
interesting without them. So let’s give them another
round of applause. Second of all, thanks to
all of our support staff. We have some people from Outreach. Jane Bianche is sitting outside, Louis Stevenson is in Redmond, and Jessica is in Montreal. They really made all this very
seamless for myself and Hall. Thanks to our videographer, our social media people, our photographer. I don’t
know where she went. Let’s give to them another
round of applause. Again, this is making the event. Then, I want to thank Hall, my co-conspirator, in helping
me put this together. It was very fun and I’ll try
to do it again next year. Lastly, thanks to all
of you for coming. This is part of what makes
the event really fun to see everyone and I hope you
all had a great time. Anything else? Thanks. That’s it.

Leave a Reply

Your email address will not be published. Required fields are marked *