>>We have a Finale Doshi-Velez, she’s a professor at

Harvard University, and has done lots of really cool

work on probabilistic modeling, sequential decision-making,

[inaudible] machine-learning, and with applications to health care. I think she’s going to tell us

about some health care stuff. Cool.>>Awesome. I’m super excited to be here to share some of the

work that we’ve been doing in our lab related to health care applications

in reinforcement learning. So today’s talk it’s

going to be like all of the problems we work

on that come from very practical problems that we see, and then we go down

and we try to improve our algorithms to deal with some

of these practical problems. So I’ll be giving a range of things that we’re

working on in the lab. I also invite interruptions, you’ve been a very polite

and quiet crowd all day. If you have questions along the way that’s fantastic, I’d

love to hear them. We don’t have to get

through all the slides, we probably won’t get through

all the slides anyway, so it’s not your fault. All right. So what do we work on in my lab? We work on a couple of

different application areas. We work on depression management, we work on critical care

scenarios in the ICU, and we also work on HIV. A couple of other problems, but this is our main

applications that we focus on. Today I’m going to be focused on because it’s

reinforcement learning, I know bandits are also part

of reinforcement learning, but I’m going to be

focusing specifically about problems that have

sequences of decisions. So that’s mostly happening in the critical care space as well

as in the HIV management space. So first of all, just what makes the

healthcare scenario interesting or

challenging to work with? Many of these things people are

probably already familiar with, but I’m just going to share a few. First of all, the way we typically work is our data comes in batches. So we go to a hospital we say, “Can we have data to

help you analyze x?” They give us a big dump

of data related to x. We’re not really able to get online

updates and all of this thing, and it takes a lot of work

to get even something like a very simple prototype

out into the real world. The other thing that’s really

important to keep in mind, is that database was designed

probably for billing purposes, it wasn’t defined to make care better for patients or

make lives easier for doctors. So lots of documents maybe missing, things are going to be confounded, and also the intensive actions

importantly are missing. So it may seem like sometimes

a doctor is behaving randomly, but maybe there’s a

variable that they’re very intentionally trying to manage that’s just not recorded in any of the data that

we have available. So that’s the setting

that we’re dealing with. I mean finally success is

not always easy to quantify. We’re doing some

reward learning work, there were some really

great discussion about rewards this morning, so I’m not going to

talk about that here. But learning rewards is hard

and doctors have a hard time specifying what does success mean in different scenarios. All right. But this tie all these

troubles and we were just having a chat

during the coffee break. I would argue that it’s

important to still try to do the best we can

with the data that we have. To ignore the fact that we have these large databases

would also be problematic. We want to try to squeeze

as much out as possible, and be aware of how

much we can actually do with the limitations that

I point out. All right. So here’s a problem set-up, and I’m not really going to

spend any time on it. There’s an agent, there’s a world,

actions, observations, rewards. All right. You’ve been hearing

about this all day. Let’s move into how you

might tackle this problem. So there’s one class of

approaches that I would consider like model-based

or value-based approaches. In these approaches, we

have some hidden state, for example if the patient

that’s evolving over time, and we try to fit a model of how

that state might be evolving, what are the observations

that we’re going to get? Or what’s the state evolution

as we take different drugs. So here’s an example. An example of reward function and we’ve trained the

model, we solve it. So that’s one class of

approach that we could take. Another class of approach

that people had thought about is forget all of that

complicated RL stuff, just find another patient

that looks like you, and just copy what worked for them. This is also a very

reasonable strategy you can just think of this as a

non-parametric estimator. Now it seems to also be

a reasonable choice, if you ask the doctors

what they’re doing, they’re often thinking that, “Oh, yeah, I saw someone similar. I want to try a similar action this time because it worked

for a patient before.” All right. In getting the long-term effects of this

there’s a little bit tricky. So one of the things that we did a couple years ago now at this point, was just to notice that these two ideas that I

just told you about, this parametric

approach, fit the model, fit the value function, versus this non-parametric

just find a clone in your data set approach really help

complimentary your strengths. That is if you happen to be

lucky enough to have a clone, that’s probably the best

world that you can live in. But if you’re not lucky enough

to have a clone in the data set, it’s probably best to use a simplified model perhaps

of disease progression, than to have to go to a clone that’s not really

that similar to you. It’s not good to match to

somebody that’s far away. So that was the intuition, then you can actually use

that in the following way. You can say that now suppose

we just have a mixture. Our clone type policy

suggests do action A, our model fitted policy

suggest do action B. We have some statistics about

the patients and the data set like how that patient relates to different patients

in the data set. We can identify how close they are, for example two similar patients, and then output some

actual action to do. So we applied this strategy. The first place we

applied it was into this scenario of HIV management. So this was with a registry

of about 33,000 patients. The observations include

CD4s, viral loads, etc, and the goal is to keep those viral loads low over

a long period of time, five years in this particular case. One thing I want to

point out is that there were a lot of different

drug combinations. So the action space was fairly large. We did this sort of thing, and the idea that I just

talked about is this policy, mixture policy, that’s the

second line from the top, that says choose depending

on where you are, either go with neighbor

or go with the model, DR is for Doubly Robust, I’ll get to that point

a few more moments. Obviously we weren’t

checking on actual patients, we had to use an important

sampling base estimator to estimate the quality

of our proposed policy. In this mixture,

Model Mixture Policy, we took this idea one step

further and said, “Well, really the key idea is parametric

versus non-parametric.” So instead of mixing at

the level of a policy, why don’t we mix at

the level of a model. If a patient is at a

particular state right now, for the next time step when we’re trying to predict what’s

going to happen next, let’s just see at that point in time, do you have a clone in the data set? If you do, expect our set of observations to be the observations that

would match your clone. But if you’re not

lucky enough to have a clone in the data set,

just use the model. Using that model-based

approach which again is mixing between this parametric and

non-parametric regressor, results in slightly

better performance. Not really statistically significant, but looks like it could

be if we had more data. All right. So this is what we did. In fact that intuition that I told you about

turned out to be true. So we didn’t build in a fact

that said you have to be x close to use the parametric

versus the nonparametric, but if we look at when these

different models are being used, we see that when the history

is short, we often match. The Kernel based is the

find a clone approach, which makes sense, it’s easier

to match to a short history. Whereas, when the history is long, we tend to use the POMDP or

the Model-Based Approach. Analogously, we notice that the Kernel is used when the distance

to your neighbors is shorter, and the POMDP is used

when it’s longer. We didn’t tell the

algorithm to you that we did an end-to-end optimization, but it makes sense, which

makes you feel good. So at this point where

this is awesome. Let’s see what other problems we

can solve with this approach. So we started looking at

sepsis management in the ICU. In this specific case, we are looking at hypo-tension management or

circulation management. So we wanted to make sure that the patient’s blood pressure

stays in a particular range, while the rest of

their body is fighting the actual infection

associated with sepsis. When we’re doing that there’s

two major act to take. You can either increase

their blood pressure by giving them vasopressors which

constricts the blood vessels, or you can increase

their blood pressure by just increasing the

amount of fluid in the body, and both of them have pros and cons. So that was the action space

that we were looking at, and the goal was to reduce 30-day mortality because

that’s a long way off. We used a reward

function that had to do with the log odds of

30-day mortality. Then when we actually

instantiated this approach, we did it slightly differently than before but the idea

remained the same. The idea here is that now we still had to find a clone

policy, that’s your Kernel. Then instead of using a model, we had an LSTM to

compress the history, and then we train a DDQN on that, and we place a little

barrier to make sure it didn’t recommend crazy

actions because as soon as you start using things DDQN’s you have no idea

what’s going to come out. But anyway, we did all of that

and you do the same again, same rough process of optimizing

that function to choose between these two very intuitive choices

to see what will come out of us. So this is what we get.

So what we find is that our approach which is the

columns off to the side. So the MOE columns, especially if you use the LSTM encoding which

is a recurrent encoding, are doing better in terms of this reward function

that we’ve created. We can compare to

the physician and we can compare to the

kernel and the DQN. Just to make this intuitive, we provide the following plot. The first plot and the

second plot are the same as for the two

different actions. What I’m showing here

is that if you take the action that the x-axis is the difference between what we recommend and what the doctor did. So zero means that the doctor

did what we recommended. Negative means that the doctor

did something different, positive means that the doctor

did something different. On the y-axis, you

see mortality rate. What you’re noticing is

that when the doctor matches our action the

mortality rate is lower. If it’s off to the side, the mortality rate is overall higher. I’m seeing some nodding heads

right kind of makes sense. So this is a point where you’re

feeling good about yourself. I’ll show you some awesome numbers. For the HIV case, I’m showing you some

more awesome numbers and more than awesome numbers, I have a plot which

seems to make a lot of sense and makes you feel

good until we plotted this. This is what happens when you do no action or if you

do random actions, and the plots look about the same as it does when we applied

our optimal policy. Some of you might have seen this before because I’ve given this talk a couple of times or versions

of this talk a couple of times. Some of you may just be much more sophisticated about

reinforcement learning, so you didn’t fall about for the

initial part the first place. But I pointed this out just to

emphasize that this is tricky. Knowing when to trust like an output of these Machine

Learning algorithms is super hard. I still think that there’s

value to the policies we found, and I’m going to explain to you why for the remainder

basically of the talk. But this sort of stuff is very

easy to get yourself misled. In the particular case

of this if you’re wondering why this stuff happened, the way we discretize the actions was based on quantiles

in the dataset. So for example, for the

vasopressor dosage, the initial actions are quite

close together like Action 1, Action 2, Action 3 in

terms of absolute dose. Where as the further out action, really Action 5 is the

biggest action that’s a way but there’s not much

action going on there. So what was really happening is that this plot is really focused

only on the middle. The middle patients are

all the healthy patients. The healthy patients are

going to survive anyway. I mean they’re in the ICU, but the more healthy ICU

patients are the ones who are more likely to survive

are the ones who are pretty much ending up here because there in

the middle of the plot. More details we have

in our archive post on why these effects happen. All right, and that observation

really was just the start. You’ve heard about some

issues earlier today, but I’m just going to reiterate. When you apply these important

sampling based estimators, you often get quite high variances. As you see here, our policy is up here. It doesn’t look particularly, it looks slightly better

but it’s not clearly better than the alternatives. You also end up with skewed datasets. So what’s going on here

that I’m trying to show is that if you

look at the weighted, that once you apply your

important sampling, what are the trajectories

that are retained? If you think of important sampling

is really taking a set of samples that you keep and then throwing away the samples

that don’t match, roughly it’s weighted but

that’s what’s going on. Your effective sample size is quite small compared to

the size of the dataset. This is in log scale, so that we’ve lost like

two orders of magnitude, and worse than losing the

two orders of magnitude, I would say is that

the sequence length in the dataset varies from

0-20 is in four hour bins. So 0-80 hours. The sequence is retained when we do our estimation are quite short

right because at some point, otherwise the important

sampling weight goes to zero and we lose the weight. So we’re really looking

at trajectories that are in the 0-20-hour range, and those are very

different population of patients then are the overall

population of patients. So this is just the start. We could spend a long

time talking about all the issues with off-policy

evaluation and batch stuff. Then again just to

emphasize that these are the sorts of themes

that can go wrong, when you just blindly turning the crank on these sort of a

purchase and they feel like it’s really important for

us as the experts in this area to remember to educate people that we work with

about these sorts of problems because these are not things that the doctors

are going to notice. They’re going to be “Oh, this

is a shiny looking number.” They’re not going to

necessarily be or even if they’ve been educated

about for example sample sizes, they might say, “Oh

your tests that had 3,000 patients,” that

sounds wonderful. But the effective sample

size was quite a bit lower. So just emphasizing those

things that we probably want to share with our colleagues when

we’re working on these problems. All right, so how can we increase

confidence in these results. So the last several years my lab is really been

thinking about different ways to holistically think about how to again squeeze as much out

of these data as possible. That was the original goal. So there is RL. That’s the part where

we really just know and love and it’s optimizing

agents and sample efficient ways with all sorts of cool estimation techniques and function approximators

and all of that. So that’s sitting inside

this dashed line. Then there’s the world outside all of that of implicit

stuff that goes on. So there’s defining the

reward function which I’m not going to really focus

on in the representation, which I’m also not going to focus on, coming in as the input. Then when it comes to validation, there’s statistical validation which I’m calling a policy evaluation. But there’s also other things

that you might want to do. I’m just calling that

checks plus plus to see are we doing reasonable

things with our policy. I emphasize that just because

at some point you run out, the statistics can

only take you so far. If there’s a dataset of a certain size and there’s a policy

that you’re trying to check, the confidence intervals

are just going to overlap and there’s nothing

you can do about that. What do you do when you hit that

particular scenario in your data? All right, so let me talk a little bit now about off-policy evaluation. We’ve heard about some of these

things a little bit earlier today. I’m just going to share

highlights from a couple of things that we’re working

on in the lab toward this that are again

inspired by things that we believe would be helpful in the particular types of

applications that we’re working on. So off-policy evaluation,

this is the quick definition. Again, I think most people in the room are pretty

familiar with this. We’re collecting data

from the clinicians. We want to estimate the value

of a policy that is under some proposed thing like the

things that I talked to you about before like what if we

use this mixture based model, that seems quite sensible, but how well will it perform. So that would be our Pi_e. There’s three major approaches

just kind of like in how in RL we split between models,

values, and policies. We have IS type approaches

typically high-variance. We have model-based methods, which are typically high bias, value based methods which

often suffer from high bias. We’ve been doing work

on the first two at the moment and some ongoing

work on the third category. I’m just going to share a couple of stories from what

we’ve been working on. So when it comes to important

sampling, as I mentioned, the big issue is that at some point, let’s say our evaluation

policy doesn’t want to do the thing that the

clinician actually did. So we have a sequence of decisions that are going along in the sequence. At some point, especially

if Pi_e is deterministic, probability of the

clinician action given our state under our proposed policy is going to be zero and at that

point this weight goes to zero. Even if you’re using

a stepwise estimator, you still lose the entire

end of the trajectory. If you care about something that

happens further in the future, this is obviously a problem. So one way we can ameliorate this

issue that we’ve been looking at is what if we can stitch trajectories that we

were going to throw out? So here’s your desired sequence. So you have four states. In each of those states, you were hoping to take this red action that

takes you to the right. So that’s your desired sequence, but in real life, here’s your batch of data. You never actually did that. You have an example of data that went to the right

twice and then went down. You have an example that

goes through the right ones, goes down, and goes to

the right twice again. Importantly, the

states still match up. Like this button in this

little toy example is not changing your transition but it’s something else that’s

providing rewards. So in this particular case, you might think okay, I have to throw these out. Or here again if you’re doing a per decision,

you would say, “Well, I can only keep the

first two time steps of the first one and the first

time step of the second one.” But what if you could take the two that you are going to

effectively throw out, and stitch them together? Again in this toy example, I can do that because the

states align exactly. So we work this out in

just all the maths of it, and the tabular case. You end up with some

subtleties in how you now have to adjust and calculate

importance weights. So in the tabular case and we’re working on it for

the continuous case, they’ll show you just a flavor

of the example that we get. So actually before I do this, so what does this intuitively doing, and why do we like

this as an approach? So what this is

intuitively doing is it’s increasing the size of the data, the number of

trajectories that you can use for any downstream thing. So whether you’re going

to apply your WDR, your PDIS, whatever it is. Whatever IS based technique

that you’re going to apply, you can apply it with

this Rejigger dataset. So the nice thing is that, as new and fancy techniques

come out in that space, we’re not preventing you

from using any of those. This isn’t an estimation

method per se. It’s a method for increasing

the amount of data that you can use for your estimation method. So what we show on some simple

examples is that if you have estimation methods on

the bottom axis is Episodes. As you gather more Episodes, the error goes down. As the error is going down, the dash line show the benefit

of using our approach on top of standard IS, WIS, PDWDR. So in all of these cases, it’s a standard method

but we’re just increasing the amount of the size of

the data that you have available so that you get a

better estimate. All right. So that was a way of augmenting any kind of importance sampling

based approach with more data. Any trajectory based important sampling based

approach with more data. So that was one example

of work that we’ve done. Now I’ll tell you a little bit about some model-based

work that we’re also doing. All right. So here we are, and this one I think is super cool because we’re

using mixtures again. So that intuition

before about, “Oh hey, what if we had this parametric

thing where we jump from place to place or we build a model

and we switch between the two?” Now what if we apply that idea, but now for evaluation? Because if we think about it

in the evaluation setting, we have exactly the same question that we had in the

optimization setting. When we’re over here, our evaluation policy

says take action A. If you’re lucky enough

to have a copy of action A taken in that

state in your dataset, then why not use it? Take that tuple and continue to

build your trajectory off of that. If you’re not lucky enough, maybe you fall back on

a parametric model. So that’s intuition number one. Intuition number two

is the following. I might want to make sure

that as I’m doing this, I am trying to stay in areas

that are well modeled. So let’s suppose that I have a region that’s not

very well modeled. That’s the one in yellow, and I have a region

that’s well modeled, that’s the one in blue. If I have this scenario, and now I have two forms

of simulators where the two simulators

again could be that parametric and non-parametric model. That’s how we usually end up

instantiating it in our case, but really this is general. You can have K approximate models. It doesn’t have to be

parametric and non-parametric. Let’s say you have your models. Those are the lines. Here is what really

happens in real life. The agent if it executes the policy, it’s going to follow this green line across this particular region. You have two choices

of which model to use at the first time step to guess where that agent

is going to end up. So if you use the black model, you’re going to get something that’s closer to the green

dot which is great. It’s more accurate, and it’s more

accurate than the dash line. But by going into this yellow region, you know that you might do

poorly later on in terms of evaluating the effect

of different actions. Because we’re in a

model-based scenario, we’re going to simulate out, we’re going to do rollouts. What could happen is that

once you reach here, you just go off the rails because this is the

poorly modeled region. You’re trying to

estimate the quality of this policy that goes through

a poorly modeled region. So maybe the better thing to

do if you just care about the error is to take

a hit the first time. Then stick to the well modeled area that now you’re following the trajectory assuming that the first step actually took it

here, rather than over here. So this is its own little

RL problem. It’s just cute. Now we’re using Reinforcement

Learning to choose which model, so the action space is

which model to choose, and the reward function is the mean square error of

the estimate of the policy. So we’re looking at different forms choosing between

models to do our rollouts. The different models have different [inaudible]

in different places. So more formally, how do we do that? I won’t spend too much time on this. We have a bound on the quality, so this is our error that

we’re trying to manage. It depends on the error from state summation and

reward estimation. The key term that I’ll

point out is that you have this Lipschitz constant

of the transition, which is a key thing that

will cause things to blow up as you accrue the biggest errors. We can do that Lipschitz

constant estimation both in the parametric

and non-parametric case by just simulating or

looking for nearby examples. So again, there’s some estimation,

approximations involved, but this thing is imputable based on once you made

a couple of approximations. So we can compute

our reward function, the action space is

relatively simple because they were just choosing

between two different actions. So here is a toy example, but an actual computed toy example of what happens when we do this. So in this case, there’s two actions, there is one action that takes

you straight across and there is one action that takes

you off diagonally. So there’s two different actions

and we ended up collecting data only two trajectories that end up doing just one

action or the other. So there’s one trajectory

that does this and another trajectory that goes

up, so that’s our data. Now, just for the

sake of illustration, let’s suppose that I fit

a parametric model that doesn’t know the difference

between the two actions. So I fit a model with all the data and I ask it to predict next state

given current state, and because of that, I’m going to conflate these two actions together. So my parametric models says

regardless of what action you do, you are always going to go up an angle that is

in-between these two. Did that make sense? Any

questions? All right. So that’s the setup for

this little toy example. Now, here is what the

evaluation policy wants to do. It wants to go up one step, go across, and then go up again. I want to get something

close to that policy based on my parametric model and

by being able to use the data, write samples from the data, which is my non-parametric model. So what happens if I’m greedy? Well, if I’m greedy, I’m like, okay, the first thing that

our evaluation policy did was go diagonally up, so I’m going to just take a sample from my data because

my data does that. I’m going to copy that, and then you reach a point

where you’re like, okay, now my model

says go to this side, and I have the choice

between sticking with my data and I can also say my

parametric model does this. I choose my parametric model

because it has less error. At this point, if I’m over here, I’m sorry to the people

on the other side, I wish there was a way to clone

myself, I’ll walk back and forth. So they are over

here, this is closer. Every step of the way, it’s going to be better to

choose the parametric model. So here’s a trajectory here, this is going to be your guest, this green line is going to be

your guest for the purple line. I’ll go to the other side

to give the better example. So that’s if something going wrong, is there a better solution? Well, a better solution would

be to say, you know what, I am going to use my parametric model first even though it accrues

a little bit of error. Because then after that,

I can use all of this data that roughly

follows the purple line, not exactly, but it stays close. Then once the purple

line starts going up, I go back to my parametric model

and this blue line over here, which was planning ahead for the

errors that it’s going to accrue, ends up looking closer to the purple

line than the green one does. So that’s the key idea

in this approach. So we’re taking that same intuition from the beginning of the talk, about how can we mix

clones versus models, but now really formalizing it in a way that we can use it

for off-policy evaluation. So that’s the intuition of why this might work and

how it might work, why planning is actually important

and it works on a real task. So what I’m going to do now

because I’m just at close time, 10 minutes, that’s a little

longer than I thought, I thought I had less than that. I’m going to quickly go through, we’ve done other work

on model-based stuff, if you’re curious, I’m happy to

chat about it at some other point. I want to talk a little bit

about validation and skip to a couple of things that we’re doing there that I’m

really excited about. So the off-policy evaluation, even with all the stuff that

we’re developing, again, it’s going to run into some

hard statistical limits. In our particular case, the doctors are being

able to observe stuff about the patients that

we cannot observe, so we know that there’s major

missing confounders in the dataset. But at the same time,

there’s a lot of data that is in the dataset and the doctors are not capable of looking at all the

data about a patient, certainly not capable of

comparing all that data about a patient to all the other

patients in the database. So there is a value

that we can provide, but we also have to

be very aware that there’s going to be

some major limitations, and in many of the assumptions

that are required also for off-policy evaluation are not

necessarily going to be met. So even if we have a method

that has nice guarantees, if your state space

is well-defined, etc, it’s not going to hold in

our real applications. So what can we do really? How can we feel good

about policies that have reasonable numbers but we’re just not sure about all of our assumptions? So I’m going to go quickly

through some of this. So we do stuff like we make sure that our our results reproduce

across sites, that’s important, we check sensitivities to all

modeling assumptions if we’re using something that requires

a covariate like WDR. We also talk a lot with the experts, we compare against standard of care, we asked clinicians, “Do

these policies make sense?” We gather annotations, but these are somewhat basic things that

one might want to do. The thing that I’m going

to spend the last couple of minutes talking about it in terms of like deep dive into is, well, when we’re asking the doctors, what’s the best way to actually

ask them if something makes sense? So the numbers sitting

up here on this slide, we gave them examples. We gave them a patient, we gave them a recommendation, we said does this

recommendation makes sense? We got an agree or disagree

or partially disagree, but that’s the very limited way of showing doctors what

the agent will do. In particular, a piece that

was missing is that how do we select those patients?

How do we know? How can we feel good about

this patient selection, this group was good enough? Because if the clinician said

that this group it looks great, would we feel good

about going forward with this policy? Probably not. Depending on how we

chose them, maybe not. If they said it made no sense, then we would know we have to

go back to the drawing board, but a positive result doesn’t necessarily indicate that

we have a good policy. So there’s a question

of how to validate an entire policy given

only some samples. So that’s a very little detour and another area that

we’ve been working in, is how can we best communicate a treatment policy to

a clinical expert? So we formalize it in

the following setting. We’re going to present the expert

with some state action pairs. In this particular state, the agent does this. The expert now is going to be

given a new state, S prime. We asked them, what

action is going to happen in this state, S prime? If they can guess it, then we can feel good. We can be like, okay, the expert has a good sense of how this

agent is going to behave, and then we’re going to trust the expert to tell us whether

the agent is reasonable. We can only do that

if the agent can do a great job of simulating

what the agent will do. So our goal, the thing that we have control over is that training set, the teaching set that we

give to the expert for them to be able to recover

back the entire policy. So here’s an example.

Here’s a Gridworld and the arrows that are filled in are the examples of

state action pairs, and then the task, for example, would be for those queries

that are circled in black, what are the actions that you think that the

agent is going to take? Not the optimal action, but what is the action

that the agent is going to take? Any guesses? People have any hypotheses

about what’s going on? Down in green. Okay. Anyone else?

People are too tired. You think that it’s

going to go right->>In the yellow spot.>>-in the yellow. I see. In general, can you summarize though? You have no idea. So there’s two forms of things that people often do

with these tasks, which have words that you know. So the policy wasn’t done in green. That’s like an imitation-type policy. I think that this often

happens in this state. The other thing could be, it looks like they’re

trying to go to blue. That’s an inverse

reinforcement learning. So we don’t know what the

people are actually doing. Are they doing IRL or

are they doing IL? The implications would

be different for this state action pairs

that we want to present. So in this case, many times, people end up doing more

inverse reinforcement learning. They think that there’s a goal

that people are trying to get to. Whereas if we give them examples

that look like the following. So in this particular case where this is different treatments

and different biomarkers, people are like, “I’m having the

foggiest of what’s was going on.” What happens in this

state, I just copy. I do this too. I look at that one, I try to find a copy

of this over here. I’m an imitation learner for

this particular type of dataset. What we showed is that

it’s really important to know what people are going to do. Wait. First of all,

so quantitatively, we noticed the difference. So this is a Turk study. We asked people after they did the experiment to tell us what

their thought process was. Then we went through all of those

things and tried to code them into something that looks

like IRL or I can’t tell. So the gray is we couldn’t tell from their description exactly

what they were doing. So in HIV case, this one over here, people were just like, “We copied.” So that was clearly imitation

learning in the GridWorld. More often, people were

describing something that looked more like with the agent

is trying to get to a goal. I understand now. Those differences mattered in the following sense. So now, we can imagine that I create a summary assuming that you

are an imitation learner. Then I give you that example. Or I create a summary

assuming that you’re an inverse reinforcement

learning learner, and that’s a different form of

summary that I’m going to give you. How would you perform if you

assumed that the extraction method, I should be using IRL or IL? So the dash line show how you

would expect performance to be. The important thing to notice is

that in the different domains, there’s a swap, there’s

not a clear better method. So it’s important to match. We also noticed that in

the different domains. So if we match what people

did remember if people were mostly imitation learners

in the HIV domain, if we create the sample knowing that they’re going

to do imitation learning, they do better than if we create the sample assuming

that they’re going to do IRL. So this is an important

piece of information. If we’re going to be

presenting examples, having the right model

of the human cognition, of how they’re going to extrapolate from the models is very important. This is obvious [inaudible] We did the actual study to

show that this is important. When we think about how to

present examples to the doctors, we want to understand what

these not symbols look like. So what I’m going to do is

I’m just going to finish up. One other thing that we

have that we’re working on is if things are

statistically indistinguishable, we want to offer options. So if we have a couple

of different policies, we want to be able

to say that some of the policies are clearly

bad like the ones in gray. The other ones might be reasonable, and we’ve applied this again to the hypotension management

task in the ICU. I won’t go into detail on that because I want to save

time for a couple of questions. So going forward. So RL in the health space I think

is a really interesting area. I was working on RL applied

to nothing doing my PhD, and now that I’m working on

these health applications, I feel like I’m working on

very different questions but they’re all very satisfying. There’s all sorts of really interesting technical

questions that come up. When you’re working

with a real domain, you care a lot about how

things are going to work. In terms of things to keep in mind, let’s think holistically about how

RL performs in the real world, and keep in mind that usually it’s a human agent system

that’s making the choices. So all of our work the way we think about it is that

our agent is going to be providing information

and recommendations to a human who has

additional information. They’re going to combine those

together to make the final choice. So it’s important to think about how the system is going to

behave in the real context. Also, we have to be

careful with the analyses. Like that first

example where I showed you where you can just

run a bunch of numbers, things can look totally peachy,

and then you’re like,” Wait. The numbers mean nothing.” That happens and we have to

make sure we are aware of that, that we take care of that. But let’s not also make that

such that we don’t think about messy problems that have

a lot of potential impact. Just two quotes that I really like going back to

comic books actually. So we have: With great power comes great responsibility.

We have to be careful. But at the same time, perfect

is the enemy of the good. If we stick with the status quo, then that’s also a problem when there’s better solutions

that are potentially out there. So I’ll leave it at that and

take any questions. Thank you.>>So we have time for

a couple questions.>>So I have one question.>>Over there.>>It happened again.>>Hey, Hailey. So I guess I’m curious about

the implications of the work you have on people wanting

to do imitation learning. I’m curious about given that that’s

what people want to do, what->>Well, it’s scenario-dependent

and that’s what we found. This is an entirely a Turk study. It wasn’t with doctors. So my hypothesis is that if

you can understand the domain, you will probably use IRL. Then you’re like, ”

You’re doing that,” because you’re trying

to get to this goal. Whereas if you’re not super

familiar with the domain, you will probably just imitate. Or somebody says, “Step to the left.” I’m just going step to

the left like I have no idea why somebody said that

or why somebody did that. So if we give some of these

examples to the doctors, like the HIV example, they may actually be doing

IRL and that’s something. Basically, the important

part of this study was that if we make the wrong assumption, we actually do get worse quality. Because it could have been that maybe the assumptions don’t matter. People are just somehow robust to

whatever examples you give them, which could have come out.>>So a related question. Going forward, how would

you think about doing this for a new task with doctors?>>So one thing

actually we discovered after this study is that we’re still really interested in extensions of this and how to do this with doctors. But one thing that came out of

more discussion as we’ve been talking to the clinicians is that even though they often

think in terms of clones, they think in terms of data points. I met this other patient five

years ago and this thing happened. When they communicate

with each other, they still have to communicate

in a language of features. Does that make sense?

So if I tell you about this patient five years

ago, it’s not in your head. I have to do like, “The patient

I saw five years ago who was over 70 years old and had broken hip,” and then I’m describing this

patient to you but I’m describing it to you

in terms of features. So seeing how that

communication works, we think that that might be actually the key to

describing a policy. It’s like looking at the features of the language of the communication that they’re already using and then putting the policy into those terms. So that’s where our

current work is headed.>>So just following

up on the parametric, non-parametric split

that you talked about. There might be some

ways to bridge it. So local learning is the one that

comes to mind where you train a parametric model on a [inaudible] neighbors

weighted appropriately. Some of the interpretability

methods like lime in particular, look at that. So I wonder, do you really

want to train separate models?>>Or could you just build some

model that works the best? People use all kinds of ensembling. So this was our initial intuition. My guess is that there’s always going to be some cases

where you want to do some ensembling and we come up

with intelligent ways to do that. From an interpretability perspective, I see that as different. I’m not sure. So lime does

these local perturbations. I think it’s really important to understand the delta

of the perturbation. That’s maybe a longer

discussion of how to do interpretability for policy. Is there for classifiers?>>Let’s take that off. One person.>>Let’s give [inaudible] another

round of applause. So that’s it. I just want to wrap up with a couple of thank you’s and shout outs. So first of all, thanks to all the speakers. This event would not have been

interesting without them. So let’s give them another

round of applause. Second of all, thanks to

all of our support staff. We have some people from Outreach. Jane Bianche is sitting outside, Louis Stevenson is in Redmond, and Jessica is in Montreal. They really made all this very

seamless for myself and Hall. Thanks to our videographer, our social media people, our photographer. I don’t

know where she went. Let’s give to them another

round of applause. Again, this is making the event. Then, I want to thank Hall, my co-conspirator, in helping

me put this together. It was very fun and I’ll try

to do it again next year. Lastly, thanks to all

of you for coming. This is part of what makes

the event really fun to see everyone and I hope you

all had a great time. Anything else? Thanks. That’s it.