Download Premium Content

First Name
Last Name
Country (Citizenship)
Opt-In to Future Emails
Thank you!
Error - something went wrong!

Enhancing Data Quality and Rater Reliability with eClinROs

July 22, 2019


Dan DeBonis

So I think Chris used the word pioneer, to me that means someone who’s been out there and made a lot of mistakes, and hopefully learned from them. But you know, I think that kind of comes with the territory and what we’re trying to do here. Yesterday was great in terms of learning the process part of ePROs and how different folks are thinking about it. And we’re going to throw another stakeholder into the mix here, we’re going to throw raters in here, or clinicians. And we tried to give you an overview today, high level, how we started thinking about this and using eCOA really as a tool to improve the data quality, not just simply transferring something from paper so it can be captured electronically, but what can we do to actually enhance the quality of the data. And we’ve got a few examples here with some data that we’ll present. And then we’ll have Dr. Emil Coccaro come up and talk about a scale that he designed, that a client came to us and said, we’d like this to be eCOA. So kind of a case study in how we approach that, how we looked at that, some of the enhancements we made that made the scale and use of it a little bit easier and more efficient for the raters as we transition to eCOA.

I’m used to scientific conferences, so this is my disclosures. I think some of you out there, more grizzled veterans here, fax modem switches, that was my logistical challenge when I started in this business, so that’s how long I’ve been doing it. My focus is in CNS and neurology, so that’s what I’ll be talking about today, but really if there is some input from folks, either during the presentation please put your hands up, interrupt me, stop, ask questions. But other therapeutic areas, I’m always interested to hear about those. You know, CNS, I saw a slide someone put up about the, I guess the uptake of eCOA in different therapeutic areas. And I wish to God I’d taken a picture of it or where I saw it, but CNS was up there as the top adoption, I guess, in eCOA. And we’ll take full credit for the work that we’ve been doing for the last 15 years. But hopefully what I show you today will show you why that is. Because there’s a few of us out in the crowd I think some of my colleagues from Bracket, legacy Bracket, are up here, who have been doing this work too, clinicians who have been doing this work for the better part of six or seven years. And when we go out now and talk to clients, it’s not a question of are they going to use eCOA. It’s a question of how you’re going to support it, it’s more logistics. So I feel like for us at least, some of the data we presented and some of the impact we’ve shown, we’re more now talking—we’re over that hurdle, we’re a bit more mature, if you will, in our industry than perhaps some of the others where it’s still kind of the paper versus eCOA battles.

Hopefully this is what we get out of today. Again, introducing the raters into the mix and distinguishing some of the things that we think about when we think about ClinROs as opposed to ePROs in terms of implementation. And the approach here, the first two are probably, I think I’ve already talked about, but how can we used information gathering, if you will, with both outcome measures and questionnaires in the eligibility process. That’s something we do a lot of now, and are actively kind of adjudicating patients based upon data we’re collecting or questionnaires we’re collecting and having the external clinicians involved in that process. But using eCOA makes that so much more effective and efficient, that you can do that, rather than waiting for things to be faxed to you or scanned to you, you know, medical records.

But I think the third one is what I want to focus on. Enabling raters, enhancing their ability to do their jobs, and one of the things when I started out in this field I remember I went to—has anybody went to a conference called Connected Health in Boston? A few people, okay. it turned out, my primary care doctor up in Boston—I’m living in Boston now—was someone who was very involved in Mass General and getting their telemedicine and using online portals and stuff, ten years ago. But I remember seeing him there, I know you, and he said, he got up on stage and gave a talk and one of the first things he said is, I get handed apps, this and that, every day almost in my role in MGH as being the arbitrator of what we look at and what we don’t look at. He said, if you’re not going to give me something that’s going to make my job easier, don’t bother. And I think for clinicians out there, that’s kind of the approach, because we work with clinicians, that’s the approach we had to take, because we could not give them something that’s going to make their job harder. What can we do, what can we give them that’s going to make their job easier? And then by the way, start showing them some data, hey this actually works.


So here’s what the FDA calls a ClinRO. I think clinicians have a very hard job. They’re evaluating behaviors, a lot of this is subjective, they’ve been trained in medical school, fellowships, working in the field for years and years to really be able to evaluate these symptoms and deal with what are often very difficult patient populations, at least in CNS and neurology. ClinROs run the gamut of very simple—if I say CGI, or CGA, do people know what that is, very simple—0-7, how do you rate this patient? Mild, moderate, essentially. Taking all your clinical expertise to make that rating, which again that’s a challenge in itself but in terms of how we think about operationalizing it, that’s pretty straightforward, to fairly complex scales that are 30-page scales. You’re going to hear, I think later today, I think Adelphi is going to talk about the VAS, which is the violence scale that they use for autism. Very complex scale. How do you take that from paper and bring it to electronic form, knowing that it’s all these different rules you have to follow, whether it’s branching logic or can you access—what can you go back and access. How are they going to fill this out and kind of on paper it’s obviously a lot more flexibility sometimes, so how do you take an electronic tool and essentially apply rules which you need obviously to create eCOA, you need rules around how you’re programming these things, and how do you apply that to these clinician-reported outcomes.

I think the point here you see, we’re dealing with humans now, we’re dealing with human raters, who have a difficult job of eliciting information from human patients that often have, again, challenging populations here. A lot of our work is Alzheimer’s, neurology, pediatric, difficult populations to deal with. And again, you’re dealing with humans, you’re not, as we think about transitioning these to eCOA, you’re dealing with the variability that humans bring to this process.

And the other part of it is regulatory guidance. Now ePRO, everyone is familiar with it, very straightforward. And you have, again with—I kind of put copyright holders on there because I think they are a big stakeholder in the regulatory process, if you have an engaged copyright holder is going to go out and say, I am going to certify that the scale that you’re creating, I’m going to do that work on the scale you're creating is essentially a valid version of my scale and I’m going to put my stamp on it, that’s great, we like that. Not many folks will do that. They kind of leave it up to you and you have to follow your processes. ClinRO, there is really no regulatory guidance. I think yesterday Bill talked about there’s some folks within the ePRO Consortium, myself included, doing some work there trying to write some guidance, kind of white papers on that. Couple of the folks, Todd Feaster and Todd Solomon out here have been involved at Bracket with a white paper that we put out about the practices we follow internally for migration of ClinROs and equivalence. So there’s been some work done but there’s nothing kind of standardized from regulatory bodies at this point as to how this needs to be done.

And really, some of these things—I’ll give you a few examples here of how we’ve done this over the years. We first started out—I first started out in depression, which again think about 15 years ago, the big thing was variability, raters employing baselines, maybe some have seen all that, the famous HAM-Ds, which were all inflated right around the inclusion mark. So the one at the top there, that’s from the quiz. And that’s a patient-reported outcome, obviously. And so that simply asks the patient, here you go, check one of these boxes. Now clinician-reported outcomes, which are the gold standard for depression, this is essentially an item that evaluates the same thing, and this is from a semi-structured interview but they’re supposed to ask all these questions and you train the clinician to probe off these questions. And then they’re supposed to effectively digitize what they get. It’s a challenge. You can imagine the variability that this introduces, both from the standpoint of the interpretation of the responses to are they going to follow the standardized administration question format here. 


And how do you do that? Well again, educate them, train them. Have they been in the field long enough? Do they have the right background? Have they been in trials before? Do they understand? And when you train them, you go out, we do this, this is how you administer the scale, the scoring benches. Something as simple as okay, if you’ve got symptoms that maybe are kind of like between a 2 and a 3, do you round up, do you round down. How do you tell someone to score that. And these studies are powered to detect three or four points of change on this scale. So it’s a big deal that everybody gets this right, if you’ve got half of them rounding down and half of them rounding up, you’ve got a problem. So these are the kinds of issues we’re dealing with at least in CNS and depression. And you could train them up and that’s great, so you—and this is the way trials are done, send them out there—we’ve brought them together, we’ve trained them, half of them were reading the newspaper when they were getting trained, but we trained them, we check that box. Send them out there, they’ll do a great job for us. However, this is how we first started the concept, trying to sell this concept 10-12 years ago.

Is anybody else here from Boston? Yeah okay. Driver’s licenses in Boston and driving in Boston, the traffic laws I think they are suggestions as opposed to laws.

But you know so think about it. We train a rater, get your license. You go out there, is anybody monitoring what’s going on when they close the door to go evaluate their patient at least in depression. Anybody know what’s going on behind that door? Fifteen years ago you didn’t. So first step was, okay let’s monitor them, different ways, recording it, doing some level of QC, recording the interviews. But where we’re trying to get to is how this, okay can we essentially build in accidental detection, are you getting too close with the cars, can we do that with eCOA, can we give them some guides that prevents them from really kind of going astray and sticking to the administration. So hopefully I made that point.

A few things about—we talk about rater reliability. Really it’s—for some clinicians this is hard because they get into this field to treat patients, treat sick patients, therapeutic. But the job when you’re doing rating scales is, we call a neutral observer. Elicit symptoms, get the information you need, move on. And then there’s kind of the issue of drift. So, over time, raters may be in five or six studies at their site, different populations maybe using the same—you know, a lot going on obviously. And we’ve seen it, you know, they will vary in the way that they assess these patients. And you know, human reliability, just a kind of smattering of different data from—the first one up there, left-hand corner is we did—we first started doing depression, we started looking at the we call it concordance between we do computer administered versus—basically after the third interview of the day, these guys are—you know—you’re doing this some of these people doing five or six interviews a day. The quality will go down. And you know, it’s only logical, I mean how many times you’re doing this, and are you better in the morning, who knows. The length of the interview, bottom, Phase II study at baseline, raters tend to spend a lot more time interviewing the patients. That’s the inclusion visit. And then a little bit less time as the study goes on, maybe they’re less symptomatic, who knows. And then the one up in the corner is one of my favorites, this is kind of the hungry judges slide. What that means is that at the beginning of the day, what they said to this study ten years ago, judges were much more favorable to defendants when they first started their sessions. As the sessions got—you know, the hungry judges—toward lunch they were a lot less lenient. That’s why I asked Bill to let me go first today.


But again, you’re dealing with humans is the point. Sticking with depression, the case study here for depression. The Hamilton Depression Scale, Max Hamilton originally wrote it. He had two raters evaluate the patient. So one rater did it, next rater did it, take the census score. Obviously that’s not standard practice in clinical trials today, when you’re getting your outcomes. We first started out in Concordant, what we did is we kind of said, training is only going to get you so far, so how can we monitor these raters in a fairly unobtrusive way. Well for this MADRS, one of the depression scales along with the HAM-D that’s kind of the gold standard for depression, let’s ask a computer to administer the same type of—you remember I put all those questions up there that you’re supposed to ask for sadness. Have you been depressed in the past week? How many days? How bad has it been? These types of questions. So let’s put that on a computer, let’s ask—we did that. Got—IMH funded the initial validation. It’s not an ePRO, it’s a computer-administered interview. You’re asking questions, it’s going through branching logic, and essentially digitizing those responses, mapping them to the anchor points of a MADRS. And that started going out and talking to clients and saying, you’re running a depression study, think about using this. So what happened is the rater would go administer their MADRS, and then the subject come out, sit at the computer, and would do the computer-administered MADRS. Now you’ve got a paired set of scores, and you can use that for initially this quality control, if the rater is reporting, so the MADRS says 0-60 with roughly 20-24 being at least moderately depressed. So if a rater is reporting a score of 24, and the subject based upon their response reporting the score of 40, you’ve got a problem.

So we went out and tested this and got pretty good, okay, does a reasonable job of being aligned with the raters, the computer and the raters, so the red is the MADRS being done by the computer. So these have got paired interviews in industry studies. And it was great. So I said okay great, you can monitor my raters, you know which ones are drift, that’s great. Of course those patients are already randomized, so if you do see a problem once you get into the study, not much you can do.

So then the concept was, what can you do at inclusion. And this is kind of cartoonish a little bit, not now start thinking about it. At my inclusion visit, baseline visit, if I’ve got that example I gave you, 24 and 40, what does that mean. To a degree you don’t care what it means, you just know you’ve got an issue, do you really want that subject in your study. Now we weren’t originally using it this way, we were simply using it to monitor the raters. But then when we went back in some of the studies that we worked on and looked at some of the data at inclusion, here’s what we found. Essentially this was from a Forest study for [unclear 19:02] I don’t even know what the commercial name is today, I should know that. But anyway it was an approved drug. I think these studies finished around 2010-11. But we went back, we’d been doing these tandem ratings, and we went back and analyzed the data and kind of said, okay, at inclusion if we exclude—if we got a post doc to say where are those patients, remember for the previous slide, that fell in the tan or orange, what if we—let’s take a look at those. And those represent the last three sets of bars there where they’re going south, that’s favoring placebo. I mean this was a positive study, but this small subgroup you can see what was going on there. If they were not getting back, if they were not within that green area, actually when you do the analysis, they favor placebo. And what was interesting to us, because I think part of why we got into this was, oh the raters are inflating, they’re up to all this, they just want to get these patients in. The worse offender really was these overzealous patients, but patients that over-endorsing symptoms and again you’d think someone coming in depressed, insight, maybe they just, who knows, but that was actually the group that contributed the most to the placebo response and so, you know, the idea that what you’re looking for here, you’re running a clinical trial, you’re looking for a reliable report.

Yes sir?



[inaudible 20:40-43]


So they were the group that—over here. So they were— you don’t want to say exaggerating the symptoms, they were reporting that they were more symptomatic than the rater. And you know, a good rater is probably going to tease out, is this really the worst you’ve ever been? But if you’re in the middle of a—do you want that person in your trial? No.

Moving on to Alzheimer’s. Yes, sir?


[inaudible 21:12-21] —she’s got the Hamilton score that fits with the study but she’s in a shelter and she’s only depressed when she is actually in the shelter at night. What do you do? So I talked to the patient and I said, what’s the story here? She says, well yeah when I’m in the shelter I get depressed and I’m sad. And I said, well I got good news and bad news for you. She says, what’s the good news. Good news is you don’t have depression. So therefore I wouldn’t treat you. And the bad news is I won’t put you in the trial. But I think other investigators whose raters weren’t curious enough to savvy to bring it to their PIs would have wound up in the trial and they would have been placebo respondent, which is not what you want.


Right. And so in a lot of the big trials in depression, you may have heard today like Janssen is running esketamine trials, Allergan running rapastinel trials. They’re using these methodologies in various different—essentially using the idea of tandem or independent evaluation of the patient at the inclusion visits to make sure that they are indeed meeting the symptoms in concordance.

Just moving on to the next, Alzheimer’s, we’ll shift gears a little bit here. Alzheimer’s outcomes are a bit more objective in the sense that you’re eliciting information, you’re saying yes or no to the responses. So for example, one of the questions on a Mini-Mental scale, ADAS-cog—the Mini-Mental and the ADAS-cog are two of the three big ones that are used in Alzheimer’s trials. One of the questions is, you know, orientation, what day is it. Pretty straightforward, and then they’re supposed to mark that, that’s a point on the scale. So you go through a series of questions and—but with Bracket, and again, a couple of guys that do this every day and are very good at it are here. But what we used to do before we started using eCOA was get the ratings in site on paper, they’d get faxed to us, we take a look at them, were they administering the scale properly, did they score them properly. And that was logistically hell. But what we do then is, you know again, there’s clear ways these scales are supposed to be administered and scored. If there is an error we go back to the rater and say, take a look at this, we can change the score, you may want to review this, so we’ll get quality control here. And we found error rates about 30%. We were doing this about ten years on these and that’s a little scary because the MMSE is typically used for inclusion criteria in Alzheimer’s trials. And then in clinical practice, those of you from the UK should be terrified by this, they had the nice neurologist, 90% of them didn’t know how to score one of the items or one of the domains in the MMSE. And they gave them an example to do and a quarter of them scored it wrong and it would have impacted by three points, which that’s a 0 to 30, that’s a big difference in terms of how you might get classified. So big issues with how these scales are administered. Here’s an example of the MMSE, this is our favorite called serial 7s. What’s 100 take away 7, I won’t ask anybody to do this, it’s always one of our favorite parlor tricks is ask someone to stand up and do this. Pretty straightforward, okay, and then if they get it right, okay you make that as a 1, if they get it wrong, you mark that as a 0. But basically, there is some rules around this that if you said 93 for the first one, that’s a point. If you said 87 for the second one, that’s not a point. But then if you said 80 for the third one, that’s a point. But if you’re doing this, and you say okay they got 87 wrong, now and they say 79, oh that’s a point. No it’s not. 


So when we designed the eCOA version of this to implement this scoring automatically. So they say 93, that’s right, they say 87, that’s wrong. The next one is 80, that’s correct. You can see, the computer is doing, the eCOA version is doing this. So it’s not—we’re not asking the rater to do this, this is all being scored, we’re putting the rules that are in place for these fairly objective scales here, asking the rater to record, and then we’ll interpret it for the scoring purposes. And the result on data quality has been tremendous.

And I mean this is really good. We’re going out to Alzheimer’s trials now. We’ve presented this, some of the guys again are here, Todd Feaster and Todd Solomon, but we’ve presented this enough where they don’t care about this, they know this is done, okay yes we get it, we need to use eCOA if we’re doing an Alzheimer’s trial because of these error rates. Essentially what we’re saying is, we’ve reduced the error rates, different versions here, from I think the MMSE about 10% now just about 6% in trials, that’s tremendous both in terms of efficiency and just your response, level of confidence that these trials are going to get accurate data from your trials.

Final case study here is just, did a quality. And that’s using, again we’re legacy Bracket, work that I came from before, all based in evidence. Can you use like edit checks essentially? If you’re getting data from your depression scale that says the patient’s getting worse, you’re getting data from your CGI which is the global assessment that says the patient’s getting better, that’s—maybe that’s happening but that’s probably an anomaly in the data that we looked at. So can we show—what can we do about that with eCOA.

PANSS is a schizophrenia scale, second example here. And in that case, you can--looking at delusions, essentially you don’t have delusions if you’re not suspicious and whatnot, it’s clinically almost impossible. And because we’ve analyzed a lot of this data over the years and gone back and looked at the impact on the data, we’re now building these kinds of edit checks, again not changing scores, just saying to the rater, if this type of—if this occurs when you’re doing the scale in our eCOA system, go take a look at this. And again, impact. So in analyzing data, and being able to right it in real time, the blue represents, again as we do our analytics, the number of errors that would occur, these discrepancies we’re getting delusions, suspicious, so things that clinically just are almost impossible, we’ve really been able to demonstrate that using an eCOA, alerting the raters to this before they submit their data, getting a lot fewer errors.

Didn’t get into process today, but there’s a lot of considerations for ClinRos here that we talked about. Happy to discuss those a bit later but again just the process of thinking through how can you use eCOA here, what can we leverage that’s going to improve not only the quality and efficiency of the trial, letting the site know right away that one of their scales at inclusion, they don’t meet criteria. Get the patient out of there, they’re done. Why wait till the end of the—things like that.

And then queries of course. You don’t have missing data. You can build into the scale that they’re required to fill in every single data point, and they can’t submit the scale if they don’t. You can enforce the order. If they’re supposed to go sequential, make sure they do that. if they’re supposed to wait five minutes, be tasking for a memory test, asking them a word, you can enforce that. These are the kinds of things you can use for eCOA and I’ve argued you should use for eCOA, you should not simply focus on replicating what’s on paper because it can be done better.


And then finally, I’m almost out of time here, but where are we going with all this? I mean I think certainly you’re getting your data quicker, we’re locking Alzheimer’s databases in weeks instead of months because you’re collecting it with eCOA and you’ve got all your data points to validate your scores. You’re not having to go back to your sites and do queries, it’s not happening. But really, bigger picture, this is where we can get to. Can we show enough data here that maybe we can start thinking about reducing sample sizes because we can power the studies in a different level because of using what we’ve done with eCOA and the different methodologies for data quality control. Nobody’s gone there yet, but hopefully that’s—as we think about this that’s where we want to go.

[END AT 31:19]

Previous Video
Approving Library Versions of Instruments
Approving Library Versions of Instruments

Approving library versions of eCOA instruments that can be applied, under license, without further approval...

Next Video
eCOA Best Practices - Licensing and Translation
eCOA Best Practices - Licensing and Translation

Best practices for licensing and translation activities during eCOA study start up by Ashley Deane, ICON.