I wanted to say that this conference started out with a story from a patient’s perspective, and a story of becoming a survivor. And it’s a very inspirational story, and it’s worthwhile to have that story to remind us why we’re all here and why we’re all in this business, because ultimately we’re here to help patients. But on our day-to-day, what are we dealing with? We’re dealing with data. So I thought it might be appropriate if we ended this conference with a story about data.
This is kind of divided into two pieces, and the first is the story, it’s the use of eCOA metadata to detect fraud. And then the second part is the rest of the bit, about developing risk indices and more.
So the story. This is the beginning of the story. We received a request from a sponsor. They had suspicious data. No explanation as to what that meant. No explanation as to what alerted them to investigate the site, but they were interested particularly in a site, and could we have a fresh look. So this request went to the project team. And the project team looked at this through the lens of what they do. Because they have three indicators that they feel could be indicative of fraud. The first is a high number of subjects in which the PIN code is the same as the year of birth. Sites are not supposed to be influencing subjects on their choice of PIN code, but they sometimes do. And when they do, what they’ll often do is they’ll suggest the subject use their year of birth. This can happen for quasi-legitimate reasons. It’s not supposed to happen, but quasi-legitimate, and not be fraud. But it is a potential indicator. Or there’s a high number of subjects with matching PIN codes, so whoever is committing fraud is not very imaginative and they just use the same PIN code over and over and over again. Or if they’re a little more imaginative, and you get the last scenario where what they’re doing is they’ve got sequential PIN codes, so they’re trying to avoid fraud but they still need help remembering them so they do them sequentially.
So the project team then looked at these three indicators. First the high number of subjects in which PIN codes are the same as year of birth. What they found was 38% of subjects had the same PIN code as their year of birth. And to the project team, this looks like a red flag. Now we don't have comparative statistics to say what is normal here, so this has to be taken with a grain of salt. Maybe it’s real, maybe it isn’t, but it is somewhat suspicious. And from the project team’s point of view, this was highly suspicious. At the very least, the site is likely influencing the subject’s PIN code choice, if nothing worse.
The rest of it was a high number of subjects with matching PIN codes. Well, 42% of the subjects’ PIN codes matched each other, but given that they were years of birth, and you have a range of years of birth, maybe that’s not terribly unusual, so not necessarily a red flag. And there were sequential PIN codes, but again they were more related to the fact that they were using year of birth.
In conclusion, all of this says maybe there’s something suspicious going on and maybe there isn’t. In the next steps, they called me in, and they gave me a little more information. They did not tell me what site they were interested in or suspicious of, but they did say that they had seen subjects with similar save times in the different forms. And what they wanted to know was, was this occurring at other sites, and what is normal.
Disclaimer: I am not a statistician. I do have some skills with Excel and I enjoy playing with data. So, not necessarily my thing but I had a good time with it.
This was the dataset. We had over 100 sites, we had over 700 subjects. The data itself, there was a morning diary being collected between those hours 4 AM and noon, and then an evening diary with availability between 6 PM and midnight. There were over 475,000 rows of data. So if anybody’s used Excel, this much data, I can tell you this took a long time because every time I tried to do something it would take ten minutes to sort itself out. But anyway I persevered. And this is what I did with it.
So basically it’s using Excel to array the data in a way that makes sense and ultimately come up with subject pairs. If there were two lines of data with different subjects that had data completion times within five minutes of each other, it would mark it as a concatenated pair. And then we asked the question: what percent of the data that was collected was one of these concatenated pairs. And what we see is—obviously I’m using Anon for the sites, and I renamed them all consistently throughout this presentation, but they’re not real as far as what the study is, so that you can’t tell what the study is.
But what you see is that you’ve got up to 58% of matching data, seems really really high; and 13% of matching data, which seems great. But what you also notice, it kind of correlates with the number of data points. So the 58% of data points, they’ve got over 20,000 data points, and you’d expect more data to kind of by chance fall within a certain range. And the fewer data points, the less likely that is. So I’m scratching my head here, going, well I can’t tell if it’s significant or not. So my brainstorm, let’s graph it.
All of a sudden something becomes a lot clearer. And you can see then the correlation between the number of data points and by chance then seeing those matching pairs, but then you can also see Anon site #1 and Anon site #4. Obviously not falling on those lines. So from a site level, those two sites are suspicious.
So the next step was, well let’s look at this on a subject level, so looking at those individual subject pairs. Looking at a completion time within one minute, and for a given subject pair having at least 100 instances of that given pair happening within one minute of that data. I mean this is really stringent. But even so, five sites were identified with one or more of these pairs that met this criterion. And it’s Anon 1, 2, 5, 12, and 4, with Anon 1 having 25 of these pairs.
If we look at this in the context of that graph, we do see that the two outlying sites have those instances, in fact the worst of them being Anon 1, having 25 of those pairs that are highly correlated, Anon 4 having one. But there’s others there that were on the line, like Anon 2 having seven of these subject pairs, or Anon 5 having one or twelve.
So we went a little deeper and we graphed a couple of these subject pairs, and looked at the time of completion of one subject pair versus the other subject pair. And you see two very distinct patterns. The pattern on the left, what’s happening is that the subject is responding to the alarm most likely and completing their diary within a very small time window. You can see that Anon 5-4 appears to be completing their diary almost exclusively at 8:00 in the morning and then 8:00 at night. And this other subject is doing something fairly similar. And so by chance, then—not by chance—their data is lining up, but that’s perfectly legitimate. We can expect that, because we have alarms and some people respond to those alarms practically religiously. And so we can see that.
The example on the right, however, is not expected. They are completing these diaries at very different times. They don’t have a pattern of time completion, and yet the other subject is also completing it in that exact same apparent pattern. Doesn’t make a lot of sense. So I would wager to say that the example on the left isn’t fraud, and the example on the right is probably site personnel sitting there with two devices filling in the data on one, filling in the data on the other; next day filling in the data on one, filling in the data on the other. It’s not the same time as the previous day, but it’s within a minute of each other.
Now, doing this to sort out what’s legitimate from what’s not legitimate is pretty laborious, to sit there and graph all these. But you can see that the hallmark of the one on the left is very low standard deviation in terms of those completion times, and the one on the right is a high standard deviation of those completion times. So we can just look at that. And, adding it on to my happy little graphic here, what you see is that Anon 1, up there on the top, with its 25 subject pairs, has a standard deviation of 1.35, pretty high; whereas Anon 12, with its one subject, very low standard deviation. That’s that legitimate one, whereas the 1-25 is probably not so legitimate. And you also see Anon 2 with its seven subject pairs and a high standard deviation.
And what I find interesting about this is that these different ways of looking at the data, from the site level versus at the subject level, you’re seeing different things. You’re detecting things, but they’re not necessarily detecting the same things. What we see when we look at the subject pairs, we didn’t see it really falling off the line, certainly not above the line, which is what we would have expected.
Going back then, turns out that Anon 2 was the one where the PIN code analysis was done. So I went back and looked at that data. What we had is 30% of the subjects at this site use the year of birth as their PIN, that’s what we had found previously. Of the suspicious subjects, those subject pairs, 42% of them used year of birth as the PIN. Not so different from the non-suspicious subjects, frankly. What I did is I matched the colors, or highlighted the different subject pairs here, so that blue in fact—there were four subjects who were within a minute of each other, and those guys are highlighted in blue. Yellow is a subject pair, we’ve got a pink subject pair, and a sort of orangey-yellow subject pair. And if you look at them for a moment, what you see is that those blue ones and the orangey-yellow ones, they’re using year of birth as a PIN. And in fact, on the right-hand side there, what you’ve got is the ones that are highlighted in which the year of birth or the PIN code actually matched—I’m sorry—there was a date of birth match. So these guys actually match a date of birth of another subject.
So what I think is happening, and this is probably digging far too deep but I think it’s kind of cool, is that you’ve got maybe two individuals at the site doing this, with two slightly different patterns. One individual is using the year of birth as the PIN code, and they are also then making the fake subjects up with dates of birth from other subjects. And another one, probably a little smarter, is using more random PIN codes that don’t align with the year of birth and aren’t basically stealing PIN codes or dates of birth from other subjects.
Coincidence? I’ll let the Incredibles speak for me.
So the story, the initial analysis based on PIN codes indicated that one factor could be considered a red flag. Looking at the matching data at sites, identified two suspicious sites. Looking at the subject level, we upped that to five suspicious sites. Looking at that in a little more depth brought it down to three sites. And one of the sites that was identified actually corresponds to the one that was initially identified by the sponsor. So, interesting data in that they came to us with suspicions about one site and we actually highlighted three for them just by doing this kind of data analysis.
Now this was something that came to us out of the blue, not something that we normally do, but it’s certainly of interest. This was an ad hoc analysis, but what kind of power would you have, and what kind could you detect if you were doing this on a routine basis. And so the question comes up, if you want to do this on a routine basis, how are you going to go about it.
So this is the second part here, and what I want to do is give a little introduction to risk-based monitoring so we’re all on the same page, and then open it up for a bit of a discussion.
This was somewhat new to me, as I said, being pulled into this project was ad hoc. And so I decided I needed a little more background on this information so I started with the logical choice of looking for FDA guidance and looked at this one, “Oversight of Clinical Investigations: A Risk-Based Approach to Monitoring.” And I highlight this from the guidance: “The overarching goal of this guidance is to enhance human subject protection,” so that human piece, “and the quality of the clinical trial data.” And the idea is to focus a sponsor’s oversight on these two factors, they are the most important aspects of study conducting and reporting.
So they go into monitoring practices, and they talk about centralized monitoring of clinical data by statistical and data management personnel—that’s what I’m really interested in here—but then targeted on site visits. So this is in general, these are monitoring practices. So one is that centralized monitoring of the data, the other is doing targeted site visits, which you can do if you’re monitoring data too. By monitoring data then you know which sites you want to go visit. Or you can go for that frequent comprehensive on-site visit type of schedule of all sites, and in fact this guidance is encouraging—this is right from the guidance—encourages greater use of centralized monitoring practices where appropriate than has been the case historically with correspondingly less emphasis on on-site monitoring. So these are ways of monitoring but let’s get away from that last one and go targeted, and also then look at the data.
And then, like I said, my focus is on that data piece. So what about monitoring data quality. They talk about, through routine review of submitted data, to look at various things. What I’m really trying to find here is what exactly could we do, what should we be looking for. Well, missing data, inconsistent data, outliers, potential protocol deviations that could be indicative of systemic or significant errors. And I will say that, having talked to the sponsor that got us involved in this whole thing in the first place, there were protocol deviations that made them suspicious of a site to begin with. So it’s not all data driven, it’s also things like deviations.
And then, conduct statistical analysis to identify data trends that aren’t detectable by on-site monitoring. And then again we get some specifics, so checks of range, checks of consistency, completeness of data. And then, unusual distribution of data within and between study sites, such as too little variance. Well, that’s what we were seeing, right, the too little variance in the form completion times.
Now those last two lines actually referenced an article. So I pulled out the article, “Statistical Approach to Central Monitoring of Data Quality in Clinical Trials,” because I’m still trying to get, how do we test this. And they had a nice little piece of information here, and listing the sources of data errors in clinical trials. And so the first is completely unintentional data errors, so in other words, you’ve got data errors because your instruments are screwy. And how would you see that; well you’d see that as a shift in the distribution of those values or large variabilities in the distributions. Instruments, you’re kind of expecting bell-shaped curves, you’ve got lots of outliers, something may be going wrong with the instrument.
Carelessness, how are you going to see this? Well, in fact you’re not really going to, it’s very hard to detect. But reassuringly, it’s generally not enough to impact the study. So their take is, you can kind of ignore carelessness.
These last two are clearly fraud: fabricated data and data falsified. You can detect fabricated data because what you can see is small variation in the distribution of the values or similarity of patterns in repeated measures. That’s what we were picking up on. You can also see data falsified to reach a desired objective. I think that was referenced in one of the other talks because what you can do is you can push the data in one way so that the subject is eligible, and then later on they’re a terrible subject, kind of thing. And this is by comparison of distributions or through center by treatment interactions, so getting into more complex math than I was particularly comfortable with, but these were some clues.
What I want to do is kind of open this up. What I want to know from you guys: Is anybody doing this? Is anybody doing risk-based analysis? You are? Well yes, you are. I was hoping for a sponsor.
AUDIENCE MEMBER 1
I think this is great. I think you’ve done a fantastic outliers analysis here, which is super useful. I guess legacy Bracket had a pretty robust data monitoring and analysts program for quite some time. It’s exciting to see that other people have come to this, far outside of what we’re doing, because it’s always neat to talk to folks who are thinking about it in similar ways and different ways. And certainly with different types of data. We generally base these risk analyses on outcome data and not necessarily around time stamps and things like that, but that’s really interesting. Again, I’d be really curious to hear if other folks are doing it. And certainly we’re starting to think more about the different types of statistical methodologies to do this, looking at how data can be distributed both within a study—like you said, is it parametric data, is it not evenly distributed, not parametric, are we using [unclear 22:19] analysis, are we looking at means and standard deviations, are we bootstrapping to do lots of comparisons to see what pops out. So there’s many ways to get at this, and I think it’s a really ever-evolving field that both helps mitigate certainly fraud potentially, but really just gives some insight into why a site—like you said if a site is looking different or a rater is looking different than their peers, why is that, and then is there something as opposed to it being a lagging thing where you find out later that this one site really screwed up your study because of whatever they were doing, is there a way to make it actionable and leading so that you can intervene in a means to maybe help that or realign that site or get rid of someone if they’re truly being fraudulent. And then how do you take that knowledge and maybe from the eCOA space build it into your product so that, instead of waiting to do an analysis, when it actually happens, if it’s going beyond a certain threshold statistically, it alerts someone right away so there’s less of a chance that that rater may act on that data again in a way that shouldn’t happen.
Of the individuals in the room who are not CRF Bracket, who is doing risk-based monitoring? Okay, I’ve got two hands, three, four, five. Who is deliberately not doing risk-based monitoring, anybody? I can’t ask why not. Of the individuals who said they were doing risk-based monitoring, are you doing it using the eCOA data? You are.
AUDIENCE MEMBER 2
We contracted with a company called Analgesic Solutions, so they programmed triggers for us because we’re collecting pain data. So they’re not only looking at the compliance but they’re also looking at if a subject is consistently answering the pain question the same way all the time. So if every question every week they have 1, 1, 1, 1, 1, or the higher level. Or even worse, if they say it’s in the middle, that’s probably the worst answer because they’re just—
If they say what?
AUDIENCE MEMBER 2
In the middle. Like if they select 5 every time.
Oh, in the middle.
AUDIENCE MEMBER 2
Yes. But what we do is, since we can’t tell the subject what answers to select, the way we address it is that we have the sites retrain the subject on how they should be answering the question and what they should be thinking about when they’re doing their pain scale so that we don’t continue to see that occur. So we hope through retraining it’ll help as they go along with the eCOA data. So yes, we still continue to monitor. We haven’t seen very good improvement yet, but we’re just hoping that if we keep at it.
That’s making the assumption that these are legitimate subjects who just don’t know how to—
AUDIENCE MEMBER 2
Whereas what we were looking at here is just fraud. And I think Jerry’s question is a good one. If you retrain them, do you see a different pattern. Because if you don’t see a different pattern, then either their not re-trainable—and you wonder if they’re decent subjects—or they’re not real subjects. So my curiosity is also in the last one, how prevalent is fraud. So you’ve got that particular risk indicator. Any other input into what makes good risk indicators? Willie?
AUDIENCE MEMBER 3
I need to start with explaining that I’m not with ICON anymore. But what I’m commenting on is work that I did when I was at ICON. Three years ago, we had a project with an electronic diary, and it was a condition where you don’t have a lab or any other way to confirm the diagnosis other than through the answers that the patient would give you. So IBS and so very difficult to diagnose. And we used an electronic diary in that study, and we had the same—the first five slides would have been my slides. So we got a call from the sponsor saying there’s something dodgy going on, can you do some analysis. Now our diary was done via phone, against our recommendation. So it wasn’t a screen-based, like a handheld device, it was via phone, IVRS. So what we needed to do, after some initial discussions, we looked at the audit trails, didn’t get much out of that one. So then we wanted to see the caller IDs. For data protection we couldn’t get them obviously, so after going back and forth over several weeks with AT&T—because the site that we were looking at was in the US—we finally got masked caller ID so that we could actually do some caller ID analysis and we got time stamps from AT&T when these individual calls over a three-month period of time had been made. And we came to a conclusion, and I think that’s what you said earlier, can we standardize how we analyze that, I’m not sure we can because I think every case is slightly different. So the way we went about that—I’m not as good with Excel as you are, I had data scientists that used R to do the work for me–but what happened there is that they did some analysis, and they were techies, really sharp data scientists. I had to try to make some sense out of the graphs that they gave me. I gave some comments and said, I don’t believe this, can you try that again, and they tried something else and it went back and forth four, five, six times until slowly a picture started to evolve and we could make sense—just like you interpreted the data, there was two people doing it really—after about two or three weeks we came to a conclusion and we had an idea what happened. The most likely scenario was that there was a, what I call a patient management organization, illegal, that was basically managing patients into a study site that was a research site, it did nothing but research, so they didn’t have any patient care. And they would enroll patients, the patient would get reimbursed a lot of money and somebody must have worked in cahoots with them, so these patients never filled in the diary themselves, they most likely built a computer system that would do the phone calls for them, because we had cohorts of five calls that would come in within 10 seconds, they would finish within 10 seconds. And these calls were always 30% shorter than all the calls of all the other 30 sites in the study. So some smart person built a computer program that made the phone calls for the patients and just filled in the diary over three months for every patient. There were like 70-80 patients at that site. So it’s difficult to roll that out on a bigger scale because you need to have the caller IDs masked, you need to get the agreement with these various toll-free numbers that were involved, a second company to AT&T. So there’s a lot of effort that needed to go into that one, plus the two or three going back and forth until we could make sense of the data that we found and had an idea what actually happened. So I think it’s a worthwhile effort. I’m not sure that there’s a lot of that going on, but like in this case, if it goes on it’s nasty. So it’s certainly something that I think needs to be addressed but it’s not an easy one.
Thanks Willie. We had a couple other hands of individuals who know their companies are using risk-based monitoring. For those two people, do you use eCOA data in that basic algorithm? What’s that? Okay, so you know Willie’s story. This individual here? I want to talk to her.
AUDIENCE MEMBER 4
We incorporate the eCOA data into our RBM.
So what parts of that data do you use? What are you looking at in the eCOA data that acts as a flag?
AUDIENCE MEMBER 4
We have a whole department that does that. I’m not in that department so I can’t answer that but I know they look at all the data that comes in from the clinical trial and do that risk-based approach and flag protocol deviations and all that.
I think we all look at at least one risk index when we look at eCOA data because we look at compliance. And diary compliance has got to be one of those flags that we look at. And if we see a site with patients with particular low compliance, well that ought to trigger some sort of event, whether it’s a phone call from the CRA. Or if there are specific patients within a site, again with very low compliance, that ought to trigger the retraining or the phone call from the investigator, or whatever. So, just on a very simple level, that’s kind of one risk measure, risk index, that we all I think look at with eCOA. But I think some of the other aspects around quality in the data, the detection of fraud, all of these other things are perhaps things that we’re less sophisticated at, at the moment.
It does concern me to some extent that if we get really good at detecting fraud and we publish what we’re doing, we’re going to get sites that are really good at committing fraud.
AUDIENCE MEMBER 5
[inaudible 32:15-21] —at Bracket and not just risk-based but essentially verifying the validity and reliability of the data for a while scientifically. Again it’s mostly been outcomes data. But these are mostly clinician-rated instruments, because Bracket’s primarily been in the CNS space where there’s been less ePRO. But one of the things we’ve seen and we’re thinking about how to quantitate it is sort of the idea of the Hawthorne Effect, there’s the famous social psychology, those of us have gone through this, but just essentially as part of our eCOA device for a lot of these measures, we actually record the assessment. It’s in the background happening, raters are aware of it. They also become more aware that there is sort of this large data set risk-based monitoring going on. And because of that, we’ve seen I think a decrease, although I can’t say it’s statistically significant, in some of our newer programs of less of what we would consider things that you might attribute to fraud, like identical scoring and what we would consider inliers, things like that. So just as sort of an overall construct, I think the field, particularly in the ClinRo environment and CNS as Dan talked about this morning there’s been a much faster uptake in the eCOA and CNS but in other spaces maybe that have been more ePRO and slower to adopt that, that may be coming and the question is, will it become an arms race between how you detect it and the sites. If they’re relay interested in doing that—that hasn’t been our experience necessarily—but certainly eCOA has made it, I think, much harder for them to commit outright fraud.
That’s interesting. I just wanted to finish up with one slide. And it was talking about data and getting things in real time and being proactive. Finding it at the beginning so that you’re not looking after the fact. And actually there’s some real difficulties with that approach. You’ve got staggered availability of data. And by the way, this is from that reference on statistical analysis of variation, etc. In the context of a clinical trial, there are difficulties. One of these is that you have different sites coming on. Well just across sites, you have different number of subjects at those sites, therefore different amounts of data. And so if your analysis is looking at these different sites, you’ve got a lot of variability. It’s like what I was doing with the site analysis, where the more data you have the more likelihood that you’re going to see these types of overlapping data times, whereas the lower end of those you have less likelihood. So you’ve got this range of the amount of data, and you’ve got to take that into account.
Again, volume of data. So just even if you are looking across the entire study, at the beginning there is very sparse data. And even sometimes at the end there is very sparse data because there aren’t that many subjects. So some of these analyses may require a fair amount of data in order to be robust. And that may just not exist, certainly not exist at the beginning of the trial when it would be great to find these issues. The other thing is cleanliness of the data. Data does get cleaned up. And if you’re doing some sort of analysis or monitoring before that data is cleaned up then there could be things that either are missing or are false signals, one or the other, that wouldn’t be there once the data was cleaned. And then, if you’re going to do this type of monitoring or analysis, then you do need to be aware that there are systematic differences between centers. If you have a Japanese population versus a US population, height and weight variation, those are different curves. And so you don’t expect them to line up very well, and so you need to take into account things that are expected to be different. It’s like with our system where we’re looking at save times. And our system would expect to see some of those if you have that regimented behavior. Whereas other things, you’re not going to see that. So these systems have to be very carefully crafted. I think it’s a fascinating topic, and I think it’s a real—something to be looked at in the future so that we are not only collecting the best data we can, but the source of that data is good data collection too.
So that’s all I have. I won’t pass it over yet. I will ask if there are any questions, and if anybody wants to ask anything about Harry Potter, I’m happy to answer that as well. Any questions or additional comments? Okay.
Thank you, Jill.
[END AT 37:53]