eCOA and Equivalence Testing: New Evidence from Meta-Analysis

January 22, 2016

Equivalence testing, the evaluation of the comparability between scores from an electronic version of a paper-based questionnaires, has long been taken as a requirement for ensuring that data captured from electronic versions of patient reported outcomes does not vary significantly from that captured on paper.

In this recorded webinar, experts from CRF Health and ICON discuss the results of a recently published meta-analysis which examined all published equivalence tests from 2007 up until 2013.

Learn about our findings and their potential impact on your future equivalence testing. 

  • What is equivalence testing and why do we do it?
  • How was the current meta-analysis run and what were the results?
  • What do these results mean in the context of the field of eCOA?
  • Is it possible to evolve to a state where equivalence testing is no longer required?

Full Transcript



Ladies and gentlemen, you are very welcome to this CRF Health webinar, in collaboration with ICON, entitled eCOA and Equivalence Testing: What the Evidence Tells Us. I’m very happy to be joined on this webinar by two of my friends and colleagues from ICON.

We’re joined by Willie Muehlhausen, who is the Head of Innovation. And he’s got more than 15 years of experience with eClinical systems across a wide range of technologies. But more recently he’s been focusing on eCOA systems specifically. And he’s worked in a number of the eCOA vendors and is recognized as a subject matter expert in eCOA in the clinical trial industry.

Also, delighted to be joined by Helen Doll who is Director of Patient-reported outcomes over at ICON. She’s trained as a medical statistician, but with specialty in statistical and psychometric issues associated with patient-reported outcomes and the measurement properties thereof. She’s been involved in the development of a wide range of patient-reported outcomes across a range of different disciplines. And her main research interest is around the use of item response theory, and Rasch methodology in particular, for the development of those patient-reported outcomes.

And we’ve also got myself, Paul O’Donohoe, as Director of Health Outcomes for CRF Health based in our London office. And I’m responsible for providing and coordinating our scientific support internally but obviously for the clients that we work with closely during our clinical trials. I’m very passionate about driving this field of eCOA through research, including some of the research that we want to present to you today around equivalence testing in eCOA and some work that we’ve done with ICON exploring this particular issue.

Before we get going, a few housekeeping things. During the course of the presentation, if you have any questions, you should be able to see a box on the lefthand side of your screen where you can enter questions. No one else, apart from we who are presenting, will be able to see those questions. So if anything comes up during the course of the presentation that you would like more details on, feel free to enter a question there and we’ll do our absolute best to get to that at the end of the presentation. The presentation or a recording thereof will be made available automatically in a few days time, so keep an eye out for that. There’ll also be a short survey sent out at the end of the presentation. I think it’s only about three questions, so we’d really appreciate your time in completing that to help us continue improving these CRF Health webinars that we deliver.

The flow for the next approximately about an hour will be I plan to do a bit of scene setting, very basic scene setting, just to make sure everyone’s kind of on the same page and knows exactly what it is that we’re talking about when it comes to equivalence testing. I’ll then hand over to Helen. She’s going to dig into details exactly what research we did do, what it was we did and the findings that we got out of that research. And then we’ll hand over to Willie, who will kind of wrap things up, give a bit of summary of where things stand. We’ll also pose some questions for where we go from here. And the plan is to have plenty of time for questions and answers at the end as well, because I know it’s a topic that, as well as creating some confusion, generates a lot of interest for people, so we’re very  interested in hearing your thoughts as well on this issue as well as any questions you might have.

There’s going to be a couple of poll questions throughout, and I’m going to get started right away with a poll question. And this is just to really try and give us a sense of the experience of those on the line with various types of testing that we’re going to be talking about. So have you run any of these below on electronic versions of patient-reported outcomes in support of a clinical trial? The options given are usability testing, cognitive interviewing, and what’s referred to as full equivalence or statistical testing. And you can choose all that apply out of those three options. If you’re not certain what these different options mean, don’t worry, I’m actually going to touch on that the next few slides. We’re just interested in seeing do people on the line have experience in many of these areas, and if so which specific type of testing. Remember, you can tick all that apply. All right, just give it another second to get a few more people responding.


All right, very nice. So usability testing the highest percentage, closely followed by cognitive interviewing and not all that many but more than I was expecting involving full equivalence testing, which I think the ratio certainly makes sense in regard to equivalence testing being the least commonly performed testing, but interesting to see there’s obviously a number of people on the line who have been involved in that level of statistical testing. And we’ll talk in a bit more detail about these results and the specific work referenced in this particular poll.

I’m going to skip straight into another poll, and this is more kind of scene setting in regards to exactly what we’re going to talk about here. It’s possible this question looks a bit weird on your screen, depending on how big your screen is, but it reads: Do you believe we are required to demonstrate equivalence of paper and electronic versions of PROs supporting primary or key secondary endpoints? Yes, no, or maybe. So this kind of gets to the heart of exactly what it is we’re going to be talking about during the presentation. So very interested to hear how people feel about this particular topic. Good number of results coming in. Pretty definitive number of results coming in. Interesting. Okay, I’m just going to give that another second. All right. So the majority of people feel that we definitely do need to demonstrate equivalence of paper and electronic versions of patient-reported outcomes when they’re supporting primary or key secondary endpoints in a clinical trial. So we’re looking at about 62% there with maybe 25% saying maybe, and very few saying no, which I think is very interesting. Okay, well we’re going to revisit this result later on.

Let’s dive on straight into the presentation. Like I said, I’m just going to do some very basic scene setting on the off chance that there’s anyone on the phone who is not aware of what electronic clinical outcome assessments or eCOA is. It’s basically when we administer the traditionally paper-based patient-reported outcomes on an electronic platform, whether that be a hand-held smartphone system, a tablet system, a PC, digital pen, IVR system. So it’s electronic administration of traditionally paper-based questionnaires. Traditionally paper-based because that’s been the predominant technology for past centuries. And obviously, as we move into a more technologically advanced society, we’re now seeing more and more of these questionnaires administered on these electronic platforms. In the migration from paper to electronic, you do see a certain number of changes occurring. So we have an example here on the left hand side of a paper version of a questionnaire, the SF-36. And on the right hand side, an electronic representation of that questionnaire. So you can immediately see some quite drastic differences. We’ve not reduced down to a single item, only one question is being asked on a screen, as opposed to having multiple items on the A4 piece of paper. It is quite arguably significant changes occurring there. But even when you’re replicating the layout of the paper version almost exactly—so with certain tablet computers you can basically have an A4 piece of technology replicating the paper layout pretty exactly—even in those instances, subtle changes have occurred in regards to how patients are interacting with the questionnaire. Very often they’ll be using a touchscreen for example as opposed to a pen and paper. Or they might be using a mouse if they’re using a PC-based version of a questionnaire. And obviously with something like an IVR system, it’s an even more fundamental change when they’re interacting with voice or with the keypad on their phone. So changes do occur, some quite subtle, some a bit less so, when you go from paper to an electronic version of a questionnaire.


So what does that mean? What do those changes mean to us in the industry? So there are two papers that are basically our driving force in this space, as I’m sure both you’re well aware of, but it’s the FDA Guidance for Industry for Patient-Reported Outcomes, which is kind of the fundamental document for how best to go about developing patient-reported outcomes in the first place, as well as the ISPOR ePRO Good Research Practice Task Paper.

So what do these papers actually say?  Well the FDA PRO Guidance, as well as talking about the best way of developing questionnaires, also touches on this fact on Page 20, when a PRO instrument is modified, you generally should provide evidence to confirm the new instrument’s adequacy. And modification explicitly includes changing an instrument from paper to electronic format. So the FDA actually calls out going from paper to electronic as something that might require a certain level of additional work to demonstrate that it’s still “adequate.” Now they don’t go into any more detail about how one might go about doing that, but that’s where the ISPOR ePRO Good Research Practice Task Paper comes in. And they basically break down the amount of change that might occur when you’ve gone from paper to electronic, and the associated level of evidence that you might need to provide to demonstrate that no fundamental changes have occurred to how the questionnaire is behaving.

There’s this very famous table again, I’m sure you’ve all seen, where they’ve broken down the level of change in regards to going from paper to electronic, from minor, moderate, to substantial. And then an associated level of evidence that one needs to do to demonstrate that you really haven’t broken or changed the fundamental properties of that instrument. And the level of evidence obviously varies, depending on the level of modification or level of change that’s happened to the questionnaire as it’s gone from paper to electronic. But they consistently talk about conducting usability testing, cognitive interviewing, and this so-called equivalence testing.

So what exactly are these things?  Well usability testing is basically examining whether patients or participants are able to interact with the hardware and software in the way that you expect them and want them to be able to do. So it’s to ensure that they’re able to provide you answers to interact with the software on the screen, but also able to interact with the hardware, to hold and turn on the hardware as needed. So it’s a very basic investigation of how the patients interact and use the device and the software to make sure it’s as burden-free as possible. And the overall goal is to demonstrate that respondents can complete a complete round of assessments as intended for that particular clinical trial and ideally in as intuitive and low-burden way as possible.

Cognitive interviewing, which is typically done alongside usability testing, is basically a structured interview where you’re exploring whether participants are understanding and responding to the new electronic implementation of the questionnaire in the same way as the original paper version. So you’re exploring how patients are interpreting this so-called new version of the questionnaire, or at least a new way of administering the questionnaire. And ideally, they’re reporting that they’re interpreting and responding in the same way as the original paper version of the questionnaire.

And then we get to this so-called equivalence testing, which is a much more detailed statistical evaluation. Looking at the scores between paper and electronic versions of the questionnaire to ensure that there’s no statistical differences in how patients are responding. So you’re really trying to ensure that the scores on the electronic version do not vary significantly from the scores from the original paper version of the questionnaire.

The way I kind of break these down in my head, I refer to both of them as equivalence testing because I think there is an element of demonstrating equivalence in all of that work, but it is a different kind of equivalence. So usability testing and cognitive interviewing can really give you a sense of qualitative equivalence, that patients are interpreting and responding in the same way, whereas the full statistical equivalence testing is more a quantitative equivalence, that you’re demonstrating that patients are kind of giving you the same kind of data.

So we have all these different kinds of ways of exploring differences between paper and electronic versions of questionnaires. But unsurprisingly the equivalence testing, that quantitative piece, is a much more robust piece. You’re basically administering the questionnaire in a representative sample of patients, 50 or more, you randomize them to complete either the paper or the electronic version of the questionnaire, then you distract them somehow or you leave it long enough so that they’ve forgotten what they answered the first time, and then you administer the alternative version of the questionnaire. And then you statistically compare the results. This is kind of a very idealized way of doing it, there’s different ways of approaching the statistical quantitative testing, but this is just kind of one version, to give you a sense of how this work is done.


And we have done a lot of this work over the years, so much so that in 2008, Gwaltney, Shields, and Shiffman were able to put together a meta-analysis looking at published equivalence between electronic and paper-and-pencil tests. They found 65 studies in the literature and a weighted correlation of .90. So the relationship between how participants responded to paper version and how they responded to the electronic version was very very high, so high that they felt confident saying that “extensive evidence indicates that paper- and computer-administered PROs are equivalent.” Despite that—and this was eight years ago—we’ve still seen lots and lots of equivalence studies being done and as represented by that poll we did at the very beginning, we’re still seeing this perception that equivalence testing needs to be done in support of patient-reported outcomes being used for primary and key secondary endpoints in clinical trials. 2008 was also quite a long time ago when it comes to technology, so maybe the technology has changed in that intervening period as well. So we basically want to replicate and build on what Gwaltney et al. did and follow a very similar approach to running a meta-analysis to explore the data from 2007 onwards, which is when Gwaltney data finished, to see were we seeing the same thing, to explore a few additional details, and really to get an understanding of what’s happening with these equivalence tests that we’re doing, and explore the idea of whether we might have learned enough about equivalence testing at this point.

But I’m now going to hand over to Helen. She’s going to talk about in more detail exactly what it is that we did in this research. Helen.


Thank you, Paul. I’m very pleased to have the opportunity of talking about our research. But first of all I wanted to start by acknowledging the whole team who worked with me on this project, it very much was a team effort.

So to start with the aims of our study. As Paul said, we wanted to provide further evidence from the measurement equivalence of questionnaire scores collected on pen and paper versus electronic devices and we did this by conducting a systematic review and associated meta-analysis of studies that had been conducted after the Gwaltney et al. paper, so from 2007 onward. And specifically, we wanted to include data from instruments migrated to an interactive voice response system, IVRS. These papers have not been included in the Gwaltney et al. study. And we also wanted to explore what factors might increase or decrease equivalence, and to what extent—and indeed whether—publication bias may be an issue. Now, our overall hypothesis was equivalence between scores on pen-and-paper and electronic platforms would be highly similar to that observed in the Gwaltney et al. study.

So in terms of our methodology, briefly, we conducted a formal systematic literature review of PRO equivalence studies to identify all published data from 2007 to 2013, and we also included data in the gray literature, such as conference abstracts. We identified 1,997 publications; and of these, 72 met our pre-defined inclusion criteria and didn’t meet our exclusion criteria, and were included in our meta-analysis. And this compares with 46 unique studies identified by Gwaltney et al. in their earlier meta-analysis. Now, like Gwaltney et al. we then proceeded to extract information relating to correlation coefficients of varying types, intra-class correlations (ICCs), Spearman and Pearson correlations, and also Kappa statistics. And we also extracted data on mean score differences, so the mean of the PRO questionnaires, reported by the participants on the pen and paper, and the electronic formats. And we standardized these mean differences by either their standard deviations or the scale range.


And we then calculated average correlation coefficients for each study over all types of correlations, or just over the more statistically correct ICCs. In terms of a data analysis, which we undertook in both data and in the specialized program comprehensive meta-analysis, we analyzed the correlation and the mean difference data separately. And we understood either a fixed or a random effects meta-analysis depending on the amount of variability we identified in the data. We did a fixed meta-analysis if the I2 value was less than .75 and random otherwise. Now the I2 value is the amount of variability on a 0-100 scale that is due to heterogeneity rather than chance. And we analyzed both the individual study data as well as the average scores, and also we excluded potentially outlying scores. These identified as those with an effect size which was more than 3 standard deviations from the pooled effects.

After this, we looked to see whether there were any modifying effects or factors such as mode of administration, year of publication, age of participants, study design, publication type, and time interval between administrations. And finally, we examined whether there was any suggestion of publication bias by using funnel plots and specific statistical tests.

So in this slide, we have a summary of our results. First, the mean differences. And these are mean differences standardized first by the standard deviation, and then standardized by the scale range. So for those mean differences that we could standardize by a standard deviation, there were 307 of these. We found that they together had a low I2 or variability of 33% and a fixed effects pooled standardized mean difference estimate of .037 of a standard deviation, which from the 95% confidence interval is likely to be no smaller than .031 and no larger than .042 standard deviation. Now, a greater number of standardized scores could be calculated using the scale of the instruments, 355 versus 307. And this is because sometimes no value of standard deviation was presented in the publication. And if we standardize the mean by the scale range, we identified an average difference of .018, which is .18 of a ten-point scale score, or .16 when each estimate was averaged over platforms. Also, the mean difference for the platform-specific estimate was within 5% of the scale score in 97% of the studies. 

Now in terms of correlation, we identified 435 individual correlations, which together were highly variable, having an I2 value of 93.8. And so we fixed a random effects model, and our pooled correlation was .88. And we identified 236 individual ICCs, and these ranged from .65 to .99, and they were also highly variable, having an I2 of 92.2. And the pooled estimate was .89. The 95% confidence interval is typed at .88 to .90. The estimates are similar if we used average platform-specific correlation. And we had 61 of these values. The pooled estimate was .88, and the 95% confidence interval .86 to .90. In terms of outlying studies, we found 20. But if we excluded these, we did not see any change in the pooled correlation estimates, which remained at .875 or approximately .88. And lastly, the pooled platform-specific ICC was .90 and this is identical to the value found in Gwaltney’s et al. earlier study.


This next slide—and I hope you can see it, it’s quite a busy slide—it shows a forest plot of the 61 correlation coefficients averaged over each study within each platform. I think the main message from this plot is that there is clear variability in the estimates. And we have some 95% confidence intervals, either being very wide—i.e. imprecise—and/or not crossing the pooled value, this is the red vertical line.

So what factors may explain this variability? Here we have a table that shows the pooled correlation coefficient by pre-defined possible modifying factors. And each of these factors actually had a significant effect on the pooled correlation coefficient. This can be seen by the non-overlapping confidence intervals. And indeed each factor had a highly statistically significant effect at P<.001. We can see from this table that studies showing a greater equivalence—i.e. a higher correlation coefficient—were those studies conducted later, .89 versus .87. Randomized studies .88 versus .85. Those with data extracted from an abstract or a poster .90 versus .87. And those with a shorter time interval between administrations .90 versus .82. Now while each was significant, it’s clear that in terms of the size of the effect, the interval between administrations is most important.

Likewise, in terms of age, younger subjects tended to show less agreement—.79 versus .90 for those age 28 to 46 years. And while there was little effect of sample size, in terms of platform, the lowest agreement was between paper and IVRS platforms—.85. And the highest between paper and tablet, touch screens— .89.

Now to come onto our investigation of publication bias, here we have the funnel plot. And we found no significant effect suggesting that publication bias was actually an issue here. We did find 20 potentially excluded studies. However, these studies—and these are highlighted in red in the plot—actually are on the right of the pooled estimate rather than on the left, and that’s larger than those identified in this review rather than smaller. Now, this is likely to reflect the variability of the data as we saw with there being some studies with relatively lower levels of agreement than as seen in the majority.

So just to conclude briefly, we found a pooled estimate of average platform-specific ICCs of .90 and this is the same as the overall estimate reported by Gwaltney et al. in their earlier meta-analysis. And so taken together, this suggests that PRO assessments administered on paper are quantitatively comparable with assessments administered on electronic device. And the lower bounds of the confidence interval means that it’s unlikely that the true correlation is lower than .88.

Now this conclusion of comparability between the paper and electronic data was robust, whether the correlation or the mean differences were examined, with the mean difference being equivalent to approximately .02 scale points—i.e. .2 of a ten-point scale—whether all means or average means across each study and platform were used, and this also was identical to the mean difference identified by Gwaltney et al.


Finally, while we don’t have any information on the extent of the migration—i.e. whether it would be minor, moderate, or substantial—nevertheless these results with the high level of equivalence identified should be reassuring to investigators, regulators, and sponsors. And we have found that the agreement from studies migrated to IVRS devices were somewhat lower than those migrated to other devices. But nevertheless, the agreement was still acceptable. And thus, for instruments with even moderate modifications such as migration to an IVRS, the results can be interpreted to question the necessity of conducting equivalence testing in the future.

So now, I will hand over to Willie to put these results in context and to talk about some ideas that we have for work moving forward. Thank you.


Yeah, thank you Helen. So where do we go from here? What does that do to us, or what does that do—how are we going to move on?

We were trying to figure out, how can we help, and how can we substantiate these findings even further. As Helen said, we did this with IVR as well, and IVR is a severe or moderate change according to the list the task force report. So we took a step back and looked at the instruments, and we also did that within the context of the ePRO Consortium a while ago. And currently when we do equivalence studies, whether it’s qualitative or quantitative, we are considering the instrument always as a whole, so we look at the instrument as a whole and we do the analysis on the instrument as a whole. But we do know that the instruments consist of different items, and these items can be categorized, from a, let’s say, technical or from a scale point of view. And we know also that not all items or domains within an instrument have the same level of equivalence. What we found during equivalence work that we’ve done ourselves and also what we found in the literature is that there may be some variability within the instruments across different domains. So an instrument is not necessarily a whole. We can dissect that. So I’m a veterinary surgeon so I like to dissect things anyway. So what we did is in the ePRO Consortium we looked at other ways of looking at that and how would we define, if you wanted to look at the individual pieces, how would we do that.

So we defined in the ePRO Consortium widgets. And widget is a term that we know from a technology point of view. And these widgets, a definition you can find somewhere for example in Wikipedia, and I’ll come to that in the next slide. So what we did now is—and the reason why we did that is BYOD is being discussed widely in the industry and probably also applied by now in many different ways. And we were wondering how does BYOD affect the current ways of doing equivalence, and equivalence testing. So BYOD basically means that we don’t have any, or very little, control over the actual device that patients will use to access our questionnaires and answer the questionnaires. So screen sizes can vary from a 4-inch or 3.7-inch device to 50-inch or more plasma or LED TV, so we don’t know what patients will use. The questionnaires can be accessed at home, they can be in the office, in the doctor’s office, or they can be in the office while they’re at work, they can be on the run, on their mobile device, they can be one online or when patients are offline, if they use an app. So there’s different ways for patients to access that, and some of that we have under control, others we do not, other factors. And patients in the extreme could actually use different devices at different times during a clinical trial to access the questionnaires. So they could use it at home or when they’re in the office, they could use a web system to access that, when they are on the run or when they’re at home they may use a tablet or a phone. So these are scenarios that are out there that we thought we just have to look into, how does it affect equivalence.


We obviously cannot measure all these different scenarios in different devices that are out there. So for example, when you look at the iPhone family, there’s probably by now iPhone 4, 5, and 6 are out there, and 6 in two different sizes. So that’s probably four or five different devices, that’s manageable. Then you add a couple of iPads to that and now we’re talking about 10 or 12 devices. When you then look at the Androids market, we’re probably adding hundreds if not thousands of devices in different screen sizes to that. So we figure that BYOD can only really be applied and can only really fly and help us in the industry if we can figure out a way, and if we can demonstrate that these different parameters that BYOD brings with it don’t affect equivalence.

So what are we going to do about that or how are we going to go about that, and that’s what I’m going to talk about now.

So now we come to the widgets. As I said, every questionnaire consists of different scales or sub-scales. From a technical point of view we talk about widgets. And widgets are like a graphical control element, or a control or different terms for that that we use, but it’s part of a user interface basically. And on the programming side, there are libraries—so for example, you all have heard about the research kit, for example, they have a certain library of these GUIs or GUI elements or widgets. Now not all of them apply to what we do, so in the ePRO Consortium we put a paper out and analyzed I think about 100 questionnaires that are out there, qualified questionnaires, and basically there’s three, probably four, core elements that we have as the verbal response scale, or adjectival scale, numeric rating scale, and visual analogue scale. These are the main scales that we use currently in development of instruments in our space. So actually a small number with three. Now obviously there are other elements, like you can have a date picker or you can have a time picker, you can have just the multiple choice questionnaire that is not considered a scale. So there’s others. But again, these are not—there’s not that many of them. And we basically were homing in on these three because we felt they were special, that we should look into these and we’ll continue to look into these.

So the idea that we have there is that patients—or the underlying assumption is that patients understand he concept of a numeric rating scale, and we’re just using the numeric rating scale as an example here, because we don’t have enough time to go through all of them, but it’s probably a good example. And how do we know that patients actually know that?

So first we’d like to define what is a numeric rating scale. Numeric rating scale has certain elements. An element is again a sub-unit of a widget. So we have the question as number one. We have the different response options, 0-10 in this case. And we have textual anchors, left and right. And that’s about it. sometimes you may have a third text anchor in the centre, but that’s basically it. So these are the key things, the key elements. And patients do understand these because a numeric rating scale is used in dozens if not hundreds of questionnaires across different therapeutic areas, different questionnaires, different patient populations, and we’re just currently doing some inventory work—and I’ll come to that later—to basically facilitate that or to prove that. So the idea was—or the idea is, patients do understand what a numeric rating scale is. So if they understand the concept of the numeric rating scale, then it doesn’t really matter what these rating scales look like as long as we have the same elements. So on this slide you'll see that if you do a quick Google search and enter “numeric rating scale” and instead of getting the links, you click that button that says pictures or images, you will get hundreds of different images of numeric rating scales, here are just four examples. And some of them we just made up ourselves because we have used them in the past. So when you look at that you will see that a numeric rating scale as it is being used right now across different questionnaires can look very differently from each other, like these four don’t look the same at all. But they all have the same elements--they have the question, they have the answer options, and they have the text anchors. So again, obviously, patients do understand a numeric rating scale concept and therefore the actual look and feel does not affect the way patients answer these questions.


So with that in mind, we are now planning to have some additional research. So we are currently analyzing widgets and we currently use prompts. So we have a list of the, let’s say, top 100 questionnaires that are out there as they’re being used. And we’re just doing an inventory—this is ongoing work—of how many of them use an NRS, how many of them use a visual analogue scale, and how many of them use a verbal response scale. We also look into the actual therapeutic areas so we can analyze it by therapeutic areas, by patient population, and by a couple of other parameters. And that’s, again, we will publish that as soon as we have the results. We’re also, what we did with our meta-analysis, we actually wrote to all authors of these 72 projects that we analyzed, to ask them to send us screenshots, scripts of the original paper versions that they have, so this is still an ongoing work, and we’re analyzing that as well because what we know from these is that they were equivalent, because we have the results and they were published so we know that these are equivalent, so if we can show that, we can show how different these different scales can be, and we can also show that we still have the same basic elements. Therefore we’re back to if the concept of a numeric rating scale or any of the others is known to the patient and acceptable, then we probably don’t have to do what we currently do with the equivalence studies. So we’re also analyzing data within ICON that we have from the cognitive debriefing usability tests that we have done in the past three, four, five years. And we have reached out to some of our friends in other companies, pharma and also consultancies and CROs, to pool that with other data, we’re not the only ones obviously that do cognitive debriefing usability testing. So we’re trying to find out and trying to—again, prove and harden that, let’s say, the assumption and prove the assumption that patients do understand that, and if they do understand the concept then it doesn’t really matter what the question is, it doesn’t matter who the patient is, and it doesn’t really matter what the answer options are, the anchors, they will answer it properly. Now again, that doesn’t mean you don’t have to do all the hard work when you develop the instruments in the first place, but when you then migrate to an electronic platform or you migrate from one electronic to another electronic platform, you may not have to do any more testing, that’s basically the ultimate outcome that we want to get. We also have conducted a couple of studies that have not been published that we are about to publish, where we built new user interfaces that are very different from what we see in an IVR or hand-held devices, and we found that there is equivalence as well. So we’re coming back to our original assumption that, if patients do understand our hypothesis, if patients understand these scales and the concept of these scales or these widgets, then they will answer the questions appropriately, and they will not—these widgets will not enter or add bias to the actual questionnaires or to the instruments. So the question that we have is for example, that we still need to answer as well is, are there certain patient populations or specific characteristics of patient populations that would influence the equivalence between devices. You saw that we looked at age, for example, age distribution. And my point and our point is probably that elderly is not necessarily a parameter that affects equivalence or affects how patients will answer the questionnaires. I think it’s more about visual impairment, cognitive impairment, or dexterity challenges that may affect equivalence. So that’s where we are back to usability testing. So if a patient can't see the questionnaire or the answers, or if they can’t understand and comprehend it or have issues with their dexterity, then that may well affect the way questionnaires will be answered. But now we’re back to BYOD. If we allow these patients to use their own device, which they probably know how to use, otherwise they wouldn't have it, and there are special devices out there with bigger fonts and bigger screens and so on, the question is, do we still need to do really usability testing, or can we assume that patients if they use their own devices they know how to use that device.


So there is still, after we’ve done that research there is a lot of questions that we still have that we think we need to find answers to, but currently all the data that we do have with finished and ongoing research would indicate that the instrument developers and the companies and anyone else that does the migration, you can’t really mess that up. You would have to really make severe changes to—and we haven’t found any changes severe enough--to really cause any non-compliance or non-equivalence. So the current thinking here is that all the testing that we currently do and that we have done, over the last few years since the guidance came out, and even prior to the guidance, they were done because we didn't know any better and we did look at the data and we didn’t have the data to prove otherwise. I believe that with the current data that we have and that others have and that will come out over the next year, year and a half, two years, we probably have enough data to really revisit the need for equivalence studies even as far as cognitive debriefing usability testing. And my prediction is that we don’t have to do that anymore, moving forward. But we will have to show that, and that's what we’re doing more.

So again here’s the conclusion—again I just talked about that. We will do some more research. We’re looking for partners every now and then to do that. We have a lot of data, but more data in this case is better. We believe that moving forward, if you follow let’s say very basic rules of how to migrate or even how to develop an instrument, and as a guideline you can use the ePRO Consortium white papers, there’s one for migration and one for development, then all you may need is an expert review to assure that the developers followed these guidelines and didn’t introduce any funky elements. So if you do that I think we’re good. I believe that the research that we’re currently doing will support that, we already have some data that supports that. And my recommendation for instrument developers is, when you look at the Google—as I said, Google “numeric ratings scale” and look for pictures, my recommendation for instrument developers is, don’t be too creative, keep it simple. Simple is better in this case.

And with that, I’ll hand it back to Paul.


Lovely. Thank you kindly Willie, and thank you very much Helen as well. And I think Willie very neatly summarized my own take on the situation. I think we’ve kind of widely overestimated the impact of going from paper to electronic might actually have on how patients respond. And when you really step back and look at all this evidence, I mean the meta-analysis didn’t really touch, as Willie said, and I think the cognitive interviewing usability testing looked at this quite rigorous statistical evidence, you really step back and look at it, as Willie said, I think it’s almost quite difficult to get equivalence when you’re going from paper to electronic. Patients are responding on how they feel, on their symptoms, and the way they go about giving you that data, I don't think, has an undue impact on the way they’re going to actually answer and what they’re going to tell you.

So on that note, we wanted to revisit this poll question from the very beginning. Again, it might have cut off a bit weirdly for you on your screen depending on your screen size, but the question is, do you believe we are required to demonstrate equivalence of paper and electronic versions of PROs supporting primary or key secondary endpoints. So yes, no, or maybe for demonstrating equivalence on paper and electronic versions of those questionnaires. Just give that another few seconds. So as a reminder, when we ran the poll at the very beginning, 62% of people said, yes it was a requirement to demonstrate equivalence, and 25% said maybe. So let’s see how we’re looking now. Very nice. So things have balanced out slightly, which suggests we’re very very persuasive speakers. Yeah, that’s very easy to see that obviously based on the evidence we’ve presented here, people also potentially agree that there isn’t as much of a requirement as perception might dictate within the field for demonstrating this equivalence. And I think it probably makes a lot of sense that more people have moved to the maybe field. Because there’s always going to be specific cases, there’s always going to be situations where you might want to demonstrate equivalence of some kind. Maybe not the full quantitative, but certainly the qualitative equivalence. I think that’s a very interesting finding, I appreciate that everyone providing their feedback.

Okay, we’re going to open it up to a Q&A quickly.

[Q&A section begins at 50:30]

Previous Guide
Paper to eCOA: A Guide to Migration
Paper to eCOA: A Guide to Migration

Next Video
Migrating Paper Instruments to eCOA
Migrating Paper Instruments to eCOA

Dr. Keith Meadows and Paul O’Donohoe discuss highlight important considerations and benefits of migrating a...