Download Premium Content

First Name
Last Name
Country (Citizenship)
Opt-In to Future Emails
Thank you!
Error - something went wrong!

Case Study: eCOA Equivalency Testing - Autism

July 22, 2019


Rebecca Grimes

Hi everyone. As Bill just mentioned, I’m here today to talk to you about a case study that was a collaborative piece between CRF Bracket, Adelphi Values, and a pharmaceutical sponsor earlier in the year. Just to give a bit of background, CRF Bracket initially worked with the pharmaceutical company to develop an electronic version of an existing clinical outcome assessment for use in a clinical trial in autism spectrum disorder for children and adolescents. This was a bit unusual in that it is a clinician-completed, observer-reported outcome. So for the purposes of today’s presentation, I’ve referred to it as kind of a ClinRo-ObsRo. And the reason we’ve called it this is because the clinician informs their answers by interviewing the parent or caregiver of the patient, rather than through direct observation of the patient. As emphasized in the FDA’s PRO guidance for industry, whenever a COA is administered in a new form of administration, evidence must be generated to show that the measurement properties are equivalent or superior to the original version. Therefore, we then collaborated with the sponsor and CRF Bracket to conduct some qualitative research, assessing the equivalence of this new eCOA compared to the existing paper version of the measure.

Just to provide a bit of background about the instrument itself, it’s an interview-based assessment. And as I mentioned previously, it’s designed to be administered by a clinician through interview of the parent or caregiver, and that clinician is usually a psychologist or a neurologist. This measure is split into three core domains. It assesses socialization, communication, and daily living skills. And it’s generally used in the diagnosis and classification of a number of learning difficulties and developmental delays, such as autism and Aspberger’s syndrome. Although this measure can actually be completed by patients with patients of all ages, for the purpose of our study we assessed patients who were aged between 3 and 18 years, just to reflect the target clinical trial population. And the completion of this measure is quite unusual, as it varies quite largely depending on the patient in question. Therefore, completion time with this measure can be anything from 20-60 minutes in length.

Just to explained that completion in a bit more detail, each of the three domains that I just mentioned is split further into sub-domains, and then these are split even further into age categories. So the administrator begins the sub-domain at the appropriate point for that patient and their age. And they work through the measure, scoring each item as follows: they score an item 2 points if the individual can usually perform the behavior independently, they score 1 point if they can sometimes or partially complete the behavior independently, and 0 points if they can never complete the behavior independently. And there’s also options for “don’t know” and “no opportunity,” where the patient has not had the opportunity to perform that behavior in their life. The administrator works through the sub-domain from their chosen start point, and they go down until a point where they score four 0’s in a row. The lowest of these items is then defined as the ceiling item and then they stop there. They then go back up to their start point and they work backwards through the measure until they score four consecutive 2’s. And the highest of those 2’s is classed as the basal item. And it’s the difference between the basal and the ceiling item that is the sub-domain score.

The electronic version was designed to be part of a trial management application whereby subject’s information each visit could be collected in terms of clinical and subjective measures, and it allowed for a healthcare professional to complete the measure as part of a trial visit. The electronic version was designed to reflect the paper version as closely as possible, and the rationale for using an eCOA in this planned trial was to allow for real-time data collection and transfer.

The ePRO Consortium is something that’s been discussed quite a lot over the last couple of days, so I just wanted to go into a bit more depth about some of the best practices regarding these instruments.

First of all, instruction. It’s acknowledged that modification of instructions may be necessary to ensure that they make sense in the new target mode. So for example, in a paper version, where it might ask you to circle an item response option, the electronic version would ask you to select one. And it’s kind of advised now, when developing a new instrument, to make language as neutral as possible so as to avoid these necessary modifications.

Terms of item stems. On a paper version of an instrument, sometimes you’d get a stem, for example, saying “During the past four weeks, how often has…” and then you’d have a series of items. On an electronic version with scrolling through the instrument, or with a different item being presented on a different page, it’s important that the item stem is repeated for each item. And this prevents any issues with patient recall when completing the measure. Also, if a full item cannot fit on the electronic screen, it’s acknowledged that there should be workarounds. These can be hover notes or scrollers in place.


Not all response option scales are actually suitable for migration to an electronic version. And it’s recommended that each item has kind of an active response. This means that the person completing the measure cannot go on to the next question without actively selecting a response. And this prevents any passive responses from scrolling too fast or clicking “next” twice by accident. And as such, it’s also quite beneficial on an electronic version of an instrument to include edit checks, and these alert respondents of any missing data and can prevent the measure from being finished with large amounts of missing data.

The migration that took place, as I mentioned before, they wanted it to be as close to the paper version as possible. So aspects such as item wording and the layout were unchanged. But there were a few modifications that were made just to allow for it to be completed in an electronic platform. The Notes section on the paper version was five or six lines at the bottom of the page that allowed the administrator to make freehand notes about the patient. On the electronic version, there is a pop-up box that jumped out to the side of the screen and allowed the administrator to make notes about the overall instrument, and they could then close that, proceed with the assessment. And there were also additional comments sections with each corresponding item where item-specific notes could be made.

In terms of the information, additional information to help answer items on the paper version can be found in appendices that isn’t attached to the instrument, whereas on the electronic version additional information was presented with a small “i” icon next to the corresponding item, and it could just be popped out and the administrator could read that information and then close that box and carry on with the measure.

Scoring of the paper version was done manually using a kind of scoring algorithm box at the bottom of the page. And on the electronic version, scores were automatically generated. And as I mentioned on the previous slide, the electronic version had an edit check function to prevent missing data.

We saw this slide yesterday, and it comes from the ISPOR ePRO good research practices task force. And it just explains the level of evidence that is required, based on the modifications that were made during migration to a new format. Just to go into a bit more detail with it again today, if the changes are minor—so they were things like, as I mentioned, changes in the instructions, they made more sense in the new format, minor changes to the formatting—then cognitive debriefing and usability testing are the two main levels of evidence that should be generated. For moderate changes—these are things like small changes to item wording or changes in mode of administration that would change the cognitive processes that take place—then it also recommended that equivalence testing is performed. And for substantial changes—these are changes to the wording of the items and to the response options—then full psychometric testing must take place, as you are basically generating a new instrument. So, looking at these, we came to the conclusion that the changes that were made for this measure were minor. And therefore, for our study we wanted to focus on cognitive debriefing and usability testing.

Just to explain those in a bit more detail, cognitive debriefing assesses whether the ePRO changes the way the respondent interprets the questions and then selects their answers. Usability testing demonstrates that the computerized assessment is completed as intended. And then equivalence testing, which we didn’t need to do, evaluates the comparability of the scores generated from the two formats of the measure.

The overall objective of our study was to evaluate conceptual equivalence between the electronic version of the measure and the original paper version to support the use of the electronic version as a primary endpoint in ASD clinical trials. This was achieved by conducting cognitive debriefing interviews with healthcare professionals that had previous experience administering the paper version of the measure. So we wanted to look at their understanding and interpretation of the eCOA version and how that was comparable to the paper version. And then we also wanted to explore the usability and acceptability of the new electronic version.


First of all, we obtained ethical approval from a centralized IRB. And then we used a third-party recruitment agency to identify eligible HCPs. They provided informed consent, and then they completed a kind of screener where they answered some questions relating to years’ experience and the kinds of patients that they treat. And then we scheduled interviews with ten participants based in the US. These interviews were two-hour, face-to-face, semi-structured cognitive debriefing interviews, and these were audio recorded and then transcripts were generated, and then we analyzed the transcripts by thematic analysis using software called Atlas.

As I mentioned, participants were recruited using a third-party recruitment agency from two key locations in the US. There was a Midwest location and an East Coast location. And we used the following eligibility criteria. We looked at HCPs who were currently practicing in the US. They had either a doctoral degree or a master’s degree in a relevant field. They must have had a minimum of three years previously administering the measure in question, and also a minimum of three years treating patients with a reduced mental capacity.

This is just some detail about the interview process itself. It was broken down into four key sections. First of all, we had a 10-15 training session, where the interviewer showed the participant how to use the electronic device itself. And then we had a 10-15 minute role playing exercise, where the interviewer and the participant completed sections of the measure together in a kind of mock way where the interviewer acted as the caregiver of someone with autism spectrum disorder. And then we had two cognitive debriefing sections, and these are each 45 minutes in length. In both cases, the participant completed sections of the measure using a think-aloud process. And just to avoid any completion bias, we alternated it where half the participants completed the paper version, and then the electronic version, and half the participants completed the electronic version and then the paper version.

Just to provide a bit more detail about each of those sections individually, the interviewer covered basic training topics in this section, to ensure that the participant was familiar with the electronic device. And we tried to reflect the level of information that they’d be given if they were to be asked to complete the measure as part of the trial itself. So we covered topics including how to select responses, how to populate the notes and comments boxes that I mentioned earlier, navigation between screens, opening the questionnaire, and using the stylus for the tablet.

And the mock administration exercise. The interviewer acted as the caregiver of a child with ASD. And this just allowed for the participant to get some hands-on experience of using the eCOA measure. We alternated the domain that we covered during this section of the interview for each successive interview, just to get broad coverage of the instrument. And we also varied the age of the hypothetical patient each time.

You might be wondering why we used hypothetical patients for this. First of all, this meant that the interviewer could dictate the age of the patient in question, and as the completion of the measure varies based on the age of the participant, this meant that we got good coverage of each sub-domain in terms of the start point and the end point, and how long each one lasted to complete. Secondly, we felt that using real patient data would have led to an emphasis on the item content, where they were really thinking about what answers to select. And considering that item wording is unchanged from the paper version, which is a well-established and commonly used measure, this was not a key focus of our interviews. And finally, collecting real patient data would have required parent and caregiver consent, and that would have extended timelines and added costs.

And the cognitive debriefing sections, I’m not sure if many of you are aware what a cognitive debriefing interview actually entails. We use a think-aloud technique where the participant is asked to read out an item, select a response, speak to that response if they have any comment to make, then the interviewer asks them further questions about their item and why they selected their responses. For the first aspect of cognitive debriefing, we wanted to look at interpretation and understanding of items and to check if response options were consistent between the two versions, we asked questions such as those in the first blue box—why did you select your answer, how easy or difficult was it for you to select an answer, things like that. And then for the second aspect, we wanted to look at the usability of the device itself. We asked questions like, overall what was it like for you to complete the questionnaire on the electronic device.


And now, just looking at the results that we had. First of all, just to summarize the sample characteristics for you, we had eight females and two males in our sample, and there was a broad range of educational levels and occupations. The experience in current role varied quite largely from 4-34 years. And the administration of the measure in the past 12 months, the majority of participants had completed it 0-5 times, and then we had one who had completed it 6-10 times, and then one participant who had completed it 20-30 times.

First of all, looking at the instruction page of the measure, understanding was consistent for both versions, and all participants showed a good understanding of these instructions. And then we asked them how they found reading the instructions, and two participants felt that the paper version, the instructions were too long. And then two participants—and this is the same two participants—suggested changing this by breaking the instructions down into bullet points. And then in terms of the electronic version, one participant explained that the instructions were a little bit hard to follow. And three participants suggested minor changes: to add colors, bullet points, and examples, respectively. But obviously, the instruction page was formatted to directly match the paper version, so the text was presented in the exact same way.

To summarize, although three participants did suggest minor formatting changes to how the instructions are presented on the electronic version, there were no apparent differences between how they interpreted the instructions or how they understood them. And it appeared that any issues that the participants did have with the instructions were kind of preferences for each of the versions, and it didn’t actually affect their ability to read the instructions or understand what the instrument was asking them to do.

And then, for the debriefing section, given that this measure is very long, it can take up to 60 minutes to complete as it is, we decided to kind of prioritize each domain of the measure for each interview. So we alternated those. And we asked participants, for this, to debrief the domain thinking of a participant that they knew particularly well or that they had seen recently. And you might be wondering why we didn’t use hypothetical patients for this. The reason is, for cognitive debriefing, we wanted them to think aloud, vocalize their own thoughts, and focus on the instrument itself. And if they spent that time interviewing the interviewer as it were, that there may be a likelihood that the interviewer could kind of bias that in some way and influence the participant’s perception of the instrument. And order of completion, as I mentioned before, was also alternated to avoid any risk of order effects.

We looked at the different domains, as I mentioned, and we looked at aspects such as item comprehension, selecting responses, and item layout. We kind of concluded that there were no significant differences in participant completion across the three domains, and we could kind of draw consistent conclusions across them. To summarize the whole measure for you, there were no notable differences in item comprehension between the paper and electronic versions. And all participants showed a good understanding of the items across all the domains. There were no identified differences with response selection, and all participants explained that they would select a response the same way for the two versions. And there were also no differences identified regarding item layout for either version. But it’s mentioned as a footnote that these findings aren’t really surprising given that the item layout and wording were the aspect of the instrument that was to remain consistent from the paper to the electronic version. This was kind of more concerning that.

And then, as I mentioned before, additional information of the paper version is found in a separate appendix. And we asked eight of the participants if they were likely to use this appendix when they were completing the measure in clinical practice. Two participants explained that they would use the appendix as it was intended. Three explained that they would refer to the appendix if they were unsure on how to answer an item, but they would only refer to it after the consultation, kind of retrospectively. And as the way that the measure is completed, you stop at the domain when you reach your basal and ceiling scores. And if an item then needs to be changed after that, then you’ve impacted your scores. Two participants explained that they would never use this appendix, and one participant explained that they would only use it very rarely.


But then, comparing that to the electronic version where the information was provided alongside the item, nine participants mentioned at least one occasion during the debriefing where the information box helped them to select a response or make them change their response. And eight participants spontaneously told me that having this information readily available provided more valid and appropriate responses. There’s just a quote at the bottom there, just to show that the person checked the information box and then they changed their answer because they understood it more.

And then, the paper version with the manual scoring box at the bottom of each page, eight participants who calculated a score manually were asked if they found it easy or difficult to do so. Four participants explained that they found it easy to calculate the score manually for the sub-domains, whereas four explained that calculating the score was difficult. But interestingly, two of the four participants that found it easy actually calculated the score wrong. I was observing them and they used their own methods. And then, seven participants additionally specifically explained that they disliked manual scoring. The reasons that they gave were, worrying that the score was wrong, finding the system itself confusing, and that the display of the algorithm box was not user friendly. Additionally, seven participants explained that they had occasions in the past where they’ve completed the measure in clinical practice and there have been errors.

On the electronic version, as I mentioned, scores are calculated by the computer. And all ten participants explained that they liked this feature. So they explained that it increased accuracy, they found it time saving, and that it was an easy-to-use function. Three participants suggested minor changes to the summary score table that was generated at the end, and that was just using color coding to allow for easier reading and including reference scores. If the scores and data from this measure are being uploaded to a trial manager application, then clinician interpretation of the scores isn’t wholly necessary for the purpose of the trial.

Overall, the participants showed a clear preference for the electronic version and how it calculated the score for them. And mainly this was because it gave them more confidence that their score would be accurate and error free.

Then finally, we just looked at some aspects of usability of the device. We gave participants free reign to use the stylus or the touchscreen feature of the tablet as they would, because it would be their own personal preference when they were completing it. Seven participants opted to use the stylus, and six of those found it easy to use. There was just one participant that found it a little bit difficult. The remaining three participants that chose to use the touchscreen all found this feature easy to use. All participants felt that they would be comfortable completing the measure in their own time without the assistance of the interviewer. And every participant felt that it was easy to open the questionnaire and it was easy to navigate between items.

Just some final conclusions from that. The research that we generated provided strong evidence that the electronic version of the measure was conceptually equivalent to the paper version in this population of interest. There were no issues with the usability or acceptability of the electronic version noted from the interviews, and all the healthcare professionals explained that they would be comfortable completing this measure on their own. The findings of this research just provide a bit of evidence the electronic version of the measure could be used instead of the paper version if implemented in this clinical trial.

So thank you for taking the time to listen to me today. Does anyone have any questions?

[Q&A Section 24:10]


If I understand this assessment right, there are different sections. And you had them looking at a specific sub-section. But did you get any feedback on their ability to jump across sections, so not necessarily do it Section 1, Section 2, Section 3. We require that patients to go through systematically through it, but when I’ve seen some of these interviews done, they’re jumping around. Was there a mechanism by which they could jump around? Or was it not addressed?


There is a navigation aspect of the tablet, where the clinician can actually choose the section that they want to complete. And then the edit check function at the end kind of flags if they haven’t completed any of those properly, and they can move around those and complete them in their own order, based on their own preference.


Congratulations, that was a really well methodologically designed study and really interesting. I’m just wondering, qualitatively it seems like you’ve established the validity, but have you actually looked at the data? Was there any stuff about looking at regression analysis or doing a Bland-Altman to see if the true concurrent validity held up between the two methodologies in terms of where the mean scores were distributed?


No, we didn’t look at any quantitative aspects of it. That would have been the equivalence testing that would have been required if the modifications were defined as being more moderate than mild. But yes, it could be something if the instrument had been migrated with more modifications made, then that would have been necessary. But as per current guidelines it’s not required.


Do you think you will, just to see?


I mean, there could be scope for doing so, just to kind of evaluate the equivalency of scores, particularly seeing as there was evidence that kind of the scores generated from the electronic version may be more accurate with the accessibility of the information boxes. It might be interesting to see if there was actually some quantitative way of comparing the scores.


Yeah I only ask because I think—I mean this is great and I think this is fine to go with the guidance. But for our own sort of establishing validity it’d be nice to be able to say not only does the diagnostic guideline say we did this, but we took it beyond that and actually we can show quantitatively that these things are truly concurrently valid psychometrically.


Yes, definitely. And I think that would be especially important if there would be instances in clinical practice where the paper version and the electronic version were both being used by different clinicians, whereas in the clinical trial the aim was to directly replace the paper version with the electronic version and only generate electronic scores. But if there were instances where both versions of the measure would be used concurrently, some kind of sensitivity analysis or something could be important to look at the scores.


Great, thanks very much.


One more quick question. You had IRB review of it? I was a little surprised to see that. Is that normal? Aren’t patients, their clinicians—why IRB review?


I don’t know, it’s just standard procedure for our own studies to generate ethical approval. And at the time when we first obtained ethical approval we were still trying to decide whether we wanted to look at hypothetical patients or to look at real patient data. So we kind of got it—we added it in as a methodology step to get ethical approval before we made those kind of design decisions.


It sometimes helps with publication if you want to publish the results, right, even if you don’t actually need IRB approval, but sort of got it, journals like that.

[END AT 27:59]


Previous Video
Best Practices - Usability Testing in BYOD Studies
Best Practices - Usability Testing in BYOD Studies

Best Practices for Usability Testing to Improve the eCOA Patient Experience Particularly in BYOD Studies - ...

Next Video
Overt Aggression Scale (OAS-M) for Outpatient Use
Overt Aggression Scale (OAS-M) for Outpatient Use

Electronic version of the Overt Aggression Scale (OAS-M) by Emil Coccaro & Dan DeBonis.