Case Study: Physical Performance Outcome Assessments

July 22, 2019


So I’m pleased to speak with you today about a study that we conducted with Nicki Bush and colleagues at Eli Lilly. And I’ve previously presented on this, and we’ve had publications out recently, so I’m not going to go too much into the details of the study methods and results. Really what I want to focus on is some of the challenges that we had in our performance outcome assessment study, and think about how that relates to digital health and wearables as well.

So at this point, as we’ve all been sitting down for a while, I thought I’d like to invite you to take part in one of the assessments that we included in our study. And if you do this, have a think about what it feels like, and what you think it may be measuring.

So this is the repeated chair stand, where you start with your arms crossed across your chest like this, sitting on a chair, and then standing up fully—don’t start quite yet. Sit back down, rise up and down five times as quickly as you feel safe to do so. I would encourage you to move your chair a little bit, make sure please that it’s flat ground. And then, on the count of three we can do this together.

[group activity]

Well thanks everyone for joining in with that. So you may have felt, well, this is something that is assessing my leg strength, muscle, and power. And this is an assessment that’s used to help predict the risk of future falls. And really when we’re thinking about the kind of validation of this, you think well how relevant is this type of movement for patients in their daily life, and what if they can only do one or two of these? So these are the sorts of assessment issues that we faced, and I’ll talk about those in a moment.

I’m going to start with my conclusion, give the game away. The lessons learned from these kind of physical performance outcome type of studies has relevance for thinking about studies, particularly with wearables. And to back this up, I’m going to talk a bit about clinical outcome assessments and wearables, and how this proposed evolution of PerfO terminology to set the context. I’ll talk a bit about the case study background and design, and then focus on four key challenges and considerations for wearables.

I’m sure many of you are familiar with this; this is really a reference context kind of slide about what clinical outcome assessments are, and there’s a definition here from FDA. You’re probably aware there’s four different types of clinical outcome assessments, and recently at an ISPOR conference somebody suggested that TechOs could be a fifth one. But I think, really, technologies run through each of these in various ways, so I’m not sure that’s going to quite catch on. But for the purposes of today, I’m focusing very much on the performance outcome assessment. So this is specifically a measurement that’s based on a task performed by a patient, and it’s according to healthcare-provider-administered instructions. So examples are a memory recall test or a six-minute walk test. And it requires the cooperation and the motivation of the patient to do these.

Although performance outcome assessments in clinical settings allow for standardization—where there may be differences in patients’ daily life, you’re able to standardize that. However, we may also want to look at this variability in patient’s life. Jen talked yesterday around the value of wearables, and just to build on this really, we know that this enables assessment of free living activity. It’s increasingly patient centric. It has the potential to be considered appropriate for use where it adheres to the basic set of properties important to clinical trials and if evidence can be provided to support the reliability, validity and interpretability of the data it generates. So really again it’s just this thought that we don’t do them just for the sake of it, but really where it’s appropriate, where it can support the endpoints of interest.

This is now really beginning to push out to how we may consider the definition of PerfO assessments. So there’s been a couple of recent suggestions about how to expand on the definition given by FDA. One from a recent DIA workshop was speaking specifically about it not just being evaluated by an appropriately trained individual, it could be independently completed. And then secondly from the ePRO consortium paper, it’s whether it could be in a supervised or an unsupervised setting, and it may be assessed directly by a qualified test administrator or instrumented using sensors and wearable devices such as an accelerometer.


So the field is rapidly changing, when you think about it. And our case study for example, we started back in 2012. And actually the PerfO terminology hadn’t existed at that time, and we referred to it as our PBM study, so performance-based measures. And the closest that we could think of it is really as a type of ClinRo, because these were assessments that had been considered relevant by clinicians. So that gives us a bit of historical context to the study and why the design was the way that it was, and the interesting things that that has shown us.

So this particular study was concerned with patients who had hip and knee joint problems, and really it’s looking at the context of the overall life expectancies being met with a corresponding increase in demand for joint arthroplasty and hip fracture surgery. And Lillly and company were interested to evaluate a compound for use to help in the recovery post surgically, specifically in elective hip and knee replacement patients and hip fracture patients. Those performance outcomes were of interest because it helps eliminate errors in personal judgement, memory, and the influence of pain on the perception of functional ability. And so in that context, we undertook a literature review to look at the measurement properties of five performance outcome assessments, and through that to identify the need to standardize the administration of these assessments and recommend a full evaluation of the measurement properties in the target populations, if it’s to support a label claim in the US. At this point, we hadn’t really designed an endpoint model as such, it was very much just looking at the ability of these assessments to look at lower extremity functional capability.

We went forward with four performance outcome assessments, per the FDA guidance for PROs. To the extent that they could be applied to these performance outcome assessments, we first of all started with a main statistical evaluation study, started back in 2012. These are the characteristics that we assessed. And we also aimed to get responder definitions on each of those for these patient populations. A bit later into the study and through feedback from FDA, which I’ll talk about in a while, we then undertook a content validation study. Again, it kind of reflects that shift in thinking that it’s originally a type of ClinRo to it becoming a type of PerfO, so the focus became on how is this valid for patients themselves.

Apologies for the slightly complex—lots going on in this slide. This just gives an outline of the study design. This was a cross-sectional longitudinal study with three visits. Each of the three patient groups had three visits. In the hip fracture group, these were all post-surgery. In the elective hip and knee, there were two pre-surgical visits and one post-surgery visit. These were all conducted in the US. Elective hip and knee replacement patients had their study at 12 sites, and we needed to bring on three additional sites to help with the hip fracture recruitment. We had very strict patient eligibility criteria.

Each of the groups undertook three performance outcome assessments, and these were the same for two. This is the timed up and go, where patients were asked to stand up from a seated position, walk three metres, turn around a cone in the direction of their unaffected hip or knee, and walk back and sit back down at their normal speed. And that’s timed. All groups were then asked to undertake the four-step stair climb. We had sensor mats at the bottom and at the top of these steps, and they were asked to ascend these four steps as quickly as possible and the sensor mats gave us the precise calculations for the time. And in the elective hip and the elective knee replacement patient groups, those that could do the four steps were then invited to take part in a long stair climb. We had to ensure that the sites we recruited actually had a suitable stairwell with enough steps. And again, we had the sensor mats at the bottom and at the top. The hip fracture group undertook the repeated chair stand, so the version that you’ve just done, but also the second version where they were able to use the chair armrests. And then each of these, as you may notice, these are all to be done as quickly as they think they could while remaining safe.

We then went on to do the content validation study, so this was just due to the nature of the timing of it, we had very prompt recruitment for the elective knee replacement patients, so we only did this in the elective hip and hip fracture patients. This was following a recent site visit; we undertook qualitative telephone interviews to get their experience of these assessments.


Thinking about our challenges that we had in this study. These are the ones that I think are probably most relevant for thinking about these additional types of performance outcome assessments. There’s a mixture of practical, logistical, methodological, conceptual, all kind of thinking, as Jen talked about yesterday, some of these might be verification or validation issues. But really I hope that these just bringing awareness of these sorts of questions helps with future study design and planning.

The first one was indemnification. This is a type of agreement where one party agrees not to hold another party liable for legal causes of action in the future. I think this was a bit of a revelation to us, who had been more used to working with PROs, that actually sites were going to be a bit concerned about patients falling down the stairs. And this actually added significantly to the timeline while various legal parties developed suitable letter for use with clinical sites. So it’s not so much about involving the manufacturer of the four-step stair climb or anything like that. It was really between ourselves and Lilly in this case, so that we could use these with clinical sites. And we know from the talk yesterday with TJ, we know just how important getting this paperwork and planning for it early, getting it resolved as quickly as possible, and just bearing that in mind. And I think, thinking about wearables, presumably this could be done in supervised setting, so the same issues would apply there. But what about in unsupervised settings, where you’re asking patients to use a wearable in their everyday life? Is there any increase risk? Are sites going to be concerned about their patients going about their daily business, using the wearables part of the study where they’re not going to be supervised. So essentially, we’re asking, does the study require the patient to do more or anything differently in their daily life, and how do we know what is normal for them anyway. And might wearing a wearable actually encourage this in some way. So the recent ePRO Consortium paper specifically makes the recommendation to not actually give feedback until the end of the study, apart from where safety is paramount. That was one issue.

The second issue, there’s a couple of points I want to make around the standardization. So of course we want to look at the variability in patients and everyday life. We still need to try and think about standardization as well. So for our study, we had each site set up or designate a chair that they would use for the same visit for the same patients each time. If a patient started off using the cushion, they would use the cushion for each of their three visits. We ensured that we had the same step model as those planned from the clinical trials. Thinking about the same kind of issues in wearables for patient’s daily life, you’d kind of expect they’re likely to have the same chairs, they’re getting up from the sofa during the day or from their chair in the kitchen. So that’s just something to bear in mind.

But also, to paraphrase Willie yesterday, thinking about to bring your own wearable or not. So maybe not right now, there may be too many issues with consumer devices, but potentially in the future, thinking about trying to ensure that patients have the option of what devices they want to use, or if they want to bring their own wearable, then perhaps we can work towards more research-friendly wearables that would enable these to happen. So, thinking about equivalence between devices or wearables and the need for calibration, those are the issues there.

Secondly, we had to provide quite a lot of site training and support, so we provided in-person site training, included an operations manual and scripts that the assessors would use to ensure that there was standardized kind of encouragement, and e-letters as well, so ongoing support to the studies. Again, thinking about how this might be equivalent for wearables that patients may potentially, or perhaps ideally, should start a study in a supervised setting as far as possible for their initial training in the wearables. And then, at what point do you take a baseline. So at a recent workshop at ISOQOL, it was suggested that maybe after a three-week period you would perhaps get a true baseline because they’ve become acclimatized to using the wearable, they’ve gotten used to using that as part of their daily life, the novelty may have worn off. And I think that three-week period refers to the 21 days where people think that you’ve been able to change a habit over that period of time, so I think that’s where that comes in. So that’s something to bear in mind; it may not be very practical to allow for a three-week wash-in period, but just something to bear in mind. And obviously try to provide some lay-friendly written reference instructions. And on an ongoing basis you have the study team to remotely monitor results in real time as far as possible and contact patients if results are looking strange, if there’s been no activity or things are looking like outliers, just to be able to understand what’s going on there.


Data monitoring is obviously essential for ensuring high data quality, and for this study we were particularly mindful of the particular time frames that we had the clinical visits. To some extent, it may be that for digital health devices, it may be easier in some ways, it’s more automated. But we do need to understand what happens if there’s minimal or absent movements on devices like accelerometers. It could be that you can triangulate this information with additional resources. For example, looking at weather patterns; if people haven’t been very active, oh it’s because there was a storm that week, giving some kind of contextual information.

And then a last point on standardization is that we had variations in the conditions. So it’s probably not surprising to think that each of the hospital sites we used had some variation in their stairwells. Most sites had 12, but one site had 11 steps. The step heights were different. The landing steps were present in some of the sites, and so we had to think about how to handle that variability within the evaluation. So we provided unadjusted times and adjusted times, through various statistical calculations. When we think about this in daily life—and this is obviously just outside here—should we just think that all steps are the same, should we just ignore the issue of people walking up and down different steps and them all not being the same in daily life. Should we ask patients to describe what sort of environment they’re in. Or can we make some observations, like you know, I don’t think going around with a selfie stick and videoing yourself is going to be very practical. But you could potentially look at GPS and use Google Street View. It might be a bit out of date, but at least it gives you some idea. So for example, I don’t know if you noticed on your way in, we had one curb, if you’re coming from off the road or walking up the few steps to the front door here. So you can imaginine getting a sense of patient’s daily life in their local environment. Or would you measure this in some way. And I don’t think it’s very practical to go around with a tape measure and measure every single step in any way. But there may be some devices in the future that would be able to measure these small amounts of steps, whether it’s just one or two. And I think at the moment there isn’t enough sensitivity, at least to my understanding, in the measures of the wearables that are currently available.

Another challenge we had was trying to capture the small increments in improvement. With our long stair climb, we wanted to see, if patients couldn’t complete all of the 12 steps, then it’s fairly easy to document how many steps they did actually climb. And it could be, for wearables, that information may be easier to just have to hand. But I think if we’re trying to understand the patient in their daily life, well, do we know how many steps they’ve been able to get up compared to how many they wanted to get up. So on this example here from local steps in Philadelphia, with Rocky and Butkus, they’ve gone up a few steps here, but actually when you see how many steps there are on these Rocky steps, there’s a lot more to go. So it’s about understanding the patient’s wider context and their goals and were they able to complete as much as they would have liked.

For the repeated chair stands, the concern that I shared earlier, what if they’re only able to do one or two of these repeated chair stands. We had the assessors clicking the stopwatch for every full rise that the participant completed, so that we’d be able to capture those partial completions. And so similarly even thinking about these kind of activities in a patient’s home, you may require additional sensors like postural sensors to be able to assess that type of movement.


At this point, I just wanted to recap on our main statistical evaluation results, before I move on to the content validation sub-study. Despite these logistical issues, we were able to recruit enough patients and demonstrate that the measurement properties of these particular performance outcome measures were supported for consideration in future use, and we were able to provide estimates for interpretation of change as well. That was recently published, if you’re interested in further information there.

Then our last challenge was with the content validation sub-study. I’ve got a couple of slides here just to outline how we approached this before thinking about how this relates to wearables.

Content validation is the extent to which the instrument measures the concept of interest. And we had specific feedback from FDA at this point. We were interested in knowing how well the patients believe the test reflects their ability to function on the day-to-day basis, and how the level of difficult reflects the challenges they face in daily function and related topics. So this was really helpful because it helped to inform the development of our interview guide. Patients had recently undertaken a site visit, and we started our interview just by kind of recapping on what the outcome assessment was and the instructions they were given. And then we asked about their overall experience and the specific components. So whether it was around turning around the cone, lifting their foot up for the step, sitting down, each of those components, which in a way is quite similar to thinking about the domains in a PRO. And then we asked questions around the relevance to their everyday activity, so how well it matched overall the issues around speed and the level of difficulty.

We had limited availability of the participants who were still within the wider evaluation study, which made it even more important to really assess for saturation, to have confidence in the sample size that we ended up with.

The approach to data saturation, we needed to think of a way that would be appropriate for this type of study. And we developed a summary grid. Each of the participants, and for each of their three performance outcome assessments, we derived the summary from their interview data and we summarized what the participant said about the relevance of the test, what they said around speed and the difficulty of the test. So, we ended up with nine summaries per participant.

And then we looked at comparing each participant with the prior interview. This was done in a chronological basis. And we compared the prior summaries to identify the new elements from each interview. This is really about summarizing the new information rather than the thought of applying a new code, which is a typical approach to saturation. And then we looked at the overall summary of the new elements for each theme and each arm of the performance outcome assessments. Was there enough variation within what people were saying about the speed of the four-step stair climb, for example. And of course, for each of these summaries that we reviewed, we undertook a quality control, back to the data source, made sure it was accurate reflection of what participants had said, etc.

Here’s an example of the summary grid. This was for the four-step stair climb. As you may remember, they had been asked to complete as fast as possible, and here are two hip fracture patients. The ID#15 said that he did the steps at his normal speed without trying to go especially faster. And the 16th interview, the participant said her norm is to move quite quickly and be slightly aggressive when climbing stairs, and she had no problem doing this in the test. So both patients were talking about actually their norms, rather than making additional efforts to be fast. But what was different with the 16th interviewee was saying that there’s no problem with her doing this fast, even though it was her norm. So it’s a subtle thing, but it’s important to capture these nuances.

Just to give you some top line results from this part of the study. We found that all hip fracture and most of the total hip replacement and fracture patients related these performance outcome assessments to similar activities performed in daily life. So even though they may not climb four steps in their daily life, they may just have one or two steps at home, or they walk up from a road up to a curb, they can see that these movements were similar. Some of the variations people talked about, well their armchair at home was quite different from the chair that was used at clinic. Now there’s only so many types of variations in types of chairs. So these subtleties were pulled out through the interviews. What we found was that actually most of the elective total hip replacement patients didn’t undertake the longer stair climbs in daily life, but by actually being asked to do this on their clinic visits, they found that actually it gave them confidence to do this in their daily life. And I think if we’d started off with content validation first of all and selected our measures based on what they said was relevant, then perhaps we wouldn’t have seen this kind of interesting result. 


And all participants reported that these outcome assessments were relevant and had a similar level of difficulty to daily life activities. And eight went on to discuss other types of activities that they felt indicated the level of difficulty or improvement associated with their hip, so things like getting in and out of their vehicles. None of the assessments we had took into account those sorts of twisting motions, so that would have been helpful to include.

When thinking about wearables, we still really need to get the patient perspective on the relevance and importance of activities, whether it’s something that’s just about their everyday life, whether it’s a passive observation or an active assessment that we’re asking them to undertake. We still need to get the PRO on the PerfO, and this could be undertaken through interviews or through diaries and it could be that these collected verbally and they could be recorded into a digital device. We need to think about the length of time that these wearables are worn and how typical that period of time might be in their daily lives, and which activities are most important. Is there a difference between being able to walk around the block or get in and out of your car a bit easier, and which of these are going to be most sensitive to change.

I recommend that we really think about probing around activities that patients avoid or say that they don’t do or think are not relevant, because it may be that their self-perception on what they’re able to do has just changed over time and they may not really be thinking about what they no longer do, unless you specifically ask around that. And then it’s important to relate this back to either the specific endpoint model or back to the general aim, so thinking about the improvement in their extremity functional capacity. Patients may not cross their arms over in the way that we’ve just done on the test, but all the patients could say yes, I’m frequently up and down from my chair, I’m answering the phone, I’m going to the door. They could see that the underlying concept of this with the leg strength, muscle, and power was relevant for their daily lives.

In conclusion, and just to repeat as I said at the beginning, the physical PerfO assessments present specific challenges, and these are relevant to consider for using wearables. Consider the extent and the approach to standardization in the assessment of wearables. And PerfO assessment experience can impact the patient’s confidence to perform certain types of activities in daily life. And lastly, daily life activities will reflect what is meaningful to patients, but be aware of the absence of activities, just to take that into account.

Lastly, I just want to thank Lilly for the permission to present this as a case study example and all those that took part in this study. So, thank you.


Thank you, Rachel. We’ve got time for a few questions. Would anybody like to ask a question? One at the back there.


Thanks very much, that was super interesting. I’m curious, when you guys were monitoring the data as it was coming in and you said that, looking at unusual variants of the data, what were the things that were the outliers, and when you followed up what were they attributed to, what were the things that kind of stuck out in this?


The reason we had to really closely monitor the data was particularly around the time frames. We had very limited windows in which the patients had to undertake their clinical site visit, and that was in relation to the expected trajectory of improvement following surgery. So we had to clinically understand the anticipated nature of recovery. I think the windows were the three- to four-day period, and so we had to always ensure that the data coming back fell within that timeframe. So that was the reason for us to closely monitor in the first place, I think. The outliers may be, well was there one or two days out, and that didn’t happen that often. We obviously had to train the sites to make sure, but sometimes there may just be—you know, surgical dates may change for the elective group. So that’s something we had to keep a close eye on. We needed to look at if there was quite substantial differences between the two raters, so we assessed for inter-rater reliability and we wanted to make sure there were no outliers with different raters presenting quite different data, I think that was another key aspect. And making sure that there were some patients that just didn’t look really very different from the rest of the group, just to try and better understand and ensure that they were eligible. So it was just fairly standard, I don’t think there was anything particularly different that was coming out, it was just the need to closely monitor the data for this particular study.



Rachel, you brought up again the theme of our friends, the lawyers, in this. You talked about indemnity. And that was interesting because I hadn’t imagined that, and it sounded like that was a surprise to the study team as well to have to think through those things. What do you think we might want to do, or are there any special considerations or ways to mitigate that indemnity piece, when perhaps we’re asking patients to do some of these performance outcomes at home in an unsupervised setting? Have you got any thoughts on that at all? Or maybe folks in the audience might as well.


Yes, well I think it’s how those patients are recruited. So if it’s through clinical sites, I think the concern will come from the clinical sites primarily. It will be, well these are our patients, we are responsible for recruiting them, for their wellbeing. So I think that’s the main hurdle. And I can’t really imagine other routes for recruitment would throw up different issues. So you know, who is willing to undertake that risk, is it something that is just very clear language in the consent form, who takes that burden of that responsibility of risk. So I don’t know, I’d be interested to see if this is an issue that’s come up for anybody actually, because it’s a relatively new—something I’ve only just started thinking about in this.

[END AT 32:29]

Previous Video
PROs in Oncology Trials
PROs in Oncology Trials

Ashley Slagle and Ari Gnanasakthy present on patient-reported outcomes in Oncology trials.

Next Video
eCOA Forum 2018 Testimonials
eCOA Forum 2018 Testimonials

Learn more about the eCOA Forum and hear what attendees have to say.