In order to select labels from the pool for the new levels, an interviewer-administered response scaling exercise similar to those used in previous studies [14 (link), 19 , 20 (link)] was adopted to estimate the severity represented by each label. For this exercise, respondents were shown a rating scale in the form of a vertical, hash-marked, 40 cm visual analog scale (VAS) with end points of 0 and 100 to be used as a visual aid in grading label severity. For the Mobility, Self-Care and Usual Activities dimensions, the same set of labels was used. The interviewer placed a card labeled ‘No problems’, ‘No pain/discomfort’, or ‘No anxiety/depression’ as appropriate at the bottom of the scale (0) to act as the lower anchor and a card labeled ‘Unable to, ‘The worst pain or discomfort I can imagine’, ‘As anxious or depressed as I can imagine’ as the upper anchor (100). The respondent was then shown other labels from the pool singly in a quasi-random order and asked to assign a score between 0 and 100 to indicate label severity in relation to the lower and upper anchors.
The interviewer noted all scores, and when the respondent had rated all labels for a particular dimension, the interviewer laid them out in rank order alongside the VAS and asked the respondent to review the ranking and make any changes he or she thought necessary. If labels were reordered at this point, the respondent was asked to assign a new score to the relevant labels. Final scores assigned were recorded in an answer booklet. The scaling task was repeated for each dimension. Before finishing with the cards, the respondent was asked whether any of the labels sounded unusual, or should not be used in relation to a particular dimension.
Respondents rated labels for all five dimensions. The three functional dimensions (Mobility, Self-Care and Usual Activities) were always interspersed by the Pain/Discomfort and Anxiety/Depression dimensions, so that the respondent did not rate the same label types consecutively. Before rating the actual labels, respondents performed a practice task based on levels of overall health to get used to the study requirements. Data on age, level of education, main activity, and use of any current treatment for health problems, together with the existing EQ-5D-3L descriptive system and EQ-VAS, were collected after the response scaling task.
Before the main response scaling task, a pilot test was performed to test study procedures and materials. Based on the results of the pilot study, some labels were eliminated from the initial pool to achieve a more manageable number for the response scaling task. In particular, any labels using additional modifiers such as ‘very’ or ‘quite’ were eliminated as were any that were considered excessively colloquial or too high a level of language. After pilot testing, it was concluded that the feasible limit was about 10–12 labels per dimension for an individual respondent.
Responses to the scaling task were analyzed by calculating means and medians and the corresponding standard deviations and interquartile ranges (IQR). Labels to go forward for further testing were selected based on criteria that had been identified before data collection started. These included selecting labels close to or at the 25th, 50th, and 75th centiles on the VAS, ensuring consistency across dimensions and coherence with wording in the descriptive system. No quantitative comparison of label scores was carried out in deciding which labels to carry forward to the next stage; median scores were simply used as a guide to determine which labels fell closest to the 25th, 50th, and 75th centiles. Labels were also required to be in colloquial language. The choice of labels and their appropriateness was discussed by the task force at several meetings during the course of the study.