First, descriptive information on the activity trackers (models, release date, placement, size, weight, and cost) from the Garmin website was recorded. Second, an abstraction tool used for this review was expanded from a tool initially created by De Vries et al. (2009) (link) to document study characteristics and measurement properties of the activity trackers. Specifically, we extracted information on the study population, protocol, statistical analysis, and results related to validity and reliability. A primary reviewer extracted details and a second reviewer checked each entry, with discrepancies resolved by consensus. For abstracted information missing from the publication, we attempted to contact at least one study author to obtain the information. In total, we contacted authors from 15 papers, among which 12 responded. Summary tables were created from the abstracted information.
Reliability of the activity trackers included (Duking et al., 2018 (link)): (i) intra-device reliability: defined as reproducibility within the same tracker; and (ii) inter-device reliability: defined as reproducibility with different trackers. Validity of the activity trackers included (Higgins and Straub, 2006 (link)) (i) criterion validity, defined by comparing the trackers to a criterion measure; and (ii) construct validity, defined by comparing the trackers to other constructs that should track or correlate positively (convergent validity) or negatively (divergent validity).
If reported, we abstracted correlation coefficients (CC). We interpreted the CC using the following ratings: <0.60 low, 0.60-<0.75 moderate, 0.75-<0.90 good, and >=0.90 excellent. If reported, we abstracted the mean percentage error (MPE) which captured over- and under-estimation, defined as the [(criterion value minus Garmin tracker value)/criterion value]*100. If reported, we also abstracted the mean absolute percentage error (MAPE) which captured the magnitude of mis-estimation, defined as the absolute value of [(criterion value minus Garmin tracker value)/criterion value]*100. The smaller MAPE represented better accuracy and accounted for both over- and underestimation. We interpreted a MAPE<5% in laboratory or controlled conditions (Fokkema et al., 2017 (link)) and MAPE<10% in free-living conditions (Chen et al., 2016 (link); Crouter et al., 2003 (link); Nelson et al., 2016 (link); Tudor-Locke et al., 2006 ) as significantly equivalent to the criterion measure. Anything over those measures was considered a practically relevant difference. We also summarized results from the Bland-Altman plots when presented (Bland and Altman, 1986 (link)).
Reporting study quality is standard practice for systematic reviews. However, we could locate no assessment tools specific to testing validity and reliability of a device. Therefore, we developed a 10-item assessment, guided both by a paper describing reporting suggestions for wearable sensors (Duking et al., 2018 (link)) and a critical appraisal tool developed originally to assess the quality of cross-sectional studies (Downes et al., 2016 ). The questions asked:
Yes or no responses were recorded for all 10 items, with “yes” indicating higher study quality.