Firstly, to investigate the effects of using crowdsourced versus single-expert annotations for training, we trained ‘comparison’ models for semantic segmentation. These models used annotations from evaluation set ROIs for training, and were evaluated on the post-correction core-set annotations (see
Second, to evaluate peak accuracy, we trained ‘full’ models for semantic segmentation using the largest amounts of crowdsourced annotations possible. The full models were trained using annotations from core-set ROIs, assigning the ROIs from 82 slides (from 11 institutes) to the training set, and the ROIs from 43 slides (from seven institutes) to the testing set. Strict separation of ROIs by institute into either training or testing provides a better measure of how models developed with our data will generalize to slides from new institutions and multi-institute studies.
Finally, to evaluate the effect of training set size on the accuracy of predictive models, we developed ‘scale-dependent’ image classification models using varying amounts of our crowdsourced annotation data (