The CAMELYON16 dataset consists of 400 total patients for whom a single WSI is provided in a tag image file format (TIFF). Annotations are given in extensible markup language (XML) format, one per each positive slide. For each annotation, several regions, defined by vertex coordinates, may be present. Since these slides were scanned at a higher resolution than the slides scanned at MSK, a tiling method was developed to extract tiles containing tissue from both inside and outside the annotated regions at MSK’s 20× equivalent magnification (0.5 μm pixel−1) to enable direct comparison with our datasets. This method generates a grid of possible tiles, excludes background via Otsu thresholding and determines whether a tile is inside an annotation region by solving a point in polygon problem.
We used 80% of the training data to train our model, and we left 20% for model selection. We extracted at random 1,000 tiles from each negative slide, and 1,000 negative tiles and 1,000 positive tiles from the positive slides. A ResNet34 model was trained augmenting the dataset on the fly with 90° rotations, horizontal flips and color jitter. The model was optimized with SGD. The best-performing model on the validation set was selected. Slide-level predictions were generated with the random forest aggregation approach explained before and trained on the entire training portion of the CAMELYON16 dataset. To train the random forest model, we exhaustively tiled with no overlap the training slides to generate the tumor probability maps. The trained random forest was then evaluated on the CAMELYON16 test dataset and on our large breast lymph node metastasis test datasets.