The main training dataset contains 25331 dermoscopic images, acquired at multiple sites and with different preprocessing methods applied beforehand. It contains images of the classes melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BKL), dermatofibroma (DF), vascular lesion (VASC) and squamous cell carcinoma (SCC). A part of the training dataset is the HAM10000 dataset which contains images of size 600 × 450 that were centered and cropped around the lesion. The dataset curators applied histogram corrections to some images [1] . Another dataset, BCN_20000, contains images of size 1024 × 1024. This dataset is particularly challenging as many images are uncropped and lesions in difficult and uncommon locations are present [2] . Last, the MSK dataset contains images with various sizes.
The dataset also contains meta-information about the patient's age group (in steps of five years), the anatomical site (eight possible sites) and the sex (male/female). The meta data is partially incomplete, i.e., there are missing values for some images.
In addition, we make use of external data. We use the 955 dermoscopic images from the 7-point dataset [3] . Moreover, we use an in-house dataset which consists of 986 images. For the unknown class, we use 353 images obtained from a web search. We include images of healthy skin, angiomas, warts, cysts, and other benign alterations. The key idea is to build a broad class of skin variations that should encourage the model to assign any image that is not part of the eight main classes to the ninth broad pool of skin alterations. We also consider the three types of meta data for our external data, if it is available.
For internal evaluation, we split the main training dataset into five folds. The dataset contains multiple images of the same lesions. Thus, we ensure that all images of the same lesion are in the same fold. We add all our external data to each of the training sets. Note that we do not include any of our images from the unknown class in our evaluation as we do not know whether they accurately represent the actual unknown class. Thus, all our models are trained to predict nine classes but we only evaluate on the known, eight classes.
We use the mean sensitivity for our internal evaluation which is defined as S=1Ci=1CTPiTPi+FNi where TP are true positives, FN are false negatives and C is the number of classes. The metric is also used for the final challenge ranking.
Free full text: Click here