We determined the influence of training data variability (especially public datasets versus routine) on the generalizability to other public test datasets, and, specifically, to cases with a variety of pathologies. To establish comparability, we limited the number of volumes and slices to match the smallest dataset from LCTSC, with 36 volumes and 3,393 slices. During this experiment, we considered only slices that showed the lung (during training and testing) to prevent a bias induced by the field of view. For example, images in VISCERAL Anatomy 3 showed either the whole body or the trunk, including the abdomen, while other datasets, such as LTRC, LCTSC, or VESSEL12, contained only images limited to the chest.
Further, we compared the generic models trained on the R-231 dataset to the publicly available systems CIP and P-HNN. For this comparison, we processed the full volumes. The CIP algorithm was shown to be sensitive to image noise. Thus, if the CIP algorithm failed, we pre-processed the volumes with a Gaussian filter kernel. If the algorithm still failed, the case was excluded for comparison. The trained P-HNN model does not distinguish between the left and right lung. Thus, evaluation metrics were computed on the full lung for masks created by P-HNN. In addition to evaluation on publicly available datasets and methods, we performed an independent evaluation of our lung segmentation model by submitting solutions to the LOLA11 challenge for which 55 CT scans are published but ground truth masks are available only to the challenge organisers. Prior research and earlier submissions suggest inconsistencies in the ground truth of the LOLA11 dataset, especially with respect to pleural effusions [24 ]. We specifically included effusions in our training datasets. To account for this discrepancy and improve comparability, we submitted two solutions: first, masks as yielded by our model and alternatively, with subsequently removed dense areas from the lung masks. The automatic exclusion of dense areas was performed by simple thresholding of values between -50 < HU < 70 and morphological operations.
Studies on lung segmentation usually use overlap- and surface-metrics to assess the automatically generated lung mask against the ground truth. However, segmentation metrics on the full lung can only marginally quantify the capability of a method to cover pathological areas in the lung as pathologies may be relatively small compared to the lung volume. Carcinomas are an example of high-density areas that are at risk of being excluded by threshold- or registration-based methods when they are close to the lung border. We utilised the publicly available, previously published Lung1 dataset [38 (link)] to quantify the model’s ability to cover tumour areas within the lung. The collection contains scans of 318 non-small cell lung cancer patients before treatment, with a manual delineation of the tumours. In this experiment, we evaluated the overlap proportion of tumour volume covered by the lung mask.
Free full text: Click here