Evaluating Lung Segmentation Robustness

We determined the influence of training data variability (especially public datasets versus routine) on the generalizability to other public test datasets, and, specifically, to cases with a variety of pathologies. To establish comparability, we limited the number of volumes and slices to match the smallest dataset from LCTSC, with 36 volumes and 3,393 slices. During this experiment, we considered only slices that showed the lung (during training and testing) to prevent a bias induced by the field of view. For example, images in VISCERAL Anatomy 3 showed either the whole body or the trunk, including the abdomen, while other datasets, such as LTRC, LCTSC, or VESSEL12, contained only images limited to the chest.
Further, we compared the generic models trained on the R-231 dataset to the publicly available systems CIP and P-HNN. For this comparison, we processed the full volumes. The CIP algorithm was shown to be sensitive to image noise. Thus, if the CIP algorithm failed, we pre-processed the volumes with a Gaussian filter kernel. If the algorithm still failed, the case was excluded for comparison. The trained P-HNN model does not distinguish between the left and right lung. Thus, evaluation metrics were computed on the full lung for masks created by P-HNN. In addition to evaluation on publicly available datasets and methods, we performed an independent evaluation of our lung segmentation model by submitting solutions to the LOLA11 challenge for which 55 CT scans are published but ground truth masks are available only to the challenge organisers. Prior research and earlier submissions suggest inconsistencies in the ground truth of the LOLA11 dataset, especially with respect to pleural effusions [24 ]. We specifically included effusions in our training datasets. To account for this discrepancy and improve comparability, we submitted two solutions: first, masks as yielded by our model and alternatively, with subsequently removed dense areas from the lung masks. The automatic exclusion of dense areas was performed by simple thresholding of values between -50 < HU < 70 and morphological operations.
Studies on lung segmentation usually use overlap- and surface-metrics to assess the automatically generated lung mask against the ground truth. However, segmentation metrics on the full lung can only marginally quantify the capability of a method to cover pathological areas in the lung as pathologies may be relatively small compared to the lung volume. Carcinomas are an example of high-density areas that are at risk of being excluded by threshold- or registration-based methods when they are close to the lung border. We utilised the publicly available, previously published Lung1 dataset [38 (link)] to quantify the model’s ability to cover tumour areas within the lung. The collection contains scans of 318 non-small cell lung cancer patients before treatment, with a manual delineation of the tumours. In this experiment, we evaluated the overlap proportion of tumour volume covered by the lung mask.

Free full text: Click here

Hofmanninger J., Prayer F., Pan J., Röhrich S., Prosch H, & Langs G. (2020). Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. European Radiology Experimental, 4, 50.

Publication 2020

Abdomen Body Carcinomas Chest Ct scans Generic Lung Lung tumour Lung volume Non small cell cancer of lung Patients Pleural effusions Scans Tumours

Corresponding Organization :

Other organizations : Medical University of Vienna

Top 5 similar protocols

Protocol cited in 15 other protocols

Variable analysis

independent variables

Training data variability (especially public datasets versus routine)
Use of pre-processing techniques (Gaussian filter kernel)

dependent variables

Generalizability to other public test datasets
Capability of the model to cover tumor areas within the lung

control variables

Number of volumes and slices (limited to match the smallest dataset from LCTSC)
Inclusion of only slices that showed the lung (during training and testing)

positive controls

Comparison of the generic models trained on the R-231 dataset to the publicly available systems CIP and P-HNN

negative controls

Exclusion of cases where the CIP algorithm failed and could not be pre-processed with a Gaussian filter kernel

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!