Comparative Evaluation of Sampling Algorithms for Spectroscopic Data Classification

Datasets. Six real-world datasets were used towards comparing the classification performance of RS, KS and MLM algorithms. Dataset 1 contains 280 infrared (IR) spectra of two Cryptococcus fungi specimens acquired via attenuated total reflection Fourier-transform infrared (ATR-FTIR) spectroscopy. This dataset is publically available at https://doi.org/10.6084/m9.figshare.7427927.v1. Class 1 is composed of 140 spectra of Cryptococcus neoformans samples and class 2 of 140 spectra of Cryptococcus gattii samples. Spectra were acquired in the 400–4000 cm⁻¹ spectral range with a resolution of 4 cm⁻¹ and 16 co-added scans using a Bruker VERTEX 70 FTIR spectrometer (Bruker Optics, Ltd., UK). The spectral data were pre-processed by excising the biofingerprint region (900–1800 cm⁻¹), which was followed by automatic weighted least squares (AWLS) baseline correction and normalization to the Amide I peak (1650 cm⁻¹). More details regarding this dataset can be found in literature (Costa et al., 2016 ; Morais et al., 2017 ).
Dataset 2 contains 240 IR spectra derived from formalin-fixed paraffin-embedded brain tissues separated into two classes. Class 1 contains 140 spectra from normal brain tissue, and class 2 contains 100 spectra from glioblastoma brain tissue. Spectra were collected via ATR-FTIR spectroscopy using a Bruker VECTOR 27 FTIR spectrometer with a Helios ATR attachment (Bruker Optics, Ltd., UK). The raw spectra, acquired in the 400–4000 cm⁻¹ spectral range with a resolution of 8 cm⁻¹ and 32 co-added scans, were pre-processed by excising the biofingerprint region (900–1800 cm⁻¹), which was followed by rubberband baseline correction and normalization to the Amide I peak (1650 cm⁻¹). This dataset is publicly available as part of the IRootLab toolbox (http://trevisanj.github.io/irootlab/) (Trevisan et al., 2013 (link)), and more information about it can be found in Gajjar et al. (2012) (link).
Dataset 3 contains 183 IR spectra distributed into 3 classes. Class 1 contains 59 spectra of Syrian hamster embryo (SHE) cells treated with benzo[a]pyrene (B[a]P), class 2 contains 62 spectra of SHE cells treated with 3-methylcholanthrene (3-MCA) and class 3 contains 62 spectra of SHE cells treated with anthracene (Ant). Spectra were acquired in the 400–4000 cm⁻¹ spectral range with a resolution of 8 cm⁻¹ by using a Bruker TENSOR 27 spectrometer with a Helios ATR attachment (Bruker Optics, Ltd., UK). Pre-processing was performed by excising the biofingerprint region (900–1800 cm⁻¹), which was followed by rubberband baseline correction and normalization to the Amide I peak (1650 cm⁻¹). This dataset is publicly available as part of the IRootLab toolbox (http://trevisanj.github.io/irootlab/) (Trevisan et al., 2013 (link)), and further information can be found in Trevisan et al. (2010) (link).
Dataset 4 contains 270 IR spectra from blood samples divided into four classes. Class 1 is composed of 90 IR spectra of control samples, class 2 contains 88 spectra from patients with Dengue, class 3 contains 66 spectra from patients with Zika and class 4 contains 26 spectra from patients with Chikungunya. This dataset is publically available at https://doi.org/10.6084/m9.figshare.7427933.v1. Spectra were collected in ATR mode by using a Bruker VERTEX 70 FTIR spectrometer (Bruker Optics, Ltd., UK). Acquisition was performed in the 400–4000 cm⁻¹ spectral range with a resolution of 4 cm⁻¹ and 16 co-added scans. Pre-processing was performed by excising the biofingerprint region (900–1800 cm⁻¹), which was followed by Savitzky-Golay smoothing (window of 7 points) (Savitzky and Golay, 1964 ), AWLS baseline correction and normalization to the Amide I peak (1650 cm⁻¹). Further details about this dataset can be found in Santos et al. (2018) .
Dataset 5 contains 351 Raman spectra of blood plasma divided into two classes: 162 spectra of healthy individuals (class 1), and 189 spectra of ovarian cancer patients (class 2). This dataset is publicly available at https://doi.org/10.6084/m9.figshare.6744206.v1. Raman spectra were collected using an InVia Renishaw Raman spectrometer coupled with a charge-coupled device (CCD) detector and Leica microscope, with 5% laser power (785 nm), 5x objective magnification, 10 s exposure time and 2 accumulations in the spectral range of 400–2000 cm⁻¹. The spectral data were pre-processed by Savitzky-Golay smoothing (window of 15 points), AWLS baseline correction and vector normalization. Further details about this dataset can be found in Paraskevaidi et al. (2018) (link).
Dataset 6 contains 322 surface-enhanced Raman spectroscopy (SERS) spectra of blood plasma also divided into two classes: 133 spectra of healthy individuals (class 1), and 189 spectra of ovarian cancer patients (class 2). This dataset is publicly available at https://doi.org/10.6084/m9.figshare.6744206.v1. SERS spectra were collected using the same settings for dataset 5 but, in this case, silver nanoparticles were mixed with the biofluid before spectral acquisition. The spectral pre-processing was performed using Savitzky-Golay smoothing (window of 15 points), AWLS baseline correction and vector normalization. Further details about this dataset can be found in Paraskevaidi et al. (2018) (link).
Simulations were also performed with simulated data. This data were generated for each simulation (1000 simulations) based on a normally distributed random matrix with size of 100 × 1000 for class 1, and 100 × 1000 for class 2 (100 observations, 1000 variables per observation). The matrix values ranged randomly from -10 to 10 units. A shift of 5 units was randomly added to class 2 to create a difference between the classes. The codes to produce class 1 and class 2 in MATLAB are ‘class_1 = randn(100, 1000).*randn(100, 1000);’ and ‘class_2 = (randn(100, 1000)+5).*randn(100, 1000);’. Class 1 and class 2 were generated for each simulation (1000 times), where all algorithms (RS, KS and MLM) were independently applied per each simulation.
Software. Data analysis was performed within the MATLAB R2014b (MathWorks, Inc., USA) environment. Pre-processing was performed using PLS Toolbox 7.9.3. (Eigenvector Research, Inc., USA) and classification was performed using the Classification Toolbox for MATLAB (http://www.michem.unimib.it/) (Ballabio and Consonni, 2013 ). RS, KS and MLM algorithms were performed using laboratory-generated routines. MLM algorithm is public available at https://doi.org/10.6084/m9.figshare.7393517.v1.
Sample selection. Samples were divided into training (70%) and test (30%) sets using, independently, the RS, KS or MLM algorithms. RS is based on a random sample selection where spectra from the original dataset are randomly assigned to training or test. KS algorithm is based on an Euclidian distance calculation, where the sample with maximum distance to all other samples are selected, then the samples which are as far away as possible from the selected samples are selected, until the selected number of samples is reached. This means that the samples are selected in such a way that they will uniformly cover the complete sample space, reducing the need for extrapolation of the remaining samples. MLM algorithm, based on a KS-based approach, applies a KS method to the data, as described before; then, a random-mutation factor is used in the KS results, where some samples from the training set are transferred to the test set, and some samples from the test set are transferred to training. Herein, the mutation factor was set at 10%. This value is inspired in the mutation probability of genetic algorithms (Morais et al., 2019 (link)), where 10% is a common threshold employed to keep a balance between the degree of randomness and model convergence. MLM algorithm is visually illustrated in Figure 1.
Classification. Classification was performed based on a PCA-LDA algorithm. For this, initially a principal component analysis (PCA) model is applied to the pre-processed data, decomposing the spectral space into a small number of PCs representing most of the original data-explained variance (Bro and Smilde, 2014 ). Each PC is composed of scores and loadings, the former representing the variance on samples direction, and the latter the variance on variables (e.g. wavenumber) direction. Then, the PCA scores are used as input for a linear discriminant analysis (LDA) classifier. LDA performs a Mahalanobis distance calculation to linearly classify the input space (PCA scores) into at least two classes (Dixon and Brereton, 2009 ; Morais and Lima, 2018 ). The LDA classification scores (

L_{ik}

) can be calculated in a non-Bayesian form as (Dixon and Brereton, 2009 ; Morais and Lima, 2018 ):

L_{ik} = {(x_{i} - {\bar{x}}_{k})}^{T} C_{pooled}^{- 1} (x_{i} - {\bar{x}}_{k})

where

x_{i}

is a vector containing the input variables for sample

i

;

{\bar{x}}_{k}

is the mean vector of class

k

;

C_{pooled}

is the pooled covariance matrix between the classes; and,

T

represents the matrix transpose operation. Model optimization was performed using cross-validation venetian blinds with 10 splits.
The PCA-LDA classification performance was evaluated by means of accuracy, sensitivity and specificity calculations. Accuracy represents the total number of samples correctly classified considering true and false negatives; sensitivity measures the proportion of positives that are correctly identified; and, specificity measures the proportion of negatives that are correctly identified (Morais and Lima, 2017 ). These parameters are calculated as follows:

Accuracy (%) = ((TP + TN) / (TP + FP + TN + FN)) \times 100

Sensitivity (%) = (TP / (TP + FN)) \times 100

Specificity (%) = (TN / (TN + FP)) \times 100

where TP stands for true positives; TN for true negatives; FP for false positives; and, FN for false negatives.

Free full text: Click here

Morais C.L., Santos M.C., Lima K.M, & Martin F.L. (2019). Improving data splitting for classification applications in spectrochemical analyses employing a random-mutation Kennard-Stone algorithm approach. Bioinformatics, 35(24), 5257-5263.

Publication 2019

Corresponding Organization : University of Central Lancashire

Other organizations : Universidade Federal do Rio Grande do Norte

Top 5 similar protocols

Protocol cited in 11 other protocols

Variable analysis

independent variables

Sample selection algorithms (RS, KS, and MLM)

dependent variables

Classification performance metrics (accuracy, sensitivity, and specificity)

control variables

PCA-LDA classification algorithm
Cross-validation with 10 splits

Annotations

This protocol is too long. Unable to provide accurate annotations

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!