We used CNNs with multiple cell structures that have two one-dimensional convolution layers, one pooling layer and one dropout layer. Convolution layers are designed to extract features with high-dimensional abstract representation. The pooling layer limits the number of model parameters tractable by pooling operations. The dropout layer prevents overfitting of the model by randomly setting some of the input units to a value of 0. Four prediction methods had been established based on four different network structures composed by the cell structures mentioned above. One-hot encoding data were fed into the network with four cell structures and fully connected layers as input, while neighboring methylation state encoding data, RNA word embedding data, and Gene2vec processing data were fed into networks with two cell structures (Fig. 10). The final result was obtained by a voting strategy from the four prediction probabilities.
Taking an example of the one-hot coding sequence, the input data matrix Xn was first fed into a 1D-convolutional layer, which used a convolutional filter WfRH, where H is the length of the filter vector. The output feature Ai at the ith position was computed by
Ai=ReLU(h=1HWfXn,i+h+bf),
where ReLU(x) = max(0, x) is the rectified linear unit function and bfR is a bias (Mairal et al. 2014 ). These convolutional operations are similar to data block of H length in sequence filtered by a sliding filter window at each ith position.
Next, a max. pooling layer was used for reduction of the dimensions of output data generated by the multiple convolutional filter operations. A max. pooling layer is a form of nonlinear downsampling achieved by outputting the maximum of each subregion.
To reduce overfitting, we added a dropout layer in which individual nodes were either “dropped out” from the network with probability 1 − P or kept with probability P at each training stage. This not only prevented overfitting, but also led to integration of various deformed network structures to generate more robust features that are more generalizable to new data.
Finally, a flattening layer that “flattened” the input data was used, which transformed multidimensional data into a single dimension. Fully connected layers with an ReLU activation function and output layer predict the binary classification probability with activation function as follows (Han and Moraga 1995 ):
y^(x)=sigmoid(x)=(11+ex).