The first few layers of the CNN network structure are used as a feature extractor to automatically obtain the image features through supervised training, which are detected by the SoftMax function in the final layer [22 (link)]. Figure 6 presents the CNN structure.
As can be seen from Figure 6, there are eight layers in CNN in total. The first five layers are alternating convolution layers and Max Pooling layers, and the remaining three are fully connected layers. The input image of CNN is the harmonic spectrum and impact spectrum generated by HPSS separation, including the original signal spectrum. The images are unified to 256 ∗ 256 and input into the first convolution filter. A filter operation is performed on the input image by 96 kernels of 11 ∗ 11 with a stride of 4 pixels in the first convolution layer due to the distance between the Receptive Field centers of adjacent neurons in the same core map [23 ]. Then, the Max Pooling layer uses the output of the first convolutional layer as the input and performs filtering operations with 96 kernels of size 3 ∗ 3. After unifying the input size, the second convolutional layer performs a filtering operation on the output of the Max Pooling layer using 256 kernels of 5 ∗ 5. The third, fourth, and fifth convolutional layers are connected to each other. There is no pooling or normalization layer in between. The third convolutional layer has a total of 384 kernels of size 3 ∗ 3 connected to the second convolutional layer's output [24 (link)]. The fourth convolutional layer has a total of 384 kernels of size 3 ∗ 3, and the fifth convolutional layer has a total of 256 kernels of size 3 ∗ 3. Finally, 256 feature maps of size 6 ∗ 6 are obtained through these five convolutional layers. These feature maps are fed to three fully connected layers, each with 4096, 1,000, and 10 neurons. The final detection result is output by the last fully connected layer [25 (link)].
Free full text: Click here