In this paper, a multimodal fusion model is designed, and the overall flowchart is shown in Figure 1. Specifically, the model consists of a feature extraction encoder and a classification head for multimodal fusion, where weights are shared among the unimodal encoders and features are extracted from each unimodal input encoder, which means that all spatial locations share the same convolution kernel, which greatly reduces the number of parameter layers required for convolution. The feature extraction encoder consists of four residual blocks, and the specific flow is shown in Figure 2. The CNN model constructed in this paper mainly consists of a convolutional layer, a maximum pooling layer and a Dropout layer, and finally a fully connected layer as the output layer.
Free full text: Click here