To better fuse multimodal features, the feature extraction module express different modal data as low-dimensional semantic vectors and finally train a semantic similarity model, at which point the different modalities can be constrained to a unified representation space and multimodal fusion representation. Here we designed a channel attention for multimodal feature fusion. Specifically, for the image of the mth modality, where m∈[1, 2, 3, 4]. The output features Fm of the feature extraction module are pooled globally in one spatial dimension to obtain a channel description of C×1 × 1 × 1, where C is the number of channels of a single modal feature. A sigmoid activation function is then used to obtain the weighting coefficients. Finally, the weight coefficients are multiplied with the corresponding input features Fm to obtain the new weighted features. The calculation of the weighted features is shown in the following equation:
where σ represents the sigmoid function, and wm represents the parameter matrix at training time. The features of different modalities are stitched together after the maximum pooling layer. Finally, a Fully Connected (FC) layer is created in the corresponding dimension of the channel and output to the classifier to obtain the classification result.
Free full text: Click here