Abstract
More individuals are sharing their thoughts and feelings through internet videos as social media platforms proliferate. While successful emotional fusion in multimodal data is a key component of multimodal sentiment analysis, most existing research falls short in this area. Predicting users' emotional inclinations through their expressions of language is made easier by multi-modal sentiment detection. As a result, the field of multi-modal sentiment detection has grown rapidly in recent years. As a result, multimodal sentiment analysis is quickly rising to the forefront of academic interest. However, in actual social media, visuals and sentences do not always work together to represent emotional polarity. Additionally, there are several information modalities, each contributing in its unique way to the overall emotional polarity. A multimodal approach to sentiment analysis that takes into account contextual knowledge is presented as a solution to these issues. The approach begins by mining social media texts for subject information that comprehensively describes the comment material. Additionally, we use cutting-edge pre-training models to identify emotional qualities that span several domains. Then, we provide methods for merging features at different levels, such as cross-modal global fusion as well as cross-modal high-level semantics fusing. At long last, we run our tests on a multimodal dataset that really exists in the real world. Results show that the proposed approach can correctly classify the tone of heterogeneous online reviews, and it also outperforms the standard approach in many other ways as well.