Background: Visual question answering (VQA) is a challenging task that requires multimodal reasoning and knowledge. The objective of VQA is to answer natural language questions based on corresponding present information in a given image. The challenge of VQA is to extract visual and textual features and pass them into a common space. However, the method faces the challenge of object detection being present in an image and finding the relationship between objects.
Methods: In this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention mechanisms to fuse them. We evaluated our methods on the DAQUAR dataset.
Results: We used three metrics to measure the performance of our methods: WUPS, Acc, and F1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might not capture the bidirectional interactions between the text and image modalities.
Conclusion: In this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying different attention mechanisms to fuse them. We showed that concatenating raw text and image features is a simple but effective method for VQA while using text as query and image as key and value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.