Recent Advances in Computer Science and Communications

Author(s): Mahdi Rezapour*

DOI: 10.2174/0126662558291150240102111855

Cross-attention Based Text-image Transformer for Visual Question Answering

Article ID: e300124226509 Pages: 7

  • * (Excluding Mailing and Handling)

Abstract

Background: Visual question answering (VQA) is a challenging task that requires multimodal reasoning and knowledge. The objective of VQA is to answer natural language questions based on corresponding present information in a given image. The challenge of VQA is to extract visual and textual features and pass them into a common space. However, the method faces the challenge of object detection being present in an image and finding the relationship between objects.

Methods: In this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention mechanisms to fuse them. We evaluated our methods on the DAQUAR dataset.

Results: We used three metrics to measure the performance of our methods: WUPS, Acc, and F1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might not capture the bidirectional interactions between the text and image modalities.

Conclusion: In this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying different attention mechanisms to fuse them. We showed that concatenating raw text and image features is a simple but effective method for VQA while using text as query and image as key and value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.

[1]
D. Teney, Q. Wu, and A. van den Hengel, "Visual question answering: A tutorial", IEEE Signal Process. Mag., vol. 34, no. 6, pp. 63-75, 2017.
[http://dx.doi.org/10.1109/MSP.2017.2739826]
[2]
Y. Xi, Y. Zhang, S. Ding, and S. Wan, "Visual question answering model based on visual relationship detection", Signal Process. Image Commun., vol. 80, p. 115648, 2020.
[http://dx.doi.org/10.1016/j.image.2019.115648]
[3]
P. Zhang, "Yin and yang: Balancing and answering binary visual questions", In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27-30 June, 2016.
[http://dx.doi.org/10.1109/CVPR.2016.542]
[4]
D.A. Hudson, and C.D. Manning, "Compositional attention networks for machine reasoning", arXiv, vol. 1803, p. 03067, 2018.
[5]
A.U. Khan, "Mmft-bert: Multimodal fusion transformer with bert encodings for visual question answering", arXiv, vol. 2010, p. 14095, 2020.
[6]
Z. Yang, "Stacked attention networks for image question answering", In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27-30 June, 2016.
[http://dx.doi.org/10.1109/CVPR.2016.10]
[7]
X. Shen, D. Han, Z. Guo, C. Chen, J. Hua, and G. Luo, "Local self-attention in transformer for visual question answering", Appl. Intell., vol. 53, no. 13, pp. 16706-16723, 2023.
[http://dx.doi.org/10.1007/s10489-022-04355-w]
[8]
K. Kafle, and C. Kanan, "An analysis of visual question answering algorithms", In Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22-29 Oct, 2017.
[http://dx.doi.org/10.1109/ICCV.2017.217]
[9]
T. Do, "Compact trilinear interaction for visual question answering", In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 27 Oct - 02 Nov, 2019.
[http://dx.doi.org/10.1109/ICCV.2019.00048]
[10]
M. Malinowski, M. Rohrbach, and M. Fritz, "Ask your neurons: A neural-based approach to answering questions about images", In Proceedings of the IEEE international conference on computer vision, Santiago, Chile, 07-13 Dec, 2015.
[http://dx.doi.org/10.1109/ICCV.2015.9]
[11]
S. Shah, "Kvqa: Knowledge-aware visual question answering", Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 8876-8884, 2019.
[12]
R. Cadene, "Murel: Multimodal relational reasoning for visual question answering", In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15-20 June, 2019.
[http://dx.doi.org/10.1109/CVPR.2019.00209]
[13]
A.K. Singh, "From strings to things: Knowledge-enabled vqa model that can read and reason", In Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea (South), 27 Oct - 02 Nov, 2019.
[http://dx.doi.org/10.1109/ICCV.2019.00470]
[14]
M. Malinowski, and M. Fritz, "A multi-world approach to question answering about real-world scenes based on uncertain input", Adv. Neural Inf. Process. Syst., p. 27, 2014.
[15]
C. Chen, S. Anjum, and D. Gurari, "Grounding answers for visual questions asked by visually impaired people", In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18-24 June, 2022.
[http://dx.doi.org/10.1109/CVPR52688.2022.01851]
[16]
B. Liu, "Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering", In IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13-16 April, 2021.
[http://dx.doi.org/10.1109/ISBI48211.2021.9434010]
[17]
Z. Lin, D. Zhang, Q. Tao, D. Shi, G. Haffari, Q. Wu, M. He, and Z. Ge, "Medical visual question answering: A survey", Artif. Intell. Med., vol. 143, p. 102611, 2023.
[http://dx.doi.org/10.1016/j.artmed.2023.102611] [PMID: 37673579]
[18]
Y. Li, S. Long, Z. Yang, H. Weng, K. Zeng, Z. Huang, F. Lee Wang, and T. Hao, "A Bi-level representation learning model for medical visual question answering", J. Biomed. Inform., vol. 134, p. 104183, 2022.
[http://dx.doi.org/10.1016/j.jbi.2022.104183] [PMID: 36038063]
[19]
Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel, "Visual question answering: A survey of methods and datasets", Comput. Vis. Image Underst., vol. 163, pp. 21-40, 2017.
[http://dx.doi.org/10.1016/j.cviu.2017.05.001]
[20]
J. Devlin, "Bert: Pre-training of deep bidirectional transformers for language understanding", arXiv, vol. 1810, p. 04805, 2018.
[21]
A. Vaswani, "Attention is all you need", Adv. Neural Inf. Process. Syst., p. 30, 2017.
[22]
A. Dosovitskiy, "An image is worth 16x16 words: Transformers for image recognition at scale", arXiv, vol. 2010, p. 11929, 2020.
[23]
Z. Wu, and M. Palmer, "Verb semantics and lexical selection", In ACL ’94: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, 1994.
[http://dx.doi.org/10.3115/981732.981751]
[24]
H. Sharma, and A.S. Jalal, "Convolutional neural networks-based VQA Model", In: Proceedings of International Conference on Frontiers in Computing and Systems: COMSYS.., Springer, 2022.
[25]
H. Tang, "Vision question answering system based on roberta and vit model", In 2022 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Xi’an, China, 28-30 Oct, 2022.
[http://dx.doi.org/10.1109/ICICML57342.2022.10009711]
[26]
Y. Liu, "Roberta: A robustly optimized bert pretraining approach", arXiv, vol. 1907, p. 11692, 2019.
[27]
Z. Lan, "Albert: A lite bert for self-supervised learning of language representations", arXiv, vol. 1909, p. 11942, 2019.
[28]
H. Touvron, "Training data-efficient image transformers & distillation through attention", In International conference on machine learning, , 2021.
[29]
H. Bao, "Beit: Bert pre-training of image transformers", arXiv, vol. 2021, p. 08254, 2021.