research-article

Efficiently Gluing Pre-Trained Language and Vision Models for Image Captioning

Authors:

Meng WangAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 6

Article No.: 115, Pages 1 - 16

https://doi.org/10.1145/3682067

Published: 19 November 2024 Publication History

Get Access

Abstract

Vision-and-language pre-training models have achieved impressive performance for image captioning. But most of them are trained with millions of paired image-text data and require huge memory and computing overhead. To alleviate this, we try to stand on the shoulders of large-scale pre-trained language models (PLM) and pre-trained vision models (PVM) and efficiently connect them for image captioning. There are two major challenges: one is that language and vision modalities have different semantic granularity (e.g., a noun may cover many pixels), and the other is that the semantic gap still exists between the pre-trained language and vision models. To this end, we design a lightweight and efficient connector to glue PVM and PLM, which holds a criterion of selection-then-transformation. Specifically, in the selection phase, we treat each image as a set of patches instead of pixels. We select salient image patches and cluster them into visual regions to align with text. Then, to effectively reduce the semantic gap, we propose to map the selected image patches into text space through spatial and channel transformations. With training on image captioning datasets, the connector learns to bridge the semantic granularity and semantic gap via backpropagation, preparing for the PLM to generate descriptions. Experimental results on the MSCOCO and Flickr30k datasets demonstrate that our method yields comparable performance to existing works. By solely training the small connector, we achieve a CIDEr performance of 132.2% on the MSCOCO Karpathy test split. Moreover, our findings reveal that fine-tuning the PLM can further enhance performance potential, resulting in a CIDEr score of 140.6%. Code and models are available at https://github.com/YuanEZhou/PrefixCap.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6077–6086.

Abstract

References

Cited By

Index Terms

Recommendations

Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

Conditional Embedding Pre-Training Language Model for Image Captioning

Bridging the Gap between Vision and Language Domains for Improved Image Captioning

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

Share

Share this Publication link

Share on social media

Affiliations