skip to main content
research-article

Efficiently Gluing Pre-Trained Language and Vision Models for Image Captioning

Published: 19 November 2024 Publication History

Abstract

Vision-and-language pre-training models have achieved impressive performance for image captioning. But most of them are trained with millions of paired image-text data and require huge memory and computing overhead. To alleviate this, we try to stand on the shoulders of large-scale pre-trained language models (PLM) and pre-trained vision models (PVM) and efficiently connect them for image captioning. There are two major challenges: one is that language and vision modalities have different semantic granularity (e.g., a noun may cover many pixels), and the other is that the semantic gap still exists between the pre-trained language and vision models. To this end, we design a lightweight and efficient connector to glue PVM and PLM, which holds a criterion of selection-then-transformation. Specifically, in the selection phase, we treat each image as a set of patches instead of pixels. We select salient image patches and cluster them into visual regions to align with text. Then, to effectively reduce the semantic gap, we propose to map the selected image patches into text space through spatial and channel transformations. With training on image captioning datasets, the connector learns to bridge the semantic granularity and semantic gap via backpropagation, preparing for the PLM to generate descriptions. Experimental results on the MSCOCO and Flickr30k datasets demonstrate that our method yields comparable performance to existing works. By solely training the small connector, we achieve a CIDEr performance of 132.2% on the MSCOCO Karpathy test split. Moreover, our findings reveal that fine-tuning the PLM can further enhance performance potential, resulting in a CIDEr score of 140.6%. Code and models are available at https://github.com/YuanEZhou/PrefixCap.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6077–6086.
[2]
Manuele Barraco, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, and Rita Cucchiara. 2022. The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops ’22), 4661–4669.
[3]
Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. 2022. VisualGPT: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 18009–18019.
[4]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv: 1504.00325. Retrieved from https://arxiv.org/abs/1504.00325
[5]
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying vision-and-language tasks via text generation. In Proceedings of the International Conference on Machine Learning. PMLR, 1931–1942.
[6]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10578–10587.
[7]
Fengyi Fu, Shancheng Fang, Weidong Chen, and Zhendong Mao. 2024. Sentiment-oriented transformer-based variational autoencoder network for live video commenting. ACM Transactions on Multimedia Computing, Communications and Applications 20, 4 (2024), 1–24.
[8]
Lei Gao and Ling Guan. 2023. A discriminant information theoretic learning framework for multi-modal feature representation. ACM Transactions on Intelligent Systems and Technology 14, 3 (2023), 1–24.
[9]
Dan Guo, Yang Wang, Peipei Song, and Meng Wang. 2021. Recurrent relational memory network for unsupervised image captioning. In Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence, 920–926.
[10]
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. Towards a unified view of parameter-efficient transfer learning. In Proceedings of the International Conference on Learning Representations, 1–15.
[11]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv: 1606.08415. Retrieved from https://arxiv.org/abs/1606.08415
[12]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[13]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning. PMLR, 2790–2799.
[14]
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, 1–13.
[15]
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. Scaling up vision-language pre-training for image captioning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17959–17968.
[16]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4634–4643.
[17]
Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and Xinlei Chen. 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10267–10276.
[18]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3128–3137.
[19]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 5583–5594.
[20]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 1–15.
[21]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1–9.
[22]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 3045–3059.
[23]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, 19730–19742.
[24]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022a. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, 12888–12900.
[25]
Jingyu Li, Zhendong Mao, Hao Li, Weidong Chen, and Yongdong Zhang. 2024. Exploring visual relationships via transformer-based graphs for enhanced image captioning. ACM Transactions on Multimedia Computing, Communications and Applications 20, 5 (2024), 1–23.
[26]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision. Springer, 121–137.
[27]
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 4582–4597.
[28]
Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. 2022b. Comprehending and Ordering Semantics for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22), 17969–17978.
[29]
Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-aware visual policy network for sequence-level image captioning. In Proceedings of the 26th ACM International Conference on Multimedia, 1416–1424.
[30]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32 (2019), 13–23.
[31]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 375–383.
[32]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7219–7228.
[33]
Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Jianlin Feng, Hongyang Chao, and Tao Mei. 2023. Semantic-conditional diffusion networks for image captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’23), 23359–23368.
[34]
Ruotian Luo. 2020. A better variant of self-critical sequence training. arXiv: 2003.09971. Retrieved from https://arxiv.org/abs/2003.09971
[35]
Ziyang Luo, Yadong Xi, Rongsheng Zhang, and Jing Ma. 2022. I-Tuning: Tuning language models with image for caption generation. arXiv: 2202.06574. Retrieved from https://arxiv.org/abs/2202.06574
[36]
Ziyang Luo, Yadong Xi, Rongsheng Zhang, and Jing Ma. 2022. VC-GPT: Visual conditioned GPT for end-to-end generative vision-and-language pre-training. arXiv: 2201.12723. Retrieved from https://arxiv.org/abs/2201.12723
[37]
Ron Mokady, Amir Hertz, and Amit H Bermano. 2021. Clipcap: Clip prefix for image captioning. arXiv: 2111.09734. Retrieved from https://arxiv.org/abs/2111.09734
[38]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10971–10980.
[39]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, 2641–2649.
[40]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
[41]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.
[42]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
[43]
Rita Ramos, Bruno Martins, Desmond Elliott, and Yova Kementchedjhieva. 2023. SmallCap: Lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2840–2849.
[44]
Yuchen Ren, Zhendong Mao, Shancheng Fang, Yan Lu, Tong He, Hao Du, Yongdong Zhang, and Wanli Ouyang. 2023. Crossing the gap: Domain generalization for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’23), 2871–2880.
[45]
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7008–7024.
[46]
Babak Rokh, Ali Azarpeyvand, and Alireza Khanteymoori. 2023. A comprehensive survey on model quantization for deep neural networks in image classification. ACM Transactions on Intelligent Systems and Technology 14, 6 (2023), 1–50.
[47]
Peipei Song, Dan Guo, Jun Cheng, and Meng Wang. 2023. Contextual attention network for emotional video captioning. IEEE Transactions on Multimedia 25 (2023), 1858–1867.
[48]
Peipei Song, Dan Guo, Xun Yang, Shengeng Tang, and Meng Wang. 2024. Emotional video captioning with vision-based emotion interpretation network. IEEE Transactions on Image Processing 33 (2024), 1122–1135.
[49]
Peipei Song, Dan Guo, Xun Yang, Shengeng Tang, Erkun Yang, and Meng Wang. 2023. Emotion-prior awareness network for emotional video captioning. In Proceedings of the 31st ACM International Conference on Multimedia, 589–600.
[50]
Peipei Song, Dan Guo, Jinxing Zhou, Mingliang Xu, and Meng Wang. 2023. Memorial gan with joint semantic optimization for unpaired image captioning. IEEE Transactions on Cybernetics 53, 7 (2023), 4388–4399.
[51]
Shengeng Tang, Dan Guo, Richang Hong, and Meng Wang. 2022. Graph-based multimodal sequential embedding for sign language translation. IEEE Transactions on Multimedia 24 (2022), 4433–4445.
[52]
Shengeng Tang, Richang Hong, Dan Guo, and Meng Wang. 2022. Gloss semantic-enhanced network with online back-translation for sign language production. In Proceedings of the 30th ACM International Conference on Multimedia, 5630–5638.
[53]
Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, and Richang Hong. 2024. Gloss-driven conditional diffusion models for sign language production. ACM Transactions on Multimedia Computing, Communications and Applications (2024). Retrieved from https://doi.org/10.1145/366357
[54]
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200–212.
[55]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 6000–6010.
[56]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4566–4575.
[57]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2016), 652–663.
[58]
Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. 2020. Minivlm: A smaller and faster vision-language model. arXiv: 2012.06946. Retrieved from https://arxiv.org/abs/2012.06946
[59]
Ning Wang, Jiangrong Xie, Hang Luo, Qinglin Cheng, Jihao Wu, Mingbo Jia, and Linlin Li. 2023. Efficient image captioning for edge devices. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 2608–2616.
[60]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the International Conference on Machine Learning, 23318–23340.
[61]
Ting Wang, Weidong Chen, Yuanhe Tian, Yan Song, and Zhendong Mao. 2023. Improving image captioning via predicting structured concepts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 360–370.
[62]
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. 2021. Simvlm: Simple visual language model pretraining with weak supervision. arXiv:2108.10904.
[63]
Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, and Ming Zhou. 2021. Xgpt: Cross-modal generative pre-training for image captioning. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Springer, 786–797.
[64]
Xun Yang, Tianyu Chang, Tianzhu Zhang, Shanshan Wang, Richang Hong, and Meng Wang. 2024. Learning hierarchical visual transformation for domain generalizable visual matching and recognition. International Journal of Computer Vision (2024). Retrieved from https://doi.org/10.1007/s11263-024-02106-7
[65]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1339–1348.
[66]
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1–10.
[67]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10685–10694.
[68]
Xun Yang, Meng Wang, and Dacheng Tao. 2017. Person re-identification with metric learning using privileged information. IEEE Transactions on Image Processing 27, 2 (2017), 791–805.
[69]
Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng. 2023. Exploring diverse in-context configurations for image captioning. In Proceedings of the Annual Conference on Neural Information Processing Systems 2023 (NeurIPS ’23).
[70]
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. In Proceedings of the European Conference on Computer Vision, 521–539.
[71]
Rui Yao, Ying Chen, Yong Zhou, Fuyuan Hu, Jiaqi Zhao, Bing Liu, and Zhiwen Shao. 2023. Attention-guided adversarial attack for video object segmentation. ACM Transactions on Intelligent Systems and Technology 14, 6 (2023).
[72]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), 684–699.
[73]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4651–4659.
[74]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579–5588.
[75]
Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2021b. RSTNet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15465–15474.
[76]
Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, and Marcus Rohrbach. 2019. Grounded video description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[77]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 13041–13049.
[78]
Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, and Meng Wang. 2022. Compact bidirectional transformer for image captioning. arXiv: 2201.01984. Retrieved from https://arxiv.org/abs/2201.01984
[79]
Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, and Hanwang Zhang. 2020. More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4777–4786.
[80]
Yuanen Zhou, Yong Zhang, Zhenzhen Hu, and Meng Wang. 2021. Semi-Autoregressive Transformer for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). IEEE, 3132–3136.

Cited By

View all
  • (2025)Integrated Image-Text Augmentation for Few-Shot Learning in Vision-Language ModelsACM Transactions on Intelligent Systems and Technology10.1145/3712700Online publication date: 20-Jan-2025
  • (2025)Adaptafood: an intelligent system to adapt recipes to specialised diets and healthy lifestylesMultimedia Systems10.1007/s00530-025-01667-y31:1Online publication date: 1-Feb-2025

Index Terms

  1. Efficiently Gluing Pre-Trained Language and Vision Models for Image Captioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 6
    December 2024
    727 pages
    EISSN:2157-6912
    DOI:10.1145/3613712
    • Editor:
    • Huan Liu
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 November 2024
    Online AM: 29 July 2024
    Accepted: 16 July 2024
    Revised: 23 May 2024
    Received: 07 December 2023
    Published in TIST Volume 15, Issue 6

    Check for updates

    Author Tags

    1. image captioning
    2. large-scale pre-trained model
    3. lightweight and efficient

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Fundamental Research Funds for the Central Universities

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)458
    • Downloads (Last 6 weeks)50
    Reflects downloads up to 12 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Integrated Image-Text Augmentation for Few-Shot Learning in Vision-Language ModelsACM Transactions on Intelligent Systems and Technology10.1145/3712700Online publication date: 20-Jan-2025
    • (2025)Adaptafood: an intelligent system to adapt recipes to specialised diets and healthy lifestylesMultimedia Systems10.1007/s00530-025-01667-y31:1Online publication date: 1-Feb-2025

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media