A Comprehensive Survey on Multimodal RAG: All Combinations of Modalities as Input and Output
Rui Zhang1, Chen Liu1, Yixin Su1, Ruixuan Li1, Xuanjing Huang2, Xuelong Li3, Philip S. Yu4
1Huazhong University of Science and Technology, 2Fudan University, 3Institute of Artificial Intelligence (TeleAI) of China Telecom, 4University of Illinois at Chicago
📚 In this paper, we conduct a comprehensive survey of the most recent work on Multimodal RAG (MM-RAG) in the sense that it has a full coverage of almost all the combinations of modalities as input and output, whereas existing survey papers typically focus on one or two modalities. Based on different input-output modality combinations, we present a taxonomy of MM-RAG methods which gives us a much clearer picture of their key technical components.
⚙️ Facilitated by such a taxonomy, we identify four essential stages of the workflow of MM-RAG, summarize common approaches to each stage, and discuss optimization strategies for each modality.
🌐 To provide a holistic understanding and practical guidance for building MM-RAG systems, we also discuss training strategies and evaluation methods of MM-RAG. Finally, we discuss various MM-RAG applications and future directions.
-
Taxonomy (from a perspective of different input-output modality combinations)
-
Workflow (including Pre-Retrieval, Retrieval, Augmentation, Generation)
-
Training Strategy (including Parameter-Frozen Strategy, Parameter-Trainable Strategy)
-
Evaluation and Benchmarks (including Evaluaton Metrics and Benchmarks)
| Paper | Task | Code |
|---|---|---|
Re-vilm: Retrievalaugmented visual language model for zero and few-shot image captioning [paper] |
Image Captioning | |
| Cross-modal retrieval and semantic refinement for remote sensing image captioning [paper] | Image Captioning | |
Smallcap: lightweight image captioning prompted with retrieval augmentation [paper] |
Image Captioning | [Code] |
Retrieval-augmented multimodal language modeling [paper] |
Image Captioning | |
Retrieval-augmented transformer for image captioning [paper] |
Image Captioning | |
Memory-augmented image captioning [paper] |
Image Captioning | |
Retrieval-augmented image captioning [paper] |
Image Captioning | |
Deltanet: Conditional medical report generation for COVID-19 diagnosis [paper] |
Image Captioning | |
| Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning [paper] | Image Captioning | [Code] |
Fact-aware multimodal retrieval augmentation for accurate medical radiology report generation [paper] |
Image Captioning | [Code] |
Retrieval, analogy, and composition: A framework for compositional generalization in image captioning [paper] |
Image Captioning |
| Paper | Task | Code |
|---|---|---|
RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval [paper] |
Image Generation | |
FineRAG: Fine-grained Retrieval-Augmented Text-to-Image Generation [paper] |
Image Generation | |
| ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation [paper] | Image Generation | [Code] |
| X&Fuse: Fusing Visual Information in Text-to-Image Generation [paper] | Image Generation | |
TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models [paper] |
Image Generation | [Code] |
Re-Imagen: Retrieval-Augmented Text-to-Image Generator [paper] |
Image Generation | |
Retrieval-augmented diffusion models [paper] |
Image Generation | [Code] |
KNN-Diffusion: Image Generation via Large-Scale Retrieval [paper] |
Image Generation | |
Memory-Driven Text-to-Image Generation [paper] |
Image Generation |
| Paper | Task | Code |
|---|---|---|
Augmenting Transformers with KNN-Based Composite Memory for Dialog [paper] |
Visual Dialog | |
Maria: A Visual Experience Powered Conversational Agent [paper] |
Visual Dialog | [Code] |
Murag: Multimodal retrieval-augmented generator for open question answering over images and text [paper] |
Visual QA | |
Retrieval Augmented Visual Question Answering with Outside Knowledge [paper] |
Visual QA | [Code] |
|  RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training [paper] |
Visual QA | [Code] |
KAT: A Knowledge Augmented Transformer for Vision-and-Language [paper] |
Visual QA | [Code] |
An empirical study of gpt-3 for few-shot knowledge-based vqa [paper] |
Visual QA | [Code] |
Cross-modal retrieval augmentation for multi-modal classification [paper] |
Visual QA | |
| Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training [paper] | Visual QA | [Code] |
| Learning to compress contexts for efficient knowledge-based visual question answering [paper] | Visual QA | |
Reveal: Retrievalaugmented visual-language pre-training with multi-source multimodal knowledge memory [paper] |
Visual QA | [Code] |
Visa: Retrieval augmented generation with visual source attribution [paper] |
Visual QA | [Code] |
| M3DOCRAG: Multi-modal Multi-page Document RAG System [paper] | Visual QA | [Code] |
Echosight: Advancing visual-language models with wiki knowledge [paper] |
Visual QA | [Code] |
| Visrag: Vision-based retrieval-augmented generation on multi-modality documents [paper] | Visual QA | [Code] |
Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering [paper] |
Visual QA | [Code] |
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [paper] |
Visual QA | [Code] |
| Paper | Task | Code |
|---|---|---|
RECAP: Retrieval-Augmented Audio Captioning [paper] |
Audio Captioning | [Code] |
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval [paper] |
Audio Captioning |
| Paper | Task | Code |
|---|---|---|
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [paper] |
Audio Generation | [Code] |
Retrieval-Augmented Text-to-Audio Generation [paper] |
Audio Generation | |
Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation [paper] |
Audio Generation | [Code] |
Retrieval augmented generation in prompt-based text-to-speech synthesis with context-aware contrastive language-audio pretraining [paper] |
Audio Generation | [Code] |
Retrieval-augmented classifier guidance for audio generation |
Audio Generation |
| Paper | Task | Code |
|---|---|---|
Retrieval-augmented egocentric video captioning [paper] |
Video Captioning | [Code] |
Incorporating background knowledge into video description generation [paper] |
Video Captioning |
| Paper | Task | Code |
|---|---|---|
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [paper] |
Video Generation | [Code] |
| Paper | Task | Code |
|---|---|---|
MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation [paper] |
Image-to-Video Generation | [Code] |
| Paper | Task | Code |
|---|---|---|
ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System [paper] |
Video QA | |
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models [paper] |
Video QA | |
iRAG: Advancing RAG for Videos with an Incremental Approach [paper] |
Video QA | |
Retrieval augmented convolutional encoder-decoder networks for video captioning [paper] |
Video QA | |
VideoRAG: Retrieval-Augmented Generation over Video Corpus [paper] |
Video QA | [Code] |
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension [paper] |
Video QA | [Code] |
| VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos [paper] | Video QA | [Code] |
| Paper | Task | Code |
|---|---|---|
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model [paper] |
Text-to-3D | [Code] |
Retrieval-Augmented Score Distillation for Text-to-3D Generation [paper] |
Text-to-3D | [Code] |
| Paper | Task | Code |
|---|---|---|
Retrieval-based neural source code summarization [paper] |
Code Summarization | [Code] |
Retrieval augmented code generation and summarization [paper] |
Code Generation and Summarization | [Code] |
Retrieval-Augmented Generation for Code Summarization via Hybrid GNN [paper] |
Code Summarization | [Code] |
RACE: Retrieval-Augmented Commit Message Generation [paper] |
Commit Message Generation | [Code] |
| Paper | Task | Code |
|---|---|---|
Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning [paper] |
Code Generation | [Code] |
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [paper] |
Code Completion | [Code] |
Retrieval augmented code generation and summarization [paper] |
Code Generation and Summarization | [Code] |
| CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context [paper] | Code Generation | [Code] |
EVOR: Evolving Retrieval for Code Generation [paper] |
Code Generation | [Code] |
Synchromesh: Reliable code generation from pre-trained language models [paper] |
Code Generation | [Code] |
A Retrieve-and-Edit Framework for Predicting Structured Outputs [paper] |
Code Generation | |
ReACC: A Retrieval-Augmented Code Completion Framework [paper] |
Code Completion | [Code] |
DocPrompting: Generating Code by Retrieving the Docs [paper] |
Code Generation | [Code] |
XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing [paper] |
Text-to-SQL |
| Paper | Task | Code |
|---|---|---|
| KnowledGPT: Enhancing Large Language Models with Retrieval and Storage Access on Knowledge Bases [paper] | KBQA | |
G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering [paper] |
KBQA | [Code] |
ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering[paper] |
KBQA | [Code] |
KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation [paper] |
KBQA | [Code] |
LI-RAGE: Late Interaction Retrieval Augmented Generation with Explicit Signals for Open-Domain Table Question Answering [paper] |
Table QA | [Code] |
| End-to-End Table Question Answering via Retrieval-Augmented Generation [paper] | Table QA | |
| RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking [paper] | Table QA | [Code] |
Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open Domain Question Answering [paper] |
Table QA | [Code] |
For a functional MM-RAG system, we identify four essential stages of its workflow: pre-retrieval, retrieval, augmentation and generation . Retrieval and generation involve the retriever and generator, respectively. Pre-retrieval involves knowledge base and query preparation. Augmentation involves the preprocessing of the query and retrieved documents before they are fed into the generator. For each stage, we discuss common approaches and modality-specific optimization strategies, and summarize representative studies.
There are mainly two strategies: parameter-frozen strategies and parameter-trainable strategies. We summarize training methods and commonly used training datasets 4 per input-output modality combination in the following Tables.
We discussed the evaluation metrics and benchmarks of the retriever and the generator as they are the two core components that affect the performance of MM-RAG.
| Modality | Benchmark | Evaluation Targets |
|---|---|---|
| Text + Image → Text | WebQA [Paper] [Resource] |
Textual knowledge retrieval and reasoning. |
OK-VQA [Paper] [Resource] |
||
A-OKVQA [Paper] [Resource] |
||
MRAG-BENCH [Paper] [Resource] |
Visual knowledge retrieval and reasoning. | |
| Visual-RAG [Paper] [Resource] | ||
| $M^2$RAG [Paper] [Resource] | Document retrieval and generation with interleaved text-image content. | |
MRAMG-Bench [Paper] [Resource] |
||
| Dyn-VQA [Paper] [Resource] | Ability of adapting to rapidly changing multimodal knowledge; multi-hop and multimodal reasoning. | |
CogBench [Paper] |
Adaptive information acquisition by recording the entire planning procedure. | |
| Liu et al. [Paper] [Resource] | Evaluates MM-RAG across four tasks: image captioning, multimodal QA, fact verification, and image reranking. | |
| Text + Table + Image → Text | OMG-QA [Paper] |
Retrieval and reasoning over complex document structures. |
PDF-MVQA [[Paper] [Resource] |
||
| Real-MM-RAG [Paper] | ||
| Text → Text | RAGAS [Paper] [Resource] |
Context relevance, answer faithfulness, and answer relevance. |
ARES [Paper] [Resource] |
||
RGB [Paper] [Resource] |
Noise robustness, negative rejection, information integration, and counterfactual robustness |
The survey and the repository are still work in progress and will be updated regularly.
🙋 If you would like to include your paper in this survey and repository, please feel free to submit a pull request or open an issue with the paper's title and a brief summary highlighting its key techniques. You can also contact us via email. Please let us know if you find out a mistake or have any suggestions! We greatly appreciate your feedback regarding this repository or survey!
🌟 If you find this resource helpful for your work, please consider citing our research.
@article{Zhang_2025,
title={A Comprehensive Survey on Multimodal RAG: All Combinations of Modalities as Input and Output},
url={http://dx.doi.org/10.36227/techrxiv.176341513.38473003/v2},
DOI={10.36227/techrxiv.176341513.38473003/v2},
publisher={IEEE},
author={Rui Zhang and Chen Liu and Yixin Su and Ruixuan Li and Xuanjing Huang and Xuelong Li and Philip S Yu},
year={2025},
month=nov
}