Awesome MM-RAG

This repository is for our survey paper:

A Comprehensive Survey on Multimodal RAG: All Combinations of Modalities as Input and Output

Rui Zhang¹, Chen Liu¹, Yixin Su¹, Ruixuan Li¹, Xuanjing Huang², Xuelong Li³, Philip S. Yu⁴

¹Huazhong University of Science and Technology, ²Fudan University, ³Institute of Artificial Intelligence (TeleAI) of China Telecom, ⁴University of Illinois at Chicago

📚 In this paper, we conduct a comprehensive survey of the most recent work on Multimodal RAG (MM-RAG) in the sense that it has a full coverage of almost all the combinations of modalities as input and output, whereas existing survey papers typically focus on one or two modalities. Based on different input-output modality combinations, we present a taxonomy of MM-RAG methods which gives us a much clearer picture of their key technical components.

⚙️ Facilitated by such a taxonomy, we identify four essential stages of the workflow of MM-RAG, summarize common approaches to each stage, and discuss optimization strategies for each modality.

🌐 To provide a holistic understanding and practical guidance for building MM-RAG systems, we also discuss training strategies and evaluation methods of MM-RAG. Finally, we discuss various MM-RAG applications and future directions.

Quick Index

Taxonomy (from a perspective of different input-output modality combinations)
Workflow (including Pre-Retrieval, Retrieval, Augmentation, Generation)
Training Strategy (including Parameter-Frozen Strategy, Parameter-Trainable Strategy)
Evaluation and Benchmarks (including Evaluaton Metrics and Benchmarks)

Taxonomy

Image→Text

Paper	Task	Code
Re-vilm: Retrievalaugmented visual language model for zero and few-shot image captioning [paper]	Image Captioning
Cross-modal retrieval and semantic refinement for remote sensing image captioning [paper]	Image Captioning
Smallcap: lightweight image captioning prompted with retrieval augmentation [paper]	Image Captioning	[Code]
Retrieval-augmented multimodal language modeling [paper]	Image Captioning
Retrieval-augmented transformer for image captioning [paper]	Image Captioning
Memory-augmented image captioning [paper]	Image Captioning
Retrieval-augmented image captioning [paper]	Image Captioning
Deltanet: Conditional medical report generation for COVID-19 diagnosis [paper]	Image Captioning
Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning [paper]	Image Captioning	[Code]
Fact-aware multimodal retrieval augmentation for accurate medical radiology report generation [paper]	Image Captioning	[Code]
Retrieval, analogy, and composition: A framework for compositional generalization in image captioning [paper]	Image Captioning

↑ Back to Index ↑

Text→Image

Paper	Task	Code
RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval [paper]	Image Generation
FineRAG: Fine-grained Retrieval-Augmented Text-to-Image Generation [paper]	Image Generation
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation [paper]	Image Generation	[Code]
X&Fuse: Fusing Visual Information in Text-to-Image Generation [paper]	Image Generation
TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models [paper]	Image Generation	[Code]
Re-Imagen: Retrieval-Augmented Text-to-Image Generator [paper]	Image Generation
Retrieval-augmented diffusion models [paper]	Image Generation	[Code]
KNN-Diffusion: Image Generation via Large-Scale Retrieval [paper]	Image Generation
Memory-Driven Text-to-Image Generation [paper]	Image Generation

↑ Back to Index ↑

Text+Image→Text

Paper	Task	Code
Augmenting Transformers with KNN-Based Composite Memory for Dialog [paper]	Visual Dialog
Maria: A Visual Experience Powered Conversational Agent [paper]	Visual Dialog	[Code]
Murag: Multimodal retrieval-augmented generator for open question answering over images and text [paper]	Visual QA
Retrieval Augmented Visual Question Answering with Outside Knowledge [paper]	Visual QA	[Code]
![Publish](https://img.shields.io/badge/Conference-ACM MM_2023-blue) RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training [paper]	Visual QA	[Code]
KAT: A Knowledge Augmented Transformer for Vision-and-Language [paper]	Visual QA	[Code]
An empirical study of gpt-3 for few-shot knowledge-based vqa [paper]	Visual QA	[Code]
Cross-modal retrieval augmentation for multi-modal classification [paper]	Visual QA
Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training [paper]	Visual QA	[Code]
Learning to compress contexts for efficient knowledge-based visual question answering [paper]	Visual QA
Reveal: Retrievalaugmented visual-language pre-training with multi-source multimodal knowledge memory [paper]	Visual QA	[Code]
Visa: Retrieval augmented generation with visual source attribution [paper]	Visual QA	[Code]
M3DOCRAG: Multi-modal Multi-page Document RAG System [paper]	Visual QA	[Code]
Echosight: Advancing visual-language models with wiki knowledge [paper]	Visual QA	[Code]
Visrag: Vision-based retrieval-augmented generation on multi-modality documents [paper]	Visual QA	[Code]
Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering [paper]	Visual QA	[Code]
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [paper]	Visual QA	[Code]

↑ Back to Index ↑

Audio→Text

Paper	Task	Code
RECAP: Retrieval-Augmented Audio Captioning [paper]	Audio Captioning	[Code]
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval [paper]	Audio Captioning

↑ Back to Index ↑

Text→Audio

Paper	Task	Code
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [paper]	Audio Generation	[Code]
Retrieval-Augmented Text-to-Audio Generation [paper]	Audio Generation
Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation [paper]	Audio Generation	[Code]
Retrieval augmented generation in prompt-based text-to-speech synthesis with context-aware contrastive language-audio pretraining [paper]	Audio Generation	[Code]
Retrieval-augmented classifier guidance for audio generation	Audio Generation

↑ Back to Index ↑

Video→Text

Paper	Task	Code
Retrieval-augmented egocentric video captioning [paper]	Video Captioning	[Code]
Incorporating background knowledge into video description generation [paper]	Video Captioning

↑ Back to Index ↑

Text→Video

Paper	Task	Code
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [paper]	Video Generation	[Code]

↑ Back to Index ↑

Image→Video

Paper	Task	Code
MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation [paper]	Image-to-Video Generation	[Code]

↑ Back to Index ↑

Video+Text→Text

Paper	Task	Code
ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System [paper]	Video QA
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models [paper]	Video QA
iRAG: Advancing RAG for Videos with an Incremental Approach [paper]	Video QA
Retrieval augmented convolutional encoder-decoder networks for video captioning [paper]	Video QA
VideoRAG: Retrieval-Augmented Generation over Video Corpus [paper]	Video QA	[Code]
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension [paper]	Video QA	[Code]
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos [paper]	Video QA	[Code]

↑ Back to Index ↑

Text→3D

Paper	Task	Code
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model [paper]	Text-to-3D	[Code]
Retrieval-Augmented Score Distillation for Text-to-3D Generation [paper]	Text-to-3D	[Code]

↑ Back to Index ↑

Code→Text

Paper	Task	Code
Retrieval-based neural source code summarization [paper]	Code Summarization	[Code]
Retrieval augmented code generation and summarization [paper]	Code Generation and Summarization	[Code]
Retrieval-Augmented Generation for Code Summarization via Hybrid GNN [paper]	Code Summarization	[Code]
RACE: Retrieval-Augmented Commit Message Generation [paper]	Commit Message Generation	[Code]

↑ Back to Index ↑

Text→Code

Paper	Task	Code
Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning [paper]	Code Generation	[Code]
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [paper]	Code Completion	[Code]
Retrieval augmented code generation and summarization [paper]	Code Generation and Summarization	[Code]
CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context [paper]	Code Generation	[Code]
EVOR: Evolving Retrieval for Code Generation [paper]	Code Generation	[Code]
Synchromesh: Reliable code generation from pre-trained language models [paper]	Code Generation	[Code]
A Retrieve-and-Edit Framework for Predicting Structured Outputs [paper]	Code Generation
ReACC: A Retrieval-Augmented Code Completion Framework [paper]	Code Completion	[Code]
DocPrompting: Generating Code by Retrieving the Docs [paper]	Code Generation	[Code]
XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing [paper]	Text-to-SQL

↑ Back to Index ↑

Text+Structured Data→Text

Paper	Task	Code
KnowledGPT: Enhancing Large Language Models with Retrieval and Storage Access on Knowledge Bases [paper]	KBQA
G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering [paper]	KBQA	[Code]
ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering[paper]	KBQA	[Code]
KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation [paper]	KBQA	[Code]
LI-RAGE: Late Interaction Retrieval Augmented Generation with Explicit Signals for Open-Domain Table Question Answering [paper]	Table QA	[Code]
End-to-End Table Question Answering via Retrieval-Augmented Generation [paper]	Table QA
RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking [paper]	Table QA	[Code]
Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open Domain Question Answering [paper]	Table QA	[Code]

↑ Back to Index ↑

Workflow

For a functional MM-RAG system, we identify four essential stages of its workflow: pre-retrieval, retrieval, augmentation and generation . Retrieval and generation involve the retriever and generator, respectively. Pre-retrieval involves knowledge base and query preparation. Augmentation involves the preprocessing of the query and retrieved documents before they are fed into the generator. For each stage, we discuss common approaches and modality-specific optimization strategies, and summarize representative studies.

Training Strategy

There are mainly two strategies: parameter-frozen strategies and parameter-trainable strategies. We summarize training methods and commonly used training datasets 4 per input-output modality combination in the following Tables.

Evaluation and Benchmarks

We discussed the evaluation metrics and benchmarks of the retriever and the generator as they are the two core components that affect the performance of MM-RAG.

Modality	Benchmark	Evaluation Targets
Text + Image → Text	WebQA [Paper] [Resource]	Textual knowledge retrieval and reasoning.
	OK-VQA [Paper] [Resource]
	A-OKVQA [Paper] [Resource]
	MRAG-BENCH [Paper] [Resource]	Visual knowledge retrieval and reasoning.
	Visual-RAG [Paper] [Resource]
	$M^2$RAG [Paper] [Resource]	Document retrieval and generation with interleaved text-image content.
	MRAMG-Bench [Paper] [Resource]
	Dyn-VQA [Paper] [Resource]	Ability of adapting to rapidly changing multimodal knowledge; multi-hop and multimodal reasoning.
	CogBench [Paper]	Adaptive information acquisition by recording the entire planning procedure.
	Liu et al. [Paper] [Resource]	Evaluates MM-RAG across four tasks: image captioning, multimodal QA, fact verification, and image reranking.
Text + Table + Image → Text	OMG-QA [Paper]	Retrieval and reasoning over complex document structures.
	PDF-MVQA [[Paper] [Resource]
	Real-MM-RAG [Paper]
Text → Text	RAGAS [Paper] [Resource]	Context relevance, answer faithfulness, and answer relevance.
	ARES [Paper] [Resource]
	RGB [Paper] [Resource]	Noise robustness, negative rejection, information integration, and counterfactual robustness

↑ Back to Index ↑

Contributing and Citation

The survey and the repository are still work in progress and will be updated regularly.

🙋 If you would like to include your paper in this survey and repository, please feel free to submit a pull request or open an issue with the paper's title and a brief summary highlighting its key techniques. You can also contact us via email. Please let us know if you find out a mistake or have any suggestions! We greatly appreciate your feedback regarding this repository or survey!

🌟 If you find this resource helpful for your work, please consider citing our research.

@article{Zhang_2025,
	title={A Comprehensive Survey on Multimodal RAG: All Combinations of Modalities as Input and Output},
	url={http://dx.doi.org/10.36227/techrxiv.176341513.38473003/v2},
	DOI={10.36227/techrxiv.176341513.38473003/v2},
	publisher={IEEE},
	author={Rui Zhang and Chen Liu and Yixin Su and Ruixuan Li and Xuanjing Huang and Xuelong Li and Philip S Yu},
	year={2025},
	month=nov
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome MM-RAG

Quick Index