Skip to content

This repository is for our survey paper: "A Comprehensive Survey on Multimodal RAG: All Combinations of Modalities as Input and Output"

Notifications You must be signed in to change notification settings

INTREBID/Awesome-MM-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 

Repository files navigation

Awesome MM-RAG

Project Badge GitHub Stars GitHub Forks PRs Welcome Made with AI DOI

This repository is for our survey paper:

A Comprehensive Survey on Multimodal RAG: All Combinations of Modalities as Input and Output

Rui Zhang1, Chen Liu1, Yixin Su1, Ruixuan Li1, Xuanjing Huang2, Xuelong Li3, Philip S. Yu4

1Huazhong University of Science and Technology, 2Fudan University, 3Institute of Artificial Intelligence (TeleAI) of China Telecom, 4University of Illinois at Chicago

📚 In this paper, we conduct a comprehensive survey of the most recent work on Multimodal RAG (MM-RAG) in the sense that it has a full coverage of almost all the combinations of modalities as input and output, whereas existing survey papers typically focus on one or two modalities. Based on different input-output modality combinations, we present a taxonomy of MM-RAG methods which gives us a much clearer picture of their key technical components.

⚙️ Facilitated by such a taxonomy, we identify four essential stages of the workflow of MM-RAG, summarize common approaches to each stage, and discuss optimization strategies for each modality.

🌐 To provide a holistic understanding and practical guidance for building MM-RAG systems, we also discuss training strategies and evaluation methods of MM-RAG. Finally, we discuss various MM-RAG applications and future directions.


Quick Index

  • Taxonomy (from a perspective of different input-output modality combinations)

  • Workflow (including Pre-Retrieval, Retrieval, Augmentation, Generation)

  • Training Strategy (including Parameter-Frozen Strategy, Parameter-Trainable Strategy)

  • Evaluation and Benchmarks (including Evaluaton Metrics and Benchmarks)


Taxonomy

Image→Text

Paper Task Code
Publish
Re-vilm: Retrievalaugmented visual language model for zero and few-shot image captioning [paper]
Image Captioning
Cross-modal retrieval and semantic refinement for remote sensing image captioning [paper] Image Captioning
Publish
Smallcap: lightweight image captioning prompted with retrieval augmentation [paper]
Image Captioning
[Code]
Publish
Retrieval-augmented multimodal language modeling [paper]
Image Captioning
Publish
Retrieval-augmented transformer for image captioning [paper]
Image Captioning
Publish
Memory-augmented image captioning [paper]
Image Captioning
Publish
Retrieval-augmented image captioning [paper]
Image Captioning
Publish
Deltanet: Conditional medical report generation for COVID-19 diagnosis [paper]
Image Captioning
Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning [paper] Image Captioning
[Code]
Publish
Fact-aware multimodal retrieval augmentation for accurate medical radiology report generation [paper]
Image Captioning
[Code]
Publish
Retrieval, analogy, and composition: A framework for compositional generalization in image captioning [paper]
Image Captioning

↑ Back to Index ↑

Text→Image

Paper Task Code
Publish
RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval [paper]
Image Generation
Publish
FineRAG: Fine-grained Retrieval-Augmented Text-to-Image Generation [paper]
Image Generation
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation [paper] Image Generation
[Code]
X&Fuse: Fusing Visual Information in Text-to-Image Generation [paper] Image Generation
Publish
TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models [paper]
Image Generation
[Code]
Publish
Re-Imagen: Retrieval-Augmented Text-to-Image Generator [paper]
Image Generation
Publish
Retrieval-augmented diffusion models [paper]
Image Generation
[Code]
Publish
KNN-Diffusion: Image Generation via Large-Scale Retrieval [paper]
Image Generation
Publish
Memory-Driven Text-to-Image Generation [paper]
Image Generation

↑ Back to Index ↑

Text+Image→Text

Paper Task Code
Publish
Augmenting Transformers with KNN-Based Composite Memory for Dialog [paper]
Visual Dialog
Publish
Maria: A Visual Experience Powered Conversational Agent [paper]
Visual Dialog
[Code]
Publish
Murag: Multimodal retrieval-augmented generator for open question answering over images and text [paper]
Visual QA
Publish
Retrieval Augmented Visual Question Answering with Outside Knowledge [paper]
Visual QA
[Code]
![Publish](https://img.shields.io/badge/Conference-ACM MM_2023-blue)
RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training [paper]
Visual QA
[Code]
Publish
KAT: A Knowledge Augmented Transformer for Vision-and-Language [paper]
Visual QA
[Code]
Publish
An empirical study of gpt-3 for few-shot knowledge-based vqa [paper]
Visual QA
[Code]
Publish
Cross-modal retrieval augmentation for multi-modal classification [paper]
Visual QA
Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training [paper] Visual QA
[Code]
Learning to compress contexts for efficient knowledge-based visual question answering [paper] Visual QA
Publish
Reveal: Retrievalaugmented visual-language pre-training with multi-source multimodal knowledge memory [paper]
Visual QA [Code]
Publish
Visa: Retrieval augmented generation with visual source attribution [paper]
Visual QA
[Code]
M3DOCRAG: Multi-modal Multi-page Document RAG System [paper] Visual QA
[Code]
Publish
Echosight: Advancing visual-language models with wiki knowledge [paper]
Visual QA
[Code]
Visrag: Vision-based retrieval-augmented generation on multi-modality documents [paper] Visual QA
[Code]
Publish
Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering [paper]
Visual QA
[Code]
Publish
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [paper]
Visual QA
[Code]

↑ Back to Index ↑

Audio→Text

Paper Task Code
Publish
RECAP: Retrieval-Augmented Audio Captioning [paper]
Audio Captioning
[Code]
Publish
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval [paper]
Audio Captioning

↑ Back to Index ↑

Text→Audio

Paper Task Code
Publish
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [paper]
Audio Generation
[Code]
Publish
Retrieval-Augmented Text-to-Audio Generation [paper]
Audio Generation
Publish
Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation [paper]
Audio Generation [Code]
Publish
Retrieval augmented generation in prompt-based text-to-speech synthesis with context-aware contrastive language-audio pretraining [paper]
Audio Generation [Code]
Publish
Retrieval-augmented classifier guidance for audio generation
Audio Generation

↑ Back to Index ↑

Video→Text

Paper Task Code
Publish
Retrieval-augmented egocentric video captioning [paper]
Video Captioning
[Code]
Publish
Incorporating background knowledge into video description generation [paper]
Video Captioning

↑ Back to Index ↑

Text→Video

Paper Task Code
Publish
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [paper]
Video Generation
[Code]

↑ Back to Index ↑

Image→Video

Paper Task Code
Publish
MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation [paper]
Image-to-Video Generation
[Code]

↑ Back to Index ↑

Video+Text→Text

Paper Task Code
Publish
ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System [paper]
Video QA
Publish
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models [paper]
Video QA
Publish
iRAG: Advancing RAG for Videos with an Incremental Approach [paper]
Video QA
Publish
Retrieval augmented convolutional encoder-decoder networks for video captioning [paper]
Video QA
Publish
VideoRAG: Retrieval-Augmented Generation over Video Corpus [paper]
Video QA
[Code]
Publish
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension [paper]
Video QA
[Code]
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos [paper] Video QA
[Code]

↑ Back to Index ↑

Text→3D

Paper Task Code
Publish
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model [paper]
Text-to-3D
[Code]
Publish
Retrieval-Augmented Score Distillation for Text-to-3D Generation [paper]
Text-to-3D
[Code]

↑ Back to Index ↑

Code→Text

Paper Task Code
Publish
Retrieval-based neural source code summarization [paper]
Code Summarization
[Code]
Publish
Retrieval augmented code generation and summarization [paper]
Code Generation and Summarization
[Code]
Publish
Retrieval-Augmented Generation for Code Summarization via Hybrid GNN [paper]
Code Summarization
[Code]
Publish
RACE: Retrieval-Augmented Commit Message Generation [paper]
Commit Message Generation
[Code]

↑ Back to Index ↑

Text→Code

Paper Task Code
Publish
Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning [paper]
Code Generation
[Code]
Publish
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [paper]
Code Completion [Code]
Publish
Retrieval augmented code generation and summarization [paper]
Code Generation and Summarization
[Code]
CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context [paper] Code Generation
[Code]
Publish
EVOR: Evolving Retrieval for Code Generation [paper]
Code Generation [Code]
Publish
Synchromesh: Reliable code generation from pre-trained language models [paper]
Code Generation
[Code]
Publish
A Retrieve-and-Edit Framework for Predicting Structured Outputs [paper]
Code Generation
Publish
ReACC: A Retrieval-Augmented Code Completion Framework [paper]
Code Completion
[Code]
Publish
DocPrompting: Generating Code by Retrieving the Docs [paper]
Code Generation
[Code]
Publish
XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing [paper]
Text-to-SQL

↑ Back to Index ↑

Text+Structured Data→Text

Paper Task Code
KnowledGPT: Enhancing Large Language Models with Retrieval and Storage Access on Knowledge Bases [paper] KBQA
Publish
G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering [paper]
KBQA
[Code]
Publish
ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering[paper]
KBQA [Code]
Publish
KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation [paper]
KBQA
[Code]
Publish
LI-RAGE: Late Interaction Retrieval Augmented Generation with Explicit Signals for Open-Domain Table Question Answering [paper]
Table QA
[Code]
End-to-End Table Question Answering via Retrieval-Augmented Generation [paper] Table QA
RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking [paper] Table QA
[Code]
Publish
Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open Domain Question Answering [paper]
Table QA
[Code]

↑ Back to Index ↑

Workflow

For a functional MM-RAG system, we identify four essential stages of its workflow: pre-retrieval, retrieval, augmentation and generation . Retrieval and generation involve the retriever and generator, respectively. Pre-retrieval involves knowledge base and query preparation. Augmentation involves the preprocessing of the query and retrieved documents before they are fed into the generator. For each stage, we discuss common approaches and modality-specific optimization strategies, and summarize representative studies.

Training Strategy

There are mainly two strategies: parameter-frozen strategies and parameter-trainable strategies. We summarize training methods and commonly used training datasets 4 per input-output modality combination in the following Tables.

Evaluation and Benchmarks

We discussed the evaluation metrics and benchmarks of the retriever and the generator as they are the two core components that affect the performance of MM-RAG.

Modality Benchmark Evaluation Targets
Text + Image → Text Publish
WebQA [Paper] [Resource]
Textual knowledge retrieval and reasoning.
Publish
OK-VQA [Paper] [Resource]
Publish
A-OKVQA [Paper] [Resource]
Publish
MRAG-BENCH [Paper] [Resource]
Visual knowledge retrieval and reasoning.
Visual-RAG [Paper] [Resource]
$M^2$RAG [Paper] [Resource] Document retrieval and generation with interleaved text-image content.
Publish
MRAMG-Bench [Paper] [Resource]
Dyn-VQA [Paper] [Resource] Ability of adapting to rapidly changing multimodal knowledge; multi-hop and multimodal reasoning.
Publish
CogBench [Paper]
Adaptive information acquisition by recording the entire planning procedure.
Liu et al. [Paper] [Resource] Evaluates MM-RAG across four tasks: image captioning, multimodal QA, fact verification, and image reranking.
Text + Table + Image → Text Publish
OMG-QA [Paper]
Retrieval and reasoning over complex document structures.
Publish
PDF-MVQA [[Paper] [Resource]
Real-MM-RAG [Paper]
Text → Text Publish
RAGAS [Paper] [Resource]
Context relevance, answer faithfulness, and answer relevance.
Publish
ARES [Paper] [Resource]
Publish
RGB [Paper] [Resource]
Noise robustness, negative rejection, information integration, and counterfactual robustness

↑ Back to Index ↑

Contributing and Citation

The survey and the repository are still work in progress and will be updated regularly.

🙋 If you would like to include your paper in this survey and repository, please feel free to submit a pull request or open an issue with the paper's title and a brief summary highlighting its key techniques. You can also contact us via email. Please let us know if you find out a mistake or have any suggestions! We greatly appreciate your feedback regarding this repository or survey!

🌟 If you find this resource helpful for your work, please consider citing our research.

@article{Zhang_2025,
	title={A Comprehensive Survey on Multimodal RAG: All Combinations of Modalities as Input and Output},
	url={http://dx.doi.org/10.36227/techrxiv.176341513.38473003/v2},
	DOI={10.36227/techrxiv.176341513.38473003/v2},
	publisher={IEEE},
	author={Rui Zhang and Chen Liu and Yixin Su and Ruixuan Li and Xuanjing Huang and Xuelong Li and Philip S Yu},
	year={2025},
	month=nov
}

About

This repository is for our survey paper: "A Comprehensive Survey on Multimodal RAG: All Combinations of Modalities as Input and Output"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published