This is a list of some awesome works on accelerating in Multimodal Large Language Models(MLLMs).
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models (Jun. 23, 2023)
- MMBench: Is Your Multi-modal Model an All-around Player? (Jul. 12, 2023, ECCV 2024)
- Evaluating Object Hallucination in Large Vision-Language Models (May. 17, 2023, EMNLP 2023)
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering (Sep. 20, 2022, NeurIPS 2022)
- Towards VQA Models That Can Read (April. 18, 2019, CVPR 2019)
- GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering (Feb. 25, 2019, CVPR 2019)
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Dec. 2, 2016, CVPR 2017)
- Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models (Mar. 20, 2025, CVPR 2025)
- EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models (Mar. 19, 2025, CVPR 2025)
- Adaptive Keyframe Sampling for Long Video Understanding (Feb. 22, 2025, CVPR 2025)
- LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token (Jan. 7, 2025, ICLR 2025)
- FastVLM: Efficient Vision Encoding for Vision Language Models (Dec. 17, 2024, CVPR 2025)
- VisionZip: Longer is Better but Not Necessary in Vision Language Models (Dec. 5, 2024, CVPR 2025)
- FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression (Dec. 5, 2024, CVPR 2025)
- CLS Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster (Dec. 2, 2024)
- DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models (Nov. 22, 2024, CVPR 2025)
- SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference (Oct. 6, 2024)
- Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding (Sep. 22, 2024, CVPR 2025)

- TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings (Sep. 15, 2024, AAAI 2025)
- VoCo-LLaMA: Towards Vision Compression with Large Language Models (Jun. 18, 2024, CVPR 2025)
- Matryoshka Query Transformer for Large Vision-Language Models (May. 29, 2024, NeurIPS 2024)
- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models (Mar. 22, 2024)
- An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models (Mar. 11, 2024, ECCV 2024)
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (Nov. 28, 2023, ECCV 2024)
- Efficient Streaming Language Models with Attention Sinks (Sep. 29, 2023, ICLR 2024)
- Token Merging: Your ViT But Faster (Oct. 17, 2022, ICLR 2023)
- Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations (Feb. 16, 2022, ICLR 2022)