Stars
[ICCV25 Highlight] The official implementation of the paper "LEGION: Learning to Ground and Explain for Synthetic Image Detection"
(NeurIPS 2025 🔥) Official implementation for "Efficient Multi-modal Large Language Models via Progressive Consistency Distillation"
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"
Code for "The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs"
[ICCV 2025] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
[NeurIPS 2025] IEAP: Image Editing As Programs with Diffusion Models
📚 Collection of token-level model compression resources.
AnywhereDoor is a multi-target backdoor attack tailored for object detection. Once implanted, it enables adversaries to specify different attack types (object vanishing, fabrication, or misclassifi…
A large scale camera-taken table detection and recognition dataset.
[EMNLP 2025 main 🔥] Code for "Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More"
Official implementation for the paper"Towards Understanding How Knowledge Evolves in Large Vision-Language Models"
AL-Bench: A benchmark for automatic logging
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Solve Visual Understanding with Reinforced VLMs
A Survey on Multimodal Retrieval-Augmented Generation
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
MM-Eureka V0 also called R1-Multimodal-Journey, Latest version is in MM-Eureka
Minimal reproduction of DeepSeek R1-Zero
This project aims to collect the latest "call for reviewers" links from various top CS/ML/AI conferences/journals
Witness the aha moment of VLM with less than $3.
Fully open reproduction of DeepSeek-R1