-
Lead NJU-MiG (Multimodal intelligence Group, 南京大学米格小组)
- https://bradyfu.github.io/
Stars
A curated paper list and taxonomy of efficient Vision-Language-Action (VLA) models for embodied manipulation.
The official implement of VITA, VITA15, LongVITA, VITA-Audio, VITA-VLA, and VITA-E.
LongLive: Real-time Interactive Long Video Generation
Github repository for ACL 2025 paper: Recent Advances in Speech Language Models: A Survey.
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
✨✨R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
✨✨[NeurIPS 2025] VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
✨✨[NeurIPS 2025] This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension"
✨✨[AAAI 2026] This is the official implementation of our paper "QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension"
Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
The Next Step Forward in Multimodal LLM Alignment
MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"
[MM 2025] A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Eagle: Frontier Vision-Language Models with Data-Centric Strategies
✨✨ [ICLR 2025] MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Awesome OVD-OVS - A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Simple PyTorch implementation of "Libra: Building Decoupled Vision System on Large Language Models" (accepted by ICML 2024)
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis