Stars
[NeurIPS 2025] The official repository for our paper, "Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning".
Code for ICML 2025 Paper "Highly Compressed Tokenizer Can Generate Without Training"
[NeurIPS 2025] Efficient Reasoning Vision Language Models
[NeurIPS 2025] Official code for paper: Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs.
Official repository for VisionZip (CVPR 2025)
[AAAI 2026] Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Model https://arxiv.org/pdf/2411.02433
Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'โ
[CVPR2025 && NTIRE2025] HVI: A New Color Space for Low-light Image Enhancement (Official Implementation)
CycleResearcher: Improving Automated Research via Automated Review
๐ WorldGen - Generate Any 3D Scene in Seconds
Physics-Informed Neural networks for Advanced modeling
PyTorch implementation of JiT https://arxiv.org/abs/2511.13720
๐ธ๐ฌ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
A generative speech model for daily dialogue.
A large-scale dataset of music sheet images designed for VQA in music understanding.
repo for paper https://arxiv.org/abs/2504.13837
[CVPR 2025] Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Paper2Agent is a multi-agent AI system that automatically transforms research papers into interactive AI agents with minimal human input.
[CVPR 2025] Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution
A high-throughput and memory-efficient inference and serving engine for LLMs
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.