- Beijing, China
- https://aberhu.github.io/
Stars
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
LEAKED SYSTEM PROMPTS FOR CHATGPT, GEMINI, GROK, CLAUDE, PERPLEXITY, CURSOR, DEVIN, REPLIT, AND MORE! - AI SYSTEMS TRANSPARENCY FOR ALL! 👐
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
"AI-Trader: Can AI Beat the Market?" Live Trading Bench: https://ai4trade.ai Tech Report Link: https://arxiv.org/abs/2512.10971
Mobile-Agent: The Powerful GUI Agent Family
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
A comprehensive list of papers for the definition of World Models and using World Models for General Video Generation, Embodied AI, and Autonomous Driving, including papers, codes, and related webs…
The official implement of VITA, VITA15, LongVITA, VITA-Audio, VITA-VLA, and VITA-E.
Fully Open Framework for Democratized Multimodal Training
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU
A Survey of Reinforcement Learning for Large Reasoning Models
Unlimited-length talking video generation that supports image-to-video and video-to-video generation
🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust RAG pipelines
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2…
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]
Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’
R1-onevision, a visual language model capable of deep CoT reasoning.
Solve Visual Understanding with Reinforced VLMs
Extend OpenRLHF to support LMM RL training for reproduction of DeepSeek-R1 on multimodal tasks.
A fork to add multimodal model training to open-r1
Fully open reproduction of DeepSeek-R1
A library for advanced large language model reasoning
Scalable RL solution for advanced reasoning of language models
The Next Step Forward in Multimodal LLM Alignment
An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & TIS & vLLM & Ray & Dynamic Sampling & Async Agentic RL)
Unified KV Cache Compression Methods for Auto-Regressive Models