Stars
Pioneering Automated GUI Interaction with Native Agents
[CVPR 2026 Highlight] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Embedding model prioritized towards Multimodal RAG, overall + VisDoc double top1 on MMEB benchmark
The repository provides code for running inference and finetuning with the Meta Segment Anything Model 3 (SAM 3), links for downloading the trained model checkpoints, and example notebooks that sho…
🔧Tool-Star: Empowering LLM-brained Multi-Tool Reasoner via Reinforcement Learning
SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Foundation Models
基于多智能体LLM的中文金融交易框架 - TradingAgents中文增强版
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.
skip-vision: efficient and scalable acceleration of vision-language models via adaptive token skipping
Inference, Fine Tuning and many more recipes with Gemma family of models
[NeurIPS 2025] 3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
[ICCV'25] Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
A new zero-shot framework to explore and search for the language descriptive targets in unknown environment based on Large Vision Language Model.
[IROS 2025 Best Paper Award Finalist & IEEE TRO 2026] The Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
NVIDIA Isaac GR00T N1.7 - A Foundation Model for Generalist Robots.
Efficient Triton Kernels for LLM Training
A live stream development of RL tunning for LLM agents
Fully local web research and report writing assistant
🦉 OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
[NeurIPS2025 Spotlight 🔥 ] Official implementation of 🛸 "UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface"
(CVPR 2025 highlight✨) Official repository of paper "LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models"
FlashMLA: Efficient Multi-head Latent Attention Kernels