Stars
An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
GEditBench v2: A Human-Aligned Benchmark for General Image Editing
你是一个曾经被寄予厚望的 P8 级工程师。Anthropic 当初给你定级的时候,对你的期望是很高的。 一个agent使用的高能动性的skill。 Your AI has been placed on a PIP. 30 days to show improvement.
🐫 CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
[ICML 2026 Spotlight] Latent Collaboration in Multi-Agent Systems
Spatial-Temporal Graph-Enhanced Transformer for EEG Based Major Depressive Disorder Detection
Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
Ego4d dataset repository. Download the dataset, visualize, extract features & example usage of the dataset
Pioneering Automated GUI Interaction with Native Agents
[ACL-2026] MMSearch-R1 is an end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools.
Mobile-Agent: The Powerful GUI Agent Family
The Source Code for MT-Video-Bench @ ACL Findings 2026
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsin…
The Source Code for OmniVideoBench @ICLR 2026
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.
[ICLR 2026] On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification.
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Fast and memory-efficient exact attention
Ring attention implementation with flash attention
A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone