Starred repositories
A Framework for Speech, Language, Audio, Music Processing with Large Language Model
An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models
A Datacenter Scale Distributed Inference Serving Framework
ICLR 2026 (Oral) | EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning
SONIC-O1 is a fully human-verified real-world audio-video benchmark spanning 13 conversational domains to evaluate MLLMs on summarization, evidence-grounded MCQ reasoning, and temporal localization…
FireRed-OpenStoryline is an AI video editing agent that transforms manual editing into intention-driven directing through natural language interaction, LLM-powered planning, and precise tool orches…
A SOTA Industrial-Grade All-in-One ASR system with ASR, VAD, LID, and Punc modules. FireRedASR2 supports Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and both speech and singi…
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
An efficient multi-modal instruction-following data synthesis tool and the official implementation of Oasis https://arxiv.org/abs/2503.08741.
Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
Refine high-quality datasets and visual AI models
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
A Lighting Pytorch Framework for Recommendation Models (PyTorch推荐算法框架), Easy-to-use and Easy-to-extend. https://datawhalechina.github.io/torch-rechub/
An Open Foundation Model and Benchmark to Accelerate Generative Recommendation
The code implementation for UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings (ICLR 2026).
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
A curated list of awesome platforms, tools, practices and resources that helps run LLMs locally
CaptionQA: Is Your Caption as Useful as the Image Itself?
The repository provides code for running inference and finetuning with the Meta Segment Anything Model 3 (SAM 3), links for downloading the trained model checkpoints, and example notebooks that sho…
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using Agentic Retrieval 🔄.
[ICLR'26] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
[NeurIPS 2025 Spotlight] A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone.
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.