-
University of Chinese Academy of Science
- Beijing in China
Starred repositories
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
An Open Phone Agent Model & Framework. Unlocking the AI Phone for Everyone
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning
GLM-ASR-Nano: A robust, open-source speech recognition model with 1.5B parameters
Versatile audio super resolution (any -> 48kHz) with AudioSR.
AudioLDM: Generate speech, sound effects, music and beyond, with text.
🤯 LobeHub - an open-source, modern design AI Agent Workspace. Supports multiple AI providers, Knowledge Base (file upload / RAG ), one click install MCP Marketplace and Artifacts / Thinking. One-cl…
GELab: GUI Exploration Lab. One of the best GUI agent solutions in the galaxy, built by the StepFun-GELab team and powered by Step’s research capabilities.
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
Lightning-Fast, On-Device TTS — running natively via ONNX.
Official implementation of YingMusic-SVC.
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
We Speech Toolkit, LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction
FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity.
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
The official implementation of OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs
DeepResearchAgent is a hierarchical multi-agent system designed not only for deep research tasks but also for general-purpose task solving. The framework leverages a top-level planning agent to coo…
MiMo-Audio: Audio Language Models are Few-Shot Learners
A multimodal RAG application that enables semantic search on multimedia sources like audio, video and images
A lightweight, powerful framework for multi-agent workflows
Official Repository of "OmniTry: Virtual Try-On Anything without Masks"
FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.
Accepted as [NeurIPS 2024] Spotlight Presentation Paper
LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.