-
University of Chinese Academy of Sciences
- BeiJing
Lists (6)
Sort Name ascending (A-Z)
Stars
High-Quality Voice Cloning TTS for 600+ Languages
[ACL 2026 Main] FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pre-training
VITA-QINYU: Expressive Spoken Language Model for Role-Playing and Singing
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
GitHub Repository for the AudSemThinker Model and the AudSem Dataset
A framework for efficient model inference with omni-modality models
Send a phone call from AI agent, in an API call. Or, directly call the bot from the configured phone number!
Open-source framework for conversational voice AI agents
[ACL 2025] OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching
The first medical SpeechLM, open-sourced with weight, data, and code of training, inference, and evaluation.
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…
A Fully Self-Hosted Solution for Full-Duplex Voice Interaction
GPT-4o-level, real-time spoken dialogue system.
An audio/acoustic activity detection and audio segmentation tool
Silero VAD: pre-trained enterprise-grade Voice Activity Detector
FlowMirror-HydraVox — A natively accelerated multi-head autoregressive TTS system derived from CosyVoice 3.0. It predicts multiple tokens per step for faster, high-quality speech synthesis, featuri…
Added vLLM support to IndexTTS for faster inference.
Multilingual TTS model with voice cloning and duration control, based on T5Gemma encoder-decoder LLM
💖🧸 Self hosted, you-owned Grok Companion, a container of souls of waifu, cyber livings to bring them into our worlds, wishing to achieve Neuro-sama's altitude. Capable of realtime voice chat, Minec…
GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"
Controllable and fast Text-to-Speech for over 7000 languages!
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
A generative speech model for daily dialogue.
GLM-ASR-Nano: A robust, open-source speech recognition model with 1.5B parameters