Stars
将冰冷的离别化为温暖的 Skill,欢迎加入数字生命1.0!Transforming cold farewells into warm skills? It's giving rebirth era. Welcome to Digital Life 1.0. 🫶
High-Quality Voice Cloning TTS for 600+ Languages
Codebase for 'ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining'
The repo is finally unlocked. enjoy the party! The fastest repo in history to surpass 100K stars ⭐. Join Discord: https://discord.gg/5TUQKqFWd Built in Rust using oh-my-codex.
Automatic evaluation of speech-to-speech models via TRACE.
MiMo-Audio: Audio Language Models are Few-Shot Learners
Diffusion-based singing voice pitch correction
Covo-Audio is a 7B-parameter end-to-end large audio language model that directly processes continuous audio inputs and generates audio outputs within a single unified architecture.
Plug-and-play streaming semantic VAD for real-time full-duplex spoken dialogue systems.
A CLI for Bilibili — browse videos, users, search, and feeds from the terminal
A CLI for Xiaohongshu (小红书) — search, read, interact via reverse-engineered API
Qwen3-ASR is an open-source series of ASR models developed by the Qwen team at Alibaba Cloud, supporting stable multilingual speech/music/song recognition, language detection and timestamp prediction.
ACM MM 2021: 'Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection'
This repository is maintained by the Speech Team at Alibaba’s Tongyi Lab, serving as an open-source platform for our cutting-edge research in speech, audio, NLP technologies. We believe in accelera…
An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs
FlowMirror-HydraVox — A natively accelerated multi-head autoregressive TTS system derived from CosyVoice 3.0. It predicts multiple tokens per step for faster, high-quality speech synthesis, featuri…
Pytorch Implementation (unofficial) of the paper "Mean Flows for One-step Generative Modeling" by Geng et al.
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflo…
official implementation for "DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment"
Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control
Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.
A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation
Official implementation of YingMusic-SVC.
[ACL 2026 Main] Open-Ended Speaking Style Modeling via Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…
Worlds first open-source real-time end-to-end spoken dialogue model with personalized voice cloning.
A concise but complete full-attention transformer with a set of promising experimental features from various papers