-
Institute of Computing Technology, Chinese Academy of Sciences
- Beijing Haidian
- https://ming-er.github.io/
Stars
[NeurIPS 2023] AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
[ICLR 2026] SmartDJ: declarative audio editing with audio langugae model.
5Hz Deep-Compression Speech VAE for AR-Diffusion and CALMs
Official implementation of "ViSAGe: Video-to-Spatial AUdio Generation" (ICLR 2025)
Official code of ICML 2025 paper "NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction"
Your faithful, impartial partner for audio evaluation — know yourself, know your rivals. 真实评测,知己知彼。
A Benchmark for Evaluating Turn-Taking and Overlap Handling in Full-Duplex Spoken Dialogue Models
This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Lan…
[ACL 2026 Main] MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
Reference implementation for Token-level Direct Preference Optimization(TDPO)
The official code repository for SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
Text-audio foundation model from Boson AI
An official implementation of DanceGRPO: Unleashing GRPO on Visual Generation
N-dimensional Rotary Position Embeddings for PyTorch
This is the code for paper: XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs
[ICLR2026] AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Unofficial PyTorch implementation of "Autoregressive Speech Synthesis without Vector Quantization (MELLE)"
PodAgent: A Comprehensive Framework for Podcast Generation
MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling, flexible speaker control, and multilingual support, while enablin…
Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching