-
Computer of Science and Technology Beijing
Lists (3)
Sort Name ascending (A-Z)
Stars
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning
The official repository VeRL for Autoregressive TTS.
High-Resolution Image Synthesis with Latent Diffusion Models
verl: Volcano Engine Reinforcement Learning for LLMs
[NeurIPS' 25] Benchmark for evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges.
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
Compute WER and SER for speech recognition evaluation
MiMo-Audio: Audio Language Models are Few-Shot Learners
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
[TMLR 2025🔥] A survey for the autoregressive models in vision.
FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.
Official code for "F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization"
Text-audio foundation model from Boson AI
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
CUDA Templates and Python DSLs for High-Performance Linear Algebra
A native-PyTorch library for large scale M-LLM (text/audio) training with tp/cp/dp.
Bert-VITS2项目bug多且教程不友好。本proj尽可能修复了Bert-vits2项目的bug,并且可一键启动训练。仅需50条目标说话人语音,获得稳定、快速的TTS模型。
Your faithful, impartial partner for audio evaluation — know yourself, know your rivals. 真实评测,知己知彼。
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
SGLang is a high-performance serving framework for large language models and multimodal models.