Stars
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!
AI Audio Datasets (AI-ADS) 🎵, including Speech, Music, and Sound Effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio a…
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
A project page template for academic papers. Demo at https://eliahuhorwitz.github.io/Academic-project-page-template/
Very low latency speech to text, intent recognition, and text to speech, for building voice agents and interfaces
CAT is more than a CRF-based ASR toolkit: it provides a complete workflow for data-efficient end-to-end ASR, supporting CTC, CTC-CRF, RNN-T, and language-model training and inference.
Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.
Qwen3-ASR is an open-source series of ASR models developed by the Qwen team at Alibaba Cloud, supporting stable multilingual speech/music/song recognition, language detection and timestamp prediction.
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-R1, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, …
A Python library for audio data augmentation. Useful for making audio ML models work well in the real world, not just in the lab.
The official repo of NBC & SpatialNet for multichannel speech separation, denoising, and dereverberation
Tools for handling multimodal data in machine learning projects.
《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调(全参数/Lora)、部署国内外开源大模型(LLM)/多模态大模型(MLLM)教程
transform-average-concatenate (TAC) method for end-to-end microphone permutation and number invariant ad-hoc beamforming.
Awesome Neural Codec Models, Text-to-Speech Synthesizers & Speech Language Models
AcademiCodec: An Open Source Audio Codec Model for Academic Research
Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models