HuggingFace 🤗 | ModelScope 🔮 | Paper 📖 | Demo 🎥
Shanghai Jiao Tong University | Ant Group
VocalNet is a series of high performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. VocalNet introduces multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. VocalNet outperforms many mainstream Omni LLMs and existing open-source speech LLMs despite using significantly less training data.
-
VocalNet-1B: Built upon LLaMA-3.2-1B-Instruct. With much fewer model parameters, VocalNet-1B achieves performance comparable to mainstream speech LLMs, including LLaMA-Omni, Freeze-Omni, and GLM-4-Voice.
-
VocalNet-8B: Built upon LLaMA-3.1-8B-Instruct. In the evaluation of speech interaction models, VocalNet-8B significantly outperforms most mainstream speech LLMs and Omni LLMs.
Model Architecture & Training Strategy
-
VocalNet consists of a speech encoder to convert waves into speech representations, a pre-trained LLM backbone and a speech decoder for speech token generation. A downsample adaptor is added after the speech encoder to achieve a lower frame rate, and a speech projector to bridge the dimension gap between the LLM hidden state and decoder inputs. The generated speech token is sent to the speech vocoder, in which the corresponding speech response is constructed.
-
VocalNet adopts a dual-stage training strategy: Multi-Modal Alignment and Generative Supervised Fine-Tuning. In the first stage, VocalNet is trained using speech queries and text responses. The LLM backbone is trained using LoRA along with the downsample adaptor. In the second stage, VocalNet is trained using speech queries and speech responses. The major components are frozen, while the speech projector and speech decoder are trained.
Streaming Decoding
VocalNet employs two attention mask mechanisms tailored for complete sequence processing and real-time speech generation respectively, inspired by MiniCPM-o. For non-streaming mode, the text position attends to whole text positions, and the speech position attends to the whole text positions, itself and its previous speech positions. For streaming mode, the text position attends to itself and its previous text positions, and the speech position attends to chunk-limited text positions, itself and its previous speech positions.
MTP Implementation
VocalNet utilizes N-1 sequential Transformer layers as MTP modules, enabling the prediction of N speech tokens in a single inference step while preserving the temporal relationships. In addition, a layer-wise decaying cross-entropy loss is introduced as the MTP loss.git clone https://github.com/YHWmz/VocalNet.git
cd VocalNet
conda create -n vocalnet python==3.10
conda activate vocalnet
pip install --upgrade pip
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -e .
# Optional: Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation- We provide two re-synthesized datasets for VocalNet training:
- VoicaAssistant-430K from Mini-Omni: ModelScope HuggingFace
- UltraChat from SLAM-Omni: ModelScope HuggingFace
Here's an example of a training data instance in JSON format:
[
{
"id": "19090",
"speech": "VoiceAssistant-430K-vocalnet/audios/19090.wav",
"conversations": [
{
"from": "human",
"value": "<speech>"
},
{
"from": "gpt",
"value": "The legality of shooting someone who breaks into your house depends on the laws of your specific location. In some places, \"stand your ground\" laws or \"castle doctrine\" laws allow homeowners to use lethal force if they feel threatened. However, other areas may require you to retreat if possible before using force. It's important to familiarize yourself with the laws in your region and consider non-lethal methods of protection as well. Consulting with a legal expert can provide you with guidance tailored to your situation."
}
],
"units": "VoiceAssistant-430K-vocalnet/cosyvoice2_tokens/19090.npy"
}
]
After downloading them, organize the data as follows in ./playground/,
├── VoiceAssistant-430K-vocalnet
│ ├── audios
│ └── cosyvoice2_tokens
└── UltraChat-vocalnet
├── audios
└── cosyvoice2_tokens
The speech wave is synthesized with CosyVoice2, and we use CosyVoice 2’s speech tokenizer to extract the speech tokens.
- Download Base LLM and speech encoder
- Base LLM: LLaMA-3.2-1B-Instruct LLaMA-3.1-8B-Instruct
- Speech encoder: Whisper-large-v3
- Path Modification in
scripts/mtp/llama_stage1_s2t.sh
# Model checkpoint configuration
CHECKPOINT_NAME="llama32-1B-instruct-s2t" # Checkpoint name (will be used as the directory name)
CHECKPOINT_DIR="./checkpoints/${CHECKPOINT_NAME}" # Path to save checkpoints
# Base model configuration
BASE_MODEL="./hf_hub/Llama-3.2-1B-Instruct" # Path to the pretrained base LLM
# Data path configuration
DATA_PATH="./playground/VoiceAssistant-430K-VocalNet/VoiceAssistant-430K.json" # Training data index
SPEECH_FOLDER="./playground/" # Root directory for speech files (contains speech queries and preprocessed speech tokens)
# Speech encoder configuration
SPEECH_ENCODER="./models/speech_encoder/whisper-large-v3" # Path to the Whisper speech encoder model- Start Training
bash scripts/mtp/llama_stage1_s2t.sh- Path Modification
# Model Configuration
CHECKPOINT_NAME="llama32-1B-instruct-s2s-mtp5" # Checkpoint name
CHECKPOINT_DIR="./checkpoints/${CHECKPOINT_NAME}" # Directory to store training checkpoints
# Model Parameters
BASE_MODEL="./checkpoints/llama32-1B-instruct-s2t" # Stage I S2T model (used as initialization)
# Dataset Paths
DATA_PATH="./playground/VoiceAssistant-430K-VocalNet/VoiceAssistant-430K.json"
SPEECH_FOLDER="./playground/"
# Speech encoder configuration
SPEECH_ENCODER="./models/speech_encoder/whisper-large-v3" # Whisper-Large-V3 encoder weights- Start Training
bash scripts/mtp/llama_stage2_s2s.sh- Model Preparartion:
- Download Our Open-source Models: VocalNet-1B from HuggingFace or ModelScope, and VocalNet-8B from HuggingFace or ModelScope.
- Download the Whisper model from HuggingFace and place it in the
./models/speech_encoder/directory.
- CosyVoice Preparartion: We utilize CosyVoice2's flow-matching model to convert VocalNet-generated speech tokens into the final audio waveform.
- To infer our VocalNet-1B and VocalNet-8B, you need to download the CosyVoice2-0.5B from HuggingFace.
- Path Modification
- Modify the path in
omni_speech/infer/vocalnet.py.
COSYVOICE_MODEL="" ## CosyVoice2-0.5B i.e. /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet
VOCALNET_MODEL = "" ## VocalNet speech LLM i.e. ./checkpoints/VocalNet-1B- Local Infer
## stage 1 infer (s2t)
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav
## stage 2 infer (s2s)
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./VocalNet is evaluated on OpenAudioBench, consisting AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions.
Bold indicates the optimal result in each subgroup and underline indicates the suboptimal result in base size models.
Overall Performance
| Model | LLM size | Modality | AlpacaEval | LLaMA Questions | TriviaQA | Web Questions | ||||
| Tiny Models | ||||||||||
| Mini-Omni | 0.5B | s→t | 1.84 | 2.7 | 0.12 | 0.22 | ||||
| s→s | 1.80 | 2.7 | 0.08 | 0.20 | ||||||
| SLAM-Omni | 0.5B | s→t | 3.50 | 29.4 | 0.39 | 0.84 | ||||
| s→s | 3.01 | 26.7 | 0.34 | 0.69 | ||||||
| VocalNet-1B (VA) | 1B | s→t | 5.38 | 70.3 | 3.38 | 4.93 | ||||
| s→s | 4.83 | 61.0 | 2.78 | 4.47 | ||||||
| VocalNet-1B | 1B | s→t | 5.79 |
71.7 |
3.60 |
5.16 |
||||
| s→s | 5.03 |
63.7 |
3.06 |
4.68 |
||||||
| Base Models | ||||||||||
| LLaMA-Omni | 8B | s→t | 5.31 | 69.7 | 4.44 | 5.44 | ||||
| s→s | 3.89 | 55.1 | 2.44 | 4.00 | ||||||
| Freeze-Omni | 7B | s→t | 4.51 | 77.7 | 5.32 | 6.41 | ||||
| s→s | 2.99 | 60.2 | 3.53 | 4.78 | ||||||
| GLM-4-Voice | 9B | s→t | 5.86 | 77.4 | 4.95 | 5.56 | ||||
| s→s | 5.27 | 64.3 | 4.63 | 5.40 | ||||||
| Baichuan-Omni-1.5 | 7B | s→t | 5.20 | 77.6 | 5.72 | 6.12 | ||||
| s→s | 4.10 | 61.2 | 4.13 | 5.18 | ||||||
| MiniCPM-o | 8B | s→t | 6.13 | 77.2 | 6.43 |
7.16 |
||||
| s→s | 4.95 | 65.8 | 4.99 | 6.22 | ||||||
| Minmo* | 8B | s→t | - | 78.9 | 4.83 | 5.50 | ||||
| s→s | 6.48 |
64.1 | 3.75 | 3.99 | ||||||
| Qwen2.5-Omni | 8B | s→t | 6.01 | 79.0 | 5.89 | 6.88 | ||||
| s→s | 5.73 | 76.3 |
5.59 | 6.70 |
||||||
| VocalNet-8B (VA) | 8B | s→t | 7.05 | 77.1 | 6.15 | 6.34 | ||||
| s→s | 6.30 | 71.4 | 5.24 | 5.81 | ||||||
| VocalNet-8B | 8B | s→t | 7.12 |
79.5 |
6.24 | 6.48 | ||||
| s→s | 6.37 | 73.1 | 5.67 |
6.16 | ||||||
Response Alignment and Acoustic Quality
| Model | AlpacaEval | Llama Questions | TriviaQA | Web Questions | Avg | |||||
| WER | UTMOS | WER | UTMOS | WER | UTMOS | WER | UTMOS | WER | UTMOS | |
| Tiny Models | ||||||||||
| Mini-Omni | 20.78 | 4.429 | 5.20 | 4.428 | 7.43 | 4.428 | 8.51 | 4.433 | 8.66 | 4.430 |
| SLAM-Omni | 5.52 | 4.439 | 5.55 | 4.467 | 6.16 | 4.470 | 6.50 | 4.461 | 6.17 | 4.464 |
| VocalNet-1B (VA) | 3.43 |
4.495 |
3.65 | 4.498 |
5.97 |
4.499 |
6.40 | 4.489 | 5.66 | 4.495 |
| VocalNet-1B | 3.43 |
4.491 | 3.27 |
4.497 | 6.73 | 4.486 | 4.88 |
4.493 |
5.31 |
4.491 |
| Base Models | ||||||||||
| LLaMA-Omni | 6.00 | 3.942 | 10.00 | 4.003 | 20.93 | 3.965 | 14.60 | 3.935 | 15.90 | 3.956 |
| Freeze-Omni | 14.33 | 4.377 | 14.20 | 4.417 | 20.39 | 4.404 | 18.25 | 4.398 | 18.31 | 4.401 |
| GLM-4-Voice | 18.71 | 4.025 | 14.45 | 4.152 | 8.33 | 4.306 | 6.08 | 4.214 | 8.99 | 4.228 |
| Baichuan-Omni-1.5 | 20.84 | 4.082 | 22.82 | 4.332 | 22.36 | 4.401 | 23.29 | 4.350 | 22.67 | 4.347 |
| MiniCPM-o | 15.35 | 4.102 | 5.73 | 4.228 | 8.08 | 4.128 | 8.94 | 4.125 | 8.72 | 4.137 |
| Qwen2.5-Omni | 2.41 |
4.299 | 0.93 |
4.315 | 1.13 |
4.339 | 4.68 | 4.363 | 2.63 |
4.342 |
| VocalNet-8B (VA) | 2.65 | 4.490 |
3.00 | 4.503 |
5.02 | 4.499 |
4.21 | 4.485 | 4.26 | 4.493 |
| VocalNet-8B | 4.71 | 4.489 | 2.68 | 4.500 | 4.04 | 4.482 | 3.11 |
4.492 |
3.56 | 4.489 |
- LLaMA-Omni: VocalNet is developed on the codebase of LLaMA-Omni.
- LLaVA: We borrowed a lot of code from LLaVA for model training.
- CosyVoice: VocalNet borrows code about speech generation from CosyVoice.
- Freeze-Omni: We borrow some code about speech decoder.
This repository is released under the Apache-2.0 license as found in the LICENSE file.
If you find our data/model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!
@article{wang2025vocalnet,
title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
journal={arXiv preprint arXiv:2504.04060},
year={2025}
}