🎧VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

HuggingFace 🤗 | ModelScope 🔮 | Paper 📖 | Demo 🎥

Shanghai Jiao Tong University | Ant Group

👀 VocalNet Overview

VocalNet is a series of high performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. VocalNet introduces multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. VocalNet outperforms many mainstream Omni LLMs and existing open-source speech LLMs despite using significantly less training data.

VocalNet-1B: Built upon LLaMA-3.2-1B-Instruct. With much fewer model parameters, VocalNet-1B achieves performance comparable to mainstream speech LLMs, including LLaMA-Omni, Freeze-Omni, and GLM-4-Voice.
VocalNet-8B: Built upon LLaMA-3.1-8B-Instruct. In the evaluation of speech interaction models, VocalNet-8B significantly outperforms most mainstream speech LLMs and Omni LLMs.

Model Architecture & Training Strategy

VocalNet consists of a speech encoder to convert waves into speech representations, a pre-trained LLM backbone and a speech decoder for speech token generation. A downsample adaptor is added after the speech encoder to achieve a lower frame rate, and a speech projector to bridge the dimension gap between the LLM hidden state and decoder inputs. The generated speech token is sent to the speech vocoder, in which the corresponding speech response is constructed.
VocalNet adopts a dual-stage training strategy: Multi-Modal Alignment and Generative Supervised Fine-Tuning. In the first stage, VocalNet is trained using speech queries and text responses. The LLM backbone is trained using LoRA along with the downsample adaptor. In the second stage, VocalNet is trained using speech queries and speech responses. The major components are frozen, while the speech projector and speech decoder are trained.

Streaming Decoding

VocalNet employs two attention mask mechanisms tailored for complete sequence processing and real-time speech generation respectively, inspired by MiniCPM-o. For non-streaming mode, the text position attends to whole text positions, and the speech position attends to the whole text positions, itself and its previous speech positions. For streaming mode, the text position attends to itself and its previous text positions, and the speech position attends to chunk-limited text positions, itself and its previous speech positions.

MTP Implementation

VocalNet utilizes N-1 sequential Transformer layers as MTP modules, enabling the prediction of N speech tokens in a single inference step while preserving the temporal relationships. In addition, a layer-wise decaying cross-entropy loss is introduced as the MTP loss.

🙌 Quick Start

Repo Download and Environment Preparartion

git clone https://github.com/YHWmz/VocalNet.git
cd VocalNet

conda create -n vocalnet python==3.10
conda activate vocalnet
pip install --upgrade pip
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -e .

# Optional: Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Data Preparartion

We provide two re-synthesized datasets for VocalNet training:

VoicaAssistant-430K from Mini-Omni: ModelScope HuggingFace
UltraChat from SLAM-Omni: ModelScope HuggingFace

Here's an example of a training data instance in JSON format:

[
  {
        "id": "19090",
        "speech": "VoiceAssistant-430K-vocalnet/audios/19090.wav",
        "conversations": [
            {
                "from": "human",
                "value": "<speech>"
            },
            {
                "from": "gpt",
                "value": "The legality of shooting someone who breaks into your house depends on the laws of your specific location. In some places, \"stand your ground\" laws or \"castle doctrine\" laws allow homeowners to use lethal force if they feel threatened. However, other areas may require you to retreat if possible before using force. It's important to familiarize yourself with the laws in your region and consider non-lethal methods of protection as well. Consulting with a legal expert can provide you with guidance tailored to your situation."
            }
        ],
        "units": "VoiceAssistant-430K-vocalnet/cosyvoice2_tokens/19090.npy"
    }
]

After downloading them, organize the data as follows in ./playground/,

├── VoiceAssistant-430K-vocalnet
│   ├── audios
│   └── cosyvoice2_tokens
└── UltraChat-vocalnet
    ├── audios
    └── cosyvoice2_tokens

The speech wave is synthesized with CosyVoice2, and we use CosyVoice 2’s speech tokenizer to extract the speech tokens.

Model Training

Stage I: Multi-Modal Alignment S2T

Download Base LLM and speech encoder

Base LLM: LLaMA-3.2-1B-Instruct LLaMA-3.1-8B-Instruct
Speech encoder: Whisper-large-v3

Path Modification in scripts/mtp/llama_stage1_s2t.sh

# Model checkpoint configuration
CHECKPOINT_NAME="llama32-1B-instruct-s2t"  # Checkpoint name (will be used as the directory name)
CHECKPOINT_DIR="./checkpoints/${CHECKPOINT_NAME}"  # Path to save checkpoints

# Base model configuration
BASE_MODEL="./hf_hub/Llama-3.2-1B-Instruct"  # Path to the pretrained base LLM

# Data path configuration
DATA_PATH="./playground/VoiceAssistant-430K-VocalNet/VoiceAssistant-430K.json"  # Training data index
SPEECH_FOLDER="./playground/"  # Root directory for speech files (contains speech queries and preprocessed speech tokens)

# Speech encoder configuration
SPEECH_ENCODER="./models/speech_encoder/whisper-large-v3"  # Path to the Whisper speech encoder model

Start Training

bash scripts/mtp/llama_stage1_s2t.sh

Stage II: Generative Supervised Fine-tuning S2S

Path Modification

# Model Configuration
CHECKPOINT_NAME="llama32-1B-instruct-s2s-mtp5"  # Checkpoint name
CHECKPOINT_DIR="./checkpoints/${CHECKPOINT_NAME}"  # Directory to store training checkpoints

# Model Parameters
BASE_MODEL="./checkpoints/llama32-1B-instruct-s2t"  # Stage I S2T model (used as initialization)

# Dataset Paths
DATA_PATH="./playground/VoiceAssistant-430K-VocalNet/VoiceAssistant-430K.json"  
SPEECH_FOLDER="./playground/"  

# Speech encoder configuration
SPEECH_ENCODER="./models/speech_encoder/whisper-large-v3"  # Whisper-Large-V3 encoder weights

Start Training

bash scripts/mtp/llama_stage2_s2s.sh

Local Infer

Model Preparartion:

Download Our Open-source Models: VocalNet-1B from HuggingFace or ModelScope, and VocalNet-8B from HuggingFace or ModelScope.
Download the Whisper model from HuggingFace and place it in the ./models/speech_encoder/ directory.

CosyVoice Preparartion: We utilize CosyVoice2's flow-matching model to convert VocalNet-generated speech tokens into the final audio waveform.

To infer our VocalNet-1B and VocalNet-8B, you need to download the CosyVoice2-0.5B from HuggingFace.

Path Modification

Modify the path in omni_speech/infer/vocalnet.py.

COSYVOICE_MODEL=""     ## CosyVoice2-0.5B       i.e. /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet
VOCALNET_MODEL = ""    ## VocalNet speech LLM   i.e. ./checkpoints/VocalNet-1B

Local Infer

## stage 1 infer (s2t)
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav 
## stage 2 infer (s2s)
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./

📈 Performance Evaluation

VocalNet is evaluated on OpenAudioBench, consisting AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions.

Bold indicates the optimal result in each subgroup and underline indicates the suboptimal result in base size models.

Overall Performance


Model	LLM size	Modality	AlpacaEval	LLaMA Questions	TriviaQA	Web Questions
Tiny Models
Mini-Omni	0.5B	s→t	1.84	2.7	0.12	0.22
Mini-Omni	0.5B	s→s	1.80	2.7	0.08	0.20
SLAM-Omni	0.5B	s→t	3.50	29.4	0.39	0.84
SLAM-Omni	0.5B	s→s	3.01	26.7	0.34	0.69
VocalNet-1B (VA)	1B	s→t	5.38	70.3	3.38	4.93
VocalNet-1B (VA)	1B	s→s	4.83	61.0	2.78	4.47
VocalNet-1B	1B	s→t	5.79	71.7	3.60	5.16
VocalNet-1B	1B	s→s	5.03	63.7	3.06	4.68
Base Models
LLaMA-Omni	8B	s→t	5.31	69.7	4.44	5.44
LLaMA-Omni	8B	s→s	3.89	55.1	2.44	4.00
Freeze-Omni	7B	s→t	4.51	77.7	5.32	6.41
Freeze-Omni	7B	s→s	2.99	60.2	3.53	4.78
GLM-4-Voice	9B	s→t	5.86	77.4	4.95	5.56
GLM-4-Voice	9B	s→s	5.27	64.3	4.63	5.40
Baichuan-Omni-1.5	7B	s→t	5.20	77.6	5.72	6.12
Baichuan-Omni-1.5	7B	s→s	4.10	61.2	4.13	5.18
MiniCPM-o	8B	s→t	6.13	77.2	6.43	7.16
MiniCPM-o	8B	s→s	4.95	65.8	4.99	6.22
Minmo*	8B	s→t	-	78.9	4.83	5.50
Minmo*	8B	s→s	6.48	64.1	3.75	3.99
Qwen2.5-Omni	8B	s→t	6.01	79.0	5.89	6.88
Qwen2.5-Omni	8B	s→s	5.73	76.3	5.59	6.70
VocalNet-8B (VA)	8B	s→t	7.05	77.1	6.15	6.34
VocalNet-8B (VA)	8B	s→s	6.30	71.4	5.24	5.81
VocalNet-8B	8B	s→t	7.12	79.5	6.24	6.48
VocalNet-8B	8B	s→s	6.37	73.1	5.67	6.16

Response Alignment and Acoustic Quality


Model	AlpacaEval		Llama Questions		TriviaQA		Web Questions		Avg
Model	WER	UTMOS	WER	UTMOS	WER	UTMOS	WER	UTMOS	WER	UTMOS
Tiny Models
Mini-Omni	20.78	4.429	5.20	4.428	7.43	4.428	8.51	4.433	8.66	4.430
SLAM-Omni	5.52	4.439	5.55	4.467	6.16	4.470	6.50	4.461	6.17	4.464
VocalNet-1B (VA)	3.43	4.495	3.65	4.498	5.97	4.499	6.40	4.489	5.66	4.495
VocalNet-1B	3.43	4.491	3.27	4.497	6.73	4.486	4.88	4.493	5.31	4.491
Base Models
LLaMA-Omni	6.00	3.942	10.00	4.003	20.93	3.965	14.60	3.935	15.90	3.956
Freeze-Omni	14.33	4.377	14.20	4.417	20.39	4.404	18.25	4.398	18.31	4.401
GLM-4-Voice	18.71	4.025	14.45	4.152	8.33	4.306	6.08	4.214	8.99	4.228
Baichuan-Omni-1.5	20.84	4.082	22.82	4.332	22.36	4.401	23.29	4.350	22.67	4.347
MiniCPM-o	15.35	4.102	5.73	4.228	8.08	4.128	8.94	4.125	8.72	4.137
Qwen2.5-Omni	2.41	4.299	0.93	4.315	1.13	4.339	4.68	4.363	2.63	4.342
VocalNet-8B (VA)	2.65	4.490	3.00	4.503	5.02	4.499	4.21	4.485	4.26	4.493
VocalNet-8B	4.71	4.489	2.68	4.500	4.04	4.482	3.11	4.492	3.56	4.489

🌞 Acknowledgements

LLaMA-Omni: VocalNet is developed on the codebase of LLaMA-Omni.
LLaVA: We borrowed a lot of code from LLaVA for model training.
CosyVoice: VocalNet borrows code about speech generation from CosyVoice.
Freeze-Omni: We borrow some code about speech decoder.

⚖️ License

This repository is released under the Apache-2.0 license as found in the LICENSE file.

💡 Citation

If you find our data/model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️！

@article{wang2025vocalnet,
  title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
  author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
  journal={arXiv preprint arXiv:2504.04060},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
checkpoints		checkpoints
cosyvoice		cosyvoice
images		images
omni_speech		omni_speech
playground		playground
scripts		scripts
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎧VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

👀 VocalNet Overview

🙌 Quick Start

Repo Download and Environment Preparartion

Data Preparartion

Model Training

Stage I: Multi-Modal Alignment S2T

Stage II: Generative Supervised Fine-tuning S2S

Local Infer

📈 Performance Evaluation

🌞 Acknowledgements

⚖️ License

💡 Citation

About

Uh oh!

Releases

Packages

Languages

License

vra/VocalNet

Folders and files

Latest commit

History

Repository files navigation

🎧VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

👀 VocalNet Overview

🙌 Quick Start

Repo Download and Environment Preparartion

Data Preparartion

Model Training

Stage I: Multi-Modal Alignment S2T

Stage II: Generative Supervised Fine-tuning S2S

Local Infer

📈 Performance Evaluation

🌞 Acknowledgements

⚖️ License

💡 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages