AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis

This repository provides the official distribution of the AffetSpeech dataset. AffetSpeech is a large-scale emotional speech dataset with fine-grained, multi-level textual descriptions, enabling advanced research in speech emotion captioning (SEC) and emotional speech synthesis (ESS), containing 253,799 high-quality emotional speech data and 1,522,794 natural language descriptions.

📊 Dataset Overview

Emotions: angry, disgust, fear, happy, neutral, sad, surprise, contempt, calm
Language: English
Format: 16kHz, 16-bit, PCM wav.
Annotations (ShareGPT format): Includes sentiment polarity, open-vocabulary emotion captions, prosody (pitch, tempo, energy), emotional intensity, prominent segments, and semantic analysis.

📥 Access Request (EULA)

AffetSpeech is released under a Restricted End User License Agreement (EULA) for non-commercial academic research purposes only.

Steps to Apply:

Download the EULA.pdf from this repository.
Read and Sign: Please read the terms carefully. The form must be signed by a Permanent Staff/Faculty Member (e.g., Professor or Senior Researcher) of your institution.
Submit: Send the scanned PDF copy to qitianhua@seu.edu.cn.
- Email Subject: [AffetSpeech Request] Name - Institution
- Email Body: Please use your institutional email address (.edu, .ac, etc.) and briefly describe your intended use of the dataset.
Verification: Once approved, we will send you a private link to download the full version.

📑 Citation

If you use AffectSpeech in your research, please cite:

BibTeX

@article{qi2026affectspeech,
  title        = {AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis},
  author       = {Qi, Tianhua and Zheng, Wenming and Schuller, Bj{\"o}rn W. and Luo, Zhaojie and Li, Haizhou},
  journal={arXiv preprint arXiv:2604.04160},
  year         = {2026}
}

Plain Text

Qi, T., Zheng, W., Schuller, B. W., Luo, Z., & Li, H. (2026). AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis. arXiv:2604.04160.

📁 Repository Structure

.
├── EULA.pdf               # License agreement to be signed
├── README.md              # Project description
├── metadata_sample/       # Small samples of annotations for preview
│   └── sample_sec.json
│   └── sample_ess.json
└── scripts/               # Helper scripts for data loading/evaluation
    └── data_loader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis

📊 Dataset Overview

📥 Access Request (EULA)

Steps to Apply:

📑 Citation

BibTeX

Plain Text

📁 Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
metadata_sample		metadata_sample
EULA.pdf		EULA.pdf
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis

📊 Dataset Overview

📥 Access Request (EULA)

Steps to Apply:

📑 Citation

BibTeX

Plain Text

📁 Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages