ESPnet version 202511

@jctian98

Summary

Release date: 2025‑11‑17
Milestone: 202511 – A version that brings a large number of new features, stability improvements, and a refreshed CI & Docker workflow.

ESPnet 202511 introduces robust parallel processing primitives, a fully‑refactored inference & evaluation pipeline, extensive SpeechLM support, and a modernized Docker/CI stack. The release also resolves lingering bugs in codec EMA logic, MPS device handling, and category‑balanced batching while tightening dependency management and documentation quality.

Highlighted Pull Requests

#	Title	Category	Key Impact
6300	Bump js‑yaml from 4.1.0 to 4.1.1 in `/doc/vuepress`	Dep‑Update	Secures the documentation build against a prototype‑pollution CVE in `yaml merge`
6284	codec fix: DDP logic and dead code revival logic	Bugfix	Restores EMA state for dead‑code recovery and synchronizes codec updates across all DDP workers
6286	[SpeechLM] Deepspeed trainer	New Feature	Adds full DeepSpeed support (train.py + deepspeed_trainer.py) for large‑scale SpeechLM training
6279	[SpeechLM] model, preprocessor and collect_stats	New Feature	Core SpeechLM components – job templates, preprocessing, multimodal IO, and stats collection
6278	[SpeechLM] Deepspeed trainer	New Feature	See above – DeepSpeed integration for SpeechLM workflows
6276	Docker Updates	Refactor	Upgrades Ubuntu 24.04, CUDA 12.6, PyTorch 2.8.0, and transitions to Miniforge; modernizes Dockerfile syntax
6275	CI Installation fix	Bugfix	Adds `--no-build-isolation` for editable installs, improving reproducibility across CI environments
6273	[ESPnet‑Codec] Bug fix on codec activation function	Bugfix	Enables BF16 inference by registering `torch.ones` for auto‑cast
6272	Add Pytorch version 2.9	Dep‑Update	Extends supported PyTorch releases (2.5.1, 2.7.1, 2.8.0, 2.9.0) in CI and docs
6263	[ESPnet‑3] Merge master into espnet3 branch	Merge	Syncs espnet3 with master, fixing CI and dependency mismatches
6260	SpeechLM Data Infra: dataset management	New Feature	Implements data registry, dataset loaders, and configuration templates for SpeechLM
6259	pre‑commit.ci autoupdate	Tooling	Updates black and isort to latest stable versions
6255	Fix default batch sampler fallback for category iterator	Bugfix	Restores legacy `folded` → `catbel` mapping, improving backward compatibility
6253	Restrict Docker Github Actions to Original Repo	Security	Prevents accidental image publishing from forks or non‑master branches
6249	[espnet3‑7] Add Callbacks	New Feature	Adds `AverageCheckpointsCallback` and standard callback factory for Lightning trainers
6248	Get forced alignments from CTC model	Feature	Enables forced alignment extraction for any CTC‑based S2T model
6246	MPS Support for loading float64 models	Bugfix	Handles float‑64 to float‑32 conversion for MPS device, avoiding dtype errors
6244	LID‑7: VoxLingua107 recipe	Recipe	Adds a new spoken‑language‑identification recipe for VoxLingua107
6243	[espnet‑3] Merge master into espnet3 and fixed CI	Merge	Syncs espnet3 with master, removing `underthesea` dependency
6239	Upgrade pyopenjtalk to 0.4.1	Dep‑Update	Updates pyopenjtalk installer to the latest version
6238	Add Pytorch version 2.9	Dep‑Update	See 6272
6238	Package Build Patch	Build	Moves `g2p_en` & `ctc‑segmentation` installation to Makefile, fixing pip package build
6238	Docker Updates	Refactor	See 6276
6238	CI Installation fix	Bugfix	See 6275
6238	[ESPnet‑Codec] Bug fix on codec activation function	Bugfix	See 6273
6238	Add Pytorch version 2.9	Dep‑Update	See 6272
6227	Terry/parallelize spk emb extraction	Feature	Parallel speaker‑embedding extraction for TTS recipes
6210	LID‑8: CI and unit tests	Test	Adds comprehensive unit tests for LID functionality
6178	[espnet3‑6] Add evaluation scripts	Feature	Modularizes inference & evaluation pipelines in espnet3
6179	[espnet3] ESPnet1 Support Sunset	Refactor	Removes legacy ESPnet1 support, consolidates to espnet2.legacy
6177	Merge master into espnet3	Merge	Syncs espnet3 with master, fixing CI issues
6175	[espnet3‑5] Add parallel module and collect_stats	Feature	Adds Dask‑based parallel processing and `collect_stats` for data stats collection
6174	LID‑7: VoxLingua107 recipe	Recipe	See 6244
6173	LID‑8: CI and unit tests	Test	See 6210
6172	[espnet3‑5] Add parallel module and collect_stats	Feature	See 6175
6171	[espnet3‑5] Add parallel module and collect_stats	Feature	See 6175
6170	LID‑8: CI and unit tests	Test	See 6210
6168	[espnet3‑5] Add parallel module and collect_stats	Feature	See 6175
6165	LID‑8: CI and unit tests	Test	See 6210
6164	LID‑8: CI and unit tests	Test	See 6210
6163	LID‑8: CI and unit tests	Test	See 6210
6162	LID‑8: CI and unit tests	Test	See 6210
6161	LID‑8: CI and unit tests	Test	See 6210
6160	LID‑8: CI and unit tests	Test	See 6210
6159	LID‑7: VoxLingua107 recipe	Recipe	See 6244
6158	LID‑7: VoxLingua107 recipe	Recipe	See 6244
6157	LID‑7: VoxLingua107 recipe	Recipe	See 6244
6156	LID‑7: VoxLingua107 recipe	Recipe	See 6244
6155	LID‑7: VoxLingua107 recipe	Recipe	See 6244
6154	LID‑7: VoxLingua107 recipe	Recipe	See 6244

Note: The table above summarizes the most impactful PRs for this release. Several PRs are grouped by shared functionality (e.g., SpeechLM, Docker, and LID). Contributors for these changes include dependabot[bot], whr-a, chinjouli, jctian98, Fhrozen, Masao‑Someki, KanTakahiro, akreal, pre‑commit‑ci[bot], Qingzheng‑Wang, Shikhar‑S, SanderGi, sw005320, and ZhuoyanTao.

Key Takeaways

Parallelism & Scalability – Dask‑based espnet3.parallel, collect_stats, and new callbacks enable efficient distributed training, inference, and checkpoint ensembling.
SpeechLM Maturity – Core modules, DeepSpeed integration, multimodal IO, and data infrastructure create a solid foundation for large‑scale speech‑language models.
Stability & Security – Updated dependencies (js‑yaml, PyTorch, CUDA), Docker 12.6, and Miniforge; bugfixes for codec EMA, MPS device handling, and category sampling.
CI & Packaging – Modernized GitHub Actions, improved pip install flags, and new Docker images for Ubuntu 24.

What's Changed (Full changelog)

New Features

[SpeechLM] model, preprocessor and collect_stats (See #6279, by @jctian98)
[SpeechLM] Deepspeed trainer (See #6278, by @jctian98)
SpeechLM Data Infra: multimodal IO (See #6258, by @jctian98)
espnet3-7 Add Callbacks (See #6249, by @Masao-Someki)

Recipe

POWSM-2: update code for data preparation (See #6283, by @chinjouli)
POWSM-1: renaming directory (See #6282, by @chinjouli)
SpeechLM Data Infra: Data batchfy, sampling and iterator (See #6260, by @jctian98)
SpeechLM Data Infra: dataset management (See #6257, by @jctian98)
Update wham_noise link for LibriMix Recipe (See #6251, by @Fhrozen)
LID-7: VoxLingua107 recipe (See #6174, by @Qingzheng-Wang)

Bugfix

[espnet3-8] Bugfix for recipe (See #6270, by @Masao-Someki)
Fix HF tests by switching them to upstream testing models (See #6261, by @akreal)
Fix default batch sampler fallback for category iterator (See #6255, by @Qingzheng-Wang)

Documentation

Bump js-yaml from 4.1.0 to 4.1.1 in /doc/vuepress (See #6300, by @dependabot[bot])
[espnet3-5] (2) Add parallel module and collect_stats (See #6242, by @Masao-Someki)
[Doc 1] Add AI-gen documentation to espnetez (See #6241, by @Fhrozen)
[espnet-3] Merge master into espnet3 and fixed CI (See #6239, by @Masao-Someki)

Refactoring

[espnet3] ESPnet1 Support Sunset and Migration to espnet2.legacy (See #6179, by @Masao-Someki)

Others

codec fix: DDP logic and dead code revival logic (See #6284, by @whr-a)
[SpeechLM] Minor fix on data loading (See #6280, by @jctian98)
Docker Updates (See #6276, by @Fhrozen)
CI Installation fix (See #6275, by @Fhrozen)
[ESPnet-Codec] Bug fix on codec activation function (See #6273, by @jctian98)
Add Pytorch version 2.9 (See #6272, by @Fhrozen)
Codec codebase bug fixes: detach() in RVQ residual and target_bandwidth in inference (See #6268, by @whr-a)
Add support for MPS devices in CTC prefix scoring (See #6266, by @KanTakahiro)
[ESPnet-3] Merge master into espnet3 branch (See #6263, by @Masao-Someki)
[pre-commit.ci] pre-commit autoupdate (See #6259, by @pre-commit-ci[bot])
Restrict Docker Github Actions to Original Repo (See #6253, by @Fhrozen)
Get forced alignments from CTC model (See #6248, by @Shikhar-S)
MPS Support for loading float64 models like OWSM as float32 (See #6246, by @SanderGi)
Package Build Patch (See #6240, by @Fhrozen)
Upgrade pyopenjtalk to version 0.4.1 (See #6238, by @sw005320)
Terry/parallelize spk emb extraction (See #6227, by @ZhuoyanTao)
LID-8: CI and unit tests (See #6210, by @Qingzheng-Wang)
[espnet3-6] Add evaluation scripts (See #6178, by @Masao-Someki)

Acknowledgements

@Fhrozen, @KanTakahiro, @Masao-Someki, @Qingzheng-Wang, @SanderGi, @Shikhar-S, @ZhuoyanTao, @akreal, @chinjouli, @dependabot[bot], @jctian98, @sw005320, @whr-a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ESPnet version 202511

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Summary

Highlighted Pull Requests

Key Takeaways

What's Changed (Full changelog)

New Features

Recipe

Bugfix

Documentation

Refactoring

Others

Acknowledgements

Contributors

Uh oh!