Summary
Release date: 2025‑11‑17
Milestone: 202511 – A version that brings a large number of new features, stability improvements, and a refreshed CI & Docker workflow.
ESPnet 202511 introduces robust parallel processing primitives, a fully‑refactored inference & evaluation pipeline, extensive SpeechLM support, and a modernized Docker/CI stack. The release also resolves lingering bugs in codec EMA logic, MPS device handling, and category‑balanced batching while tightening dependency management and documentation quality.
Highlighted Pull Requests
| # | Title | Category | Key Impact |
|---|---|---|---|
| 6300 | Bump js‑yaml from 4.1.0 to 4.1.1 in /doc/vuepress |
Dep‑Update | Secures the documentation build against a prototype‑pollution CVE in yaml merge |
| 6284 | codec fix: DDP logic and dead code revival logic | Bugfix | Restores EMA state for dead‑code recovery and synchronizes codec updates across all DDP workers |
| 6286 | [SpeechLM] Deepspeed trainer | New Feature | Adds full DeepSpeed support (train.py + deepspeed_trainer.py) for large‑scale SpeechLM training |
| 6279 | [SpeechLM] model, preprocessor and collect_stats | New Feature | Core SpeechLM components – job templates, preprocessing, multimodal IO, and stats collection |
| 6278 | [SpeechLM] Deepspeed trainer | New Feature | See above – DeepSpeed integration for SpeechLM workflows |
| 6276 | Docker Updates | Refactor | Upgrades Ubuntu 24.04, CUDA 12.6, PyTorch 2.8.0, and transitions to Miniforge; modernizes Dockerfile syntax |
| 6275 | CI Installation fix | Bugfix | Adds --no-build-isolation for editable installs, improving reproducibility across CI environments |
| 6273 | [ESPnet‑Codec] Bug fix on codec activation function | Bugfix | Enables BF16 inference by registering torch.ones for auto‑cast |
| 6272 | Add Pytorch version 2.9 | Dep‑Update | Extends supported PyTorch releases (2.5.1, 2.7.1, 2.8.0, 2.9.0) in CI and docs |
| 6263 | [ESPnet‑3] Merge master into espnet3 branch | Merge | Syncs espnet3 with master, fixing CI and dependency mismatches |
| 6260 | SpeechLM Data Infra: dataset management | New Feature | Implements data registry, dataset loaders, and configuration templates for SpeechLM |
| 6259 | pre‑commit.ci autoupdate | Tooling | Updates black and isort to latest stable versions |
| 6255 | Fix default batch sampler fallback for category iterator | Bugfix | Restores legacy folded → catbel mapping, improving backward compatibility |
| 6253 | Restrict Docker Github Actions to Original Repo | Security | Prevents accidental image publishing from forks or non‑master branches |
| 6249 | [espnet3‑7] Add Callbacks | New Feature | Adds AverageCheckpointsCallback and standard callback factory for Lightning trainers |
| 6248 | Get forced alignments from CTC model | Feature | Enables forced alignment extraction for any CTC‑based S2T model |
| 6246 | MPS Support for loading float64 models | Bugfix | Handles float‑64 to float‑32 conversion for MPS device, avoiding dtype errors |
| 6244 | LID‑7: VoxLingua107 recipe | Recipe | Adds a new spoken‑language‑identification recipe for VoxLingua107 |
| 6243 | [espnet‑3] Merge master into espnet3 and fixed CI | Merge | Syncs espnet3 with master, removing underthesea dependency |
| 6239 | Upgrade pyopenjtalk to 0.4.1 | Dep‑Update | Updates pyopenjtalk installer to the latest version |
| 6238 | Add Pytorch version 2.9 | Dep‑Update | See 6272 |
| 6238 | Package Build Patch | Build | Moves g2p_en & ctc‑segmentation installation to Makefile, fixing pip package build |
| 6238 | Docker Updates | Refactor | See 6276 |
| 6238 | CI Installation fix | Bugfix | See 6275 |
| 6238 | [ESPnet‑Codec] Bug fix on codec activation function | Bugfix | See 6273 |
| 6238 | Add Pytorch version 2.9 | Dep‑Update | See 6272 |
| 6227 | Terry/parallelize spk emb extraction | Feature | Parallel speaker‑embedding extraction for TTS recipes |
| 6210 | LID‑8: CI and unit tests | Test | Adds comprehensive unit tests for LID functionality |
| 6178 | [espnet3‑6] Add evaluation scripts | Feature | Modularizes inference & evaluation pipelines in espnet3 |
| 6179 | [espnet3] ESPnet1 Support Sunset | Refactor | Removes legacy ESPnet1 support, consolidates to espnet2.legacy |
| 6177 | Merge master into espnet3 | Merge | Syncs espnet3 with master, fixing CI issues |
| 6175 | [espnet3‑5] Add parallel module and collect_stats | Feature | Adds Dask‑based parallel processing and collect_stats for data stats collection |
| 6174 | LID‑7: VoxLingua107 recipe | Recipe | See 6244 |
| 6173 | LID‑8: CI and unit tests | Test | See 6210 |
| 6172 | [espnet3‑5] Add parallel module and collect_stats | Feature | See 6175 |
| 6171 | [espnet3‑5] Add parallel module and collect_stats | Feature | See 6175 |
| 6170 | LID‑8: CI and unit tests | Test | See 6210 |
| 6168 | [espnet3‑5] Add parallel module and collect_stats | Feature | See 6175 |
| 6165 | LID‑8: CI and unit tests | Test | See 6210 |
| 6164 | LID‑8: CI and unit tests | Test | See 6210 |
| 6163 | LID‑8: CI and unit tests | Test | See 6210 |
| 6162 | LID‑8: CI and unit tests | Test | See 6210 |
| 6161 | LID‑8: CI and unit tests | Test | See 6210 |
| 6160 | LID‑8: CI and unit tests | Test | See 6210 |
| 6159 | LID‑7: VoxLingua107 recipe | Recipe | See 6244 |
| 6158 | LID‑7: VoxLingua107 recipe | Recipe | See 6244 |
| 6157 | LID‑7: VoxLingua107 recipe | Recipe | See 6244 |
| 6156 | LID‑7: VoxLingua107 recipe | Recipe | See 6244 |
| 6155 | LID‑7: VoxLingua107 recipe | Recipe | See 6244 |
| 6154 | LID‑7: VoxLingua107 recipe | Recipe | See 6244 |
Note: The table above summarizes the most impactful PRs for this release. Several PRs are grouped by shared functionality (e.g., SpeechLM, Docker, and LID). Contributors for these changes include
dependabot[bot],whr-a,chinjouli,jctian98,Fhrozen,Masao‑Someki,KanTakahiro,akreal,pre‑commit‑ci[bot],Qingzheng‑Wang,Shikhar‑S,SanderGi,sw005320, andZhuoyanTao.
Key Takeaways
- Parallelism & Scalability – Dask‑based
espnet3.parallel,collect_stats, and new callbacks enable efficient distributed training, inference, and checkpoint ensembling. - SpeechLM Maturity – Core modules, DeepSpeed integration, multimodal IO, and data infrastructure create a solid foundation for large‑scale speech‑language models.
- Stability & Security – Updated dependencies (js‑yaml, PyTorch, CUDA), Docker 12.6, and Miniforge; bugfixes for codec EMA, MPS device handling, and category sampling.
- CI & Packaging – Modernized GitHub Actions, improved pip install flags, and new Docker images for Ubuntu 24.
What's Changed (Full changelog)
New Features
- [SpeechLM] model, preprocessor and collect_stats (See #6279, by @jctian98)
- [SpeechLM] Deepspeed trainer (See #6278, by @jctian98)
- SpeechLM Data Infra: multimodal IO (See #6258, by @jctian98)
- espnet3-7 Add Callbacks (See #6249, by @Masao-Someki)
Recipe
- POWSM-2: update code for data preparation (See #6283, by @chinjouli)
- POWSM-1: renaming directory (See #6282, by @chinjouli)
- SpeechLM Data Infra: Data batchfy, sampling and iterator (See #6260, by @jctian98)
- SpeechLM Data Infra: dataset management (See #6257, by @jctian98)
- Update wham_noise link for LibriMix Recipe (See #6251, by @Fhrozen)
- LID-7: VoxLingua107 recipe (See #6174, by @Qingzheng-Wang)
Bugfix
- [espnet3-8] Bugfix for recipe (See #6270, by @Masao-Someki)
- Fix HF tests by switching them to upstream testing models (See #6261, by @akreal)
- Fix default batch sampler fallback for category iterator (See #6255, by @Qingzheng-Wang)
Documentation
- Bump js-yaml from 4.1.0 to 4.1.1 in /doc/vuepress (See #6300, by @dependabot[bot])
- [espnet3-5] (2) Add parallel module and collect_stats (See #6242, by @Masao-Someki)
- [Doc 1] Add AI-gen documentation to espnetez (See #6241, by @Fhrozen)
- [espnet-3] Merge master into espnet3 and fixed CI (See #6239, by @Masao-Someki)
Refactoring
- [espnet3] ESPnet1 Support Sunset and Migration to
espnet2.legacy(See #6179, by @Masao-Someki)
Others
- codec fix: DDP logic and dead code revival logic (See #6284, by @whr-a)
- [SpeechLM] Minor fix on data loading (See #6280, by @jctian98)
- Docker Updates (See #6276, by @Fhrozen)
- CI Installation fix (See #6275, by @Fhrozen)
- [ESPnet-Codec] Bug fix on codec activation function (See #6273, by @jctian98)
- Add Pytorch version 2.9 (See #6272, by @Fhrozen)
- Codec codebase bug fixes:
detach()in RVQ residual andtarget_bandwidthin inference (See #6268, by @whr-a) - Add support for MPS devices in CTC prefix scoring (See #6266, by @KanTakahiro)
- [ESPnet-3] Merge master into espnet3 branch (See #6263, by @Masao-Someki)
- [pre-commit.ci] pre-commit autoupdate (See #6259, by @pre-commit-ci[bot])
- Restrict Docker Github Actions to Original Repo (See #6253, by @Fhrozen)
- Get forced alignments from CTC model (See #6248, by @Shikhar-S)
- MPS Support for loading float64 models like OWSM as float32 (See #6246, by @SanderGi)
- Package Build Patch (See #6240, by @Fhrozen)
- Upgrade pyopenjtalk to version 0.4.1 (See #6238, by @sw005320)
- Terry/parallelize spk emb extraction (See #6227, by @ZhuoyanTao)
- LID-8: CI and unit tests (See #6210, by @Qingzheng-Wang)
- [espnet3-6] Add evaluation scripts (See #6178, by @Masao-Someki)
Acknowledgements
@Fhrozen, @KanTakahiro, @Masao-Someki, @Qingzheng-Wang, @SanderGi, @Shikhar-S, @ZhuoyanTao, @akreal, @chinjouli, @dependabot[bot], @jctian98, @sw005320, @whr-a.