ICASSP 2026 Submission
Generative target-speaker extraction (TSE) methods often produce more natural outputs than predictive models. Recent diffusion- or flow-matching-based approaches typically rely on a fixed number of reverse steps with uniform step size.
We introduce Adaptive Discriminative Flow Matching TSE (AD-FlowTSE) — a generative framework that extracts target speech with an adaptive step size.
Unlike prior FM-based speech enhancement and TSE methods that transport between the mixture (or a normal prior) and the clean-speech distribution, AD-FlowTSE defines the flow between the background and the source, governed by the mixing ratio (MR) of the source and background forming the mixture.
This design enables MR-aware initialization, where the model starts at an adaptive point along the background–source trajectory rather than using a fixed reverse schedule across all noise levels.
💡 Experiments show that AD-FlowTSE delivers efficient and accurate TSE by achieving strong performance even with a single reverse step, further enhanced by auxiliary MR estimation, path alignment with mixture composition, and noise-adaptive step sizes.
Follow the official data-preparation pipeline from SpeakerBeam. After preparation, ensure your dataset follows the same directory structure (mixture, clean, and reference files).
Pre-trained AD-FlowTSE models and mixing-ratio predictors are available here.
python train_t_predicter.py \
--config config/<config_FlowTSE_alpha.yaml | config_FlowTSE_alpha_noisy.yaml>python train.py \
--config config/<config_FlowTSE_large.yaml | config_FlowTSE_large_noisy.yaml>Run evaluation with different MR-predictor variants:
python eval.py \
--config config/<config_FlowTSE_large.yaml | config_FlowTSE_large_noisy.yaml> \
--t_predicter <ECAPAMLP | GT | ZERO | ONE | RAND>Our UDiT backbone is ported and modified from SoloAudio. We thank the authors for releasing their high-quality implementation.
If you find this work helpful, please cite:
@inproceedings{hsieh2026adflowtse,
title = {Adaptive Discriminative Flow Matching for Target Speaker Extraction},
author = {Tsun-An Hsieh and Minje Kim},
booktitle = {submitted to Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year = {2026},
}