Official data preparation scripts for the URGENT 2025 Challenge.
The metadata files generated by this repo is compatible with the baseline code. See the instruction for more details about how to run the baseline code.
❗️❗️[2024-11-18] We have added some missing files which are necessary for data preparation in Track 2, commonvoice_19.0_es_train_track2.json.gz. If you cloned the repogitory before Nov. 18, please pull the latest commit.
❗️❗️[2024-11-16] We have modified some data preparation and evaluation scripts. If you cloned the repogitory before Nov. 16, please pull the latest commit.
-
The default generated
data/speech_trainsubset is only intended for dynamic mixing (on-the-fly simulation) in the ESPnet framework. It has the same content inspk1.scp(clean reference speech) andwav.scp(noisy speech) files to facilitate on-the-fly simulation of different distortions. -
The validation set made by this script is different from the official validation set used in the leaderboard, although the data source and the type of distortions do not change. The official one will be provided on when the leaderboard opens (Nov. 25). Note that we only provide the noisy data but not the ground truth of the official validation set until the leaderboard swithces to test phase (Dec. 23) to avoid cheating in the leaderboard.
-
The unofficial validation set made by this script can be used to select the best checkpoint. Participants can freely change the configuration to generate the unofficial validation set.
>8Cores- At least 1.3 TB of free disk space for the track 1 and ??? TB for the track 2
- Note that we only counted audio files and did not include the size of archived files (e.g., .zip or .tar.gz files)
- Speech
- DNS5 speech (original 131 GB + resampled 187 GB): 318 GB
- LibriTTS (original 44 GB + resampled 7 GB): 51 GB
- VCTK: 12 GB
- WSJ (original sph 24GB + converted 31 GB): 55 GB
- EARS: 61 GB
- CommonVoice 19.0 speech
- Track 1 (original mp3 221 GB + resampled 200 GB): 421 GB
- Track 2 (original mp3 221 GB + resampled fr102+ g23 GB): ??? GB
- MLS (less compressed version downloaded from LibriVox)
- Track 1 (original 60 GB + resampled 60 GB): 120 GB
- Track 2 (original 6TB + resampled ???TB): ???TB
- Noise
- DNS5 noise (original 58 GB + resampled 35 GB): 93 GB
- WHAM! noise (48 kHz): 76 GB
- FSD50K (original 24 GB + resampled 6 GB): 30 GB
- FMA: (original 24 GB + resampled 36 GB): 60 GB
- RIR
- DNS5 RIRs (48 kHz): 6 GB
- Others
- default simulated validation data: 2 GB
- simulated wind noise for training (with default config): 1 GB
- Speech
-
After cloning this repository, run the following command to initialize the submodules:
git submodule update --init --recursive
-
Install environmemnt. Python 3.10 and Torch 2.0.1+ are recommended. With Conda, just run
conda env create -f environment.yaml conda activate urgent2025
In case of the following error
ERROR: Failed building wheel for pypesq ERROR: Could not build wheels for pypesq, which is required to install pyproject.toml-based projectsyou could manually install
pypesqin advance via: (make sure you havenumpyinstalled before trying this to avoid compilation errors)python -m pip install https://github.com/vBaiCai/python-pesq/archive/master.zip
-
Get the download link of Commonvoice dataset v19.0 from https://commonvoice.mozilla.org/en/datasets
For German, English, Spanish, French, and Chinese (China), please do the following.
a. Select
Common Voice Corpus 19.0b. Enter your email and check the two mandatory boxes
c. Right-click the
Download Dataset Bundlebutton and select "Copy link"d. Paste the link to utils/prepare_CommonVoice19_speech.sh
-
Make a symbolic link to wsj0 and wsj1 data
a. Make a directory
./wsjb. Make a symbolic link to wsj0 and wsj1 under
./wsj(./wsj/wsj0/and./wsj/wsj1/)
-
FFmpeg-related
To simulate wind noise and codec artifacts, our scripts utilize FFmpeg.
a. Activate your python environment
b. Get the path to FFmpeg by
which ffmpegc. Change
/path/to/ffmpegin simulation/simulate_data_from_param.py to the path to your ffmpeg. -
Run the script
./prepare_espnet_data.sh
NOTE: Please do not change
output_dirin each shell script called inprepare_{dataset}.sh. If you want to download datasets to somewhere else, make a symbolic link to that directory.# example when you want to download FSD50K noise to /path/to/somewhere # prepare_fsd50k_noise.sh specifies ./fsd50k as output_dir, so make a symbolic link from /path/to/somewhere to ./fsd50k mkdir -p /path/to/somewhere ln -s /path/to/somewhere ./fsd50k
-
Install eSpeak-NG (used for the phoneme similarity metric computation)
- Follow the instructions in https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md#linux