An Improved Event-Independent Network (EIN) for Polyphonic Sound Event Localization and Detection (SELD)
from Centre for Vision, Speech and Signal Processing, University of Surrey.
- Introduction
- Requirements
- Download Dataset
- Preprocessing
- QuickEvaluate
- Usage
- Results
- FAQs
- Citing
- Reference
This is a Pytorch implementation of Event-Independent Networks for Polyphonic SELD.
Event-Independent Networks for Polyphonic SELD uses a trackwise output format and multi-task learning (MTL) of a soft parameter-sharing scheme. For more information, please read papers here.
The features of this method are:
- It uses a trackwise output format to detect different sound events of the same type but with different DoAs.
- It uses a permutation-invaiant training (PIT) to solve the track permutation problem introducted by trackwise output format.
- It uses multi-head self-attention (MHSA) to separate tracks.
- It uses multi-task learning (MTL) of a soft parameter-sharing scheme for joint-SELD.
Currently, the code is availabel for TAU-NIGENS Spatial Sound Events 2020 dataset. Data augmentation methods are not included.
We provide two ways to setup the environment. Both are based on Anaconda.
-
Use the provided
prepare_env.sh. Note that you need to set theanaconda_dirinprepare_env.shto your anaconda directory, then directly runbash scripts/prepare_env.sh
-
Use the provided
environment.yml. Note that you also need to set theprefixto your aimed env directory, then directly runconda env create -f environment.yml
After setup your environment, don't forget to activate it
conda activate einDownload dataset is easy. Directly run
bash scripts/download_dataset.shIt is needed to preprocess the data and meta files. .wav files will be saved to .h5 files. Meta files will also be converted to .h5 files. After downloading the data, directly run
bash scripts/preproc.shPreprocessing for meta files (labels) separate labels to different tracks, each with up to one event and a corresponding DoA. The same event is consistently put in the same track. For frame-level permutation-invariant training, this may not be necessary, but for chunk-level PIT or no PIT, consistently arrange the same event in the same track is reasonable.
We uploaded the pre-trained model here. Download it and unzip it in the code folder (EIN-SELD folder) using
wget 'https://zenodo.org/record/4158864/files/out_train.zip' && unzip out_train.zipThen directly run
bash scripts/predict.sh && sh scripts/evaluate.shHyper-parameters are stored in ./configs/ein_seld/seld.yaml. You can change some of them, such as train_chunklen_sec, train_hoplen_sec, test_chunklen_sec, test_hoplen_sec, batch_size, lr and others.
To train a model yourself, setup ./configs/ein_seld/seld.yaml and directly run
bash scripts/train.shtrain_fold and valid_fold in ./configs/ein_seld/seld.yaml means using what folds to train and validate. Note that valid_fold can be None which means no validation is needed, and this is usually used for training using fold 1-6.
overlap can be 1 or 2 or combined 1&2, which means using non-overlapped sound event to train or overlapped to train or both.
--seed is set to a random integer by default. You can set it to a fixed number. Results will not be completely the same if RNN or Transformer is used.
You can consider to add --read_into_mem argument in train.sh to pre-load all of the data into memory to increase the training speed, according to your resources.
--num_workers also affects the training speed, adjust it according to your resources.
Prediction predicts resutls and save to ./out_infer folder. The saved results is the submission result for DCASE challenge. Directly run
bash scripts/predict.shPrediction predicts results on testset_type set, which can be dev or eval. If it is dev, test_fold cannot be None.
Evaluation evaluate the generated submission result. Directly run
bash scripts/evaluate.shIt is notable that EINV2-DA is a single model with plain VGGish architecture using only the channel-rotation and the specaug data-augmentation methods.
-
If you have any question, please email to caoyfive@gmail.com or report an issue here.
-
Currently the
pin_memorycan only be set toTrue. For more information, please check Pytorch Doc and Nvidia Developer Blog. -
After downloading, you can delete
downloaded_packagesfolder to save some space.
If you use the code, please consider citing the papers below
@article{cao2020anevent,
title={An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection},
author={Cao, Yin and Iqbal, Turab and Kong, Qiuqiang and Fengyan, An and Wang, Wenwu and Plumbley, Mark D},
journal={arXiv preprint arXiv:2010.13092},
year={2020}
}
@article{cao2020event,
title={Event-Independent Network for Polyphonic Sound Event Localization and Detection},
author={Cao, Yin and Iqbal, Turab and Kong, Qiuqiang and Zhong, Yue and Wang, Wenwu and Plumbley, Mark D},
journal={arXiv preprint arXiv:2010.00140},
year={2020}
}
-
Archontis Politis, Sharath Adavanne, and Tuomas Virtanen. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020). November 2020. URL
-
Annamaria Mesaros, Sharath Adavanne, Archontis Politis, Toni Heittola, and Tuomas Virtanen. Joint measurement of localization and detection of sound events. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, NY, Oct 2019. URL
-
Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. URL