wavCSE is a modular speech representation learning framework designed for systematic and reproducible research in self supervised speech representation learning and downstream multi task modeling.
The framework enforces a clear separation between upstream embedding extraction and downstream model training and evaluation.
The wavCSE framework follows a two stage pipeline executed sequentially.
Pretrained self supervised learning models are used to extract fixed dimensional speech representations from raw audio and store them on disk.
Downstream multi task neural networks are trained and evaluated using the extracted embeddings.
Important
Stage 1 must be executed from the upstream/ directory.
Stage 2 must be executed from the downstream/ directory.
Upstream embedding extraction must be completed before downstream training.
wavCSE/
├── upstream/ # Upstream SSL based embedding extraction
├── downstream/ # Downstream multi task training and evaluation
├── environment.yml # Conda environment definition
└── README.md # Project documentation
Create the conda environment:
conda env create -f environment.yml
conda activate opencvCore libraries (already included in environment.yml):
- python 3.9
- torch 2.7.1
- torchaudio 2.7.1
This stage loads a pretrained upstream model and converts raw audio signals into fixed dimensional speech embeddings that are serialized to disk.
Navigate to the upstream/ directory before running the extraction script.
cd upstream
python main.py \
--dataset_name speechcommand \
--config configs/extract_embedding.yml \
--device_index 0| Argument | Description |
|---|---|
--dataset_name |
Dataset identifier. Must be one of the supported dataset codes listed below |
--config |
Path to upstream YAML configuration file |
--device_index |
GPU index to use (optional, overrides configuration file) |
The following datasets are currently supported by wavCSE.
Only the dataset codes listed below are valid values for the --dataset_name argument.
| Dataset Code | Dataset Name |
|---|---|
speechcommand |
Google Speech Commands v1 |
voxceleb |
VoxCeleb1 |
iemocap |
IEMOCAP |
fluentspeechcommand |
Fluent Speech Commands |
Note
For each dataset, the folder name must exactly match the dataset code shown above.
The internal directory structure of the dataset must remain unchanged.
- Serialized embedding files (
.pt) - A CSV file containing:
- Audio path
- Label
- Embedding file size
This stage loads the extracted embeddings, constructs a downstream multi task model, trains the model, and evaluates multiple checkpoint variants.
Navigate to the downstream/ directory before running the training and evaluation script.
cd downstream
python main.py \
--task_type ks_si_er_ic \
--config configs/build_model.yml \
--device_index 0| Argument | Description |
|---|---|
--task_type |
Task combination string specifying the downstream task configuration |
--config |
Path to downstream YAML configuration file |
--device_index |
GPU index to use (optional, overrides configuration file) |
| Code | Task |
|---|---|
ks |
Keyword Spotting |
si |
Speaker Identification |
er |
Emotion Recognition |
ic |
Intent Classification |
The task_type argument defines the set of downstream tasks to be trained jointly.
Single task configurations
ks: Keyword Spottingsi: Speaker Identificationer: Emotion Recognitionic: Intent Classification
Multi task configurations
Multiple tasks are specified by concatenating task codes using underscores.
Two task combinations
ks_siks_ersi_erer_ic
Three task combinations
ks_si_erks_er_icsi_er_ic
Four task combination
ks_si_er_ic
All specified tasks are trained jointly using a shared downstream model.
The downstream pipeline supports evaluation of multiple checkpoint types.
| Checkpoint | Description |
|---|---|
best |
Checkpoint that achieves the best overall validation accuracy during training |
opt |
Checkpoint that achieves the best average task accuracy during training (applicable only for multi task learning) |
epochX |
Checkpoint corresponding to epoch index X. If X is not specified (i.e., epoch), the final epoch checkpoint is used by default |
For the epochX option, the corresponding checkpoint must exist in the checkpoint directory created during training.
For example, specifying epoch10 loads the checkpoint saved at the tenth training epoch.
- Metric summaries
- Per-task accuracy
- Prediction CSV files
wavCSE uses pretrained self supervised models for upstream embedding extraction.
The latest official WavLM Large checkpoint is automatically downloaded when running the upstream embedding extraction stage. No manual download is required.
Pretrained downstream multi task models are available for selected task configurations.
For the task combination:
- ks_si_er (Keyword Spotting + Speaker Identification + Emotion Recognition)
Pretrained checkpoints can be downloaded from:
https://drive.google.com/drive/folders/1kRlOutpEg9Ii7BR4QbntdSr01ZmIjdgo?usp=drive_link
The folder contains the following checkpoint files:
-
wavlm_large_ks_si_er_best.pth
Best validation checkpoint -
wavlm_large_ks_si_er_epoch100.pth
Checkpoint saved at epoch 100 -
wavlm_large_ks_si_er_opt.pth
Best average task performance checkpoint (multi task optimization)
The checkpoint can then be evaluated using the corresponding checkpoint option (best, epoch100, or opt) in the downstream evaluation stage.
All experimental settings are controlled via YAML configuration files, including:
- Upstream model selection
- Transformer layer selection
- Pooling strategies
- Training hyperparameters
- Evaluation behavior
This design enables fully reproducible experiments without modifying code.
- Stage 1 must be executed from the
upstream/directory - Stage 2 must be executed from the
downstream/directory - Upstream embedding extraction must be completed before downstream training
- GPU usage can be controlled through configuration files or command line overrides
The wavCSE framework is designed to support future research extensions, including:
- Additional upstream models such as HuBERT and wav2vec 2.0
- New pooling and aggregation strategies
- Additional downstream tasks
- Analysis and visualization modules
For questions, or feedback related to wavCSE, please contact:
-
Braveenan Sritharan
braveenans.22@cse.mrt.ac.lk -
Uthayasanker Thayasivam
rtuthaya@cse.mrt.ac.lk
This project is intended for research and academic use.