sslsv is a PyTorch-based deep learning toolkit consisting of a collection of Self-Supervised Learning (SSL) frameworks for learning speaker representations, applicable to various speaker-related downstream tasks, notably Speaker Verification (SV).
Its main objectives are to: (1) provide implementations of state-of-the-art SSL frameworks by adapting algorithms from the computer vision domain; and (2) evaluate them within a consistent and comparable environment.
An overview of the general training and evaluation framework is provided in the figure below.
- June 2025 β π Release of results and checkpoints (v2.0).
- June 2025 β π Support for Python 3.13 and PyTorch 2.7.
- December 2024 β π§ͺ Implementation of SimCLR MultiViews and MoCo Margins.
- November 2024 β π‘ Implementation of Self-Supervised Positive Sampling (SSPS).
- July 2024 β π± Implementation of more losses for SimCLR Margins (SphereFace, CurricularFace, MagFace, AdaFace).
- May 2024 β π Documentation of the complete codebase.
- April 2024 β π οΈ Complete refactoring, including typing, tests, and coding style (v2.0).
- January 2024 β π Implementation of the W-MSE framework.
- July 2023 β β‘ Support for PyTorch Distributed Data Parallel (DDP).
- June 2023 β π§ Evaluation on language, emotion, age, and gender recognition tasks.
- April 2023 β π Additional benchmarks (SITW, VOiCES) and metrics (CLLR, ActDCF, AvgRPrec).
- March 2023 β π Support for cosine scoring normalizations and PLDA evaluations.
- January 2023 β π§ͺ Implementation of SimCLR Margins (CosFace and ArcFace).
- December 2022 β π Implementation of SSL frameworks: LIM, CPC, SimCLR, MoCo, Barlow Twins, VICReg, VIbCReg, DeepCluster, SwAV, SimSiam, BYOL, and DINO.
- June 2022 β π First release of sslsv (v1.0).
General
- Data:
- Supervised and Self-supervised datasets (siamese and DINO sampling)
- Audio augmentation (noise and reverberation)
- Training:
- CPU, GPU and multi-GPUs (DataParallel and DistributedDataParallel)
- Checkpointing, resuming, early stopping and logging
- Tensorboard and wandb
- Evaluation:
- Speaker verification
- Backend: Cosine scoring and PLDA
- Metrics: EER, MinDCF, ActDFC, CLLR, AvgRPrec
- Classification (emotion, language, ...)
- Speaker verification
- Notebooks: DET curve, scores distribution, t-SNE on embeddings, ...
- Misc: scalable config, typing, documentation and tests
Encoders
-
TDNN (
sslsv.encoders.TDNN)
X-vectors: Robust dnn embeddings for speaker recognition [PDF]
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur -
Simple Audio CNN (
sslsv.encoders.SimpleAudioCNN)
Representation Learning with Contrastive Predictive Coding [PDF]
Aaron van den Oord, Yazhe Li, Oriol Vinyals -
ResNet-34 (
sslsv.encoders.ResNet34)
VoxCeleb2: Deep Speaker Recognition [PDF]
Joon Son Chung, Arsha Nagrani, Andrew Zisserman -
ECAPA-TDNN (
sslsv.encoders.ECAPATDNN)
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [PDF]
Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck -
S3PRL (
sslsv.encoders.S3PRL)
Pre-trained speech foundation models (e.g., WavLM, HuBERT, wav2vec 2.0) can be used as encoders using the s3prl toolkit
Frameworks
-
LIM (
sslsv.methods.LIM)
Learning Speaker Representations with Mutual Information [PDF]
Mirco Ravanelli, Yoshua Bengio -
CPC (
sslsv.methods.CPC)
Representation Learning with Contrastive Predictive Coding [PDF]
Aaron van den Oord, Yazhe Li, Oriol Vinyals -
SimCLR (
sslsv.methods.SimCLR)
A Simple Framework for Contrastive Learning of Visual Representations [PDF]
Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton -
MoCo v2+ (
sslsv.methods.MoCo)
Improved Baselines with Momentum Contrastive Learning [PDF]
Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He -
DeepCluster v2 (
sslsv.methods.DeepCluster)
Deep Clustering for Unsupervised Learning of Visual Features [PDF]
Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze -
SwAV (
sslsv.methods.SwAV)
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments [PDF]
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin -
W-MSE (
sslsv.methods.WMSE)
Whitening for Self-Supervised Representation Learning [PDF]
Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe -
Barlow Twins (
sslsv.methods.BarlowTwins)
Barlow Twins: Self-Supervised Learning via Redundancy Reduction [PDF]
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, StΓ©phane Deny -
VICReg (
sslsv.methods.VICReg)
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning [PDF]
Adrien Bardes, Jean Ponce, Yann LeCun -
VIbCReg (
sslsv.methods.VIbCReg)
Computer Vision Self-supervised Learning Methods on Time Series [PDF]
Daesoo Lee, Erlend Aune -
BYOL (
sslsv.methods.BYOL)
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning [PDF]
Jean-Bastien Grill, Florian Strub, Florent AltchΓ©, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, RΓ©mi Munos, Michal Valko -
SimSiam (
sslsv.methods.SimSiam)
Exploring Simple Siamese Representation Learning [PDF]
Xinlei Chen, Kaiming He -
DINO (
sslsv.methods.DINO)
Emerging Properties in Self-Supervised Vision Transformers [PDF]
Mathilde Caron, Hugo Touvron, Ishan Misra, HervΓ© JΓ©gou, Julien Mairal, Piotr Bojanowski, Armand Joulin
Methods (contributions)
-
Combiner (
sslsv.methods.Combiner)
Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning [PDF] [Ref]
Theo Lepage, Reda Dehak -
Margins (
sslsv.methods.SimCLRMargins,sslsv.methods.MoCoMargins)
Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations [PDF] [Ref]
Theo Lepage, Reda Dehak -
SSPS (
sslsv.methods._SSPS)
Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling [PDF] [Ref]
Theo Lepage, Reda Dehak
sslsv runs on Python 3.13.3 with the following dependencies.
| Module | Versions |
|---|---|
| torch | 2.7.1 |
| torchaudio | 2.7.1 |
| numpy | * |
| pandas | * |
| soundfile | * |
| scikit-learn | * |
| speechbrain | * |
| tensorboard | * |
| wandb | * |
| ruamel.yaml | * |
| dacite | * |
| prettyprinter | * |
| tqdm | * |
Note: developers will also need pytest, pre-commit and twine to work on this project.
Speaker recognition:
Language recognition:
Emotion recognition:
Data-augmentation:
Data used for main experiments (conducted on VoxCeleb1 and VoxCeleb2 + data-augmentation) can be automatically downloaded, extracted and prepared using the following scripts.
python tools/prepare_data/prepare_voxceleb.py data/
python tools/prepare_data/prepare_augmentation.py data/The resulting data folder shoud have the structure presented below.
data
βββ musan_split/
βββ simulated_rirs/
βββ voxceleb1/
βββ voxceleb2/
βββ voxceleb1_test_O
βββ voxceleb1_test_H
βββ voxceleb1_test_E
βββ voxsrc2021_val
βββ voxceleb1_train.csv
βββ voxceleb2_train.csv
Other datasets have to be manually downloaded and extracted but their train and trials files can be created using the corresponding scripts from the tools/prepare_data/ folder.
-
Example format of a train file (
voxceleb1_train.csv)File,Speaker voxceleb1/id10001/1zcIwhmdeo4/00001.wav,id10001 ... voxceleb1/id11251/s4R4hvqrhFw/00009.wav,id11251 -
Example format of a trials file (
voxceleb1_test_O)1 voxceleb1/id10270/x6uYqmx31kE/00001.wav voxceleb1/id10270/8jEAjG6SegY/00008.wav ... 0 voxceleb1/id10309/0cYFdtyWVds/00005.wav voxceleb1/id10296/Y-qKARMSO7k/00001.wav
- Clone this repository:
git clone https://github.com/theolepage/sslsv.git. - Install dependencies:
pip install -r requirements.txt.
Note: sslsv can also be installed as a standalone package via pip with pip install sslsv or with pip install . (in the project root folder) to get the latest version.
- Start a training (2 GPUs):
./train_ddp.sh 2 <config_path>. - Evaluate your model (2 GPUs):
./evaluate_ddp.sh 2 <config_path>.
Note: use sslsv/bin/train.py and sslsv/bin/evaluate.py for non-distributed mode to run with a CPU, a single GPU or multiple GPUs (DataParallel).
You can visualize your experiments with tensorboard --logdir models/your_model/.
Use wandb online and wandb offline to toggle wandb. To log your experiments you first need to provide your API key with wandb login API_KEY.
Documentation is currently being developed...
- Configs:
models/ssl/voxceleb2/ - Train set: VoxCeleb2
- Evaluation: VoxCeleb1-O (Original)
- Encoder: Fast ResNet-34 and ECAPA-TDNN
| Method | Model | EER (%) | minDCF (p=0.01) | Checkpoint |
|---|---|---|---|---|
| LIM | lim/lim_loss-NCE_proj-2048-BN-R-2048-BN-R-512 |
16.13 | 0.9015 | |
| CPC | cpc/cpc_t-4_agg-GRU-1-256 |
12.77 | 0.8033 | |
| SimCLR | simclr/simclr_proj-none_t-0.03 |
9.05 | 0.6364 | π |
| MoCo | moco/moco_proj-none_Q-32768_t-0.03_m-0.999 |
8.49 | 0.5990 | π |
| DeepCluster | deepcluster/deepcluster_proj-2048-BN-R-2048-BN-R-512_K-3000-3000-3000_t-0.1 |
15.16 | 0.8193 | |
| SwAV | swav/swav_proj-2048-BN-R-2048-BN-R-512_K-6000_t-0.1 |
11.82 | 0.7177 | π |
| W-MSE | wmse/wmse_proj-1024-BN-R-64_ws-128 |
14.62 | 0.8506 | |
| Barlow Twins | barlowtwins/barlowtwins_proj-2048-BN-R-2048-BN-R-512_lambda-0.005 |
13.22 | 0.7658 | |
| VICReg | vicreg/vicreg_proj-2048-BN-R-2048-BN-R-512_inv-1.0_var-1.0_cov-0.1 |
11.33 | 0.6658 | π |
| BYOL | byol/byol_proj-2048-BN-R-2048-BN-R-512_pred-4096-BN-R-256_m-0.996-sched |
13.99 | 0.7509 | |
| SimSiam | simsiam/simsiam_proj-2048-BN-R-2048-BN-R-512-BN_pred-512-BN-R-2048 |
28.94 | 0.9984 | |
| DINO | dino/dino_proj-2048-BN-G-2048-BN-G-256-L2-65536_G-2x4_L-4x2_t-0.04 |
6.04 | 0.4526 | π |
| Supervised | supervised/supervised_loss-AAM_s-30_m-0.2 |
2.95 | 0.3122 | π |
| Method | Model | EER (%) | minDCF (p=0.01) | Checkpoint |
|---|---|---|---|---|
| SimCLR | simclr/simclr_enc-ECAPATDNN-1024_proj-none_t-0.03 |
6.41 | 0.5160 | π |
| MoCo | moco/moco_enc-ECAPATDNN-1024_proj-none_Q-32768_t-0.03_m-0.999 |
6.48 | 0.5372 | π |
| SwAV | swav/swav_enc-ECAPATDNN-1024_proj-2048-BN-R-2048-BN-R-512_K-6000_t-0.1 |
8.12 | 0.6148 | π |
| VICReg | vicreg/vicreg_enc-ECAPATDNN-1024_proj-2048-BN-R-2048-BN-R-512_inv-1.0_var-1.0_cov-0.1 |
7.42 | 0.5659 | π |
| DINO | dino/dino_enc-ECAPATDNN-1024_proj-2048-BN-G-2048-BN-G-256-L2-65536_G-2x4_L-4x2_t-0.04 |
2.82 | 0.3463 | π |
| Supervised | supervised/supervised_enc-ECAPATDNN-1024_loss-AAM_s-30_m-0.2 |
1.34 | 0.1521 | π |
- Configs:
models/ssps/voxceleb2/ - Train set: VoxCeleb2
- Evaluation: VoxCeleb1-O (Original)
- Encoder: ECAPA-TDNN
| Method | Model | EER (%) | minDCF (p=0.01) | Checkpoint |
|---|---|---|---|---|
| SimCLR | simclr_e-ecapa/ssps_kmeans_25k_uni-1 |
2.57 | 0.3033 | π |
| DINO | dino_e-ecapa/ssps_kmeans_25k_uni-1 |
2.53 | 0.2843 | π |
sslsv contains third-party components and code adapted from other open-source projects, including: voxceleb_trainer, voxceleb_unsupervised and solo-learn.
If you use sslsv, please consider starring this repository on GitHub and citing one of the following papers.
@Article{lepage2025SLSRReview,
title = {Self-Supervised Learning for Speaker Recognition: A study and review},
author = {Lepage, Theo and Dehak, Reda},
year = {2026},
journal = {Speech Communication},
volume = {176},
pages = {103333},
doi = {10.1016/j.specom.2025.103333},
url = {https://arxiv.org/pdf/2602.10829}
}
@InProceedings{lepage2025SSPS,
title = {SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification},
author = {Lepage, Theo and Dehak, Reda},
year = {2025},
booktitle = {Interspeech 2025},
pages = {1098--1102},
doi = {10.21437/Interspeech.2025-183},
url = {https://www.isca-archive.org/interspeech_2025/lepage25_interspeech.pdf}
}
@Article{lepage2025BootstrappedPositiveSampling,
title = {Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling},
author = {Lepage, Theo and Dehak, Reda},
year = {2025},
journal = {IEEE Transactions on Audio, Speech and Language Processing},
volume = {33},
pages = {2932--2945},
doi = {10.1109/TASLPRO.2025.3587462},
url = {https://arxiv.org/pdf/2501.17772}
}This project is released under the MIT License.