Skip to content

theolepage/sslsv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

168 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

sslsv

sslsv is a PyTorch-based deep learning toolkit consisting of a collection of Self-Supervised Learning (SSL) frameworks for learning speaker representations, applicable to various speaker-related downstream tasks, notably Speaker Verification (SV).

Its main objectives are to: (1) provide implementations of state-of-the-art SSL frameworks by adapting algorithms from the computer vision domain; and (2) evaluate them within a consistent and comparable environment.

An overview of the general training and evaluation framework is provided in the figure below.


News

  • June 2025 – πŸ‘ Release of results and checkpoints (v2.0).
  • June 2025 – πŸ”– Support for Python 3.13 and PyTorch 2.7.
  • December 2024 – πŸ§ͺ Implementation of SimCLR MultiViews and MoCo Margins.
  • November 2024 – πŸ’‘ Implementation of Self-Supervised Positive Sampling (SSPS).
  • July 2024 – 🌱 Implementation of more losses for SimCLR Margins (SphereFace, CurricularFace, MagFace, AdaFace).
  • May 2024 – πŸ“š Documentation of the complete codebase.
  • April 2024 – πŸ› οΈ Complete refactoring, including typing, tests, and coding style (v2.0).
  • January 2024 – πŸš€ Implementation of the W-MSE framework.
  • July 2023 – ⚑ Support for PyTorch Distributed Data Parallel (DDP).
  • June 2023 – 🧠 Evaluation on language, emotion, age, and gender recognition tasks.
  • April 2023 – πŸ“Š Additional benchmarks (SITW, VOiCES) and metrics (CLLR, ActDCF, AvgRPrec).
  • March 2023 – πŸ“ Support for cosine scoring normalizations and PLDA evaluations.
  • January 2023 – πŸ§ͺ Implementation of SimCLR Margins (CosFace and ArcFace).
  • December 2022 – πŸš€ Implementation of SSL frameworks: LIM, CPC, SimCLR, MoCo, Barlow Twins, VICReg, VIbCReg, DeepCluster, SwAV, SimSiam, BYOL, and DINO.
  • June 2022 – 🌠 First release of sslsv (v1.0).

Features

General

  • Data:
    • Supervised and Self-supervised datasets (siamese and DINO sampling)
    • Audio augmentation (noise and reverberation)
  • Training:
    • CPU, GPU and multi-GPUs (DataParallel and DistributedDataParallel)
    • Checkpointing, resuming, early stopping and logging
    • Tensorboard and wandb
  • Evaluation:
    • Speaker verification
      • Backend: Cosine scoring and PLDA
      • Metrics: EER, MinDCF, ActDFC, CLLR, AvgRPrec
    • Classification (emotion, language, ...)
  • Notebooks: DET curve, scores distribution, t-SNE on embeddings, ...
  • Misc: scalable config, typing, documentation and tests
Encoders
  • TDNN (sslsv.encoders.TDNN)
    X-vectors: Robust dnn embeddings for speaker recognition [PDF]
    David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur

  • Simple Audio CNN (sslsv.encoders.SimpleAudioCNN)
    Representation Learning with Contrastive Predictive Coding [PDF]
    Aaron van den Oord, Yazhe Li, Oriol Vinyals

  • ResNet-34 (sslsv.encoders.ResNet34)
    VoxCeleb2: Deep Speaker Recognition [PDF]
    Joon Son Chung, Arsha Nagrani, Andrew Zisserman

  • ECAPA-TDNN (sslsv.encoders.ECAPATDNN)
    ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [PDF]
    Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck

  • S3PRL (sslsv.encoders.S3PRL)
    Pre-trained speech foundation models (e.g., WavLM, HuBERT, wav2vec 2.0) can be used as encoders using the s3prl toolkit

Frameworks
  • LIM (sslsv.methods.LIM)
    Learning Speaker Representations with Mutual Information [PDF]
    Mirco Ravanelli, Yoshua Bengio

  • CPC (sslsv.methods.CPC)
    Representation Learning with Contrastive Predictive Coding [PDF]
    Aaron van den Oord, Yazhe Li, Oriol Vinyals

  • SimCLR (sslsv.methods.SimCLR)
    A Simple Framework for Contrastive Learning of Visual Representations [PDF]
    Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton

  • MoCo v2+ (sslsv.methods.MoCo)
    Improved Baselines with Momentum Contrastive Learning [PDF]
    Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He

  • DeepCluster v2 (sslsv.methods.DeepCluster)
    Deep Clustering for Unsupervised Learning of Visual Features [PDF]
    Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze

  • SwAV (sslsv.methods.SwAV)
    Unsupervised Learning of Visual Features by Contrasting Cluster Assignments [PDF]
    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin

  • W-MSE (sslsv.methods.WMSE)
    Whitening for Self-Supervised Representation Learning [PDF]
    Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe

  • Barlow Twins (sslsv.methods.BarlowTwins)
    Barlow Twins: Self-Supervised Learning via Redundancy Reduction [PDF]
    Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, StΓ©phane Deny

  • VICReg (sslsv.methods.VICReg)
    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning [PDF]
    Adrien Bardes, Jean Ponce, Yann LeCun

  • VIbCReg (sslsv.methods.VIbCReg)
    Computer Vision Self-supervised Learning Methods on Time Series [PDF]
    Daesoo Lee, Erlend Aune

  • BYOL (sslsv.methods.BYOL)
    Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning [PDF]
    Jean-Bastien Grill, Florian Strub, Florent AltchΓ©, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, RΓ©mi Munos, Michal Valko

  • SimSiam (sslsv.methods.SimSiam)
    Exploring Simple Siamese Representation Learning [PDF]
    Xinlei Chen, Kaiming He

  • DINO (sslsv.methods.DINO)
    Emerging Properties in Self-Supervised Vision Transformers [PDF]
    Mathilde Caron, Hugo Touvron, Ishan Misra, HervΓ© JΓ©gou, Julien Mairal, Piotr Bojanowski, Armand Joulin

Methods (contributions)
  • Combiner (sslsv.methods.Combiner)
    Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning [PDF] [Ref]
    Theo Lepage, Reda Dehak

  • Margins (sslsv.methods.SimCLRMargins, sslsv.methods.MoCoMargins)
    Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations [PDF] [Ref]
    Theo Lepage, Reda Dehak

  • SSPS (sslsv.methods._SSPS)
    Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling [PDF] [Ref]
    Theo Lepage, Reda Dehak


Requirements

sslsv runs on Python 3.13.3 with the following dependencies.

Module Versions
torch 2.7.1
torchaudio 2.7.1
numpy *
pandas *
soundfile *
scikit-learn *
speechbrain *
tensorboard *
wandb *
ruamel.yaml *
dacite *
prettyprinter *
tqdm *

Note: developers will also need pytest, pre-commit and twine to work on this project.


Datasets

Speaker recognition:

Language recognition:

Emotion recognition:

Data-augmentation:

Data used for main experiments (conducted on VoxCeleb1 and VoxCeleb2 + data-augmentation) can be automatically downloaded, extracted and prepared using the following scripts.

python tools/prepare_data/prepare_voxceleb.py data/
python tools/prepare_data/prepare_augmentation.py data/

The resulting data folder shoud have the structure presented below.

data
β”œβ”€β”€ musan_split/
β”œβ”€β”€ simulated_rirs/
β”œβ”€β”€ voxceleb1/
β”œβ”€β”€ voxceleb2/
β”œβ”€β”€ voxceleb1_test_O
β”œβ”€β”€ voxceleb1_test_H
β”œβ”€β”€ voxceleb1_test_E
β”œβ”€β”€ voxsrc2021_val
β”œβ”€β”€ voxceleb1_train.csv
└── voxceleb2_train.csv

Other datasets have to be manually downloaded and extracted but their train and trials files can be created using the corresponding scripts from the tools/prepare_data/ folder.

  • Example format of a train file (voxceleb1_train.csv)

    File,Speaker
    voxceleb1/id10001/1zcIwhmdeo4/00001.wav,id10001
    ...
    voxceleb1/id11251/s4R4hvqrhFw/00009.wav,id11251
    
  • Example format of a trials file (voxceleb1_test_O)

    1 voxceleb1/id10270/x6uYqmx31kE/00001.wav voxceleb1/id10270/8jEAjG6SegY/00008.wav
    ...
    0 voxceleb1/id10309/0cYFdtyWVds/00005.wav voxceleb1/id10296/Y-qKARMSO7k/00001.wav
    

Installation

  1. Clone this repository: git clone https://github.com/theolepage/sslsv.git.
  2. Install dependencies: pip install -r requirements.txt.

Note: sslsv can also be installed as a standalone package via pip with pip install sslsv or with pip install . (in the project root folder) to get the latest version.


Usage

  • Start a training (2 GPUs): ./train_ddp.sh 2 <config_path>.
  • Evaluate your model (2 GPUs): ./evaluate_ddp.sh 2 <config_path>.

Note: use sslsv/bin/train.py and sslsv/bin/evaluate.py for non-distributed mode to run with a CPU, a single GPU or multiple GPUs (DataParallel).

Tensorboard

You can visualize your experiments with tensorboard --logdir models/your_model/.

wandb

Use wandb online and wandb offline to toggle wandb. To log your experiments you first need to provide your API key with wandb login API_KEY.


Documentation

Documentation is currently being developed...


Results

SSL frameworks

  • Configs: models/ssl/voxceleb2/
  • Train set: VoxCeleb2
  • Evaluation: VoxCeleb1-O (Original)
  • Encoder: Fast ResNet-34 and ECAPA-TDNN

Fast ResNet-34

Method Model EER (%) minDCF (p=0.01) Checkpoint
LIM lim/lim_loss-NCE_proj-2048-BN-R-2048-BN-R-512 16.13 0.9015
CPC cpc/cpc_t-4_agg-GRU-1-256 12.77 0.8033
SimCLR simclr/simclr_proj-none_t-0.03 9.05 0.6364 πŸ”—
MoCo moco/moco_proj-none_Q-32768_t-0.03_m-0.999 8.49 0.5990 πŸ”—
DeepCluster deepcluster/deepcluster_proj-2048-BN-R-2048-BN-R-512_K-3000-3000-3000_t-0.1 15.16 0.8193
SwAV swav/swav_proj-2048-BN-R-2048-BN-R-512_K-6000_t-0.1 11.82 0.7177 πŸ”—
W-MSE wmse/wmse_proj-1024-BN-R-64_ws-128 14.62 0.8506
Barlow Twins barlowtwins/barlowtwins_proj-2048-BN-R-2048-BN-R-512_lambda-0.005 13.22 0.7658
VICReg vicreg/vicreg_proj-2048-BN-R-2048-BN-R-512_inv-1.0_var-1.0_cov-0.1 11.33 0.6658 πŸ”—
BYOL byol/byol_proj-2048-BN-R-2048-BN-R-512_pred-4096-BN-R-256_m-0.996-sched 13.99 0.7509
SimSiam simsiam/simsiam_proj-2048-BN-R-2048-BN-R-512-BN_pred-512-BN-R-2048 28.94 0.9984
DINO dino/dino_proj-2048-BN-G-2048-BN-G-256-L2-65536_G-2x4_L-4x2_t-0.04 6.04 0.4526 πŸ”—
Supervised supervised/supervised_loss-AAM_s-30_m-0.2 2.95 0.3122 πŸ”—

ECAPA-TDNN

Method Model EER (%) minDCF (p=0.01) Checkpoint
SimCLR simclr/simclr_enc-ECAPATDNN-1024_proj-none_t-0.03 6.41 0.5160 πŸ”—
MoCo moco/moco_enc-ECAPATDNN-1024_proj-none_Q-32768_t-0.03_m-0.999 6.48 0.5372 πŸ”—
SwAV swav/swav_enc-ECAPATDNN-1024_proj-2048-BN-R-2048-BN-R-512_K-6000_t-0.1 8.12 0.6148 πŸ”—
VICReg vicreg/vicreg_enc-ECAPATDNN-1024_proj-2048-BN-R-2048-BN-R-512_inv-1.0_var-1.0_cov-0.1 7.42 0.5659 πŸ”—
DINO dino/dino_enc-ECAPATDNN-1024_proj-2048-BN-G-2048-BN-G-256-L2-65536_G-2x4_L-4x2_t-0.04 2.82 0.3463 πŸ”—
Supervised supervised/supervised_enc-ECAPATDNN-1024_loss-AAM_s-30_m-0.2 1.34 0.1521 πŸ”—

SSPS

  • Configs: models/ssps/voxceleb2/
  • Train set: VoxCeleb2
  • Evaluation: VoxCeleb1-O (Original)
  • Encoder: ECAPA-TDNN
Method Model EER (%) minDCF (p=0.01) Checkpoint
SimCLR simclr_e-ecapa/ssps_kmeans_25k_uni-1 2.57 0.3033 πŸ”—
DINO dino_e-ecapa/ssps_kmeans_25k_uni-1 2.53 0.2843 πŸ”—

Acknowledgements

sslsv contains third-party components and code adapted from other open-source projects, including: voxceleb_trainer, voxceleb_unsupervised and solo-learn.


Citations

If you use sslsv, please consider starring this repository on GitHub and citing one of the following papers.

@Article{lepage2025SLSRReview,
  title   = {Self-Supervised Learning for Speaker Recognition: A study and review},
  author  = {Lepage, Theo and Dehak, Reda},
  year    = {2026},
  journal = {Speech Communication},
  volume  = {176},
  pages   = {103333},
  doi     = {10.1016/j.specom.2025.103333},
  url     = {https://arxiv.org/pdf/2602.10829}
}

@InProceedings{lepage2025SSPS,
  title     = {SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2025},
  booktitle = {Interspeech 2025},
  pages     = {1098--1102},
  doi       = {10.21437/Interspeech.2025-183},
  url       = {https://www.isca-archive.org/interspeech_2025/lepage25_interspeech.pdf}
}

@Article{lepage2025BootstrappedPositiveSampling,
  title     = {Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2025},
  journal   = {IEEE Transactions on Audio, Speech and Language Processing},
  volume    = {33},
  pages     = {2932--2945},
  doi       = {10.1109/TASLPRO.2025.3587462},
  url       = {https://arxiv.org/pdf/2501.17772}
}

License

This project is released under the MIT License.

Releases

No releases published

Packages

 
 
 

Contributors