Skip to content

BKB00001/GMOD

Repository files navigation

GMOD: Voice-Face Association Learning via Graph Mining and Orthogonal Disentanglement

GMOD is an unsupervised framework for voice-face association learning. It addresses two common limitations of existing methods: false-negative conflicts caused by treating cross-video same-identity samples as negatives, and unreliable pseudo-labels introduced by rigid clustering or prototype assumptions. GMOD constructs a global cross-modal similarity graph to mine latent positive samples for multi-positive contrastive learning, and uses orthogonal disentanglement to separate shared identity features from modality-private noise.

Framework

GMOD framework

Overview

This repository focuses on the GMOD training and evaluation pipeline. The dataset layout and extracted feature format are compatible with my-yy/vfal-eva, which can be used as the reference setup for preparing the environment and input files.

GMOD includes:

  • Graph mining over video/movie-level voice-face samples.
  • Multi-positive bidirectional InfoNCE training.
  • Orthogonal identity/nuisance disentanglement.
  • Reconstruction regularization for both modalities.
  • Validation and testing on verification, matching, and retrieval tasks.

Repository Structure

GMOD/
├── dataset/
│   ├── face_input.pkl
│   ├── voice_input.pkl
│   ├── info/
│   └── evals/
├── image/
│   └── model.png
├── loaders/
├── models/
│   └── gmod_model.py
├── preprocess/
├── scripts/
├── utils/
├── works/
│   └── GMOD.py
├── requirements.txt
└── readme.md

Installation

Create a Python environment with CUDA-enabled PyTorch, then install the dependencies used by GMOD:

pip install -r requirements.txt

The training code calls .cuda() directly, so a CUDA-enabled PyTorch installation and an available GPU are expected.

Dataset

GMOD expects pre-extracted voice and face features. The expected file layout follows the vfal-eva convention:

dataset/
├── evals/
│   ├── test_matching_10.pkl
│   ├── test_matching_g.pkl
│   ├── test_matching.pkl
│   ├── test_retrieval.pkl
│   ├── test_verification_g.pkl
│   ├── test_verification.pkl
│   └── valid_verification.pkl
├── info/
│   ├── name2gender.pkl
│   ├── name2jpgs_wavs.pkl
│   ├── name2movies.pkl
│   ├── name2voice_id.pkl
│   ├── train_valid_test_names.pkl
│   └── works/
│       └── wen_weights.txt
├── face_input.pkl
└── voice_input.pkl

Input feature files:

  • face_input.pkl: a dictionary from face image paths to 512-dimensional face embeddings.
  • voice_input.pkl: a dictionary from voice clip paths to 192-dimensional voice embeddings.
  • dataset/info/*.pkl: metadata for names, genders, movies/videos, voice IDs, and train/valid/test splits.
  • dataset/evals/*.pkl: predefined evaluation protocols for verification, matching, and retrieval.

Training

Run GMOD with:

python works/GMOD.py

Common arguments:

python works/GMOD.py \
  --epoch 200 \
  --batch_size 256 \
  --lr 1e-4 \
  --top_p 20 \
  --top_p_end 8 \
  --top_p_ramp 20 \
  --temperature 0.07 \
  --lambda_nce 1.0 \
  --lambda_ortho 0.1 \
  --lambda_rec 1.0 \
  --eval_step 200 \
  --early_stop 10

Important options:

Argument Default Description
--epoch 200 Maximum number of training epochs
--batch_size 256 Training batch size
--lr 1e-4 Adam learning rate
--top_p 20 Initial graph-mining neighborhood size
--top_p_end 8 Final neighborhood size after ramping
--top_p_ramp 20 Epochs used to ramp from top_p to top_p_end
--temperature 0.07 Temperature for multi-positive InfoNCE
--lambda_nce 1.0 Weight for the contrastive loss
--lambda_ortho 0.1 Weight for the orthogonal disentanglement loss
--lambda_rec 1.0 Weight for the reconstruction loss
--eval_step 200 Validation interval in training steps
--load_model "" Optional checkpoint path for continued training

Weights & Biases

To enable online W&B logging, create .wb_config.json in the project root:

{
  "WB_KEY": "Your wandb auth key"
}

The default W&B project/name in works/GMOD.py is:

project = GMOD
name = gmod

For offline logging:

export WANDB_MODE=offline

On Windows PowerShell:

$env:WANDB_MODE = "offline"

Evaluation

Evaluation is triggered automatically during training. The main metrics are:

  • valid/auc: validation verification AUC.
  • test/auc: test verification AUC.
  • test/auc_g: gender-constrained verification AUC.
  • test/ms_v2f: voice-to-face matching score.
  • test/ms_f2v: face-to-voice matching score.
  • test/map_v2f: voice-to-face retrieval mAP.
  • test/map_f2v: face-to-voice retrieval mAP.

When the validation AUC improves and passes the test threshold, the script runs the full test protocol and saves the model.

Outputs

Checkpoints and result JSON files are saved under:

output/GMOD/gmod/

Example output file names:

auc[87.92,87.73]_ms[87.04,87.04]_map[7.14,7.59].pkl
auc[87.92,87.73]_ms[87.04,87.04]_map[7.14,7.59].pkl.json

Example result JSON:

{
  "test/auc": 87.73,
  "test/auc_g": 77.37,
  "test/map_v2f": 7.14,
  "test/map_f2v": 7.59,
  "test/ms_v2f": 87.04,
  "test/ms_f2v": 87.04,
  "test/ms_v2f_g": 76.90,
  "test/ms_f2v_g": 76.21
}

Regenerating Evaluation Protocols

The repository includes scripts for regenerating the evaluation pickle files:

python scripts/1_verification.py
python scripts/2_matching.py
python scripts/3_retrieval.py

These scripts depend on the dataset/info/*.pkl metadata files from the vfal-eva dataset.

Preprocessing

Most users should use the extracted features from vfal-eva. If you need to rebuild features from raw audio/video, see:

  • preprocess/preprocess/: audio extraction, VAD, frame extraction, face cropping, and pose estimation.
  • preprocess/voice_extractor/: voice feature extraction.
  • preprocess/face_extractor/: face feature extraction.

The preprocessing pipeline depends on external models and raw VoxCeleb-style media paths, so feature reuse is recommended for reproducing GMOD.

Citation

If you use the dataset or evaluation setup from my-yy/vfal-eva, please also acknowledge that benchmark configuration.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages