GMOD: Voice-Face Association Learning via Graph Mining and Orthogonal Disentanglement

GMOD is an unsupervised framework for voice-face association learning. It addresses two common limitations of existing methods: false-negative conflicts caused by treating cross-video same-identity samples as negatives, and unreliable pseudo-labels introduced by rigid clustering or prototype assumptions. GMOD constructs a global cross-modal similarity graph to mine latent positive samples for multi-positive contrastive learning, and uses orthogonal disentanglement to separate shared identity features from modality-private noise.

Framework

Overview

This repository focuses on the GMOD training and evaluation pipeline. The dataset layout and extracted feature format are compatible with my-yy/vfal-eva, which can be used as the reference setup for preparing the environment and input files.

GMOD includes:

Graph mining over video/movie-level voice-face samples.
Multi-positive bidirectional InfoNCE training.
Orthogonal identity/nuisance disentanglement.
Reconstruction regularization for both modalities.
Validation and testing on verification, matching, and retrieval tasks.

Repository Structure

GMOD/
├── dataset/
│   ├── face_input.pkl
│   ├── voice_input.pkl
│   ├── info/
│   └── evals/
├── image/
│   └── model.png
├── loaders/
├── models/
│   └── gmod_model.py
├── preprocess/
├── scripts/
├── utils/
├── works/
│   └── GMOD.py
├── requirements.txt
└── readme.md

Installation

Create a Python environment with CUDA-enabled PyTorch, then install the dependencies used by GMOD:

pip install -r requirements.txt

The training code calls .cuda() directly, so a CUDA-enabled PyTorch installation and an available GPU are expected.

Dataset

GMOD expects pre-extracted voice and face features. The expected file layout follows the vfal-eva convention:

dataset/
├── evals/
│   ├── test_matching_10.pkl
│   ├── test_matching_g.pkl
│   ├── test_matching.pkl
│   ├── test_retrieval.pkl
│   ├── test_verification_g.pkl
│   ├── test_verification.pkl
│   └── valid_verification.pkl
├── info/
│   ├── name2gender.pkl
│   ├── name2jpgs_wavs.pkl
│   ├── name2movies.pkl
│   ├── name2voice_id.pkl
│   ├── train_valid_test_names.pkl
│   └── works/
│       └── wen_weights.txt
├── face_input.pkl
└── voice_input.pkl

Input feature files:

face_input.pkl: a dictionary from face image paths to 512-dimensional face embeddings.
voice_input.pkl: a dictionary from voice clip paths to 192-dimensional voice embeddings.
dataset/info/*.pkl: metadata for names, genders, movies/videos, voice IDs, and train/valid/test splits.
dataset/evals/*.pkl: predefined evaluation protocols for verification, matching, and retrieval.

Training

Run GMOD with:

python works/GMOD.py

Common arguments:

python works/GMOD.py \
  --epoch 200 \
  --batch_size 256 \
  --lr 1e-4 \
  --top_p 20 \
  --top_p_end 8 \
  --top_p_ramp 20 \
  --temperature 0.07 \
  --lambda_nce 1.0 \
  --lambda_ortho 0.1 \
  --lambda_rec 1.0 \
  --eval_step 200 \
  --early_stop 10

Important options:

Argument	Default	Description
`--epoch`	`200`	Maximum number of training epochs
`--batch_size`	`256`	Training batch size
`--lr`	`1e-4`	Adam learning rate
`--top_p`	`20`	Initial graph-mining neighborhood size
`--top_p_end`	`8`	Final neighborhood size after ramping
`--top_p_ramp`	`20`	Epochs used to ramp from `top_p` to `top_p_end`
`--temperature`	`0.07`	Temperature for multi-positive InfoNCE
`--lambda_nce`	`1.0`	Weight for the contrastive loss
`--lambda_ortho`	`0.1`	Weight for the orthogonal disentanglement loss
`--lambda_rec`	`1.0`	Weight for the reconstruction loss
`--eval_step`	`200`	Validation interval in training steps
`--load_model`	`""`	Optional checkpoint path for continued training

Weights & Biases

To enable online W&B logging, create .wb_config.json in the project root:

{
  "WB_KEY": "Your wandb auth key"
}

The default W&B project/name in works/GMOD.py is:

project = GMOD
name = gmod

For offline logging:

export WANDB_MODE=offline

On Windows PowerShell:

$env:WANDB_MODE = "offline"

Evaluation

Evaluation is triggered automatically during training. The main metrics are:

valid/auc: validation verification AUC.
test/auc: test verification AUC.
test/auc_g: gender-constrained verification AUC.
test/ms_v2f: voice-to-face matching score.
test/ms_f2v: face-to-voice matching score.
test/map_v2f: voice-to-face retrieval mAP.
test/map_f2v: face-to-voice retrieval mAP.

When the validation AUC improves and passes the test threshold, the script runs the full test protocol and saves the model.

Outputs

Checkpoints and result JSON files are saved under:

output/GMOD/gmod/

Example output file names:

auc[87.92,87.73]_ms[87.04,87.04]_map[7.14,7.59].pkl
auc[87.92,87.73]_ms[87.04,87.04]_map[7.14,7.59].pkl.json

Example result JSON:

{
  "test/auc": 87.73,
  "test/auc_g": 77.37,
  "test/map_v2f": 7.14,
  "test/map_f2v": 7.59,
  "test/ms_v2f": 87.04,
  "test/ms_f2v": 87.04,
  "test/ms_v2f_g": 76.90,
  "test/ms_f2v_g": 76.21
}

Regenerating Evaluation Protocols

The repository includes scripts for regenerating the evaluation pickle files:

python scripts/1_verification.py
python scripts/2_matching.py
python scripts/3_retrieval.py

These scripts depend on the dataset/info/*.pkl metadata files from the vfal-eva dataset.

Preprocessing

Most users should use the extracted features from vfal-eva. If you need to rebuild features from raw audio/video, see:

preprocess/preprocess/: audio extraction, VAD, frame extraction, face cropping, and pose estimation.
preprocess/voice_extractor/: voice feature extraction.
preprocess/face_extractor/: face feature extraction.

The preprocessing pipeline depends on external models and raw VoxCeleb-style media paths, so feature reuse is recommended for reproducing GMOD.

Citation

If you use the dataset or evaluation setup from my-yy/vfal-eva, please also acknowledge that benchmark configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GMOD: Voice-Face Association Learning via Graph Mining and Orthogonal Disentanglement

Framework

Overview

Repository Structure

Installation

Dataset

Training

Weights & Biases

Evaluation

Outputs

Regenerating Evaluation Protocols

Preprocessing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
image		image
loaders		loaders
models		models
preprocess		preprocess
scripts		scripts
utils		utils
works		works
.gitignore		.gitignore
GMOD_CameraReady.txt		GMOD_CameraReady.txt
Supplementary.pdf		Supplementary.pdf
readme.md		readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GMOD: Voice-Face Association Learning via Graph Mining and Orthogonal Disentanglement

Framework

Overview

Repository Structure

Installation

Dataset

Training

Weights & Biases

Evaluation

Outputs

Regenerating Evaluation Protocols

Preprocessing

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages