GMOD is an unsupervised framework for voice-face association learning. It addresses two common limitations of existing methods: false-negative conflicts caused by treating cross-video same-identity samples as negatives, and unreliable pseudo-labels introduced by rigid clustering or prototype assumptions. GMOD constructs a global cross-modal similarity graph to mine latent positive samples for multi-positive contrastive learning, and uses orthogonal disentanglement to separate shared identity features from modality-private noise.
This repository focuses on the GMOD training and evaluation pipeline. The dataset layout and extracted feature format are compatible with my-yy/vfal-eva, which can be used as the reference setup for preparing the environment and input files.
GMOD includes:
- Graph mining over video/movie-level voice-face samples.
- Multi-positive bidirectional InfoNCE training.
- Orthogonal identity/nuisance disentanglement.
- Reconstruction regularization for both modalities.
- Validation and testing on verification, matching, and retrieval tasks.
GMOD/
├── dataset/
│ ├── face_input.pkl
│ ├── voice_input.pkl
│ ├── info/
│ └── evals/
├── image/
│ └── model.png
├── loaders/
├── models/
│ └── gmod_model.py
├── preprocess/
├── scripts/
├── utils/
├── works/
│ └── GMOD.py
├── requirements.txt
└── readme.md
Create a Python environment with CUDA-enabled PyTorch, then install the dependencies used by GMOD:
pip install -r requirements.txtThe training code calls .cuda() directly, so a CUDA-enabled PyTorch installation and an available GPU are expected.
GMOD expects pre-extracted voice and face features. The expected file layout follows the vfal-eva convention:
dataset/
├── evals/
│ ├── test_matching_10.pkl
│ ├── test_matching_g.pkl
│ ├── test_matching.pkl
│ ├── test_retrieval.pkl
│ ├── test_verification_g.pkl
│ ├── test_verification.pkl
│ └── valid_verification.pkl
├── info/
│ ├── name2gender.pkl
│ ├── name2jpgs_wavs.pkl
│ ├── name2movies.pkl
│ ├── name2voice_id.pkl
│ ├── train_valid_test_names.pkl
│ └── works/
│ └── wen_weights.txt
├── face_input.pkl
└── voice_input.pkl
Input feature files:
face_input.pkl: a dictionary from face image paths to 512-dimensional face embeddings.voice_input.pkl: a dictionary from voice clip paths to 192-dimensional voice embeddings.dataset/info/*.pkl: metadata for names, genders, movies/videos, voice IDs, and train/valid/test splits.dataset/evals/*.pkl: predefined evaluation protocols for verification, matching, and retrieval.
Run GMOD with:
python works/GMOD.pyCommon arguments:
python works/GMOD.py \
--epoch 200 \
--batch_size 256 \
--lr 1e-4 \
--top_p 20 \
--top_p_end 8 \
--top_p_ramp 20 \
--temperature 0.07 \
--lambda_nce 1.0 \
--lambda_ortho 0.1 \
--lambda_rec 1.0 \
--eval_step 200 \
--early_stop 10Important options:
| Argument | Default | Description |
|---|---|---|
--epoch |
200 |
Maximum number of training epochs |
--batch_size |
256 |
Training batch size |
--lr |
1e-4 |
Adam learning rate |
--top_p |
20 |
Initial graph-mining neighborhood size |
--top_p_end |
8 |
Final neighborhood size after ramping |
--top_p_ramp |
20 |
Epochs used to ramp from top_p to top_p_end |
--temperature |
0.07 |
Temperature for multi-positive InfoNCE |
--lambda_nce |
1.0 |
Weight for the contrastive loss |
--lambda_ortho |
0.1 |
Weight for the orthogonal disentanglement loss |
--lambda_rec |
1.0 |
Weight for the reconstruction loss |
--eval_step |
200 |
Validation interval in training steps |
--load_model |
"" |
Optional checkpoint path for continued training |
To enable online W&B logging, create .wb_config.json in the project root:
{
"WB_KEY": "Your wandb auth key"
}The default W&B project/name in works/GMOD.py is:
project = GMOD
name = gmod
For offline logging:
export WANDB_MODE=offlineOn Windows PowerShell:
$env:WANDB_MODE = "offline"Evaluation is triggered automatically during training. The main metrics are:
valid/auc: validation verification AUC.test/auc: test verification AUC.test/auc_g: gender-constrained verification AUC.test/ms_v2f: voice-to-face matching score.test/ms_f2v: face-to-voice matching score.test/map_v2f: voice-to-face retrieval mAP.test/map_f2v: face-to-voice retrieval mAP.
When the validation AUC improves and passes the test threshold, the script runs the full test protocol and saves the model.
Checkpoints and result JSON files are saved under:
output/GMOD/gmod/
Example output file names:
auc[87.92,87.73]_ms[87.04,87.04]_map[7.14,7.59].pkl
auc[87.92,87.73]_ms[87.04,87.04]_map[7.14,7.59].pkl.json
Example result JSON:
{
"test/auc": 87.73,
"test/auc_g": 77.37,
"test/map_v2f": 7.14,
"test/map_f2v": 7.59,
"test/ms_v2f": 87.04,
"test/ms_f2v": 87.04,
"test/ms_v2f_g": 76.90,
"test/ms_f2v_g": 76.21
}The repository includes scripts for regenerating the evaluation pickle files:
python scripts/1_verification.py
python scripts/2_matching.py
python scripts/3_retrieval.pyThese scripts depend on the dataset/info/*.pkl metadata files from the vfal-eva dataset.
Most users should use the extracted features from vfal-eva. If you need to rebuild features from raw audio/video, see:
preprocess/preprocess/: audio extraction, VAD, frame extraction, face cropping, and pose estimation.preprocess/voice_extractor/: voice feature extraction.preprocess/face_extractor/: face feature extraction.
The preprocessing pipeline depends on external models and raw VoxCeleb-style media paths, so feature reuse is recommended for reproducing GMOD.
If you use the dataset or evaluation setup from my-yy/vfal-eva, please also acknowledge that benchmark configuration.