This repository contains the PyTorch implementation of the paper:
Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings (ICCV Workshops 2019)
This work focuses on aligning multimodal embeddings using advanced Wasserstein autoencoder techniques.
The code is written in Python 2.7.0 and CUDA 9.0.
Requirements:
- torch 0.3
- torchvision 0.3.0
- nltk 3.5
- gensim
- Punkt Sentence Tokenizer:
import nltk
nltk.download()
> d punkt
To install the requirements:
conda config --add channels pytorch
conda config --add channels anaconda
conda config --add channels conda-forge
conda config --add channels conda-forge/label/cf202003
conda create -n <environment_name> --file requirements.txt
conda activate <environment_name>
-
The preprocessed COCO and Flickr30K datasets used in the experiments are based on the SCAN and can be downloaded from:
Place the downloaded datasets in the
datafolder. -
Run
vocab.pyto generate the vocabulary for the datasets:
python vocab.py --data_path data --data_name f30k_precomp
python vocab.py --data_path data --data_name coco_precomp
Train a new JWAE model using the following:
python train.py --data_path "$DATA_PATH" --data_name coco_precomp --vocab_path "$VOCAB_PATH"
Evaluate the trained model using:
from vocab import Vocabulary
import evaluation
evaluation.evalrank("$CHECKPOINT_PATH", data_path="$DATA_PATH", split="test")
@inproceedings{Mahajan:2019:JWA,
author = {Shweta Mahajan and Teresa Botschen and Iryna Gurevych and Stefan Roth},
booktitle = {ICCV Workshop on Cross-Modal Learning in Real World},
title = {Joint {W}asserstein Autoencoders for Aligning Multi-modal Embeddings},
year = {2019}
}