By Dongze Hao, Qunbo Wang and Jing Liu
This is the official implementation of the paper. In this paper, we propose a Semantic-Visual Graph Reasoning framework (SVG) for VisDial. Specifically, we first construct a semantic graph to capture the semantic rela- tionships between different entities in the current question and the dialog history. Secondly, we construct a semantics-aware visual graph to capture high-level visual semantics including key objects of the image and their visual relationships. Exten- sive experimental results on the VisDial v0.9 and v1.0 show that our method has shown superior performance compared to the state-of-the-art models across most evaluation metrics.
conda create -n svg python=3.8
conda activate svg
conda conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch
pip install tqdm pyyaml nltk setproctitle- Download the data
- Download the VisDial v0.9 and v1.0 dialog json files from here and keep it under
$PROJECT_ROOT/data/v0.9and$PROJECT_ROOT/data/v1.0directory, respectively. - batra-mlp-lab provides the word counts for VisDial v1.0 train split
visdial_1.0_word_counts_train.json. They are used to build the vocabulary. Keep it under$PROJECT_ROOT/data/v1.0directory. - batra-mlp-lab provides Faster-RCNN image features pre-trained on Visual Genome. Keep it under
$PROJECT_ROOT/data/visdial_1.0_imgand set argumentimg_feature_typetofaster_rcnn_x101inconfig/hparams.pyfile.features_faster_rcnn_x101_train.h5: Bottom-up features of 36 proposals from images oftrainsplit.features_faster_rcnn_x101_val.h5: Bottom-up features of 36 proposals from images ofvalsplit.features_faster_rcnn_x101_test.h5: Bottom-up features of 36 proposals from images oftestsplit.
- gicheonkang provides pre-extracted Faster-RCNN image features, which contain bounding boxes information. Set argument
img_feature_typetodan_faster_rcnn_x101inconfig/hparams.pyfile.train_btmup_f.hdf5: Bottom-up features of 10 to 100 proposals from images oftrainsplit (32GB).train_imgid2idx.pkl:image_idto bbox index file fortrainsplitval_btmup_f.hdf5: Bottom-up features of 10 to 100 proposals from images ofvalidationsplit (0.5GB).val_imgid2idx.pkl:image_idto bbox index file forvalsplittest_btmup_f.hdf5: Bottom-up features of 10 to 100 proposals from images oftestsplit (2GB).test_imgid2idx.pkl:image_idto bbox index file fortestsplit
- Preprocess the data
- Download the GloVe pretrained word vectors from here, and keep
glove.6B.300d.txtunder$PROJECT_ROOT/data/word_embeddings/glovedirectory. Runpython data/preprocess/init_glove.py
- Preprocesse textual inputs
python data/data_utils.py
- Train the model
python main.py --model svg --version 1.0
- Evaluate the model
python main.py --model svg --evaluate /path/to/checkpoint.pth --eval_split val --version 1.0
This code is reimplemented as a fork of batra-mlp-lab/visdial-challenge-starter-pytorch and yuleiniu/rva.