The Tensorflow implementation of our ACL 2018 paper:
A Deep Relevance Model for Zero-Shot Document Filtering, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen
Paper url: http://aclweb.org/anthology/P18-1214
- Python 3.5
- Tensorflow 1.2
- Numpy
- Traitlets
Prepare your dataset: first, prepare your own data. See Data Preparation
Configure: then, configure the model through the config file. Configurable parameters are listed here
See the example: sample.config
In additional, you need to change the zero-shot label settings in get_label.py
(You need make sure both get_label.py and model.py are put in same directory)
Training : pass the config file, training data and validation data as
python model.py config-file\
--train \
--train_file: path to training data\
--validation_file: path to validation data\
--checkpoint_dir: directory to store/load model checkpoints\
--load_model: True or False(depends on existing or not). Start with a new model or continue trainingSee example: sample-train.sh
Testing: pass the config file and testing data as
python model.py config-file\
--test \
--test_file: path to testing data\
--test_size: size of testing data (number of testing samples)\
--checkpoint_dir: directory to load trained model\
--output_score_file: file to output documents score\Relevance scores will be output to output_score_file, one score per line, in the same order as test_file.
All seed words and documents must be mapped into sequences of integer term ids. Term id starts with 1.
Training Data Format
Each training sample is a tuple of (seed words, postive document, negative document)
seed_words \t postive_document \t negative_document
Example: 334,453,768 \t 123,435,657,878,6,556 \t 443,554,534,3,67,8,12,2,7,9
Testing Data Format
Each testing sample is a tuple of (seed words, document)
seed_words \t document
Example: 334,453,768 \t 123,435,657,878,6,556
Validation Data Format
The format is same as training data format
Label Dict File Format
Each line is a tuple of (label_name, seed_words)
label_name/seed_words
Example: alt.atheism/atheist christian atheism god islamic
Word2id File Format
Each line is a tuple of (word, id)
word id
Example: world 123
Embedding File Format
Each line is a tuple of (id, embedding)
id embedding
Example: 1 0.3 0.4 0.5 0.6 -0.4 -0.2
Model Configurations
BaseNN.embedding_size: embedding dimension of wordBaseNN.max_q_len: max query lengthBaseNN.max_d_len: max document lengthDataGenerator.max_q_len: max query length. Should be the same asBaseNN.max_q_lenDataGenerator.max_d_len: max query length. Should be the same asBaseNN.max_d_lenBaseNN.vocabulary_size: vocabulary sizeDataGenerator.vocabulary_size: vocabulary sizeBaseNN.batch_size: batch sizeBaseNN.max_epochs: max number of epochs to trainBaseNN.eval_frequency: evaluate model on validation set very this epochsBaseNN.checkpoint_steps: save model very this epochs
Data
DAZER.emb_in: path of initial embeddings fileDAZER.label_dict_path: path of label dict fileDAZER.word2id_path: path of word2id file
Training Parameters
DAZER.epsilon: epsilon for Adam OptimizerDAZER.embedding_size: embedding dimension of wordDAZER.vocabulary_size: vocabulary size of the datasetDAZER.kernal_width: width of the kernelDAZER.kernal_num: num of kernelDAZER.regular_term: weight of L2 lossDAZER.maxpooling_num: num of K-max poolingDAZER.decoder_mlp1_num: num of hidden units of first mlp in relevance aggregation partDAZER.decoder_mlp2_num: num of hidden units of second mlp in relevance aggregation partDAZER.model_learning_rate: learning rate for model instead of adversarial calssifierDAZER.adv_learning_rate: learning rate for adversarial classfierDAZER.train_class_num: num of class in training timeDAZER.adv_term: weight of adversarial loss when updating model's parametersDAZER.zsl_num: num of zero-shot labelsDAZER.zsl_type: type of zero-shot label setting ( you may have multiply zero-shot settings in same number of zero-shot label, this indicates which type of zero-shot label setting you pick for experiemnt, see get_label.py for more details )