Code and Data for Language Modeling with Editable External Knowledge.
To setup your environment, run:
conda create -n mem_rewrite PYTHON=3.11
conda activate mem_rewrite
# get pytorch with a version of CUDA compatible with your machine
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
# install other requirements
bash setup.shTo run Mixtral, set your TogetherAI API token
export TOGETHER_API_KEY=<TOGETHER_API_KEY>Set your OpenAI API token if you wish to run GPT* models
export OPENAI_API_KEY=<OPENAI_API_KEY>To run ERASE on the CLARK-News dataset, use:
python lm_eval.py \
--dataset news \
--datapath CLARK_news/ \
(--model_name [mistralai/Mixtral-8x7B-Instruct-v0.1|meta-llama/Llama-3-8b-chat-hf]) \
(--local_model_path <local_model_fp> \)
--context_length [2048|4096] \
--save_as_facts \
--retrieve_facts similarity \
(--overwrite_facts similarity --edit_hops 1)--model_namesets the model name for querying the TogetherAI API (for open-source models) or OpenAI API (for GPT* models). If this flag is set, queries the respective API for model inference. Otherwise, queries a local model.--local_model_pathsets the filepath if we want to use a local copy of a Huggingface Instruct model. One of--model_nameor--local_model_pathmust be set.--context_lengthsets the context window of the model--save_as_factstoggles saving the entries to the KB as facts (rather than as passages)--retrieve_factssets how we want to retrieve KB entries. Set it tosimilarityfor dense retrieval. To turn off retrieval, do not include this flag.--overwrite_factstoggles updating existing KB entries according to new documents. Set it tosimilarityto use dense retrieval to retrieve facts to update. To turn off updating behavior, do not include this flag.--edit_hopssets how many "hops" of retrieval we want to performing when updating existing entries. For each edit_hops > 1, the retriever will perform another round of retrieval based on similarity to the facts retrieved from the last round. This is set to 1 by default.
The CLARK-News dataset is available under CLARK_news.
If you want to collect more data, you may run our data collection process:
- Get Wikidata triples that change over time
python script/get_wikidata_triples.py --data_dir <output_dir>This saves the final triples to <output_dir>/property_to_results.csv.
- Get candidate sources for fact from Google
python script/extract_queries.py \
--source_csv <csv_of_wikidata_triples> \
--target_csv <csv_with_candidate_sources>where csv_of_wikidata_triples is the filepath to the CSV file from step 1.
This populates csv_with_candidate_sources with a list of candidate sources from Google.
- Get human-validated annotations (launch annotation interface):
python AnnotationInterface/webserver.py \
--source_file <csv_with_candidate_sources> \
--target_file <csv_with_human_validated_sources> \
--download_date <download_date>where csv_with_candidate_sources is the filepath to the CSV file from step 2.
This populates csv_with_human_validated_sources with human annotations.
download_date is the date that step 2 was run, in the YYYY-MM-DD. This is needed to infer the origin date of articles mined from Google.
- Pull text of sources from links:
python script/pull_external_sources.py \
--edits_file <csv_with_human_validated_sources> \
--output_dir <output_dir_of_sources>- Automated validation of round 1 annotations:
python script/check_annotations.py # display annotations in annotations.html- Second round of human annotation to validate round 1 (launch checking interface):
python CheckInterface/webserver.py- Make questions from wikidata relations
python script/generate_wikidata_questions.py \
--wikidata_csv <csv_with_human_validated_sources> \
--output_dir <qs_output_dir>Coming soon.
To cite this work, you may use the following BibTex entry:
@misc{li2024language,
title={Language Modeling with Editable External Knowledge},
author={Belinda Z. Li and Emmy Liu and Alexis Ross and Abbas Zeitoun and Graham Neubig and Jacob Andreas},
year={2024},
eprint={2406.11830},
archivePrefix={arXiv},
}