This is code for paper Design Challenges in Low-resource Cross-lingual Entity Linking (EMNLP'20). https://arxiv.org/abs/2005.00692
We have a demo here, and our paper page is here!
A good research work is always accompanied by a thorough and faithful reference. If you use or extend our work, please cite the following paper:
@inproceedings{FSYZR20,
author = {Xingyu Fu and Weijia Shi and Xiaodong Yu and Zian Zhao and Dan Roth},
title = {{Design Challenges in Low-resource Cross-lingual Entity Linking}},
booktitle = {Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2020},
url = "https://cogcomp.seas.upenn.edu/papers/FSYZR20.pdf",
funding = {LORELEI},
}
Clone the repository from our github page (don't forget to star us!)
git clone https://github.com/zeyofu/EDL.gitpip install -r requirements.txt
For faster processing we store the various maps (e.g. string to Wikipedia candidates, string to Lorelei KB candidates etc.) in a mongodb database collection. MongoDB stores various statistics and string-to-candidate indices that are used to compute candidates. To start up the Mongo DB daemon, run:
mongod --dbpath your_mongo_pathThe dbpath argument is where mongodb creates the database and indexes. You can specify a different path, but you need to rebuild th indices below.
Note the mongo should be run in the port 27017.
First, preprocess the Wikipedia dump for the languages you care about using the wikidump_preprocessing code here. Preprocessing wikis into a folder outwiki consisting of different language wikipedia folder. For example, the layout of outwiki folder should be:
– outwiki:\
– enwiki\
– eswiki\
– zhwiki \
....
Set the abs_path in utils/misc_utils.py to outwiki. When you first time run the link_entity.py, data will be automatically loaded to mongo. (It may take a long time)
- Get
google_api_keyfollowing https://developers.google.com/custom-search/v1/overview. - Get
google_api_cx(search engine ID) from https://programmablesearchengine.google.com/, on which set "Sites to search" as*.wikipedia.org,map.google.com. - Use
google_api_cxandgoogle_api_keyas parameters to run src/link_entity.py
Download pretrained Multilingual Bert or train own Bert model.
cd src
python link_entity.py --kbdir ${kbdir} --lang ${lang} --year ${year}
where kbdir is directory of preprocessed wikipedia outwiki, lang is language code used in Preprocessing Wikipedia, and year is downloaded wikipedia version (e.g. "20191020").