Code and information to replicate experiments on http://webdatacommons.org/largescaleproductcorpus/v2/index.html
anaconda(or similar for standard packages)py_entitymatchingxgboostdeepmatcher
Please download and unzip the WDC LSPC v2 normalized data files into the corresponding folder under data/raw/wdc-lspc/
- Run noise-training-sets notebook
- Run process-to-magellan and process-to-wordcooc notebooks
Run run-wordcooc, run-magellan or run-deepmatcher notebooks to replicate learning curve and label-noise experiments
Find the best parameter combinations in the file optimized-parameters.txt
To allow for gradient updates of the embedding layer, simply change the line
embed.weight.requires_grad = False
in models/core.py to True in the deepmatcher package
Additional requirement: textdistance
The notebook sample-training-sets contains the code used for building the 4 training sets for each product category
Project structure based on Cookiecutter Data Science: https://drivendata.github.io/cookiecutter-data-science/