Columbia Reviews Summarization
The data crawler was written before the class and tweaked and extended during the project. It is available in the separate repo: soid/columbia-catalog-scraper
git clone https://github.com/soid/columbia-catalog-scraper
cd columbia-catalog-scraper
pip3 install -r requirements.txt
./run-script.sh scripts/wiki_search_train.py
./run-script.sh scripts/wiki_article_train.py
./run-crawler.sh -p
Raw data is available for download in the separate repository for convenience: soid/columbia-catalog-data
For annotation, the project uses the JupyterLab notebook data-annotation.ipynb.
It requires installing JupyterLab and iWidgets.
It refers to my website peqod.com during the annotation process, which is located in the seprate
repo soid/peqod.
The annotated data is saved in culpa.text.jsonl, while the annotated example is removed from culpa.jsonl
– both files use json object per line format. A separate training data is saved in *.json files as a single json object
for convenience.
We use COOP model for the establishing the strongest baseline. The modifications of the original paper are located in the separate repository: soid/coop-cusum
Retrieve the CuSum data for COOP:
git clone git@github.com:soid/coop-cusum.git
cd coop-cusum
python scripts/get_summ.py culpa data/culpa
cp ../path/to/CuSum/culpa.jsonl data/culpa/train1.jsonl
python scripts/preprocess.py culpa data/culpa/train1.jsonl > data/culpa/train.jsonl
cat data/culpa/train.jsonl | sort -R > data/culpa/train2.jsonl
mv data/culpa/train2.jsonl data/culpa/train.jsonl
Generate SentencePiece vocabulary (can adjust parameters in scripts/spm_train.py if desired):
python3 scripts/spm_train.py data/culpa/train.jsonl data/sentencepiece/culpa
The first baseline we consider is naive first sentence. Evaluate:
python model_first_sentence.py culpa.test.json
rouge-1 : 19.42
rouge-2 : 2.80
rouge-l : 12.75Similarly, compute Extractive Oracle:
python model_oracle.py culpa.test.jsonTo show sample:
python model_oracle.py culpa.test.json --show-sampleTraining COOP/BiMeanVae baseline:
python train.py config/culpa/culpa-bimeanvae.jsonnet -s log/culpa/ex1The model is saved in log/culpa/ex1
Training COOP/Optimus baseline (requires 16Gb+ GPU):
python train.py config/culpa/culpa-optimus.jsonnet -s log/culpa/ex2Printing out rouge scores by steps:
python ../CuSum/coopmetrics.py log/culpa/ex1
Mode training code is based on the COOP code with modifications. I forked the project in soid/coop-cusum, and my modifications are in the revision 8d4fe0a
Training is done by running COOP with cusum-bimeanvae.jsonnet config:
python train.py config/culpa/cusum-bimeanvae.jsonnet -s log/culpa/ex1Our decoder requires the latest version of kmeans_pytorch installed from GitHub subhadarship/kmeans_pytorch.
Move culpa_coop.py to the COOP repository and run:
python culpa_coop.py log/culpa/ex1 data/culpa/test.jsonThis will evaluate and print out the sample from the paper, as well as evaluate 30 references and print out the Rouge score.