GePE: Generalizable Paper Embedding with Language-Driven Biased Random Walk

Project of AI3602 Data Mining, 2024 Spring, SJTU
Kailing Wang Weiji Xie Xiangyuan Xue

This project introduces the Generalizable Paper Embedding (GePE) model, which leverages both textual and structural information from academic papers to improve paper classification, citation prediction, and recommendation tasks. Using a language-driven biased random walk, GePE efficiently captures semantic relationships between papers, enhancing the embeddings' quality and applicability to unseen data. This approach helps researchers effectively navigate and analyze extensive academic literature.

🛠️ Requirements

You can install them following the instructions below.

Create a new conda environment and activate it:

conda create -n gepe python=3.10
conda activate gepe

Install pytorch with appropriate CUDA version and corresponding pyg-lib, e.g.

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install pyg-lib -f https://data.pyg.org/whl/torch-1.12.1+cu113.html

Then install other dependencies:
```
pip install -r requirements.txt
```

Latest version is recommended for all the packages, but make sure that your CUDA version is compatible with your pytorch.

⚓ Preparation

Before training, you should prepare the necessary dataset and embedding. In this project, we use ogbn-arxiv dataset for experiments. Although the graph can be downloaded automatically, you have to download the raw texts of titles and abstracts manually by running the following commands:

PYTHONPATH=. python dataset/dataloader.py # This would download ogbn-arxiv
mkdir data && cd data
wget https://snap.stanford.edu/ogb/data/misc/ogbn_arxiv/titleabs.tsv.gz
gunzip titleabs.tsv.gz

To test the encoding ability of Scibert without node2vec, you can run the following command to download model and generate the embeddings:

PYTHONPATH=. python dataset/embedding.py

If you want to train bert-based models, we recommend pretokenize the abstract of papers in the dataset and cache them as files. For each model, run the following command to generate the embeddings:

PYTHONPATH=. python model/the_model_you_want_to_use.py # scibert, distilbert, bert.

After running the above commands, the data structure should be like this:

- data/
  |- ogbn_arxiv/
  |- titleabs.tsv
  |- embeddings_cls.pth
  |- embeddings_mean.pth
  |- pre_tokenize.pth
  |- other similar stuff
- utils/
- model/
- other_folders/
- *.py
- readme.md

🚀 Training

Training arguments can be found in utils/args.py. You can run the following command to train the model:

python train.py --model_type embedding --batch_size 16384 
python train.py --model_type pretrained_bert --batch_size 8192 
python train.py --model_type pretrained_bert --batch_size 8192 --pretrain your_model.pth # allow resume training

The batch size set above is for GPUs with 24GB memory. You can adjust the batch size according to your GPU memory.

Here is a visualization of the embedding space of our method:

💯 Evaluation

We support two types of evaluation: classification and link prediction. You can run the following commands to evaluate the model:

python validate_cls.py --model_type pretrained_bert --pretrain your_model.pth
python validate_lp.py --model_type pretrained_bert --pretrain your_model.pth

Here are our evaluation results for your reference:

Method	# Parameters	Generalizable	NC (ACC)	LP (AUC)
Hash Mapping	$3.7\text{M}$	No	$9.8%$	$0.558$
Vanilla Embedding	$130.0\text{M}$	No	$60.5%$	$0.934$
Language Encoding	$30.5\text{M}$	Yes	$26.9%$	$0.733$
GePE (ours)	$23.1\text{M}$	Yes	$68.3%$	$0.859$

Refer to the poster if you want to see more details.

🤖 Demo

We provide a recommendation system based on the trained model. You can run the following command to start the recommendation application:

# To run in command line
PYTHONPATH=. python app/rs_cmd.py --model_type scibert # Note that the default pretrained path is model/test.pth
# To run in gradio
PYTHONPATH=. python app/rs_gradio.py --model_type scibert

Here is a sample result of the recommendation system:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GePE: Generalizable Paper Embedding with Language-Driven Biased Random Walk

🛠️ Requirements

⚓ Preparation

🚀 Training

💯 Evaluation

🤖 Demo

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
app		app
assets		assets
dataset		dataset
model		model
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py
validate_cls.py		validate_cls.py
validate_lp.py		validate_lp.py

Loping151/GePE

Folders and files

Latest commit

History

Repository files navigation

GePE: Generalizable Paper Embedding with Language-Driven Biased Random Walk

🛠️ Requirements

⚓ Preparation

🚀 Training

💯 Evaluation

🤖 Demo

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages