HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment

This repo contains the sample code for reproducing the results of our ICML 2025 paper: Hierarchical Graph Tokenization for Molecule-Language Alignment, which has also been presented at ICML 2024 workshop on Foundation Models in the Wild. 😆😆😆

Updates:

The customized datasets, including HiPubChem and MotifHallu, are open-sourced.
The model checkpoints are open-sourced.

Environment Setup

Mostly refer to LLaVA installation

Clone this repository and navigate to project folder
Install Package

If you have any trouble install torch-geometric related packages, please refer to guide-to-pyg-install for detailed instructions.

conda create -n env_hight python=3.10 -y
conda activate env_hight
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

# Install Graph related packages. We use torch-113 with CUDA-11.8, and pytorch_geometric=2.3.1 please change accordingly.
pip install -r requirements.txt

Install additional packages for training cases

pip install ninja
pip install flash-attn --no-build-isolation

Weights

Component Weights Download

[] will be released soon!

Dataset

PubChem: Referring to MoleculeSTM.
Mol-Instructions: Referring to Mol-Instructions.

Train

LLaVA training consists of two stages:

Stage 1: Alignment Pretraining. Initial stage aligns molecules with text using a PubChem dataset of 330K pairs. Focuses on fine-tuning the alignment projector while keeping the graph encoder and LLM frozen to leverage pre-trained knowledge.
Stage 2: Task-specific Instruction Tuning. Second stage targets compound property prediction, chemical reaction analysis, and molecule description generation. Utilizes task-specific instruction datasets and LoRA for LLM adaptation, retaining common-sense reasoning capabilities. Allows adaptable adaptors for specific needs or modular knowledge integration.

Stage 1: Alignment Pretraining

See pretrain.sh for an example of how to run the pretraining stage.

$GRAPH_TOWER can be chosen from vqvae2 or hvqvae2.

Stage 2: Task-specific Instruction Tuning

You can train all specific tasks combine together finetune.sh or train them separately.

Evaluation

See Evaluation.md for detailed instructions on how to evaluate the model.

Misc

If you find our paper and repo useful, please cite our paper:

@inproceedings{chen2025hierarchical,
title={Hierarchical Graph Tokenization for Molecule-Language Alignment},
author={Yongqiang Chen and Quanming Yao and Juzheng Zhang and James Cheng and Yatao Bian},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=wpbNczwAwV}
}

We would like to acknowledge the contribution of InstructMol to the base codes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
llava		llava
scripts		scripts
.gitignore		.gitignore
Evaluation.md		Evaluation.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment

Environment Setup

Weights

Component Weights Download

Dataset

Train

Stage 1: Alignment Pretraining

Stage 2: Task-specific Instruction Tuning

Evaluation

Misc

About

Uh oh!

Releases

Packages

Languages

License

LFhase/HIGHT

Folders and files

Latest commit

History

Repository files navigation

HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment

Environment Setup

Weights

Component Weights Download

Dataset

Train

Stage 1: Alignment Pretraining

Stage 2: Task-specific Instruction Tuning

Evaluation

Misc

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages