Skip to content

LFhase/HIGHT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment

Paper Github License License

This repo contains the sample code for reproducing the results of our ICML 2025 paper: Hierarchical Graph Tokenization for Molecule-Language Alignment, which has also been presented at ICML 2024 workshop on Foundation Models in the Wild. 😆😆😆

Updates:

  • The customized datasets, including HiPubChem and MotifHallu, are open-sourced.
  • The model checkpoints are open-sourced.

Environment Setup

Mostly refer to LLaVA installation

  1. Clone this repository and navigate to project folder

  2. Install Package

  • If you have any trouble install torch-geometric related packages, please refer to guide-to-pyg-install for detailed instructions.
conda create -n env_hight python=3.10 -y
conda activate env_hight
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

# Install Graph related packages. We use torch-113 with CUDA-11.8, and pytorch_geometric=2.3.1 please change accordingly.
pip install -r requirements.txt
  1. Install additional packages for training cases
pip install ninja
pip install flash-attn --no-build-isolation

Weights

Component Weights Download

  • [] will be released soon!

Dataset

Train

LLaVA training consists of two stages:

  • Stage 1: Alignment Pretraining. Initial stage aligns molecules with text using a PubChem dataset of 330K pairs. Focuses on fine-tuning the alignment projector while keeping the graph encoder and LLM frozen to leverage pre-trained knowledge.
  • Stage 2: Task-specific Instruction Tuning. Second stage targets compound property prediction, chemical reaction analysis, and molecule description generation. Utilizes task-specific instruction datasets and LoRA for LLM adaptation, retaining common-sense reasoning capabilities. Allows adaptable adaptors for specific needs or modular knowledge integration.

Stage 1: Alignment Pretraining

See pretrain.sh for an example of how to run the pretraining stage.

  • $GRAPH_TOWER can be chosen from vqvae2 or hvqvae2.

Stage 2: Task-specific Instruction Tuning

You can train all specific tasks combine together finetune.sh or train them separately.

Evaluation

See Evaluation.md for detailed instructions on how to evaluate the model.

Misc

If you find our paper and repo useful, please cite our paper:

@inproceedings{chen2025hierarchical,
title={Hierarchical Graph Tokenization for Molecule-Language Alignment},
author={Yongqiang Chen and Quanming Yao and Juzheng Zhang and James Cheng and Yatao Bian},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=wpbNczwAwV}
}

We would like to acknowledge the contribution of InstructMol to the base codes.

About

[ICML 2025] Hierarchical Graph Tokenization for Molecule-Language Alignment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published