This repo contains the sample code for reproducing the results of our ICML 2025 paper: Hierarchical Graph Tokenization for Molecule-Language Alignment, which has also been presented at ICML 2024 workshop on Foundation Models in the Wild. 😆😆😆
Updates:
- The customized datasets, including
HiPubChemandMotifHallu, are open-sourced. - The model checkpoints are open-sourced.
Mostly refer to LLaVA installation
-
Clone this repository and navigate to project folder
-
Install Package
- If you have any trouble install torch-geometric related packages, please refer to guide-to-pyg-install for detailed instructions.
conda create -n env_hight python=3.10 -y
conda activate env_hight
pip install --upgrade pip # enable PEP 660 support
pip install -e .
# Install Graph related packages. We use torch-113 with CUDA-11.8, and pytorch_geometric=2.3.1 please change accordingly.
pip install -r requirements.txt- Install additional packages for training cases
pip install ninja
pip install flash-attn --no-build-isolation
- [] will be released soon!
- PubChem: Referring to MoleculeSTM.
- Mol-Instructions: Referring to Mol-Instructions.
LLaVA training consists of two stages:
- Stage 1: Alignment Pretraining. Initial stage aligns molecules with text using a PubChem dataset of 330K pairs. Focuses on fine-tuning the alignment projector while keeping the graph encoder and LLM frozen to leverage pre-trained knowledge.
- Stage 2: Task-specific Instruction Tuning. Second stage targets compound property prediction, chemical reaction analysis, and molecule description generation. Utilizes task-specific instruction datasets and LoRA for LLM adaptation, retaining common-sense reasoning capabilities. Allows adaptable adaptors for specific needs or modular knowledge integration.
See pretrain.sh for an example of how to run the pretraining stage.
$GRAPH_TOWERcan be chosen fromvqvae2orhvqvae2.
You can train all specific tasks combine together finetune.sh or train them separately.
See Evaluation.md for detailed instructions on how to evaluate the model.
If you find our paper and repo useful, please cite our paper:
@inproceedings{chen2025hierarchical,
title={Hierarchical Graph Tokenization for Molecule-Language Alignment},
author={Yongqiang Chen and Quanming Yao and Juzheng Zhang and James Cheng and Yatao Bian},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=wpbNczwAwV}
}We would like to acknowledge the contribution of InstructMol to the base codes.