Skip to content

Mubaoy/molformer

 
 

Repository files navigation

MoLFormer under ChemVL Protocol

This repository is a fork of IBM MoLFormer for reproducing MoLFormer as an external baseline under the ChemVL MoleculeNet protocol.

The purpose of this fork is not to redesign MoLFormer. It keeps the original MoLFormer encoder/checkpoint interface and adds a ChemVL-compatible evaluation wrapper so MoLFormer can be compared fairly with MoleculeSTM, GEM, ChemVL, and MolMCL-style baselines under the same datasets, split rules, seeds, metrics, and output format.

Project Scope

Model Input Finetune setup
MoLFormer-XL SMILES sequence Full-model fine-tuning with a ChemVL-compatible property head

Benchmark tables:

Table Split Datasets
A scaffold BACE, BBBP, ClinTox, HIV, SIDER, Tox21, ESOL, FreeSolv, Lipo, QM7
B random_scaffold BACE, BBBP, ClinTox, HIV, SIDER, Tox21, ESOL, FreeSolv, Lipo, QM7

MoleculeACE Table C is not implemented in this fork. It would require a separate MoleculeACE loader and the MolMCL split protocol.

Metric convention follows ChemVL: classification uses ROC-AUC, QM7 uses MAE, and ESOL/FreeSolv/Lipo use RMSE. Each (dataset, split) setting is run with runseed = 1, 2, 3.

Repository Layout

scripts/chemvl_protocol/       ChemVL-compatible MoLFormer runner and batch scripts
configs/chemvl_protocol/       Base configs for Table A/B runs
finetune/                      Original MoLFormer finetune/model components
parameter_audit.sh             One-command parameter audit entry point
parameter_audit.py             MoLFormer parameter audit implementation
parameter_summary.csv          Generated by parameter_audit.sh

Environment

Install ChemVL protocol dependencies in the Python environment used for this fork:

pip install -r scripts/chemvl_protocol/requirements.txt

For strict split reproducibility, keep the RDKit version aligned with the ChemVL/GEM runs:

rdkit-pypi==2022.9.5

The encoder also requires the original MoLFormer stack, including PyTorch and pytorch-fast-transformers.

Data And Checkpoints

Set the roots before running:

export CHEMVL_DATA_ROOT=/path/to/chemvl-data
export MOLFORMER_REPO=/path/to/this/molformer/repo

Expected MoleculeNet layout:

${CHEMVL_DATA_ROOT}/finetuning_datasets/MPP/classification/<task>/processed/<task>_processed_ac.csv
${CHEMVL_DATA_ROOT}/finetuning_datasets/MPP/regression/<task>/processed/<task>_processed_ac.csv

Expected pretrained checkpoint:

${MOLFORMER_REPO}/data/Pretrained MoLFormer/checkpoints/N-Step-Checkpoint_3_30000.ckpt

Checkpoints are runtime artifacts and should not be committed.

Run Experiments

Dry-run commands:

DRY_RUN=1 bash scripts/chemvl_protocol/run_moleculenet_scaffold.sh
DRY_RUN=1 bash scripts/chemvl_protocol/run_moleculenet_random_scaffold.sh

Run Table A/B:

bash scripts/chemvl_protocol/run_moleculenet_scaffold.sh
bash scripts/chemvl_protocol/run_moleculenet_random_scaffold.sh

Run both in the background:

setsid -f bash scripts/chemvl_protocol/run_ab_background.sh \
  > "${CHEMVL_DATA_ROOT}/results/moleculenet/molformer_under_chemvl/run_ab.log" \
  2>&1 < /dev/null

Useful override:

PYTHON=/path/to/python RUNSEED_START=1 RUNSEED_END=3 \
  bash scripts/chemvl_protocol/run_moleculenet_scaffold.sh

SIDER may need a smaller batch size on memory-constrained GPUs:

SIDER_BATCH_SIZE=16 bash scripts/chemvl_protocol/resume_after_sider_oom.sh

Parameter Audit

Run one command from the repository root. A minimal verified audit environment is:

conda create -n molformer_param_audit python=3.8 pip -y
conda activate molformer_param_audit
pip install numpy==1.24.4 pandas scikit-learn scipy rdkit-pypi==2022.9.5 torch==1.13.1 transformers==4.30.2
PATH=/root/miniconda3/bin:/usr/local/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
  pip install pytorch-fast-transformers==0.4.0 --no-build-isolation
bash parameter_audit.sh --strict

The temporary PATH override hides /usr/local/cuda/bin/nvcc so pytorch-fast-transformers builds CPU extensions only. This is sufficient for parameter counting and avoids CUDA toolkit/PyTorch CUDA-version mismatch during installation.

The script writes:

parameter_summary.csv

The audit first tries to instantiate the local ChemVL MoLFormer adapter and count model.parameters() after applying freeze_encoder. Use --strict when producing the formal table so dependency issues fail fast instead of falling back. If pytorch-fast-transformers is unavailable, the non-strict mode falls back to a transparent static architecture count from MolFormerPropertyModel, the checked-in config, and finetune/bert_vocab.txt. The fallback is marked as STATIC_ARCHITECTURE_FALLBACK in the CSV.

Default audit target is Table A bbbp with num_tasks=1, matching the formal comparison table. To audit a different head size:

bash parameter_audit.sh --dataset tox21 --num_tasks 12
bash parameter_audit.sh --output outputs/parameter_summary.csv

Outputs

Completed runs are written under:

${CHEMVL_DATA_ROOT}/results/moleculenet/molformer_under_chemvl/

Important files:

molformer_under_chemvl_summary_by_dataset.csv
molformer_under_chemvl_summary_macro.csv
molformer_under_chemvl_summary.png

Each individual run stores:

<result_root>/<version>/<dataset>/<timestamp>/
  config.json
  result.json
  train_val_test_history.csv

Reproducibility Notes

  • Split implementation is designed to match ChemVL scaffold and random_scaffold behavior.
  • seed = 1 controls the split in the shipped configs.
  • runseed = 1, 2, 3 controls training randomness.
  • Classification missing labels are ignored following ChemVL multitask convention.
  • Regression metric selection follows ChemVL: QM7 uses MAE, ESOL/FreeSolv/Lipo use RMSE.

Upstream

Original MoLFormer project/paper:

About

Repository for MolFormer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 64.0%
  • Python 34.1%
  • Shell 1.8%
  • Dockerfile 0.1%