MoLFormer under ChemVL Protocol

This repository is a fork of IBM MoLFormer for reproducing MoLFormer as an external baseline under the ChemVL MoleculeNet protocol.

The purpose of this fork is not to redesign MoLFormer. It keeps the original MoLFormer encoder/checkpoint interface and adds a ChemVL-compatible evaluation wrapper so MoLFormer can be compared fairly with MoleculeSTM, GEM, ChemVL, and MolMCL-style baselines under the same datasets, split rules, seeds, metrics, and output format.

Project Scope

Model	Input	Finetune setup
MoLFormer-XL	SMILES sequence	Full-model fine-tuning with a ChemVL-compatible property head

Benchmark tables:

Table	Split	Datasets
A	`scaffold`	BACE, BBBP, ClinTox, HIV, SIDER, Tox21, ESOL, FreeSolv, Lipo, QM7
B	`random_scaffold`	BACE, BBBP, ClinTox, HIV, SIDER, Tox21, ESOL, FreeSolv, Lipo, QM7

MoleculeACE Table C is not implemented in this fork. It would require a separate MoleculeACE loader and the MolMCL split protocol.

Metric convention follows ChemVL: classification uses ROC-AUC, QM7 uses MAE, and ESOL/FreeSolv/Lipo use RMSE. Each (dataset, split) setting is run with runseed = 1, 2, 3.

Repository Layout

scripts/chemvl_protocol/       ChemVL-compatible MoLFormer runner and batch scripts
configs/chemvl_protocol/       Base configs for Table A/B runs
finetune/                      Original MoLFormer finetune/model components
parameter_audit.sh             One-command parameter audit entry point
parameter_audit.py             MoLFormer parameter audit implementation
parameter_summary.csv          Generated by parameter_audit.sh

Environment

Install ChemVL protocol dependencies in the Python environment used for this fork:

pip install -r scripts/chemvl_protocol/requirements.txt

For strict split reproducibility, keep the RDKit version aligned with the ChemVL/GEM runs:

rdkit-pypi==2022.9.5

The encoder also requires the original MoLFormer stack, including PyTorch and pytorch-fast-transformers.

Data And Checkpoints

Set the roots before running:

export CHEMVL_DATA_ROOT=/path/to/chemvl-data
export MOLFORMER_REPO=/path/to/this/molformer/repo

Expected MoleculeNet layout:

${CHEMVL_DATA_ROOT}/finetuning_datasets/MPP/classification/<task>/processed/<task>_processed_ac.csv
${CHEMVL_DATA_ROOT}/finetuning_datasets/MPP/regression/<task>/processed/<task>_processed_ac.csv

Expected pretrained checkpoint:

${MOLFORMER_REPO}/data/Pretrained MoLFormer/checkpoints/N-Step-Checkpoint_3_30000.ckpt

Checkpoints are runtime artifacts and should not be committed.

Run Experiments

Dry-run commands:

DRY_RUN=1 bash scripts/chemvl_protocol/run_moleculenet_scaffold.sh
DRY_RUN=1 bash scripts/chemvl_protocol/run_moleculenet_random_scaffold.sh

Run Table A/B:

bash scripts/chemvl_protocol/run_moleculenet_scaffold.sh
bash scripts/chemvl_protocol/run_moleculenet_random_scaffold.sh

Run both in the background:

setsid -f bash scripts/chemvl_protocol/run_ab_background.sh \
  > "${CHEMVL_DATA_ROOT}/results/moleculenet/molformer_under_chemvl/run_ab.log" \
  2>&1 < /dev/null

Useful override:

PYTHON=/path/to/python RUNSEED_START=1 RUNSEED_END=3 \
  bash scripts/chemvl_protocol/run_moleculenet_scaffold.sh

SIDER may need a smaller batch size on memory-constrained GPUs:

SIDER_BATCH_SIZE=16 bash scripts/chemvl_protocol/resume_after_sider_oom.sh

Parameter Audit

Run one command from the repository root. A minimal verified audit environment is:

conda create -n molformer_param_audit python=3.8 pip -y
conda activate molformer_param_audit
pip install numpy==1.24.4 pandas scikit-learn scipy rdkit-pypi==2022.9.5 torch==1.13.1 transformers==4.30.2
PATH=/root/miniconda3/bin:/usr/local/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
  pip install pytorch-fast-transformers==0.4.0 --no-build-isolation
bash parameter_audit.sh --strict

The temporary PATH override hides /usr/local/cuda/bin/nvcc so pytorch-fast-transformers builds CPU extensions only. This is sufficient for parameter counting and avoids CUDA toolkit/PyTorch CUDA-version mismatch during installation.

The script writes:

parameter_summary.csv

The audit first tries to instantiate the local ChemVL MoLFormer adapter and count model.parameters() after applying freeze_encoder. Use --strict when producing the formal table so dependency issues fail fast instead of falling back. If pytorch-fast-transformers is unavailable, the non-strict mode falls back to a transparent static architecture count from MolFormerPropertyModel, the checked-in config, and finetune/bert_vocab.txt. The fallback is marked as STATIC_ARCHITECTURE_FALLBACK in the CSV.

Default audit target is Table A bbbp with num_tasks=1, matching the formal comparison table. To audit a different head size:

bash parameter_audit.sh --dataset tox21 --num_tasks 12
bash parameter_audit.sh --output outputs/parameter_summary.csv

Outputs

Completed runs are written under:

${CHEMVL_DATA_ROOT}/results/moleculenet/molformer_under_chemvl/

Important files:

molformer_under_chemvl_summary_by_dataset.csv
molformer_under_chemvl_summary_macro.csv
molformer_under_chemvl_summary.png

Each individual run stores:

<result_root>/<version>/<dataset>/<timestamp>/
  config.json
  result.json
  train_val_test_history.csv

Reproducibility Notes

Split implementation is designed to match ChemVL scaffold and random_scaffold behavior.
seed = 1 controls the split in the shipped configs.
runseed = 1, 2, 3 controls training randomness.
Classification missing labels are ignored following ChemVL multitask convention.
Regression metric selection follows ChemVL: QM7 uses MAE, ESOL/FreeSolv/Lipo use RMSE.

Upstream

Original MoLFormer project/paper:

Nature Machine Intelligence: https://rdcu.be/c12D0
arXiv: https://arxiv.org/abs/2106.09553
Original data/checkpoints: https://ibm.box.com/v/MoLFormer-data

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs/chemvl_protocol		configs/chemvl_protocol
data		data
finetune		finetune
notebooks		notebooks
scripts/chemvl_protocol		scripts/chemvl_protocol
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.md		environment.md
parameter_audit.py		parameter_audit.py
parameter_audit.sh		parameter_audit.sh
parameter_summary.csv		parameter_summary.csv
setup.yaml		setup.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoLFormer under ChemVL Protocol

Project Scope

Repository Layout

Environment

Data And Checkpoints

Run Experiments

Parameter Audit

Outputs

Reproducibility Notes

Upstream

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MoLFormer under ChemVL Protocol

Project Scope

Repository Layout

Environment

Data And Checkpoints

Run Experiments

Parameter Audit

Outputs

Reproducibility Notes

Upstream

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages