This repository is a fork of IBM MoLFormer for reproducing MoLFormer as an external baseline under the ChemVL MoleculeNet protocol.
The purpose of this fork is not to redesign MoLFormer. It keeps the original MoLFormer encoder/checkpoint interface and adds a ChemVL-compatible evaluation wrapper so MoLFormer can be compared fairly with MoleculeSTM, GEM, ChemVL, and MolMCL-style baselines under the same datasets, split rules, seeds, metrics, and output format.
| Model | Input | Finetune setup |
|---|---|---|
| MoLFormer-XL | SMILES sequence | Full-model fine-tuning with a ChemVL-compatible property head |
Benchmark tables:
| Table | Split | Datasets |
|---|---|---|
| A | scaffold |
BACE, BBBP, ClinTox, HIV, SIDER, Tox21, ESOL, FreeSolv, Lipo, QM7 |
| B | random_scaffold |
BACE, BBBP, ClinTox, HIV, SIDER, Tox21, ESOL, FreeSolv, Lipo, QM7 |
MoleculeACE Table C is not implemented in this fork. It would require a separate MoleculeACE loader and the MolMCL split protocol.
Metric convention follows ChemVL: classification uses ROC-AUC, QM7 uses MAE, and ESOL/FreeSolv/Lipo use RMSE. Each (dataset, split) setting is run with runseed = 1, 2, 3.
scripts/chemvl_protocol/ ChemVL-compatible MoLFormer runner and batch scripts
configs/chemvl_protocol/ Base configs for Table A/B runs
finetune/ Original MoLFormer finetune/model components
parameter_audit.sh One-command parameter audit entry point
parameter_audit.py MoLFormer parameter audit implementation
parameter_summary.csv Generated by parameter_audit.sh
Install ChemVL protocol dependencies in the Python environment used for this fork:
pip install -r scripts/chemvl_protocol/requirements.txtFor strict split reproducibility, keep the RDKit version aligned with the ChemVL/GEM runs:
rdkit-pypi==2022.9.5
The encoder also requires the original MoLFormer stack, including PyTorch and pytorch-fast-transformers.
Set the roots before running:
export CHEMVL_DATA_ROOT=/path/to/chemvl-data
export MOLFORMER_REPO=/path/to/this/molformer/repoExpected MoleculeNet layout:
${CHEMVL_DATA_ROOT}/finetuning_datasets/MPP/classification/<task>/processed/<task>_processed_ac.csv
${CHEMVL_DATA_ROOT}/finetuning_datasets/MPP/regression/<task>/processed/<task>_processed_ac.csv
Expected pretrained checkpoint:
${MOLFORMER_REPO}/data/Pretrained MoLFormer/checkpoints/N-Step-Checkpoint_3_30000.ckpt
Checkpoints are runtime artifacts and should not be committed.
Dry-run commands:
DRY_RUN=1 bash scripts/chemvl_protocol/run_moleculenet_scaffold.sh
DRY_RUN=1 bash scripts/chemvl_protocol/run_moleculenet_random_scaffold.shRun Table A/B:
bash scripts/chemvl_protocol/run_moleculenet_scaffold.sh
bash scripts/chemvl_protocol/run_moleculenet_random_scaffold.shRun both in the background:
setsid -f bash scripts/chemvl_protocol/run_ab_background.sh \
> "${CHEMVL_DATA_ROOT}/results/moleculenet/molformer_under_chemvl/run_ab.log" \
2>&1 < /dev/nullUseful override:
PYTHON=/path/to/python RUNSEED_START=1 RUNSEED_END=3 \
bash scripts/chemvl_protocol/run_moleculenet_scaffold.shSIDER may need a smaller batch size on memory-constrained GPUs:
SIDER_BATCH_SIZE=16 bash scripts/chemvl_protocol/resume_after_sider_oom.shRun one command from the repository root. A minimal verified audit environment is:
conda create -n molformer_param_audit python=3.8 pip -y
conda activate molformer_param_audit
pip install numpy==1.24.4 pandas scikit-learn scipy rdkit-pypi==2022.9.5 torch==1.13.1 transformers==4.30.2
PATH=/root/miniconda3/bin:/usr/local/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
pip install pytorch-fast-transformers==0.4.0 --no-build-isolation
bash parameter_audit.sh --strictThe temporary PATH override hides /usr/local/cuda/bin/nvcc so pytorch-fast-transformers builds CPU extensions only. This is sufficient for parameter counting and avoids CUDA toolkit/PyTorch CUDA-version mismatch during installation.
The script writes:
parameter_summary.csv
The audit first tries to instantiate the local ChemVL MoLFormer adapter and count model.parameters() after applying freeze_encoder. Use --strict when producing the formal table so dependency issues fail fast instead of falling back. If pytorch-fast-transformers is unavailable, the non-strict mode falls back to a transparent static architecture count from MolFormerPropertyModel, the checked-in config, and finetune/bert_vocab.txt. The fallback is marked as STATIC_ARCHITECTURE_FALLBACK in the CSV.
Default audit target is Table A bbbp with num_tasks=1, matching the formal comparison table. To audit a different head size:
bash parameter_audit.sh --dataset tox21 --num_tasks 12
bash parameter_audit.sh --output outputs/parameter_summary.csvCompleted runs are written under:
${CHEMVL_DATA_ROOT}/results/moleculenet/molformer_under_chemvl/
Important files:
molformer_under_chemvl_summary_by_dataset.csv
molformer_under_chemvl_summary_macro.csv
molformer_under_chemvl_summary.png
Each individual run stores:
<result_root>/<version>/<dataset>/<timestamp>/
config.json
result.json
train_val_test_history.csv
- Split implementation is designed to match ChemVL
scaffoldandrandom_scaffoldbehavior. seed = 1controls the split in the shipped configs.runseed = 1, 2, 3controls training randomness.- Classification missing labels are ignored following ChemVL multitask convention.
- Regression metric selection follows ChemVL: QM7 uses MAE, ESOL/FreeSolv/Lipo use RMSE.
Original MoLFormer project/paper:
- Nature Machine Intelligence: https://rdcu.be/c12D0
- arXiv: https://arxiv.org/abs/2106.09553
- Original data/checkpoints: https://ibm.box.com/v/MoLFormer-data