STRIPES (Spatio-Temporal Representation of Interactions in Protein-ligand Engagement Strings) is a molecular fingerprinting method that encodes per-atom protein–ligand interactions from MD trajectories into symbolic strings. This repository contains the full pipeline: extraction from simulations, pairwise similarity computation, SMILES generation via a pretrained Transformer, and embedding visualization with t-SNE.
STRIPES/
├── STRIPES_similarity/ # Pairwise STRIPES similarity (Hungarian algorithm)
│ ├── similarity_function_hungarian.py
│ └── requirements.txt
├── STRIPES2SMILES/ # BERT-style pretraining + SMILES decoder finetuning
│ ├── pretraining.py
│ ├── finetuning.py
│ ├── smiles_utils.py
│ ├── run_pretraining.py # entry point — pretraining
│ ├── run_finetuning.py # entry point — finetuning (Optuna HPO)
│ ├── generate.py # entry point — SMILES generation
│ ├── extract_test_and_generate.py # batch generation over the finetuning test sets
│ ├── causal_generation/ # STRIPES sequences for perturbation experiments (Fig. 3e, Supp. Fig. S6)
│ │ └── PIM1/
│ │ └── stripes.txt # PIM1 sequences with manually perturbed interaction tokens
│ └── requirements.txt
├── t-SNE/ # t-SNE grid search on STRIPES embeddings
│ ├── tsne_stripes_gridsearch.py
│ └── requirements.txt
├── PubChem_analysis/ # PubChem search + bioactivity analysis on generated molecules
│ ├── pubchem_search.py
│ └── requirements.txt
├── MD/ # MD-derived results and STRIPES similarity per dataset
├── misc/ # Analysis / figure-generation scripts used for the paper
├── figures/ # Paper figure assets
└── data/
├── MISATO/ # Pretraining dataset (26 MB)
├── PPAR/ # Finetuning dataset
├── PIM1/ # Finetuning dataset
├── JAK1/ # Finetuning dataset
└── AR/ # Finetuning dataset
Note: STRIPES extraction from GROMACS MD trajectories (
STRIPES_extractor) lives in a separate repository and is not part of this codebase.
Python >= 3.8 is required. GPU support (CUDA) is optional but strongly recommended for STRIPES2SMILES.
To install all dependencies at once:
pip install -r requirements.txtEach module also ships its own requirements.txt if you only need a specific component:
pip install -r <module>/requirements.txtA STRIPES string encodes per-atom interaction profiles across MD frames. Atoms are separated by ;; within each atom, per-frame interaction tokens are separated by ..
Interaction tokens:
| Token | Interaction |
|---|---|
H(a1) / H(a2) / H(a3) |
H-bond — ligand as acceptor (strong / moderate / weak) |
H(d1) / H(d2) / H(d3) |
H-bond — ligand as donor (strong / moderate / weak) |
B |
Hydrophobic interaction |
S(-) / S(+) |
Salt bridge (negative / positive ligand charge) |
X |
Halogen bond |
P(f) / P(o) / P(t) |
π–π stacking (face-to-face / offset / T-shaped) |
C(p) / C(t) / C(e) |
Cation–π interaction (parallel / tilted / edge) |
- |
No interaction |
Example: H(a1).H(a1).B;-.-.−;...
The pairwise-similarity (section 2), pretraining (section 3a), and t-SNE (section 4) steps all require a STRIPES token-to-index mapping at data/stripes_tokens2label.json. Generate it once from the MISATO dataset:
python misc/stripes_token2label.pyReads data/MISATO/dataset.csv (column STRIPES) and writes data/stripes_tokens2label.json.
STRIPES fingerprints are extracted from GROMACS MD trajectories (.tpr/.xtc plus .itp topology files) into a CSV with columns mol_id, STRIPES. The extraction tool lives in a separate companion repository and is not included here — this repo picks up the pipeline starting from the resulting STRIPES CSVs (see data/).
Computes all-vs-all STRIPES similarity using the Hungarian (optimal bipartite matching) algorithm on per-atom Jaccard similarity.
pip install -r STRIPES_similarity/requirements.txt
python STRIPES_similarity/similarity_function_hungarian.py \
--dataset data/PPAR/dataset.csv \
--token-index data/stripes_tokens2label.json \
--output ppar_similarities.csvInput CSV must contain columns: STRIPES, smiles, pKi.
Output: CSV with columns smiles1, smiles2, STRIPES1, STRIPES2, pKi1, pKi2, similarity.
A two-stage deep learning pipeline: (i) BERT-style masked language model pretraining on STRIPES, (ii) encoder–decoder finetuning for STRIPES → SMILES translation with Optuna hyperparameter optimization.
pip install -r STRIPES2SMILES/requirements.txtpython STRIPES2SMILES/run_pretraining.py \
--data_path data/MISATO \
--output_dir results_preKey options: --d_model 512, --n_heads 8, --n_layers 8, --batch_size 8, --num_epochs 100, --lr 1e-4, --seed 42.
Outputs in results_pre/:
pretrained_stripes_encoder.pth— best encoder checkpointstripes_vocab.pkl— vocabulary for finetuningtraining_metadata.json— loss curves and model configpretraining_loss.png— training/validation loss plot
python STRIPES2SMILES/run_finetuning.py \
--data_path data/ \
--pretrained_model results_pre/pretrained_stripes_encoder.pth \
--pretrained_vocab results_pre/stripes_vocab.pkl \
--output_dir results_fine \
--datasets PPAR PIM1 JAK1 AR \
--n_trials 100Each dataset directory must contain a dataset.csv with columns: STRIPES, can_smiles, pKi (mol_id is optional — auto-generated from the row index if missing). Only rows with pKi >= 6.0 are used for finetuning.
Outputs per dataset in results_fine/:
<DATASET>_model.pth— finetuned model checkpoint<DATASET>_config.json— model config and best Optuna hyperparameters<DATASET>_optuna.json— full Optuna trial log<DATASET>_split.json— train/val/test split sizes<DATASET>_test_set.csv— held-out test set (mol_id,STRIPES,can_smiles), used as input for generation (see 3c below)<DATASET>_metrics.json— training stats<DATASET>_plots.png— training/validation loss curvessummary.json— aggregated metrics across all datasets
Generates SMILES for the held-out test set of each dataset, sweeping over beam_size, n_molecules, temperature, and temperature_increment (used to produce the results reported in the paper).
python STRIPES2SMILES/extract_test_and_generate.py \
--pretrained_model results_pre/pretrained_stripes_encoder.pth \
--pretrained_vocab results_pre/stripes_vocab.pkl \
--results_dir results_fine \
--datasets PIM1 JAK1 AR \
--beam_sizes 5 10 15 \
--n_molecules 5 10 \
--temperatures 1.2 1.4 \
--increments -0.2 -0.3 -0.4 -0.5Reads <DATASET>_test_set.csv (produced by run_finetuning.py, see 3b). For each parameter combination, writes to results_fine/<DATASET>_finetuned/comparison/:
generated_beam<b>_N<n>_T<t>_step<s>.csv— generated moleculesgenerated_beam<b>_N<n>_T<t>_step<s>_metrics.json— validity/uniqueness/novelty
To select the unique generated molecules across the whole sweep — e.g. as a starting point for downstream MD simulations and STRIPES-similarity analysis (see MD/) — deduplicate them by canonical_smiles:
python misc/merging_generated_mols.py \
--folder results_fine/PPAR_finetuned/comparisonWrites all_unique_molecules.csv to <folder>/all_generated_combined/.
To test whether modifying individual interaction tokens in a STRIPES sequence produces chemically coherent changes in the generated molecules, targeted perturbations were applied to representative PIM1 sequences: hydrogen-bond, hydrophobic contact, and salt bridge tokens were independently added or removed. The perturbed sequences are provided in STRIPES2SMILES/causal_generation/PIM1/stripes.txt (CSV with columns mol_id, STRIPES).
Generate SMILES from the perturbed sequences using the standard generation script (see 3d):
python STRIPES2SMILES/generate.py \
--finetuned_model results_fine/PIM1_model.pth \
--pretrained_model results_pre/pretrained_stripes_encoder.pth \
--pretrained_vocab results_pre/stripes_vocab.pkl \
--input_csv STRIPES2SMILES/causal_generation/PIM1/stripes.txt \
--output causal_generation_PIM1.csvFor ad-hoc generation from a single sequence, a CSV (column stripes/STRIPES), or a plain text file:
# Single sequence
python STRIPES2SMILES/generate.py \
--finetuned_model results_fine/PPAR_model.pth \
--pretrained_model results_pre/pretrained_stripes_encoder.pth \
--pretrained_vocab results_pre/stripes_vocab.pkl \
--sequence "<stripes_string>" \
--output generated.csv
# Batch from CSV
python STRIPES2SMILES/generate.py \
--finetuned_model results_fine/PPAR_model.pth \
--pretrained_model results_pre/pretrained_stripes_encoder.pth \
--pretrained_vocab results_pre/stripes_vocab.pkl \
--input_csv data/PPAR/dataset.csv \
--output generated.csvOutputs <output>.csv (columns mol_id, can_smiles, stripes, rank, smiles, canonical_smiles, is_valid) and <output>_metrics.json (validity, uniqueness, novelty).
Grid search over 36 combinations of t-SNE hyperparameters (perplexity, n_iter, learning_rate) evaluated by trustworthiness and continuity.
pip install -r t-SNE/requirements.txt
python t-SNE/tsne_stripes_gridsearch.py \
--vocab_path data/stripes_tokens2label.json \
--data_path data/MISATO/dataset.csv \
--output_dir results/tsne_grid_search \
--svg_save_dir results/figuresInput CSV (dataset.csv) must contain columns: STRIPES, lig_MW, lig_logP, lig_TPSA, lig_H_donor, lig_H_acceptor, hydrophobic_atoms, polar_atoms, net_charge, frac_hydrophobic, frac_polar, frac_positive, frac_negative, mean_hydropathy, mean_sasa.
Outputs in --output_dir:
quality_metrics.csv— trustworthiness and continuity for all 36 configurationsbest_configuration_results.csv— t-SNE coordinates for the best configurationexperiment_summary.txt— full experiment report<config>/tsne_<config>_<property>.svg— per-property SVG plots for every configurationresults/figures/tsne_BEST_<config>_<property>.svg— SVG plots for the best configurationcorrelation_*.png— correlation matrices
Validates generated molecules against PubChem and retrieves bioactivity data (EC50, IC50, Ki, Kd, AC50) for each target.
pip install -r PubChem_analysis/requirements.txt
# Run once per dataset
python PubChem_analysis/pubchem_search.py \
--dataset PPAR \
--results_dir results_fine \
--data_dir data \
--output_dir results_pubchem/PPAR--results_dir layout expected (produced by extract_test_and_generate.py + misc/merging_generated_mols.py, see 3c):
results_fine/
└── <DATASET>_finetuned/comparison/all_generated_combined/
└── all_unique_molecules.csv # must contain column 'canonical_smiles'
Alternatively, pass --input <path_to_all_unique_molecules.csv> directly to bypass --results_dir.
--data_dir layout expected (same as the rest of the pipeline):
data/<DATASET>/dataset.csv # must contain column 'can_smiles' (used as the novelty reference set)
Output: <DATASET>_bioactivity_results.csv (saved to --output_dir, or alongside the input CSV if not given) — for each novel molecule found on PubChem and/or ChEMBL: canonical_smiles, target, exists_on_pubchem, exists_on_chembl, and the lowest reported EC50_uM/IC50_uM/Ki_uM/Kd_uM/AC50_uM against the target.
All random seeds are fixed to 42. Results may vary slightly across hardware and software versions due to non-deterministic GPU operations.
If you use this repository in your work, please cite:
Criscuolo, E., & Grisoni, F. (2026). Towards a physically interpretable symbolic language of molecular recognition. ChemRxiv. https://doi.org/10.26434/chemrxiv.15000358
BibTeX
@article{criscuolo2026towards,
title = {Towards a physically interpretable symbolic language of molecular recognition},
author = {Criscuolo, Emanuele and Grisoni, Francesca},
year = {2026},
journal = {ChemRxiv},
doi = {10.26434/chemrxiv.15000358},
note = {Preprint},
url = {https://doi.org/10.26434/chemrxiv.15000358}
}This project is released under the MIT License. See LICENSE for details.