Skip to content

roshan2004/molscope

Repository files navigation

MolScope logo

MolScope

CI codecov Docs PyPI Python License: MIT Code style: Ruff DOI

Turn a molecular structure file into descriptors, contact maps, ML graphs, and coarse-grained bead models, with a small, readable Python API.

Reads .xyz, .pdb, .cif, and .sdf (or fetches from the RCSB by ID). The core depends only on NumPy and Matplotlib; heavier backends (RDKit, PyTorch Geometric, DGL, Gemmi) are opt-in extras. Built for teaching, exploratory analysis, and ML-for-molecules prototyping, not as a replacement for full simulation or cheminformatics stacks.

📖 Full documentation: https://molscope.readthedocs.io

3D structure (element) Secondary structure (DSSP) Residue contact map Coarse-grained beads
Aquaporin-1 rendered as a 3D element-coloured molecular structure Aquaporin-1 coloured by DSSP secondary structure: helices red, turns cyan, coil grey Residue-level contact map heatmap for Aquaporin-1 Coarse-grained bead model of Aquaporin-1

Install

pip install molscope            # core: NumPy + Matplotlib only

Optional extras, added only when a workflow needs them:

Extra Adds
fast scipy KD-tree for faster bond/contact search on large structures
chem RDKit chemical perception and descriptors
cif Gemmi mmCIF parsing and validation
pyg / dgl / graph / gnn PyTorch Geometric / DGL / NetworkX graph export
viz py3Dmol interactive viewer
xlsx read/write .xlsx molecule tables
gpu Torch dense distance backend
mcp MCP server for AI assistants (Python >= 3.10)
pip install "molscope[chem,cif,pyg]"   # combine as needed

For local development with uv: uv sync (creates .venv and installs deps + dev tools from the lockfile), then uv run pytest.

Quickstart

Given a .pdb (or .xyz / .cif / .sdf), here is what you can pull out:

import molscope as ms

mol = ms.read("protein.pdb")        # or ms.fetch("1fqy") from the RCSB
print(mol.summary())                # atoms, formula, chains, bounding box

ca   = mol.select(chain="A").alpha_carbons()   # metadata selections
cmap = mol.contact_map(cutoff=8.0)             # residue contact map (NumPy)
desc = mol.descriptors()                       # dict of structural descriptors
graph = mol.to_graph()                         # ML-ready graph, no extra deps
data  = mol.to_pyg_data()                      # PyTorch Geometric Data ([pyg])
cg    = mol.coarse_grain("residue_com")        # one bead per residue

Molecule is immutable: translate, centered, and rotate each return a new molecule, so transformations chain cleanly.

From a folder of structures to a trained GNN

build_dataset collapses read → featurise → label-join → split into one call, and GraphDataset carries it the last mile to a training loop:

ds = ms.build_dataset(
    "data/*.pdb",                 # glob, list of paths/Molecules, or fetch_dataset(ids) from RCSB
    fmt="pyg",                    # "pyg" | "dgl" | "networkx" | "raw"
    labels="labels.csv",          # joined to each graph by file stem
    split=(0.8, 0.1, 0.1),
    cache_dir=".graph_cache",     # on-disk featurisation cache; reruns reuse it
)
scaler = ds.standardize_targets() # fit on train only; no val/test leakage
for batch in ds.loader("train", batch_size=32):   # batching PyG/DGL DataLoader
    ...                           # scaler.inverse_transform(pred) -> physical units

This graph-ML on-ramp adds no core dependency (fmt="pyg"/"dgl" need their extras). See Molecular graphs and the runnable PDB to a trained GNN example.

Choose your workflow

Not sure what to run first? Match your goal to a starting point. Every row is a CLI command (most have a one-line Python equivalent); the full list is in the Command line table below.

Your goal Start here Next step
Look at a structure molscope file.pdb — render it, save a PNG/GIF molscope report file.pdb for a one-file overview
Sanity-check a file molscope qc file.pdb — did it parse cleanly? (any format) molscope structure-report file.pdb — is this protein ML-ready?
Compare two structures molscope compare a.pdb b.pdb — RMSD, per-residue and contact/descriptor deltas add --out diff.md for a written report
Featurise to a table molscope presets descriptors — see what's available molscope analyze "*.pdb" --out features.csv
Build an ML graph dataset ms.build_dataset(...) (API) or molscope export … --to pyg (CLI) molscope prepare table.csv for train/val/test splits
Coarse-grain a structure molscope presets coarse-grainmolscope coarse-grain file.pdb --mapping martini
Triage docking output molscope dock-report poses.sdf — HTML report + top poses dock-summary · dock-diverse · dock-rank
Drive it from an AI assistant pip install "molscope[mcp]", then add the MCP server ask for analyses in natural language

On any unfamiliar file, molscope report file.pdb gives the broadest single-command overview and molscope presets lists every descriptor, graph, and coarse-grain option. Before a featurisation run, molscope preflight file.pdb --workflow graph (or pass --preflight to analyze / export / coarse-grain, or preflight=True to mol.to_graph() / descriptors() / coarse_grain()) warns about inputs — inferred bonds, missing metadata, no hydrogens, huge dense matrices — that would silently degrade the result.

What you can do

Capability Guide
Read/write XYZ, PDB, mmCIF, SDF; fetch from RCSB; build from SMILES Reading files
Stream large multi-model files frame by frame Reading files
Select atoms by metadata (chain, residue, name, ...) Selections
Geometry, RMSD, distances, angles, torsions Geometry and measurements
Contact maps and distance matrices Contact maps
DSSP secondary structure, torsions, interfaces, binding sites Protein analysis
Native and RDKit-backed descriptors Structural descriptors
Preflight guardrails before featurising (inferred bonds, missing metadata, dense-matrix sizes) Structure QC
Chemical perception, protein template bonds, bond-order inference Chemical perception
Atom/bond and residue-contact graphs for ML (with positional encodings) Molecular graphs
Assemble a split, labelled graph dataset (cache, loaders, target scaling, RCSB fetch) Molecular graphs
Coarse-grained bead mappings (residue, Martini-style, custom) Coarse-graining
NMR ensembles and clustering Ensemble analysis
Plotting and py3Dmol viewing Plotting and viewing
Diverse subset selection from a CSV/XLSX table Diverse selection

Task-oriented tutorials: PDB to descriptors, PDB to graph/GNN, and PDB to coarse-grained beads. A runnable tour over the bundled samples lives in examples/tour.py.

Command line

Command Does
molscope <file> (view) visualise a structure, save a PNG or GIF
molscope report one-file structure report (QC, chains/ligands, descriptors, contact map, graph stats, optional CG preview) as HTML/Markdown
molscope compare compare two static structures: aligned RMSD, per-residue deviations, contact-map delta, descriptor delta
molscope qc lightweight structure-quality report (atoms, chains, ligands, metadata, elements, bonds, altLoc, CIF/PDB warnings)
molscope structure-report ML-readiness check for a protein (residue gaps, missing/truncated atoms, chain breaks, net charge)
molscope preflight warn about inputs that silently degrade descriptor/graph/CG output (inferred bonds, missing metadata, altLocs, no hydrogens, large dense matrices)
molscope presets list the available descriptor / graph / coarse-grain presets and what each one produces
molscope analyze batch descriptor table to CSV
molscope binding-site ligand binding-site contacts and pocket descriptors
molscope export batch graph export to PyG / DGL / NetworkX
molscope prepare build ML-ready train/validation/test splits from a table or SDF
molscope coarse-grain map a structure to CG beads and write a coordinate file (PDB CONECT bonds)
molscope select diverse subset from a CSV/XLSX table
molscope dock-summary rank docking poses from an SDF; summary + top-hit tables + score plot
molscope dock-diverse diverse shortlist of top hits by Tanimoto clustering
molscope dock-rank transparent consensus ranking across scored SDFs
molscope dock-report self-contained HTML report + top poses for PyMOL/ChimeraX/Mol*
molscope examples/data/1fqy.pdb --select atom_name=CA --color-by residue --save ca.png
molscope report examples/data/3ptb.pdb --out-dir report/ --coarse-grain
molscope compare apo.pdb holo.pdb --atoms ca --out compare.md
molscope qc examples/data/3ptb.pdb
molscope structure-report --fetch 1ubq      # is this protein ML-ready?
molscope preflight examples/data/1ubq.pdb --workflow graph   # what would degrade my graph?
molscope presets descriptors          # discover the --preset / node_features options
molscope analyze examples/data/*.pdb --out results.csv --preset native-3d --jobs 4
molscope export "data/*.cif" --to pyg --out-dir pyg_graphs/ --pe laplacian --jobs 8
molscope prepare data.csv --smiles-col SMILES --split scaffold --out-dir prepared/
molscope coarse-grain examples/data/1fqy.pdb --mapping martini --out cg.pdb
molscope select molecules.csv --smiles-col SMILES --compute-descriptors -n 100 --out picked.csv
molscope dock-summary vina_out.sdf --score-field minimizedAffinity --top 20
molscope dock-diverse vina_out.sdf --top 500 --select 50

Pipeline-friendly output. Every --json command (qc, structure-report, compare, preflight, presets) prints the same envelope, so a downstream tool can read one shape regardless of the command:

{
  "tool": "molscope", "version": "0.16.0", "command": "qc",
  "input": "3ptb.pdb",        // path/id, or a list for batch commands
  "parser": "pdb",            // reader chosen from the extension
  "backends": ["scipy"],      // optional packages this run engaged
  "warnings": [ ... ],
  "result": { ... }           // the command-specific payload
}

The batch commands take --manifest PATH (analyze, export) to write that envelope alongside the outputs, with feature_names (CSV columns / graph node+edge features) and skipped (each input that failed, with the reason) — handy for reproducible runs and for catching dropped structures:

molscope analyze "data/*.pdb" --out features.csv --manifest run.json
molscope export "data/*.cif" --to pyg -o graphs/ --manifest run.json

Use from an AI assistant (MCP)

MolScope ships an optional Model Context Protocol server, so an MCP-capable assistant (Claude Code/Desktop, Codex CLI, Gemini CLI) can drive its analyses in natural language. It exposes the public API as 28 tools (structure analysis, graphs, plots, dataset prep, docking-hit triage) and adds no new science.

pip install "molscope[mcp]"              # needs Python >= 3.10
claude mcp add molscope -- molscope-mcp  # Claude Code
codex mcp add molscope -- molscope-mcp   # Codex CLI
gemini mcp add molscope molscope-mcp     # Gemini CLI

For example: "fetch trypsin (3ptb), find the benzamidine binding-site residues, and render a contact map." See docs/user-guide/mcp-server.md for the full tool reference.

Scientific validation

MolScope is explicit about which results are cross-checked against reference tools and which are intentionally lightweight:

Feature Status
Geometry, RMSD, contact maps Cross-checked vs MDAnalysis (near machine precision)
Bond perception, chemical features Cross-checked vs RDKit
Secondary structure (simplified DSSP) Cross-checked vs mkdssp: ~98 to 99% 3-state agreement across helical, mixed, and all-beta folds
Protein template bonds Cross-checked vs known per-residue chemistry
Native descriptors, molecular graphs Deterministic; not benchmarked against a curated library
Coarse-graining Mapping and visualisation only; not a validated force-field model
Standard protonation Idealised pH-7 textbook model ("standard")
pKa-aware protonation Environment-aware via PROPKA (proteins) / Dimorphite-DL (SMILES) at a chosen pH

Methods, tolerances, and failure modes are in docs/validation.md. The CI validation job runs physical invariants plus these cross-checks on every push, and publishes a generated validation summary (a table of what passed or was skipped) to the workflow run page and as a downloadable artifact, so the scientific boundaries are easy to inspect without reading the logs.

Scope and design philosophy

MolScope is a lightweight core (NumPy and Matplotlib only) with a broad optional surface. Those optional extras are deliberately of two kinds, and the distinction is what keeps "lightweight" honest:

  • Interop / output targets (networkx, pyg, dgl, viz, xlsx, gpu, mcp). These let MolScope emit to your framework or run faster. Speaking many formats is the bridge mission, so this surface is expected to grow.
  • Method backends (chem/RDKit, cif/gemmi, propka, dimorphite, validation/MDAnalysis). These delegate a scientific computation to an external tool, and each one inherits that tool's scope, versioning, and correctness surface. This is the surface we keep deliberately small.

A new method backend earns inclusion only when MolScope adds real integration value beyond what you could write in a line or two against the tool directly (for example, mapping RDKit perception back onto MolScope's atom model and residue templates), and when the wrapped tool is maintained. The core does something useful on its own; everything heavier degrades gracefully when its extra is absent. MolScope is not, and is not trying to become, a re-implementation of a cheminformatics or simulation stack: where a dedicated tool is the right answer, it integrates that tool rather than reinventing it, and says so.

FAQ

Which formats can it read? .xyz, .pdb, .cif, and .sdf; fetch from the RCSB with ms.fetch("1fqy"); or build from SMILES with ms.read_smiles(...) (needs [chem]).

Does it handle MD trajectories? It works on static structures and multi-model files (NMR ensembles, and ms.stream(...) to iterate large multi-model PDB/XYZ frame by frame). It has no trajectory engine; for DCD/XTC and friends use MDAnalysis or MDTraj.

Is the coarse-graining a real force field? No. It produces CG mappings and bead graphs for inspection and ML prototyping. The OpenMM XML export describes topology only and is not a validated Martini parameter set.

Do I need RDKit or PyTorch? No. The core runs on NumPy and Matplotlib; those are opt-in extras you install only for the matching workflow.

Will odd PDB files parse? ATOM/HETATM lines are read by fixed columns (not whitespace), so touching, large, or negative coordinate fields read correctly. Alternate conformations default to the primary altLoc.

Development and citation

uv run pytest                  # full test suite
uv run pytest tests/validation # validation suite only
uv run ruff check .            # lint

CI runs the suite and linting across Python 3.9 / 3.11 / 3.13, smoke-imports the extras, and runs a separate validation job on every push and PR. Notable changes per release are in CHANGELOG.md.

Each release is archived on Zenodo with a citable DOI. The concept DOI 10.5281/zenodo.20433850 always resolves to the latest version; citation metadata is in CITATION.cff, so GitHub's "Cite this repository" button produces BibTeX and APA entries.

License

MIT

About

Lightweight Python toolkit for molecular structure analysis, visualisation, descriptors, graph export, and coarse-graining.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors