Turn a molecular structure file into descriptors, contact maps, ML graphs, and coarse-grained bead models, with a small, readable Python API.
Reads .xyz, .pdb, .cif, and .sdf (or fetches from the RCSB by ID). The
core depends only on NumPy and Matplotlib; heavier backends (RDKit, PyTorch
Geometric, DGL, Gemmi) are opt-in extras. Built for teaching, exploratory
analysis, and ML-for-molecules prototyping, not as a replacement for full
simulation or cheminformatics stacks.
📖 Full documentation: https://molscope.readthedocs.io
| 3D structure (element) | Secondary structure (DSSP) | Residue contact map | Coarse-grained beads |
|---|---|---|---|
pip install molscope # core: NumPy + Matplotlib onlyOptional extras, added only when a workflow needs them:
| Extra | Adds |
|---|---|
fast |
scipy KD-tree for faster bond/contact search on large structures |
chem |
RDKit chemical perception and descriptors |
cif |
Gemmi mmCIF parsing and validation |
pyg / dgl / graph / gnn |
PyTorch Geometric / DGL / NetworkX graph export |
viz |
py3Dmol interactive viewer |
xlsx |
read/write .xlsx molecule tables |
gpu |
Torch dense distance backend |
mcp |
MCP server for AI assistants (Python >= 3.10) |
pip install "molscope[chem,cif,pyg]" # combine as neededFor local development with uv: uv sync (creates
.venv and installs deps + dev tools from the lockfile), then uv run pytest.
Given a .pdb (or .xyz / .cif / .sdf), here is what you can pull out:
import molscope as ms
mol = ms.read("protein.pdb") # or ms.fetch("1fqy") from the RCSB
print(mol.summary()) # atoms, formula, chains, bounding box
ca = mol.select(chain="A").alpha_carbons() # metadata selections
cmap = mol.contact_map(cutoff=8.0) # residue contact map (NumPy)
desc = mol.descriptors() # dict of structural descriptors
graph = mol.to_graph() # ML-ready graph, no extra deps
data = mol.to_pyg_data() # PyTorch Geometric Data ([pyg])
cg = mol.coarse_grain("residue_com") # one bead per residueMolecule is immutable: translate, centered, and rotate each return a new
molecule, so transformations chain cleanly.
build_dataset collapses read → featurise → label-join → split into one call,
and GraphDataset carries it the last mile to a training loop:
ds = ms.build_dataset(
"data/*.pdb", # glob, list of paths/Molecules, or fetch_dataset(ids) from RCSB
fmt="pyg", # "pyg" | "dgl" | "networkx" | "raw"
labels="labels.csv", # joined to each graph by file stem
split=(0.8, 0.1, 0.1),
cache_dir=".graph_cache", # on-disk featurisation cache; reruns reuse it
)
scaler = ds.standardize_targets() # fit on train only; no val/test leakage
for batch in ds.loader("train", batch_size=32): # batching PyG/DGL DataLoader
... # scaler.inverse_transform(pred) -> physical unitsThis graph-ML on-ramp adds no core dependency (fmt="pyg"/"dgl" need their
extras). See Molecular graphs and the
runnable PDB to a trained GNN example.
Not sure what to run first? Match your goal to a starting point. Every row is a CLI command (most have a one-line Python equivalent); the full list is in the Command line table below.
| Your goal | Start here | Next step |
|---|---|---|
| Look at a structure | molscope file.pdb — render it, save a PNG/GIF |
molscope report file.pdb for a one-file overview |
| Sanity-check a file | molscope qc file.pdb — did it parse cleanly? (any format) |
molscope structure-report file.pdb — is this protein ML-ready? |
| Compare two structures | molscope compare a.pdb b.pdb — RMSD, per-residue and contact/descriptor deltas |
add --out diff.md for a written report |
| Featurise to a table | molscope presets descriptors — see what's available |
molscope analyze "*.pdb" --out features.csv |
| Build an ML graph dataset | ms.build_dataset(...) (API) or molscope export … --to pyg (CLI) |
molscope prepare table.csv for train/val/test splits |
| Coarse-grain a structure | molscope presets coarse-grain → molscope coarse-grain file.pdb --mapping martini |
|
| Triage docking output | molscope dock-report poses.sdf — HTML report + top poses |
dock-summary · dock-diverse · dock-rank |
| Drive it from an AI assistant | pip install "molscope[mcp]", then add the MCP server |
ask for analyses in natural language |
On any unfamiliar file, molscope report file.pdb gives the broadest
single-command overview and molscope presets lists every descriptor, graph,
and coarse-grain option. Before a featurisation run, molscope preflight file.pdb --workflow graph (or pass --preflight to analyze / export /
coarse-grain, or preflight=True to mol.to_graph() / descriptors() /
coarse_grain()) warns about inputs — inferred bonds, missing metadata, no
hydrogens, huge dense matrices — that would silently degrade the result.
| Capability | Guide |
|---|---|
| Read/write XYZ, PDB, mmCIF, SDF; fetch from RCSB; build from SMILES | Reading files |
| Stream large multi-model files frame by frame | Reading files |
| Select atoms by metadata (chain, residue, name, ...) | Selections |
| Geometry, RMSD, distances, angles, torsions | Geometry and measurements |
| Contact maps and distance matrices | Contact maps |
| DSSP secondary structure, torsions, interfaces, binding sites | Protein analysis |
| Native and RDKit-backed descriptors | Structural descriptors |
| Preflight guardrails before featurising (inferred bonds, missing metadata, dense-matrix sizes) | Structure QC |
| Chemical perception, protein template bonds, bond-order inference | Chemical perception |
| Atom/bond and residue-contact graphs for ML (with positional encodings) | Molecular graphs |
| Assemble a split, labelled graph dataset (cache, loaders, target scaling, RCSB fetch) | Molecular graphs |
| Coarse-grained bead mappings (residue, Martini-style, custom) | Coarse-graining |
| NMR ensembles and clustering | Ensemble analysis |
| Plotting and py3Dmol viewing | Plotting and viewing |
| Diverse subset selection from a CSV/XLSX table | Diverse selection |
Task-oriented tutorials: PDB to descriptors,
PDB to graph/GNN, and
PDB to coarse-grained beads. A
runnable tour over the bundled samples lives in examples/tour.py.
| Command | Does |
|---|---|
molscope <file> (view) |
visualise a structure, save a PNG or GIF |
molscope report |
one-file structure report (QC, chains/ligands, descriptors, contact map, graph stats, optional CG preview) as HTML/Markdown |
molscope compare |
compare two static structures: aligned RMSD, per-residue deviations, contact-map delta, descriptor delta |
molscope qc |
lightweight structure-quality report (atoms, chains, ligands, metadata, elements, bonds, altLoc, CIF/PDB warnings) |
molscope structure-report |
ML-readiness check for a protein (residue gaps, missing/truncated atoms, chain breaks, net charge) |
molscope preflight |
warn about inputs that silently degrade descriptor/graph/CG output (inferred bonds, missing metadata, altLocs, no hydrogens, large dense matrices) |
molscope presets |
list the available descriptor / graph / coarse-grain presets and what each one produces |
molscope analyze |
batch descriptor table to CSV |
molscope binding-site |
ligand binding-site contacts and pocket descriptors |
molscope export |
batch graph export to PyG / DGL / NetworkX |
molscope prepare |
build ML-ready train/validation/test splits from a table or SDF |
molscope coarse-grain |
map a structure to CG beads and write a coordinate file (PDB CONECT bonds) |
molscope select |
diverse subset from a CSV/XLSX table |
molscope dock-summary |
rank docking poses from an SDF; summary + top-hit tables + score plot |
molscope dock-diverse |
diverse shortlist of top hits by Tanimoto clustering |
molscope dock-rank |
transparent consensus ranking across scored SDFs |
molscope dock-report |
self-contained HTML report + top poses for PyMOL/ChimeraX/Mol* |
molscope examples/data/1fqy.pdb --select atom_name=CA --color-by residue --save ca.png
molscope report examples/data/3ptb.pdb --out-dir report/ --coarse-grain
molscope compare apo.pdb holo.pdb --atoms ca --out compare.md
molscope qc examples/data/3ptb.pdb
molscope structure-report --fetch 1ubq # is this protein ML-ready?
molscope preflight examples/data/1ubq.pdb --workflow graph # what would degrade my graph?
molscope presets descriptors # discover the --preset / node_features options
molscope analyze examples/data/*.pdb --out results.csv --preset native-3d --jobs 4
molscope export "data/*.cif" --to pyg --out-dir pyg_graphs/ --pe laplacian --jobs 8
molscope prepare data.csv --smiles-col SMILES --split scaffold --out-dir prepared/
molscope coarse-grain examples/data/1fqy.pdb --mapping martini --out cg.pdb
molscope select molecules.csv --smiles-col SMILES --compute-descriptors -n 100 --out picked.csv
molscope dock-summary vina_out.sdf --score-field minimizedAffinity --top 20
molscope dock-diverse vina_out.sdf --top 500 --select 50Pipeline-friendly output. Every --json command (qc, structure-report,
compare, preflight, presets) prints the same envelope, so a downstream tool
can read one shape regardless of the command:
The batch commands take --manifest PATH (analyze, export) to write that
envelope alongside the outputs, with feature_names (CSV columns / graph
node+edge features) and skipped (each input that failed, with the reason) —
handy for reproducible runs and for catching dropped structures:
molscope analyze "data/*.pdb" --out features.csv --manifest run.json
molscope export "data/*.cif" --to pyg -o graphs/ --manifest run.jsonMolScope ships an optional Model Context Protocol server, so an MCP-capable assistant (Claude Code/Desktop, Codex CLI, Gemini CLI) can drive its analyses in natural language. It exposes the public API as 28 tools (structure analysis, graphs, plots, dataset prep, docking-hit triage) and adds no new science.
pip install "molscope[mcp]" # needs Python >= 3.10
claude mcp add molscope -- molscope-mcp # Claude Code
codex mcp add molscope -- molscope-mcp # Codex CLI
gemini mcp add molscope molscope-mcp # Gemini CLIFor example: "fetch trypsin (3ptb), find the benzamidine binding-site residues,
and render a contact map." See docs/user-guide/mcp-server.md
for the full tool reference.
MolScope is explicit about which results are cross-checked against reference tools and which are intentionally lightweight:
| Feature | Status |
|---|---|
| Geometry, RMSD, contact maps | Cross-checked vs MDAnalysis (near machine precision) |
| Bond perception, chemical features | Cross-checked vs RDKit |
| Secondary structure (simplified DSSP) | Cross-checked vs mkdssp: ~98 to 99% 3-state agreement across helical, mixed, and all-beta folds |
| Protein template bonds | Cross-checked vs known per-residue chemistry |
| Native descriptors, molecular graphs | Deterministic; not benchmarked against a curated library |
| Coarse-graining | Mapping and visualisation only; not a validated force-field model |
| Standard protonation | Idealised pH-7 textbook model ("standard") |
| pKa-aware protonation | Environment-aware via PROPKA (proteins) / Dimorphite-DL (SMILES) at a chosen pH |
Methods, tolerances, and failure modes are in docs/validation.md.
The CI validation job runs physical invariants plus these cross-checks on every
push, and publishes a generated validation summary
(a table of what passed or was skipped) to the workflow run page and as a downloadable
artifact, so the scientific boundaries are easy to inspect without reading the logs.
MolScope is a lightweight core (NumPy and Matplotlib only) with a broad optional surface. Those optional extras are deliberately of two kinds, and the distinction is what keeps "lightweight" honest:
- Interop / output targets (
networkx,pyg,dgl,viz,xlsx,gpu,mcp). These let MolScope emit to your framework or run faster. Speaking many formats is the bridge mission, so this surface is expected to grow. - Method backends (
chem/RDKit,cif/gemmi,propka,dimorphite,validation/MDAnalysis). These delegate a scientific computation to an external tool, and each one inherits that tool's scope, versioning, and correctness surface. This is the surface we keep deliberately small.
A new method backend earns inclusion only when MolScope adds real integration value beyond what you could write in a line or two against the tool directly (for example, mapping RDKit perception back onto MolScope's atom model and residue templates), and when the wrapped tool is maintained. The core does something useful on its own; everything heavier degrades gracefully when its extra is absent. MolScope is not, and is not trying to become, a re-implementation of a cheminformatics or simulation stack: where a dedicated tool is the right answer, it integrates that tool rather than reinventing it, and says so.
Which formats can it read? .xyz, .pdb, .cif, and .sdf; fetch from the
RCSB with ms.fetch("1fqy"); or build from SMILES with ms.read_smiles(...)
(needs [chem]).
Does it handle MD trajectories? It works on static structures and multi-model
files (NMR ensembles, and ms.stream(...) to iterate large multi-model PDB/XYZ
frame by frame). It has no trajectory engine; for DCD/XTC and friends use
MDAnalysis or MDTraj.
Is the coarse-graining a real force field? No. It produces CG mappings and bead graphs for inspection and ML prototyping. The OpenMM XML export describes topology only and is not a validated Martini parameter set.
Do I need RDKit or PyTorch? No. The core runs on NumPy and Matplotlib; those are opt-in extras you install only for the matching workflow.
Will odd PDB files parse? ATOM/HETATM lines are read by fixed columns (not whitespace), so touching, large, or negative coordinate fields read correctly. Alternate conformations default to the primary altLoc.
uv run pytest # full test suite
uv run pytest tests/validation # validation suite only
uv run ruff check . # lintCI runs the suite and linting across Python 3.9 / 3.11 / 3.13, smoke-imports the
extras, and runs a separate validation job on every push and PR. Notable changes
per release are in CHANGELOG.md.
Each release is archived on Zenodo with a citable DOI. The concept DOI
10.5281/zenodo.20433850 always resolves
to the latest version; citation metadata is in CITATION.cff, so
GitHub's "Cite this repository" button produces BibTeX and APA entries.
{ "tool": "molscope", "version": "0.16.0", "command": "qc", "input": "3ptb.pdb", // path/id, or a list for batch commands "parser": "pdb", // reader chosen from the extension "backends": ["scipy"], // optional packages this run engaged "warnings": [ ... ], "result": { ... } // the command-specific payload }