MolScope

Turn a molecular structure file into descriptors, contact maps, ML graphs, and coarse-grained bead models, with a small, readable Python API.

Reads .xyz, .pdb, .cif, and .sdf (or fetches from the RCSB by ID). The core depends only on NumPy and Matplotlib; heavier backends (RDKit, PyTorch Geometric, DGL, Gemmi) are opt-in extras. Built for teaching, exploratory analysis, and ML-for-molecules prototyping, not as a replacement for full simulation or cheminformatics stacks.

📖 Full documentation: https://molscope.readthedocs.io

3D structure (element)	Secondary structure (DSSP)	Residue contact map	Coarse-grained beads

Install

pip install molscope            # core: NumPy + Matplotlib only

Optional extras, added only when a workflow needs them:

Extra	Adds
`fast`	scipy KD-tree for faster bond/contact search on large structures
`chem`	RDKit chemical perception and descriptors
`cif`	Gemmi mmCIF parsing and validation
`pyg` / `dgl` / `graph` / `gnn`	PyTorch Geometric / DGL / NetworkX graph export
`viz`	py3Dmol interactive viewer
`xlsx`	read/write `.xlsx` molecule tables
`gpu`	Torch dense distance backend
`mcp`	MCP server for AI assistants (Python >= 3.10)

pip install "molscope[chem,cif,pyg]"   # combine as needed

For local development with uv: uv sync (creates .venv and installs deps + dev tools from the lockfile), then uv run pytest.

Quickstart

Given a .pdb (or .xyz / .cif / .sdf), here is what you can pull out:

import molscope as ms

mol = ms.read("protein.pdb")        # or ms.fetch("1fqy") from the RCSB
print(mol.summary())                # atoms, formula, chains, bounding box

ca   = mol.select(chain="A").alpha_carbons()   # metadata selections
cmap = mol.contact_map(cutoff=8.0)             # residue contact map (NumPy)
desc = mol.descriptors()                       # dict of structural descriptors
graph = mol.to_graph()                         # ML-ready graph, no extra deps
data  = mol.to_pyg_data()                      # PyTorch Geometric Data ([pyg])
cg    = mol.coarse_grain("residue_com")        # one bead per residue

Molecule is immutable: translate, centered, and rotate each return a new molecule, so transformations chain cleanly.

From a folder of structures to a trained GNN

build_dataset collapses read → featurise → label-join → split into one call, and GraphDataset carries it the last mile to a training loop:

ds = ms.build_dataset(
    "data/*.pdb",                 # glob, list of paths/Molecules, or fetch_dataset(ids) from RCSB
    fmt="pyg",                    # "pyg" | "dgl" | "networkx" | "raw"
    labels="labels.csv",          # joined to each graph by file stem
    split=(0.8, 0.1, 0.1),
    cache_dir=".graph_cache",     # on-disk featurisation cache; reruns reuse it
)
scaler = ds.standardize_targets() # fit on train only; no val/test leakage
for batch in ds.loader("train", batch_size=32):   # batching PyG/DGL DataLoader
    ...                           # scaler.inverse_transform(pred) -> physical units

This graph-ML on-ramp adds no core dependency (fmt="pyg"/"dgl" need their extras). See Molecular graphs and the runnable PDB to a trained GNN example.

Choose your workflow

Not sure what to run first? Match your goal to a starting point. Every row is a CLI command (most have a one-line Python equivalent); the full list is in the Command line table below.

Your goal	Start here	Next step
Look at a structure	`molscope file.pdb` — render it, save a PNG/GIF	`molscope report file.pdb` for a one-file overview
Sanity-check a file	`molscope qc file.pdb` — did it parse cleanly? (any format)	`molscope structure-report file.pdb` — is this protein ML-ready?
Compare two structures	`molscope compare a.pdb b.pdb` — RMSD, per-residue and contact/descriptor deltas	add `--out diff.md` for a written report
Featurise to a table	`molscope presets descriptors` — see what's available	`molscope analyze "*.pdb" --out features.csv`
Build an ML graph dataset	`ms.build_dataset(...)` (API) or `molscope export … --to pyg` (CLI)	`molscope prepare table.csv` for train/val/test splits
Coarse-grain a structure	`molscope presets coarse-grain` → `molscope coarse-grain file.pdb --mapping martini`
Triage docking output	`molscope dock-report poses.sdf` — HTML report + top poses	`dock-summary` · `dock-diverse` · `dock-rank`
Drive it from an AI assistant	`pip install "molscope[mcp]"`, then add the MCP server	ask for analyses in natural language

On any unfamiliar file, molscope report file.pdb gives the broadest single-command overview and molscope presets lists every descriptor, graph, and coarse-grain option. Before a featurisation run, molscope preflight file.pdb --workflow graph (or pass --preflight to analyze / export / coarse-grain, or preflight=True to mol.to_graph() / descriptors() / coarse_grain()) warns about inputs — inferred bonds, missing metadata, no hydrogens, huge dense matrices — that would silently degrade the result.

What you can do

Capability	Guide
Read/write XYZ, PDB, mmCIF, SDF; fetch from RCSB; build from SMILES	Reading files
Stream large multi-model files frame by frame	Reading files
Select atoms by metadata (chain, residue, name, ...)	Selections
Geometry, RMSD, distances, angles, torsions	Geometry and measurements
Contact maps and distance matrices	Contact maps
DSSP secondary structure, torsions, interfaces, binding sites	Protein analysis
Native and RDKit-backed descriptors	Structural descriptors
Preflight guardrails before featurising (inferred bonds, missing metadata, dense-matrix sizes)	Structure QC
Chemical perception, protein template bonds, bond-order inference	Chemical perception
Atom/bond and residue-contact graphs for ML (with positional encodings)	Molecular graphs
Assemble a split, labelled graph dataset (cache, loaders, target scaling, RCSB fetch)	Molecular graphs
Coarse-grained bead mappings (residue, Martini-style, custom)	Coarse-graining
NMR ensembles and clustering	Ensemble analysis
Plotting and py3Dmol viewing	Plotting and viewing
Diverse subset selection from a CSV/XLSX table	Diverse selection

Task-oriented tutorials: PDB to descriptors, PDB to graph/GNN, and PDB to coarse-grained beads. A runnable tour over the bundled samples lives in examples/tour.py.

Command line

Command	Does
`molscope <file>` (view)	visualise a structure, save a PNG or GIF
`molscope report`	one-file structure report (QC, chains/ligands, descriptors, contact map, graph stats, optional CG preview) as HTML/Markdown
`molscope compare`	compare two static structures: aligned RMSD, per-residue deviations, contact-map delta, descriptor delta
`molscope qc`	lightweight structure-quality report (atoms, chains, ligands, metadata, elements, bonds, altLoc, CIF/PDB warnings)
`molscope structure-report`	ML-readiness check for a protein (residue gaps, missing/truncated atoms, chain breaks, net charge)
`molscope preflight`	warn about inputs that silently degrade descriptor/graph/CG output (inferred bonds, missing metadata, altLocs, no hydrogens, large dense matrices)
`molscope presets`	list the available descriptor / graph / coarse-grain presets and what each one produces
`molscope analyze`	batch descriptor table to CSV
`molscope binding-site`	ligand binding-site contacts and pocket descriptors
`molscope export`	batch graph export to PyG / DGL / NetworkX
`molscope prepare`	build ML-ready train/validation/test splits from a table or SDF
`molscope coarse-grain`	map a structure to CG beads and write a coordinate file (PDB CONECT bonds)
`molscope select`	diverse subset from a CSV/XLSX table
`molscope dock-summary`	rank docking poses from an SDF; summary + top-hit tables + score plot
`molscope dock-diverse`	diverse shortlist of top hits by Tanimoto clustering
`molscope dock-rank`	transparent consensus ranking across scored SDFs
`molscope dock-report`	self-contained HTML report + top poses for PyMOL/ChimeraX/Mol*

molscope examples/data/1fqy.pdb --select atom_name=CA --color-by residue --save ca.png
molscope report examples/data/3ptb.pdb --out-dir report/ --coarse-grain
molscope compare apo.pdb holo.pdb --atoms ca --out compare.md
molscope qc examples/data/3ptb.pdb
molscope structure-report --fetch 1ubq      # is this protein ML-ready?
molscope preflight examples/data/1ubq.pdb --workflow graph   # what would degrade my graph?
molscope presets descriptors          # discover the --preset / node_features options
molscope analyze examples/data/*.pdb --out results.csv --preset native-3d --jobs 4
molscope export "data/*.cif" --to pyg --out-dir pyg_graphs/ --pe laplacian --jobs 8
molscope prepare data.csv --smiles-col SMILES --split scaffold --out-dir prepared/
molscope coarse-grain examples/data/1fqy.pdb --mapping martini --out cg.pdb
molscope select molecules.csv --smiles-col SMILES --compute-descriptors -n 100 --out picked.csv
molscope dock-summary vina_out.sdf --score-field minimizedAffinity --top 20
molscope dock-diverse vina_out.sdf --top 500 --select 50

Pipeline-friendly output. Every --json command (qc, structure-report, compare, preflight, presets) prints the same envelope, so a downstream tool can read one shape regardless of the command:

{
  "tool": "molscope", "version": "0.16.0", "command": "qc",
  "input": "3ptb.pdb",        // path/id, or a list for batch commands
  "parser": "pdb",            // reader chosen from the extension
  "backends": ["scipy"],      // optional packages this run engaged
  "warnings": [ ... ],
  "result": { ... }           // the command-specific payload
}

The batch commands take --manifest PATH (analyze, export) to write that envelope alongside the outputs, with feature_names (CSV columns / graph node+edge features) and skipped (each input that failed, with the reason) — handy for reproducible runs and for catching dropped structures:

molscope analyze "data/*.pdb" --out features.csv --manifest run.json
molscope export "data/*.cif" --to pyg -o graphs/ --manifest run.json

Use from an AI assistant (MCP)

MolScope ships an optional Model Context Protocol server, so an MCP-capable assistant (Claude Code/Desktop, Codex CLI, Gemini CLI) can drive its analyses in natural language. It exposes the public API as 28 tools (structure analysis, graphs, plots, dataset prep, docking-hit triage) and adds no new science.

pip install "molscope[mcp]"              # needs Python >= 3.10
claude mcp add molscope -- molscope-mcp  # Claude Code
codex mcp add molscope -- molscope-mcp   # Codex CLI
gemini mcp add molscope molscope-mcp     # Gemini CLI

For example: "fetch trypsin (3ptb), find the benzamidine binding-site residues, and render a contact map." See docs/user-guide/mcp-server.md for the full tool reference.

Scientific validation

MolScope is explicit about which results are cross-checked against reference tools and which are intentionally lightweight:

Feature	Status
Geometry, RMSD, contact maps	Cross-checked vs MDAnalysis (near machine precision)
Bond perception, chemical features	Cross-checked vs RDKit
Secondary structure (simplified DSSP)	Cross-checked vs `mkdssp`: ~98 to 99% 3-state agreement across helical, mixed, and all-beta folds
Protein template bonds	Cross-checked vs known per-residue chemistry
Native descriptors, molecular graphs	Deterministic; not benchmarked against a curated library
Coarse-graining	Mapping and visualisation only; not a validated force-field model
Standard protonation	Idealised pH-7 textbook model (`"standard"`)
pKa-aware protonation	Environment-aware via PROPKA (proteins) / Dimorphite-DL (SMILES) at a chosen pH

Methods, tolerances, and failure modes are in docs/validation.md. The CI validation job runs physical invariants plus these cross-checks on every push, and publishes a generated validation summary (a table of what passed or was skipped) to the workflow run page and as a downloadable artifact, so the scientific boundaries are easy to inspect without reading the logs.

Scope and design philosophy

MolScope is a lightweight core (NumPy and Matplotlib only) with a broad optional surface. Those optional extras are deliberately of two kinds, and the distinction is what keeps "lightweight" honest:

Interop / output targets (networkx, pyg, dgl, viz, xlsx, gpu, mcp). These let MolScope emit to your framework or run faster. Speaking many formats is the bridge mission, so this surface is expected to grow.
Method backends (chem/RDKit, cif/gemmi, propka, dimorphite, validation/MDAnalysis). These delegate a scientific computation to an external tool, and each one inherits that tool's scope, versioning, and correctness surface. This is the surface we keep deliberately small.

A new method backend earns inclusion only when MolScope adds real integration value beyond what you could write in a line or two against the tool directly (for example, mapping RDKit perception back onto MolScope's atom model and residue templates), and when the wrapped tool is maintained. The core does something useful on its own; everything heavier degrades gracefully when its extra is absent. MolScope is not, and is not trying to become, a re-implementation of a cheminformatics or simulation stack: where a dedicated tool is the right answer, it integrates that tool rather than reinventing it, and says so.

FAQ

Which formats can it read? .xyz, .pdb, .cif, and .sdf; fetch from the RCSB with ms.fetch("1fqy"); or build from SMILES with ms.read_smiles(...) (needs [chem]).

Does it handle MD trajectories? It works on static structures and multi-model files (NMR ensembles, and ms.stream(...) to iterate large multi-model PDB/XYZ frame by frame). It has no trajectory engine; for DCD/XTC and friends use MDAnalysis or MDTraj.

Is the coarse-graining a real force field? No. It produces CG mappings and bead graphs for inspection and ML prototyping. The OpenMM XML export describes topology only and is not a validated Martini parameter set.

Do I need RDKit or PyTorch? No. The core runs on NumPy and Matplotlib; those are opt-in extras you install only for the matching workflow.

Will odd PDB files parse? ATOM/HETATM lines are read by fixed columns (not whitespace), so touching, large, or negative coordinate fields read correctly. Alternate conformations default to the primary altLoc.

Development and citation

uv run pytest                  # full test suite
uv run pytest tests/validation # validation suite only
uv run ruff check .            # lint

CI runs the suite and linting across Python 3.9 / 3.11 / 3.13, smoke-imports the extras, and runs a separate validation job on every push and PR. Notable changes per release are in CHANGELOG.md.

Each release is archived on Zenodo with a citable DOI. The concept DOI 10.5281/zenodo.20433850 always resolves to the latest version; citation metadata is in CITATION.cff, so GitHub's "Cite this repository" button produces BibTeX and APA entries.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
molscope		molscope
notebooks		notebooks
paper		paper
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
.readthedocs.yaml		.readthedocs.yaml
.zenodo.json		.zenodo.json
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
descriptors.csv		descriptors.csv
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolScope

Install

Quickstart

From a folder of structures to a trained GNN

Choose your workflow

What you can do

Command line

Use from an AI assistant (MCP)

Scientific validation

Scope and design philosophy

FAQ

Development and citation

License

About

Uh oh!

Releases 15

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MolScope

Install

Quickstart

From a folder of structures to a trained GNN

Choose your workflow

What you can do

Command line

Use from an AI assistant (MCP)

Scientific validation

Scope and design philosophy

FAQ

Development and citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages