SECTOR: Structural Entropy-based Learning of Spatiotemporal Organisation in Spatial Transcriptomics

SECTOR (Structural Entropy-based Clustering and pseudoTime ORdering) is a lightweight deep graph learning framework for spatial transcriptomics (ST). SECTOR jointly infers spatial domains and a continuous within-section pseudotime from the same model. It builds a fused spatial–expression graph from coordinates and gene expression of spatial locations (spots/cells/bins), then optimises a differentiable structural entropy objective regularised by spatial total variation (TV) to obtain spatially coherent domains and smooth pseudotime fields.

SECTOR has been evaluated across seven ST datasets grouped into three data regimes:

Sequencing-based ST: 10x Visium DLPFC and Stereo-seq mouse embryo.
Imaging-based ST: MERFISH hypothalamus, STARmap cortex and BaristaSeq primary cortex.
Large-scale high-resolution ST: Visium HD colorectal cancer (CRC) and Xenium breast infiltrating ductal carcinoma (IDC).

The recommended way to run SECTOR is through the Python API, as demonstrated in the tutorial notebooks. A command-line interface is also provided as a secondary convenience interface.

1. Clone the repository

git clone https://github.com/LHBCB/SECTOR.git
cd SECTOR

2. Installation

SECTOR has been developed and evaluated with Python 3.12, PyTorch 2.7.1 with CUDA 12.6, and torch_geometric 2.7.0. We recommend using a dedicated conda environment.

2.1 Create a conda environment

conda create -n sector_env python=3.12
conda activate sector_env

2.2 Install PyTorch

CUDA build, recommended:

pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126

For CPU-only installation or a different CUDA version, follow the official PyTorch installation instructions for your system.

2.3. Install SECTOR dependencies

From the repository root:

pip install -r requirements.txt

This installs, among others:

Core scientific libraries: numpy, scipy, pandas, scikit-learn, matplotlib, scikit-misc
ST / AnnData ecosystem: anndata, scanpy, h5py
GNN stack (on top of installed PyTorch): torch-geometric
Notebook support: notebook, ipykernel

3. Tutorial

3.1 Detailed tutorials

Three group-level tutorial notebooks are provided in this repository:

Tutorial	Datasets covered	Main content
`tutorial_sequencing_based_ST.ipynb`	10x Visium DLPFC; Stereo-seq mouse embryo	Sequencing-based ST workflow and parameter settings.
`tutorial_imaging_based_ST.ipynb`	MERFISH hypothalamus; STARmap cortex; BaristaSeq primary cortex	Imaging-based ST workflow and parameter settings.
`tutorial_large_scale_high_resolution_ST.ipynb`	Visium HD CRC; Xenium IDC	Large-scale high-resolution ST workflow and parameter settings.

Each notebook demonstrates:

expected .h5ad input format;
model initialisation through the Python API;
key parameter settings for the corresponding dataset group;
model fitting with fit();
domain and pseudotime inference with pred();
visualisation of spatial domains and pseudotime;
metric reporting when annotations are available;
practical tuning and troubleshooting guidance.

It is recommended to start from the tutorial notebook that best matches the technology and scale of the target dataset.

3.2 Expected input format

SECTOR expects each spatial section to be stored as an .h5ad / AnnData object. By default, both the Python API and CLI look for files at:

{dataset_path}/{dataset}/{slice}.h5ad

This path structure is convenient when a project contains multiple datasets or multiple slices per dataset. For example:

./data/10x_visium/151673.h5ad

For a custom dataset, create the same structure, for example:

./data/my_dataset/my_slice.h5ad

and set:

dataset_path = "./data"
dataset = "my_dataset"
slice = "my_slice"

The input .h5ad file should contain:

Field	Requirement
`adata.X`	Expression matrix with observations as rows and genes/features as columns. Raw or count-like expression values are recommended. Sparse matrices are supported.
`adata.obsm["spatial"]`	Spatial coordinates with shape `n_obs × 2`.
`adata.obs[label]`	Optional annotation column for evaluation only. The default label key is `Region`.

If no annotation is available, run SECTOR with eval_mode=0. When eval_mode=1, SECTOR computes clustering metrics such as normalised mutual information (NMI), homogeneity (HOM) and completeness (COM) using the label column specified by label. Labels are not used during model training.

3.3 Basic usage example: 10x Visium DLPFC

from sector import SECTOR

m = SECTOR(
    # input slice path
    dataset_path='./data',       # Root directory for ST datasets.
    dataset='10x_visium',        # Dataset folder name.
    slice='151673',              # DLPFC slice file name without the .h5ad suffix.

    # evaluation and output
    num_clusters=7,              # Expected number of spatial domains.
    eval_mode=1,                 # Evaluation mode; annotations are used only for metric calculation.
    label='Region',              # Ground-truth annotation column in adata.obs, required when eval_mode=1.

    # key graph and feature parameters
    n_comps=20,                  # Number of PCA components.
    n_top_genes=2000,            # Number of HVGs.
    k=1,                         # Feature-graph neighbours (k_feat in the manuscript).
    k_s=6,                       # Spatial-graph neighbours.
)

m.fit(
    lambda_tv=2.0,              # Spatial TV regularisation coefficient.
    lr=0.001,                   # Learning rate.
    stability_nmi_thr=0.97,     # Label-free early-stopping threshold based on assignment stability.
    balance_probe_epochs=20,    # Number of balance-probe epochs.
    gamma_balance=1.0,          # Balance regularisation weight, used only if the probe detects cluster under-use.
)

adata = m.pred(
    spatial_anchor='south',     # Pseudotime orientation; root_cluster can be used instead after inspecting domains.
    plot=True,                  # If True, spatial domains and pseudotime are plotted and saved.
    island_min_frac=0.1,        # Relative minimum component size for post hoc island cleaning.
    island_min_abs=40,          # Absolute minimum component size for post hoc island cleaning.
)

3.4 Outputs

By default, SECTOR saves outputs to:

./sector_model/{dataset}_{slice}_K{num_clusters}.pt
./output/{dataset}.{slice}.sector.h5ad
./figures/{dataset}.{slice}.clusters.png
./figures/{dataset}.{slice}.pseudotime.png

The output .h5ad file stores:

Output	Location
Predicted spatial domains	`adata.obs["pred_region"]`
Inferred pseudotime	`adata.obs["pseudotime"]`
SECTOR embedding	`adata.obsm["sector_embedding"]`
Metrics, if `eval_mode=1`	`adata.uns["SECTOR"]["final_metrics"]`

3.5 Optional: running SECTOR from the command-line interface

A CLI is available for users who prefer a one-command workflow. It calls the same SECTOR fitting and prediction logic as the Python API. For exploratory analysis and custom datasets, we recommend the Python API and notebooks over the CLI.

Example: 10x Visium DLPFC

python run_sector.py \
    --dataset_path ./data \
    --dataset DLPFC \
    --slice 151673 \
    --num_clusters 7 \
    --lambda_tv 2.0 \
    --eval_mode 1 \
    --plot True \
    --island_min_frac 0.1 \
    --island_min_abs 40

4. Key parameters and practical tuning guidance

The table below summarises the most important user-facing parameters. Defaults are sensible starting points, but some datasets may require limited tuning.

4.1 Data and evaluation

Parameter	Default	Controls	Practical guidance
`dataset_path`	`./data`	Root folder for datasets.	Use with `dataset` and `slice` to locate `{dataset_path}/{dataset}/{slice}.h5ad`.
`dataset`	`DLPFC`	Dataset folder name.	For custom data, use the folder name under `dataset_path`.
`slice`	`151673`	Slice/file name without `.h5ad`.	For custom data, use the `.h5ad` file stem.
`label`	`Region`	Annotation column in `adata.obs`.	Required only when `eval_mode=1`.
`eval_mode`	`1`	Whether to compute label-based metrics.	Use `0` for unannotated datasets.

4.2 Graph construction and representation

Parameter	Default	Controls	Practical guidance
`num_clusters`	`7`	Expected number of spatial domains.	Set based on annotations, known anatomy, exploratory runs or the biological resolution of interest. Persistent under-use may indicate that this value is too large.
`lambda_tv`	`2.0`	Strength of spatial TV regularisation.	Increase for fragmented/noisy domains; decrease for oversmoothed domains or narrow adjacent regions.
`k_s`	`6`	Number of neighbours in the spatial graph.	Larger values increase spatial continuity; smaller values preserve fine boundaries.
`k`	`1`	Number of neighbours in the feature graph.	Larger values increase feature-graph connectivity but may over-aggregate weak signals.
`n_top_genes`	`2000`	Number of HVGs used before PCA.	Important for large-panel or whole-transcriptome datasets. Targeted-panel datasets often retain most or all informative genes.
`n_comps`	`20`	Number of PCA dimensions.	Moderate values usually work well. Increase if feature variation is not captured; reduce to lower memory cost.
`use_svg`	`False`	Use spatially variable genes instead of HVGs.	Experimental option. HVG-based feature construction is the default.
`beta_f`	`0.5`	Weight of feature graph in the fused graph.	Higher values emphasise expression similarity; lower values emphasise spatial adjacency.

4.3 Optimisation and stability

Parameter	Default	Controls	Practical guidance
`lr`	`1e-3`	Learning rate.	Reduce for unstable training, especially on large or heterogeneous datasets.
`epochs`	`1000`	Maximum number of training epochs.	Increase if convergence is slow.
`tv_warmup_epochs`	`100`	Warm-up period for TV regularisation.	Helps avoid imposing spatial smoothing too early.
`unsup_patience_checks`	`6`	Label-free early stopping patience.	Increase for noisy or large datasets.
`stability_nmi_thr`	automatic	Stability threshold between consecutive assignments.	If omitted, SECTOR adapts this threshold by dataset size.

4.4 Balance probe and cluster under-use

Parameter	Default	Controls	Practical guidance
`balance_probe_epochs`	`20`	Probe period with balance term disabled.	Keep enabled. SECTOR first tests whether all clusters are naturally used.
`gamma_balance`	`1.0`	Strength of optional balance regularisation.	Treat as a safeguard against severe cluster under-use, not a routine tuning knob. Increase only if under-use persists.
`balance_mode`	`volume`	Cluster-usage definition for the balance term.	`volume` is generally used for sequencing-based ST; `node` can be useful for cell-level imaging-based ST.

If the balance probe succeeds, keep the balance term disabled. If it fails, start from the built-in default and tune only when necessary.

4.5 Pseudotime orientation and post-processing

Parameter	Default	Controls	Practical guidance
`root_cluster`	`None`	Cluster used to orient pseudotime.	Set when a biologically meaningful start domain is known.
`spatial_anchor`	`south`	Spatial anchor for pseudotime orientation when `root_cluster` is not set.	Choose from `north`, `south`, `east`, `west` according to tissue orientation.
`invert_y`	`True`	Whether to invert y-axis for plotting.	Adjust according to coordinate convention.
`island_min_frac`	`0.0`	Relative threshold for post hoc island cleaning.	Increase modestly for fragmented domains; reduce if small real regions are removed.
`island_min_abs`	`0`	Absolute minimum island size.	Use dataset-specific values, for example `40` in DLPFC tutorial settings.
`island_max_iter`	`2`	Maximum island-cleaning passes.	Usually does not require tuning.

4.6 Large-scale mode

Parameter	Default	Controls	Practical guidance
`large_scale_mode`	`auto`	Dense/sparse implementation switching.	Keep as `auto` for most users.
`large_scale_n_obs_threshold`	`100000`	Threshold for very-large sparse mode.	Lower this value if memory is limited; raise it if hardware allows a more adaptive feature graph.
`use_hvg_only`	`1`	Whether large-scale mode keeps HVGs only.	Use `1` for whole-transcriptome data unless all genes are needed.
`attr_graph_mode`	`cached_exact`	Feature-graph builder in large-scale mode.	Default is recommended.

In large_scale_mode="auto", SECTOR uses:

dense mode for n_obs < 10,000;
sparse graph construction with an MLP-derived feature graph for 10,000 <= n_obs < large_scale_n_obs_threshold;
sparse graph construction with a PCA/raw feature graph for n_obs >= large_scale_n_obs_threshold.

This preserves the same SECTOR objective but avoids dense N × N distance or adjacency matrices for large datasets.

5. Diagnostics on common failure modes

SECTOR is usually robust across moderate parameter ranges, but different platforms and tissues may require limited tuning.

5.1 Oversmoothed domains

Typical pattern: neighbouring anatomical regions are merged, narrow boundaries disappear, or small biologically meaningful domains are absorbed into larger regions.

Possible adjustments:

Decrease lambda_tv to reduce spatial smoothing.
Reduce spatial or feature-graph connectivity by decreasing k_s and/or k.
Reduce post hoc island-cleaning strength by decreasing island_min_frac and/or island_min_abs.

5.2 Fragmented or noisy domains

Typical pattern: predicted domains contain many isolated islands, boundaries appear noisy, or spatially coherent tissue regions are split into many small pieces.

Possible adjustments:

Increase lambda_tv to encourage stronger spatial coherence.
Increase spatial or feature-graph connectivity by increasing k_s and/or k.
Apply modest island cleaning by slightly increasing island_min_frac and/or island_min_abs.

5.3 Under-used clusters

Typical pattern: the number of predicted clusters is smaller than num_clusters, or one or more clusters contain very few spots/cells/bins.

Possible adjustments:

Keep the balance probe enabled so SECTOR can first test whether all clusters are naturally used.
Increase gamma_balance only if cluster under-use persists after the balance probe.
Reconsider whether num_clusters is biologically reasonable; persistent under-use may indicate that the requested number of clusters is larger than the tissue structure supports.
For large-scale datasets, also consider tuning lr, because unstable optimisation can contribute to poor cluster usage.

5.4 Unstable training or poor convergence

Typical pattern: results vary substantially across runs, cluster usage changes abruptly, or training does not stabilise.

Possible adjustments:

Reduce lr, especially for large or highly heterogeneous datasets.
Increase epochs if the model has not converged.
Check whether early stopping is too aggressive for the dataset.
Keep the random seed fixed when comparing parameter settings.

5.5 High memory usage

Typical pattern: training is slow, memory is exhausted, or graph construction becomes the bottleneck.

Possible adjustments:

Keep large_scale_mode="auto"; SECTOR automatically switches from dense to sparse graph construction for larger datasets.
For very large datasets, SECTOR further reduces memory usage by constructing the feature graph from the PCA representation rather than repeatedly updating it from a changing neural representation.
Adjust large_scale_n_obs_threshold depending on available hardware. The default is 100000.
For whole-transcriptome or large-panel datasets, reduce n_top_genes and/or n_comps.
Reducing k_s and/or k can also reduce graph size, but may affect spatial continuity and should be tuned together with lambda_tv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SECTOR: Structural Entropy-based Learning of Spatiotemporal Organisation in Spatial Transcriptomics

1. Clone the repository

2. Installation

2.1 Create a conda environment

2.2 Install PyTorch

2.3. Install SECTOR dependencies

3. Tutorial

3.1 Detailed tutorials

3.2 Expected input format

3.3 Basic usage example: 10x Visium DLPFC

3.4 Outputs

3.5 Optional: running SECTOR from the command-line interface

Example: 10x Visium DLPFC

4. Key parameters and practical tuning guidance

4.1 Data and evaluation

4.2 Graph construction and representation

4.3 Optimisation and stability

4.4 Balance probe and cluster under-use

4.5 Pseudotime orientation and post-processing

4.6 Large-scale mode

5. Diagnostics on common failure modes

5.1 Oversmoothed domains

5.2 Fragmented or noisy domains

5.3 Under-used clusters

5.4 Unstable training or poor convergence

5.5 High memory usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
data/10x_visium		data/10x_visium
sector		sector
LICENSE		LICENSE
README.md		README.md
SECTOR_workflow.jpg		SECTOR_workflow.jpg
requirements.txt		requirements.txt
run_sector.py		run_sector.py
tutorial_imaging_based_ST.ipynb		tutorial_imaging_based_ST.ipynb
tutorial_large_scale_high_resolution_ST.ipynb		tutorial_large_scale_high_resolution_ST.ipynb
tutorial_sequencing_based_ST.ipynb		tutorial_sequencing_based_ST.ipynb

Folders and files

Latest commit

History

Repository files navigation

SECTOR: Structural Entropy-based Learning of Spatiotemporal Organisation in Spatial Transcriptomics

1. Clone the repository

2. Installation

2.1 Create a conda environment

2.2 Install PyTorch

2.3. Install SECTOR dependencies

3. Tutorial

3.1 Detailed tutorials

3.2 Expected input format

3.3 Basic usage example: 10x Visium DLPFC

3.4 Outputs

3.5 Optional: running SECTOR from the command-line interface

Example: 10x Visium DLPFC

4. Key parameters and practical tuning guidance

4.1 Data and evaluation

4.2 Graph construction and representation

4.3 Optimisation and stability

4.4 Balance probe and cluster under-use

4.5 Pseudotime orientation and post-processing

4.6 Large-scale mode

5. Diagnostics on common failure modes

5.1 Oversmoothed domains

5.2 Fragmented or noisy domains

5.3 Under-used clusters

5.4 Unstable training or poor convergence

5.5 High memory usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages