pyECOD Mini - Clean Domain Partitioning Tool

Version: 2.0.0 Status: ✅ Production Ready - Integrated with pyecod_prod Framework

A minimal, validated domain partitioning tool for ECOD protein classification with production-ready scaling.

Overview

pyECOD Mini is a clean extraction of the proven domain partitioning algorithm from the legacy pyECOD project. This repository represents a fresh start with:

✅ Validated Algorithm: 6/6 regression tests passing, ~80% domain boundary accuracy
✅ Modern Python: Type hints, dataclasses, mypy compliance
✅ Production Ready: Integrated with pyecod_prod for large-scale processing
✅ Clean Architecture: Simple, maintainable, well-documented
✅ External Workflow Support: CLI arguments for integration with batch processing systems

Quick Start

Installation

# Clone repository
git clone git@github.com:rschaeff/pyecod_mini.git
cd pyecod_mini

# Install in development mode
pip install -e . --user

# Verify installation
pyecod-mini --validate

Basic Usage

Partition a single protein from a batch:

pyecod-mini 8ovp_A --batch-id ecod_weekly_20250905 --verbose

Use custom input/output paths (for integration with external workflows):

pyecod-mini 8s72_N \
    --summary-xml /path/to/domains/8s72_N.develop291.domain_summary.xml \
    --output /path/to/partitions/8s72_N.domains.xml \
    --verbose

Generate PyMOL visualization:

pyecod-mini 8ovp_A --batch-id ecod_weekly_20250905 --visualize
pymol /data/ecod/pdb_updates/batches/ecod_weekly_20250905/comparison_8ovp_A.pml

Current Status

✅ Production Ready - Fully integrated with pyecod_prod framework

The algorithm currently:

✅ Partitions domains with ~80% accuracy
✅ Passes 6 regression tests
✅ Handles discontinuous domains
✅ Optimizes domain boundaries
✅ Produces ECOD-compliant XML output
✅ Integrates with batch processing workflows
✅ Processes 15-chain test batch successfully (100%)

Goals

Phase 1: Algorithm Extraction ✨ Current Phase

Extract the proven mini algorithm with improvements:

Modern Python Standards
- All data structures as dataclasses (no dicts)
- Comprehensive type hints (mypy compliant)
- Black code formatting
- Ruff linting
Validated Correctness
- All 6 regression tests passing
- 80% test coverage
- Performance benchmarks met
Clean Architecture
- Modular design
- Clear separation of concerns
- Comprehensive documentation

Phase 2: Production Framework (Next)

Build scalable production processing:

SLURM Integration
- Parallel processing of 40k+ proteins
- Job tracking and monitoring
- Automatic retry on failure
Database Integration
- Import results to PostgreSQL
- Quality filtering
- Collision detection
Observability
- Real-time progress monitoring
- Quality metrics tracking
- Performance dashboards

Key Features

Core Algorithm

Evidence Integration: Combine BLAST and HHsearch evidence
Chain Decomposition: Handle multi-domain chains via BLAST alignment
Boundary Optimization: Refine domain boundaries using alignment quality
Discontinuous Domains: Support multi-segment domains
Provenance Tracking: Complete audit trail of decisions

CLI Arguments

Batch Mode (Default):

pyecod-mini PROTEIN_ID [--batch-id BATCH_ID] [--verbose] [--visualize]

PROTEIN_ID: Protein to partition (e.g., 8ovp_A, 8s72_N)
--batch-id: Optional batch ID for batch detection
--verbose: Show detailed processing information
--visualize: Generate PyMOL comparison script

Integration Mode (Custom Paths):

pyecod-mini PROTEIN_ID \
    --summary-xml PATH_TO_DOMAIN_SUMMARY_XML \
    --output PATH_TO_OUTPUT_XML \
    [--verbose]

--summary-xml: Path to input domain summary XML (overrides batch detection)
--output: Path to output partition XML (overrides batch detection)

This mode enables integration with external batch processing workflows like pyecod_prod.

Integration with pyecod_prod

pyecod-mini is designed to integrate seamlessly with the pyecod_prod batch processing framework:

pyecod_prod runs BLAST and HHsearch, generates domain summaries
pyecod_prod calls pyecod-mini with --summary-xml and --output for each chain
pyecod-mini partitions domains and writes output XML with provenance metadata
pyecod_prod tracks completion in batch manifest

Example Integration Call:

from pyecod_prod.partition.partition_runner import PartitionRunner

runner = PartitionRunner(pyecod_mini_path="/home/user/.local/bin/pyecod-mini")
runner.partition_protein(
    pdb_id="8s72",
    chain_id="N",
    summary_xml="/data/ecod/batches/ecod_weekly_20250905/domains/8s72_N.develop291.domain_summary.xml",
    output_path="/data/ecod/batches/ecod_weekly_20250905/partitions/8s72_N.domains.xml"
)

Production Processing

Batch Scanning: Identify proteins to process
External Workflow Integration: Custom file paths via CLI
Progress Monitoring: Real-time dashboard in pyecod_prod
Quality Control: Automatic quality assessment
Database Import: Safe import with collision detection

Project Structure

pyecod_mini/
├── src/pyecod_mini/          # Main package
│   ├── core/                 # Core algorithm
│   ├── cli/                  # Command-line interface
│   └── production/           # Production framework
├── tests/                    # Test suite
├── data/                     # Reference data
├── scripts/                  # Utility scripts
├── docs/                     # Documentation
└── config/                   # Configuration templates

See EXTRACTION_PLAN.md for detailed structure.

Development

Prerequisites

Python >= 3.9
PostgreSQL (for production database)
SLURM cluster (for production processing)

Setup

# Clone repository
git clone git@github.com:rschaeff/pyecod_mini.git
cd pyecod_mini

# Install in development mode
pip install -e . --user

# Validate installation
pyecod-mini --validate

# Run tests
pytest tests/

# Type checking
mypy src/

# Format code
black src/ tests/

Testing

Regression Tests

The 6 critical regression tests validate algorithm correctness:

pytest tests/test_ecod_regression.py -v

Test Case 1: 8ovp_A (GFP-PBP fusion protein)

Expected: 3 domains (or 2 if discontinuous merged)
Coverage: >= 80%
Classification: GFP + PBP families correctly identified

Performance Tests

pytest tests/ -m performance

Target metrics:

Single protein: < 10 seconds
100 proteins: < 15 minutes
Memory: < 500 MB per protein

Documentation

EXTRACTION_PLAN.md - Repository setup and extraction plan
ALGORITHM_VALIDATION.md - Validation strategy and test plan
PRODUCTION_DESIGN.md - Production framework architecture
CLAUDE.md - Lessons learned from original pyECOD

Key Lessons Applied

From the original pyECOD failure:

✅ Algorithm First, Infrastructure Second
- Validate algorithm before building complex infrastructure
- Use proven algorithm as foundation
✅ Consistent Data Structures
- All structured data as dataclasses
- No dict/dataclass confusion
- Type safety throughout
✅ Test-Driven Validation
- Regression tests prove correctness
- Performance tests ensure scalability
- Quality tests validate production readiness
✅ Simple > Complex
- No premature abstraction
- Clear, maintainable code
- Avoid over-engineering

Roadmap

Phase 1: Extraction & Validation ✅ COMPLETED

Phase 2: Production Framework ✅ COMPLETED

Integration with pyecod_prod
CLI arguments for external workflows
Custom path support (--summary-xml, --output)
Batch processing integration
End-to-end validation (15/15 chains, 100% success)

Phase 3: Production Deployment 🚧 IN PROGRESS

Process 15-chain test batch (100% success)
Process 100-protein test batch
Process 1000-protein staging batch
Full production run (40k+ proteins)
Quality assessment
Database import
Documentation finalization

Contributing

This is an internal research project. Development follows:

Code Style: Black (line length 100)
Type Checking: mypy strict mode
Testing: pytest with >= 80% coverage
Documentation: Google-style docstrings

License

Internal research project - not for public distribution.

Authors

Based on the original pyECOD Mini implementation, redesigned and refactored for production use.

Acknowledgments

Original mini algorithm developers
ECOD database team
Lessons learned from the failed main pyECOD pipeline (see ../pyecod/CLAUDE.md)

Current Status: ✅ Production Ready - Integrated with pyecod_prod

Algorithm: 6/6 regression tests passing
Integration: End-to-end validation complete (15/15 chains, 100%)
CLI: Enhanced with --summary-xml and --output arguments
Next: Scale to full production batch (40k+ proteins)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
config		config
data		data
docs		docs
src/pyecod_mini		src/pyecod_mini
test_data		test_data
tests		tests
.gitignore		.gitignore
.python-version		.python-version
ALGORITHM_VALIDATION.md		ALGORITHM_VALIDATION.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PYECOD_MINI_API_SPEC.md		PYECOD_MINI_API_SPEC.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pyECOD Mini - Clean Domain Partitioning Tool

Overview

Quick Start

Installation

Basic Usage

Current Status

Goals

Phase 1: Algorithm Extraction ✨ Current Phase

Phase 2: Production Framework (Next)

Key Features

Core Algorithm

CLI Arguments

Integration with pyecod_prod

Production Processing

Project Structure

Development

Prerequisites

Setup

Testing

Regression Tests

Performance Tests

Documentation

Key Lessons Applied

Roadmap

Phase 1: Extraction & Validation ✅ COMPLETED

Phase 2: Production Framework ✅ COMPLETED

Phase 3: Production Deployment 🚧 IN PROGRESS

Contributing

License

Authors

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages