Version: 2.0.0 Status: ✅ Production Ready - Integrated with pyecod_prod Framework
A minimal, validated domain partitioning tool for ECOD protein classification with production-ready scaling.
pyECOD Mini is a clean extraction of the proven domain partitioning algorithm from the legacy pyECOD project. This repository represents a fresh start with:
- ✅ Validated Algorithm: 6/6 regression tests passing, ~80% domain boundary accuracy
- ✅ Modern Python: Type hints, dataclasses, mypy compliance
- ✅ Production Ready: Integrated with pyecod_prod for large-scale processing
- ✅ Clean Architecture: Simple, maintainable, well-documented
- ✅ External Workflow Support: CLI arguments for integration with batch processing systems
# Clone repository
git clone git@github.com:rschaeff/pyecod_mini.git
cd pyecod_mini
# Install in development mode
pip install -e . --user
# Verify installation
pyecod-mini --validatePartition a single protein from a batch:
pyecod-mini 8ovp_A --batch-id ecod_weekly_20250905 --verboseUse custom input/output paths (for integration with external workflows):
pyecod-mini 8s72_N \
--summary-xml /path/to/domains/8s72_N.develop291.domain_summary.xml \
--output /path/to/partitions/8s72_N.domains.xml \
--verboseGenerate PyMOL visualization:
pyecod-mini 8ovp_A --batch-id ecod_weekly_20250905 --visualize
pymol /data/ecod/pdb_updates/batches/ecod_weekly_20250905/comparison_8ovp_A.pml✅ Production Ready - Fully integrated with pyecod_prod framework
The algorithm currently:
- ✅ Partitions domains with ~80% accuracy
- ✅ Passes 6 regression tests
- ✅ Handles discontinuous domains
- ✅ Optimizes domain boundaries
- ✅ Produces ECOD-compliant XML output
- ✅ Integrates with batch processing workflows
- ✅ Processes 15-chain test batch successfully (100%)
Extract the proven mini algorithm with improvements:
-
Modern Python Standards
- All data structures as dataclasses (no dicts)
- Comprehensive type hints (mypy compliant)
- Black code formatting
- Ruff linting
-
Validated Correctness
- All 6 regression tests passing
- 80% test coverage
- Performance benchmarks met
-
Clean Architecture
- Modular design
- Clear separation of concerns
- Comprehensive documentation
Build scalable production processing:
-
SLURM Integration
- Parallel processing of 40k+ proteins
- Job tracking and monitoring
- Automatic retry on failure
-
Database Integration
- Import results to PostgreSQL
- Quality filtering
- Collision detection
-
Observability
- Real-time progress monitoring
- Quality metrics tracking
- Performance dashboards
- Evidence Integration: Combine BLAST and HHsearch evidence
- Chain Decomposition: Handle multi-domain chains via BLAST alignment
- Boundary Optimization: Refine domain boundaries using alignment quality
- Discontinuous Domains: Support multi-segment domains
- Provenance Tracking: Complete audit trail of decisions
Batch Mode (Default):
pyecod-mini PROTEIN_ID [--batch-id BATCH_ID] [--verbose] [--visualize]PROTEIN_ID: Protein to partition (e.g.,8ovp_A,8s72_N)--batch-id: Optional batch ID for batch detection--verbose: Show detailed processing information--visualize: Generate PyMOL comparison script
Integration Mode (Custom Paths):
pyecod-mini PROTEIN_ID \
--summary-xml PATH_TO_DOMAIN_SUMMARY_XML \
--output PATH_TO_OUTPUT_XML \
[--verbose]--summary-xml: Path to input domain summary XML (overrides batch detection)--output: Path to output partition XML (overrides batch detection)
This mode enables integration with external batch processing workflows like pyecod_prod.
pyecod-mini is designed to integrate seamlessly with the pyecod_prod batch processing framework:
- pyecod_prod runs BLAST and HHsearch, generates domain summaries
- pyecod_prod calls
pyecod-miniwith--summary-xmland--outputfor each chain - pyecod-mini partitions domains and writes output XML with provenance metadata
- pyecod_prod tracks completion in batch manifest
Example Integration Call:
from pyecod_prod.partition.partition_runner import PartitionRunner
runner = PartitionRunner(pyecod_mini_path="/home/user/.local/bin/pyecod-mini")
runner.partition_protein(
pdb_id="8s72",
chain_id="N",
summary_xml="/data/ecod/batches/ecod_weekly_20250905/domains/8s72_N.develop291.domain_summary.xml",
output_path="/data/ecod/batches/ecod_weekly_20250905/partitions/8s72_N.domains.xml"
)- Batch Scanning: Identify proteins to process
- External Workflow Integration: Custom file paths via CLI
- Progress Monitoring: Real-time dashboard in pyecod_prod
- Quality Control: Automatic quality assessment
- Database Import: Safe import with collision detection
pyecod_mini/
├── src/pyecod_mini/ # Main package
│ ├── core/ # Core algorithm
│ ├── cli/ # Command-line interface
│ └── production/ # Production framework
├── tests/ # Test suite
├── data/ # Reference data
├── scripts/ # Utility scripts
├── docs/ # Documentation
└── config/ # Configuration templates
See EXTRACTION_PLAN.md for detailed structure.
- Python >= 3.9
- PostgreSQL (for production database)
- SLURM cluster (for production processing)
# Clone repository
git clone git@github.com:rschaeff/pyecod_mini.git
cd pyecod_mini
# Install in development mode
pip install -e . --user
# Validate installation
pyecod-mini --validate
# Run tests
pytest tests/
# Type checking
mypy src/
# Format code
black src/ tests/The 6 critical regression tests validate algorithm correctness:
pytest tests/test_ecod_regression.py -vTest Case 1: 8ovp_A (GFP-PBP fusion protein)
- Expected: 3 domains (or 2 if discontinuous merged)
- Coverage: >= 80%
- Classification: GFP + PBP families correctly identified
pytest tests/ -m performanceTarget metrics:
- Single protein: < 10 seconds
- 100 proteins: < 15 minutes
- Memory: < 500 MB per protein
- EXTRACTION_PLAN.md - Repository setup and extraction plan
- ALGORITHM_VALIDATION.md - Validation strategy and test plan
- PRODUCTION_DESIGN.md - Production framework architecture
- CLAUDE.md - Lessons learned from original pyECOD
From the original pyECOD failure:
-
✅ Algorithm First, Infrastructure Second
- Validate algorithm before building complex infrastructure
- Use proven algorithm as foundation
-
✅ Consistent Data Structures
- All structured data as dataclasses
- No dict/dataclass confusion
- Type safety throughout
-
✅ Test-Driven Validation
- Regression tests prove correctness
- Performance tests ensure scalability
- Quality tests validate production readiness
-
✅ Simple > Complex
- No premature abstraction
- Clear, maintainable code
- Avoid over-engineering
- Create repository structure
- Design extraction plan
- Design validation plan
- Design production framework
- Extract core algorithm with type hints
- Migrate test suite
- Run regression tests (6/6 passing)
- Performance benchmarks
- Integration with pyecod_prod
- CLI arguments for external workflows
- Custom path support (--summary-xml, --output)
- Batch processing integration
- End-to-end validation (15/15 chains, 100% success)
- Process 15-chain test batch (100% success)
- Process 100-protein test batch
- Process 1000-protein staging batch
- Full production run (40k+ proteins)
- Quality assessment
- Database import
- Documentation finalization
This is an internal research project. Development follows:
- Code Style: Black (line length 100)
- Type Checking: mypy strict mode
- Testing: pytest with >= 80% coverage
- Documentation: Google-style docstrings
Internal research project - not for public distribution.
Based on the original pyECOD Mini implementation, redesigned and refactored for production use.
- Original mini algorithm developers
- ECOD database team
- Lessons learned from the failed main pyECOD pipeline (see ../pyecod/CLAUDE.md)
Current Status: ✅ Production Ready - Integrated with pyecod_prod
- Algorithm: 6/6 regression tests passing
- Integration: End-to-end validation complete (15/15 chains, 100%)
- CLI: Enhanced with --summary-xml and --output arguments
- Next: Scale to full production batch (40k+ proteins)