Skip to content

cellethology/sei-framework

 
 

Repository files navigation

SEI Framework

SEI (Sequence-to-Effect Inference) is a deep learning framework for predicting chromatin profiles and sequence classes from DNA sequences. This framework allows you to extract embeddings from DNA sequences using the pre-trained SEI model.

Overview

The SEI framework provides tools to:

  • Extract embeddings from FASTA sequences using the pre-trained SEI model
  • Predict chromatin profiles and sequence classes from DNA sequences
  • Process sequences in batches for efficient inference

Prerequisites

Environment Setup

1. Install uv (if not already installed)

# Or using pip
pip install uv

2. Create Virtual Environment and Install Dependencies

Run the environment setup cmd:

uv sync
source .venv/bin/activate

3. Download the SEI Model

The trained SEI model needs to be downloaded from Zenodo. You can use the provided download script:

bash download_data.sh

This will download:

  • The trained SEI model (sei_model.tar.gz) from Zenodo
  • SEI framework resources (FASTA files) from Zenodo

Note: The model files should be extracted to the model/ directory. The main model file should be at model/sei.pth.

Alternatively, you can manually download from:

Or just run

bash env_setup.sh

Usage

Extract Embeddings from FASTA Sequences

The main script for extracting embeddings is retrieve_embeddings/retrieve_embeddings.py.

Basic Usage

python retrieve_embeddings/retrieve_embeddings.py \
    --input-file retrieve_embeddings/test.fasta \
    --output-file output/embeddings.npz

Full Command with All Options

python retrieve_embeddings/retrieve_embeddings.py \
    --input-file <path-to-input.fasta> \
    --output-file <path-to-output.npz> \
    --model-path model/sei.pth \
    --batch-size 32 \
    --sequence-length 4096 \
    --use-hooks

Command-Line Arguments

  • --input-file (required): Path to input FASTA file containing DNA sequences
  • --output-file (required): Path to output .npz file where embeddings will be saved
  • --model-path (optional): Path to SEI model file (default: model/sei.pth)
  • --batch-size (optional): Batch size for processing sequences (default: 32)
  • --sequence-length (optional): Target sequence length for encoding (default: 4096)
  • --use-hooks (default): Use register_hooks method for embedding extraction (recommended)
  • --no-use-hooks: Use manual layer-by-layer method instead of hooks

Output Format

The script outputs a compressed NumPy archive (.npz) file containing:

  • ids: Array of sequence IDs from the FASTA file
  • embeddings: Array of embeddings with shape (num_sequences, 960, 16)

Example

# Extract embeddings using the default method (hooks)
python retrieve_embeddings/retrieve_embeddings.py \
    --input-file retrieve_embeddings/test.fasta \
    --output-file output/embeddings.npz

# Extract embeddings using manual method
python retrieve_embeddings/retrieve_embeddings.py \
    --input-file retrieve_embeddings/test.fasta \
    --output-file output/embeddings.npz \
    --no-use-hooks

Loading Embeddings

You can load the saved embeddings in Python:

import numpy as np

# Load embeddings
data = np.load('output/embeddings.npz')
sequence_ids = data['ids']
embeddings = data['embeddings']

print(f"Loaded {len(sequence_ids)} sequences")
print(f"Embeddings shape: {embeddings.shape}")  # (num_sequences, 960, 16)

Project Structure

sei-framework/
├── model/                # SEI model files
│   ├── sei.py            # Model architecture
│   ├── sei.pth           # Trained model weights (download required)
│   └── *.names           # Target and sequence class names
├── retrieve_embeddings/  # Embedding extraction scripts
│   ├── retrieve_embeddings.py  # Main embedding extraction script
│   ├── util.py           # Utility functions for embeddings
│   └── test.fasta        # Example input file
├── encode/               # Sequence encoding utilities
├── pca/                  # PCA analysis tools
├── tests/                # Unit tests
├── env_setup.sh          # Environment setup script
├── download_data.sh      # Model download script
└── pyproject.toml        # Project dependencies

Testing

Run the test suite to verify your installation:

pytest tests/

Troubleshooting

Model File Not Found

If you encounter FileNotFoundError: Model file not found: model/sei.pth, make sure you have:

  1. Downloaded the model using bash download_data.sh
  2. Extracted the model files to the model/ directory
  3. Verified that model/sei.pth exists

The framework will automatically use CPU if CUDA is not available.

About

code to run sei and obtain sei and sequence class predictions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.1%
  • Shell 1.9%