SEI (Sequence-to-Effect Inference) is a deep learning framework for predicting chromatin profiles and sequence classes from DNA sequences. This framework allows you to extract embeddings from DNA sequences using the pre-trained SEI model.
The SEI framework provides tools to:
- Extract embeddings from FASTA sequences using the pre-trained SEI model
- Predict chromatin profiles and sequence classes from DNA sequences
- Process sequences in batches for efficient inference
- Python 3.9 or higher
uvpackage manager (install from https://github.com/astral-sh/uv)- CUDA-capable GPU (optional, but recommended for faster inference)
# Or using pip
pip install uvRun the environment setup cmd:
uv sync
source .venv/bin/activateThe trained SEI model needs to be downloaded from Zenodo. You can use the provided download script:
bash download_data.shThis will download:
- The trained SEI model (
sei_model.tar.gz) from Zenodo - SEI framework resources (FASTA files) from Zenodo
Note: The model files should be extracted to the model/ directory. The main model file should be at model/sei.pth.
Alternatively, you can manually download from:
- Model: https://zenodo.org/record/4906996 (DOI: 10.5281/zenodo.4906996)
- Extract the model files to the
model/directory
Or just run
bash env_setup.sh
The main script for extracting embeddings is retrieve_embeddings/retrieve_embeddings.py.
python retrieve_embeddings/retrieve_embeddings.py \
--input-file retrieve_embeddings/test.fasta \
--output-file output/embeddings.npzpython retrieve_embeddings/retrieve_embeddings.py \
--input-file <path-to-input.fasta> \
--output-file <path-to-output.npz> \
--model-path model/sei.pth \
--batch-size 32 \
--sequence-length 4096 \
--use-hooks--input-file(required): Path to input FASTA file containing DNA sequences--output-file(required): Path to output.npzfile where embeddings will be saved--model-path(optional): Path to SEI model file (default:model/sei.pth)--batch-size(optional): Batch size for processing sequences (default: 32)--sequence-length(optional): Target sequence length for encoding (default: 4096)--use-hooks(default): Use register_hooks method for embedding extraction (recommended)--no-use-hooks: Use manual layer-by-layer method instead of hooks
The script outputs a compressed NumPy archive (.npz) file containing:
ids: Array of sequence IDs from the FASTA fileembeddings: Array of embeddings with shape(num_sequences, 960, 16)
# Extract embeddings using the default method (hooks)
python retrieve_embeddings/retrieve_embeddings.py \
--input-file retrieve_embeddings/test.fasta \
--output-file output/embeddings.npz
# Extract embeddings using manual method
python retrieve_embeddings/retrieve_embeddings.py \
--input-file retrieve_embeddings/test.fasta \
--output-file output/embeddings.npz \
--no-use-hooksYou can load the saved embeddings in Python:
import numpy as np
# Load embeddings
data = np.load('output/embeddings.npz')
sequence_ids = data['ids']
embeddings = data['embeddings']
print(f"Loaded {len(sequence_ids)} sequences")
print(f"Embeddings shape: {embeddings.shape}") # (num_sequences, 960, 16)sei-framework/
├── model/ # SEI model files
│ ├── sei.py # Model architecture
│ ├── sei.pth # Trained model weights (download required)
│ └── *.names # Target and sequence class names
├── retrieve_embeddings/ # Embedding extraction scripts
│ ├── retrieve_embeddings.py # Main embedding extraction script
│ ├── util.py # Utility functions for embeddings
│ └── test.fasta # Example input file
├── encode/ # Sequence encoding utilities
├── pca/ # PCA analysis tools
├── tests/ # Unit tests
├── env_setup.sh # Environment setup script
├── download_data.sh # Model download script
└── pyproject.toml # Project dependencies
Run the test suite to verify your installation:
pytest tests/If you encounter FileNotFoundError: Model file not found: model/sei.pth, make sure you have:
- Downloaded the model using
bash download_data.sh - Extracted the model files to the
model/directory - Verified that
model/sei.pthexists
The framework will automatically use CPU if CUDA is not available.