Skip to content

gao-lab/SEERS

Repository files navigation

SEERS: Selective Enrichment of Episomes with Random Sequences

A systematic delineation of 3-prime UTR regulatory elements and their contextual associations.

Paper

This repository accompanies the SEERS manuscript:

https://www.biorxiv.org/content/10.1101/2025.06.09.658412v2

Repository Contents

This public repository is organized around two reproducible components:

  1. SEERS count/enrichment processing and k-mer analysis in SEERS_data_processing/
  2. TALE model training, evaluation, and ClinVar ISM analysis in TALE_model_260312/

The earlier v260128 TALE scripts are retained for reference in TALE_models_260128_legacy/.

Folder Guide

SEERS_data_processing/

  • Nn_pp.R: extracts and counts N45 sequences from merged FASTQ files.
  • Nn_pp_pool.R: pools sequence proportions across replicates.
  • combine_dna_cyt_nuc.R: computes cytoplasmic and nuclear enrichment scores from DNA/Cyt/Nuc counts.
  • kmer_profiling.R: performs k-mer association analyses.

These scripts use local input filenames configured near the top of each script. Run them from the directory containing the input files, or update work_dir and the input filenames explicitly.

TALE_model_260312/

Recommended TALE model release.

  • model/model_bundle.pt: selected pretrained PyTorch model bundle.
  • model/summary.json: run configuration and metrics.
  • notebooks/TALE_training_260312.ipynb: training notebook snapshot.
  • notebooks/TALE_ClinVar_ISM_v6.ipynb: ClinVar SNV ISM logo notebook.
  • results/: training histories and summary plots.
  • SHA256SUMS: checksums for the bundled model and summary.

Selected run:

  • TALE_e5-lstm256bx128b-fc256-0.1-ema0.999-seed3407_260320_132204
  • Training data: TALE_train_data_260312.csv
  • Independent A549 evaluation data: 3pL6-A549-T1.csv
  • Selected weights: EMA
  • Best validation loss: 0.2024837457580668

TALE_models_260128_legacy/

Legacy v260128 scripts and pretrained weights retained for comparison with the previous TALE workflow.

Data

Full training and analysis data are distributed through Zenodo:

https://doi.org/10.5281/zenodo.18737939

For the current TALE workflow, use:

  • TALE_train_data_260312.csv
  • 3pL6-A549-T1.csv
  • ClinVar_3UTR_SNPs.tsv when running ClinVar-related analyses

Large source tables are not committed directly to this GitHub repository.

Environment

Minimal requirements:

  • R >= 4.0
  • Python >= 3.9
  • NGmerge for paired-end read merging

Python dependencies are listed in requirements.txt.

Quick Start

Prepare merged FASTQ files when starting from paired-end reads:

./NGmerge -d -1 read1.fq.gz -2 read2.fq.gz -o merged.fq.gz

Run SEERS count and enrichment processing:

cd SEERS_data_processing
Rscript Nn_pp.R
Rscript Nn_pp_pool.R
Rscript combine_dna_cyt_nuc.R
Rscript kmer_profiling.R

Run or inspect the updated TALE notebooks:

cd TALE_model_260312/notebooks

The training notebook expects TALE_train_data_260312.csv and 3pL6-A549-T1.csv to be available locally. The ClinVar ISM notebook uses the bundled model by default and expects ClinVar input sequences to be supplied in the notebook parameter cell or through local data downloaded from the data DOI.

Changelog

Date Version/Update Description
2026-06-11 v2.2 Added the v260312 TALE model release, refreshed documentation, and reorganized the repository structure for reproducibility.
2026-04-08 v2.1 Added split-aware usage examples in README and synchronized quick-start commands with the v260128 workflow.
2026-01-28 v2.0 Updated scripts and datasets.
2025-04-21 v1.2 Added kmer_motif.ipynb and N45_dissect.ipynb.
2024-12-08 v1.1 Added TALE_SNP_effect.ipynb.
2024-08-14 v1.0 TensorFlow 2.16 compatibility update and notebook refactoring.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors