SEERS: Selective Enrichment of Episomes with Random Sequences

A systematic delineation of 3-prime UTR regulatory elements and their contextual associations.

Paper

This repository accompanies the SEERS manuscript:

https://www.biorxiv.org/content/10.1101/2025.06.09.658412v2

Repository Contents

This public repository is organized around two reproducible components:

SEERS count/enrichment processing and k-mer analysis in SEERS_data_processing/
TALE model training, evaluation, and ClinVar ISM analysis in TALE_model_260312/

The earlier v260128 TALE scripts are retained for reference in TALE_models_260128_legacy/.

Folder Guide

`SEERS_data_processing/`

Nn_pp.R: extracts and counts N45 sequences from merged FASTQ files.
Nn_pp_pool.R: pools sequence proportions across replicates.
combine_dna_cyt_nuc.R: computes cytoplasmic and nuclear enrichment scores from DNA/Cyt/Nuc counts.
kmer_profiling.R: performs k-mer association analyses.

These scripts use local input filenames configured near the top of each script. Run them from the directory containing the input files, or update work_dir and the input filenames explicitly.

`TALE_model_260312/`

Recommended TALE model release.

model/model_bundle.pt: selected pretrained PyTorch model bundle.
model/summary.json: run configuration and metrics.
notebooks/TALE_training_260312.ipynb: training notebook snapshot.
notebooks/TALE_ClinVar_ISM_v6.ipynb: ClinVar SNV ISM logo notebook.
results/: training histories and summary plots.
SHA256SUMS: checksums for the bundled model and summary.

Selected run:

TALE_e5-lstm256bx128b-fc256-0.1-ema0.999-seed3407_260320_132204
Training data: TALE_train_data_260312.csv
Independent A549 evaluation data: 3pL6-A549-T1.csv
Selected weights: EMA
Best validation loss: 0.2024837457580668

`TALE_models_260128_legacy/`

Legacy v260128 scripts and pretrained weights retained for comparison with the previous TALE workflow.

Data

Full training and analysis data are distributed through Zenodo:

https://doi.org/10.5281/zenodo.18737939

For the current TALE workflow, use:

TALE_train_data_260312.csv
3pL6-A549-T1.csv
ClinVar_3UTR_SNPs.tsv when running ClinVar-related analyses

Large source tables are not committed directly to this GitHub repository.

Environment

Minimal requirements:

R >= 4.0
Python >= 3.9
NGmerge for paired-end read merging

Python dependencies are listed in requirements.txt.

Quick Start

Prepare merged FASTQ files when starting from paired-end reads:

./NGmerge -d -1 read1.fq.gz -2 read2.fq.gz -o merged.fq.gz

Run SEERS count and enrichment processing:

cd SEERS_data_processing
Rscript Nn_pp.R
Rscript Nn_pp_pool.R
Rscript combine_dna_cyt_nuc.R
Rscript kmer_profiling.R

Run or inspect the updated TALE notebooks:

cd TALE_model_260312/notebooks

The training notebook expects TALE_train_data_260312.csv and 3pL6-A549-T1.csv to be available locally. The ClinVar ISM notebook uses the bundled model by default and expects ClinVar input sequences to be supplied in the notebook parameter cell or through local data downloaded from the data DOI.

Changelog

Date	Version/Update	Description
2026-06-11	v2.2	Added the v260312 TALE model release, refreshed documentation, and reorganized the repository structure for reproducibility.
2026-04-08	v2.1	Added split-aware usage examples in README and synchronized quick-start commands with the v260128 workflow.
2026-01-28	v2.0	Updated scripts and datasets.
2025-04-21	v1.2	Added `kmer_motif.ipynb` and `N45_dissect.ipynb`.
2024-12-08	v1.1	Added `TALE_SNP_effect.ipynb`.
2024-08-14	v1.0	TensorFlow 2.16 compatibility update and notebook refactoring.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
SEERS_data_processing		SEERS_data_processing
TALE_model_260312		TALE_model_260312
TALE_models_260128_legacy		TALE_models_260128_legacy
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEERS: Selective Enrichment of Episomes with Random Sequences

Paper

Repository Contents

Folder Guide

`SEERS_data_processing/`

`TALE_model_260312/`

`TALE_models_260128_legacy/`

Data

Environment

Quick Start

Changelog

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SEERS: Selective Enrichment of Episomes with Random Sequences

Paper

Repository Contents

Folder Guide

SEERS_data_processing/

TALE_model_260312/

TALE_models_260128_legacy/

Data

Environment

Quick Start

Changelog

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`SEERS_data_processing/`

`TALE_model_260312/`

`TALE_models_260128_legacy/`

Packages