A systematic delineation of 3-prime UTR regulatory elements and their contextual associations.
This repository accompanies the SEERS manuscript:
https://www.biorxiv.org/content/10.1101/2025.06.09.658412v2
This public repository is organized around two reproducible components:
- SEERS count/enrichment processing and k-mer analysis in
SEERS_data_processing/ - TALE model training, evaluation, and ClinVar ISM analysis in
TALE_model_260312/
The earlier v260128 TALE scripts are retained for reference in TALE_models_260128_legacy/.
Nn_pp.R: extracts and counts N45 sequences from merged FASTQ files.Nn_pp_pool.R: pools sequence proportions across replicates.combine_dna_cyt_nuc.R: computes cytoplasmic and nuclear enrichment scores from DNA/Cyt/Nuc counts.kmer_profiling.R: performs k-mer association analyses.
These scripts use local input filenames configured near the top of each script. Run them from the directory containing the input files, or update work_dir and the input filenames explicitly.
Recommended TALE model release.
model/model_bundle.pt: selected pretrained PyTorch model bundle.model/summary.json: run configuration and metrics.notebooks/TALE_training_260312.ipynb: training notebook snapshot.notebooks/TALE_ClinVar_ISM_v6.ipynb: ClinVar SNV ISM logo notebook.results/: training histories and summary plots.SHA256SUMS: checksums for the bundled model and summary.
Selected run:
TALE_e5-lstm256bx128b-fc256-0.1-ema0.999-seed3407_260320_132204- Training data:
TALE_train_data_260312.csv - Independent A549 evaluation data:
3pL6-A549-T1.csv - Selected weights: EMA
- Best validation loss: 0.2024837457580668
Legacy v260128 scripts and pretrained weights retained for comparison with the previous TALE workflow.
Full training and analysis data are distributed through Zenodo:
https://doi.org/10.5281/zenodo.18737939
For the current TALE workflow, use:
TALE_train_data_260312.csv3pL6-A549-T1.csvClinVar_3UTR_SNPs.tsvwhen running ClinVar-related analyses
Large source tables are not committed directly to this GitHub repository.
Minimal requirements:
- R >= 4.0
- Python >= 3.9
- NGmerge for paired-end read merging
Python dependencies are listed in requirements.txt.
Prepare merged FASTQ files when starting from paired-end reads:
./NGmerge -d -1 read1.fq.gz -2 read2.fq.gz -o merged.fq.gzRun SEERS count and enrichment processing:
cd SEERS_data_processing
Rscript Nn_pp.R
Rscript Nn_pp_pool.R
Rscript combine_dna_cyt_nuc.R
Rscript kmer_profiling.RRun or inspect the updated TALE notebooks:
cd TALE_model_260312/notebooksThe training notebook expects TALE_train_data_260312.csv and 3pL6-A549-T1.csv to be available locally. The ClinVar ISM notebook uses the bundled model by default and expects ClinVar input sequences to be supplied in the notebook parameter cell or through local data downloaded from the data DOI.
| Date | Version/Update | Description |
|---|---|---|
| 2026-06-11 | v2.2 | Added the v260312 TALE model release, refreshed documentation, and reorganized the repository structure for reproducibility. |
| 2026-04-08 | v2.1 | Added split-aware usage examples in README and synchronized quick-start commands with the v260128 workflow. |
| 2026-01-28 | v2.0 | Updated scripts and datasets. |
| 2025-04-21 | v1.2 | Added kmer_motif.ipynb and N45_dissect.ipynb. |
| 2024-12-08 | v1.1 | Added TALE_SNP_effect.ipynb. |
| 2024-08-14 | v1.0 | TensorFlow 2.16 compatibility update and notebook refactoring. |