Skip to content

griffithlab/NEAT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NEAT - Neoantigen Evaluation & Automated Triage

Machine learning models for neoantigen candidate classification in personalized cancer vaccine design

License: MIT pVACtools DOI


Overview

pVACml houses the machine learning models and analysis code developed to support automated neoantigen candidate pre-classification within the pVACtools suite. The model is trained on expert Immunogenomics Tumor Board (ITB) decisions from real-world personalized cancer vaccine clinical trials and classifies neoantigen peptide candidates as Accept, Review, or Reject based on a combination of genomic, expression, and MHC binding features.

The repository is organized into two independent sections. They use different dependency files and different model bundles:

Section Purpose Code & data location Dependencies
Manuscript Reproduce figures, analyses, manuscript model training workflows, and a demonstration prediction on a new case manuscript/ manuscript/requirements.txt
Model development Retrain / refresh the pipeline model and artifacts intended for pVACtools integration (current version compatible with pVACtools 7.0) model_development/ model_development/requirements.txt

Important distinctions

End users running pVACseq with ML enabled should follow pVACtools documentation (e.g. pvacseq add_ml_predictions). This README focuses on developers reproducing the paper or refreshing the v7 model from this repo.


Repository structure

NEAT/
├── README.md
├── LICENSE
├── .gitignore
│
├── manuscript/                      # Publication reproducibility (NOT the pVACtools-shipped bundle)
│   ├── requirements.txt             
│   ├── manuscript_model/            # Artifacts for manuscript/scripts/predict.py demo
│   ├── data/
│   │   ├── predict_new_case_data/   # Demo inputs for manuscript prediction script
│   │   ├── training_testing_data/
│   │   ├── imputation_analysis/
│   │   ├── review_time_analysis_data/
│   │   ├── manuscript_prediction_results/
│   │   └── …
│   └── scripts/
│       ├── ml_randomforest_model.py
│       ├── ml_logistic_model.py
│       ├── predict.py               # Manuscript: demo prediction on a new case
│       ├── evaluation_on_prospective_test_set.py
│       ├── imputation_analysis.py
│       └── review_time_analysis.py
│
└── model_development/               # Model development 
    ├── requirements.txt             
    ├── data/                        # Pre- & post-imputation tables; prediction outputs
    ├── scripts/
    │   ├── impute_missing.py        # Step 1: fit encoders + IterativeImputer
    │   ├── train.py                 # Step 2: tune + train BalancedRandomForest
    │   └── predict.py               # Step 3: score a new case
    └── model/
        ├── temporary_model_artifacts/   
        └── pvactools7.0_model/          # Staging → pVACtools …/ml_model_artifacts
            ├── README.md                
            ├── rf_downsample_model_*.pkl
            ├── trained_imputer_*.joblib
            └── label_encoders_*.pkl

Manuscript (manuscript/)

Use this when reproducing figures, statistical analyses, manuscript RF/logistic workflows, and the manuscript walkthrough of prediction on a new case.

Environment

pip install -r manuscript/requirements.txt

Use only manuscript/requirements.txt for scripts under manuscript/scripts/. That environment is aligned with the paper’s tooling (e.g. matplotlib, seaborn, broader analysis stack) and is separate from model_development/requirements.txt.

Demo prediction (manuscript)

The script manuscript/scripts/predict.py merges three pVACseq-style TSVs for one sample, applies the manuscript imputer/encoders/model, and writes an aggregated TSV.

Place inputs under manuscript/data/predict_new_case_data/:

  • <sample>.MHC_I.all_epitopes.aggregated.tsv
  • <sample>.MHC_I.all_epitopes.tsv
  • <sample>.MHC_II.all_epitopes.aggregated.tsv

From the repository root:

python manuscript/scripts/predict.py

Other manuscript analyses

cd manuscript/scripts
python <script_name>.py

Examples include prospective test evaluation, review-time analysis, and imputation comparisons. Each script may assume paths under manuscript/data/.


Model development (model_development/)

Use this when retraining or regenerating artifacts for the pVACtools–compatible pipeline. Scripts live under model_development/scripts/ and are intended to be run in this order:

  1. model_development/scripts/impute_missing.py — load pre-imputation table, impute missing values.
  2. model_development/scripts/train.py — tune and train BalancedRandomForestClassifier and save the model.
  3. model_development/scripts/predict.py — for a new case, merge class I/II inputs, apply saved imputer/encoders/model, write ML prediction TSV.

Environment

pip install -r model_development/requirements.txt

Use model_development/requirements.txt (NumPy / scikit-learn / imbalanced-learn pins aligned with the neoantigen_ml_numpy126–style stack used for these scripts). Do not mix this file with manuscript/requirements.txt unless you understand the version differences.

Artifacts shipped to pVACtools

The directory model_development/model/pvactools7.0_model/ holds the model bundle intended to be copied into the pVACtools repository for the v7 ML path. In pVACtools, the same files live under:

pvactools/supporting_files/ml_model_artifacts/
(griffithlab/pVACtools on GitHub)

Here (this repo) In pVACtools
model_development/model/pvactools7.0_model/ pvactools/supporting_files/ml_model_artifacts/

Dataset

The entire dataset (training set, development test set, prospective test set) comprises 1,943 expert-labeled neoantigen peptide candidates spanning 33 patients and 8 cancer types from three clinical trials at Washington University School of Medicine:

Trial Cancer type Patients
NCT05111353 Pancreatic cancer 14
NCT03606967 Metastatic TNBC 11
NCT05741242 Basket trial (multiple types) 8

Each record includes ITB labels (Accept / Reject / Review) and up to 72 features covering MHC binding predictions, RNA expression, tumor variant allele frequency, transcript support, driver gene status, etc.

Note on data availability: Clinical genomic data from individual patients cannot be shared publicly due to IRB restrictions. Aggregate feature matrices used for model training are provided in manuscript/data/ in de-identified form.


Integration with pVACtools

The production model is integrated into pVACtools v7. End-user commands (for example):

pvacseq run ... --run-ml-predictions
pvacseq add_ml_predictions \
  input.tsv \
  output_dir/ \
  --accept-threshold 0.55 \
  --reject-threshold 0.30

Predictions are displayed in pVACview alongside binding affinity, expression, and variant-level features. Predicted labels are pre-populated but fully editable during ITB review.

Developers: when updating the bundled model in pVACtools from this repository, copy from model_development/model/pvactools7.0_model/ into pvactools/supporting_files/ml_model_artifacts/ after completing model_development/scripts: impute_missing.pytrain.pypredict.py (and validate outputs), using model_development/requirements.txt.


Citation

If you use pVACml or the associated model in your work, please cite:

Yao J, Singhal K, Kiwala S, et al.
Automating immunogenomic tumor board decision-making for neoantigen cancer vaccine design.
[Journal] (2025). DOI: [pending]

Related resources


Questions and contributions

For questions about the model or codebase, please open an issue. For questions related to the pVACtools integration, see the pVACtools GitHub.

About

This repository contains a new feature for the pVACtools suite that enables machine learning–based prediction of neoantigen candidate selection following pVACseq. Trained on Immunogenomics Tumor Board decisions, the model supports automated prioritization and integrates into the pVACview interface.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages