NEAT - Neoantigen Evaluation & Automated Triage

Machine learning models for neoantigen candidate classification in personalized cancer vaccine design

Overview

pVACml houses the machine learning models and analysis code developed to support automated neoantigen candidate pre-classification within the pVACtools suite. The model is trained on expert Immunogenomics Tumor Board (ITB) decisions from real-world personalized cancer vaccine clinical trials and classifies neoantigen peptide candidates as Accept, Review, or Reject based on a combination of genomic, expression, and MHC binding features.

The repository is organized into two independent sections. They use different dependency files and different model bundles:

Section	Purpose	Code & data location	Dependencies
Manuscript	Reproduce figures, analyses, manuscript model training workflows, and a demonstration prediction on a new case	`manuscript/`	`manuscript/requirements.txt`
Model development	Retrain / refresh the pipeline model and artifacts intended for pVACtools integration (current version compatible with pVACtools 7.0)	`model_development/`	`model_development/requirements.txt`

Important distinctions

The Model development section is the one meant for future retraining and for the files that are copied into the pVACtools codebase for the ML function. The staging folder is model_development/model/pvactools7.0_model/, which corresponds to pvactools/supporting_files/ml_model_artifacts/ in griffithlab/pVACtools (see model_development/model/pvactools7.0_model/README.md).

End users running pVACseq with ML enabled should follow pVACtools documentation (e.g. pvacseq add_ml_predictions). This README focuses on developers reproducing the paper or refreshing the v7 model from this repo.

Repository structure

NEAT/
├── README.md
├── LICENSE
├── .gitignore
│
├── manuscript/                      # Publication reproducibility (NOT the pVACtools-shipped bundle)
│   ├── requirements.txt             
│   ├── manuscript_model/            # Artifacts for manuscript/scripts/predict.py demo
│   ├── data/
│   │   ├── predict_new_case_data/   # Demo inputs for manuscript prediction script
│   │   ├── training_testing_data/
│   │   ├── imputation_analysis/
│   │   ├── review_time_analysis_data/
│   │   ├── manuscript_prediction_results/
│   │   └── …
│   └── scripts/
│       ├── ml_randomforest_model.py
│       ├── ml_logistic_model.py
│       ├── predict.py               # Manuscript: demo prediction on a new case
│       ├── evaluation_on_prospective_test_set.py
│       ├── imputation_analysis.py
│       └── review_time_analysis.py
│
└── model_development/               # Model development 
    ├── requirements.txt             
    ├── data/                        # Pre- & post-imputation tables; prediction outputs
    ├── scripts/
    │   ├── impute_missing.py        # Step 1: fit encoders + IterativeImputer
    │   ├── train.py                 # Step 2: tune + train BalancedRandomForest
    │   └── predict.py               # Step 3: score a new case
    └── model/
        ├── temporary_model_artifacts/   
        └── pvactools7.0_model/          # Staging → pVACtools …/ml_model_artifacts
            ├── README.md                
            ├── rf_downsample_model_*.pkl
            ├── trained_imputer_*.joblib
            └── label_encoders_*.pkl

Manuscript (`manuscript/`)

Use this when reproducing figures, statistical analyses, manuscript RF/logistic workflows, and the manuscript walkthrough of prediction on a new case.

Environment

pip install -r manuscript/requirements.txt

Use only manuscript/requirements.txt for scripts under manuscript/scripts/. That environment is aligned with the paper’s tooling (e.g. matplotlib, seaborn, broader analysis stack) and is separate from model_development/requirements.txt.

Demo prediction (manuscript)

The script manuscript/scripts/predict.py merges three pVACseq-style TSVs for one sample, applies the manuscript imputer/encoders/model, and writes an aggregated TSV.

Place inputs under manuscript/data/predict_new_case_data/:

<sample>.MHC_I.all_epitopes.aggregated.tsv
<sample>.MHC_I.all_epitopes.tsv
<sample>.MHC_II.all_epitopes.aggregated.tsv

From the repository root:

python manuscript/scripts/predict.py

Other manuscript analyses

cd manuscript/scripts
python <script_name>.py

Examples include prospective test evaluation, review-time analysis, and imputation comparisons. Each script may assume paths under manuscript/data/.

Model development (`model_development/`)

Use this when retraining or regenerating artifacts for the pVACtools–compatible pipeline. Scripts live under model_development/scripts/ and are intended to be run in this order:

model_development/scripts/impute_missing.py — load pre-imputation table, impute missing values.
model_development/scripts/train.py — tune and train BalancedRandomForestClassifier and save the model.
model_development/scripts/predict.py — for a new case, merge class I/II inputs, apply saved imputer/encoders/model, write ML prediction TSV.

Environment

pip install -r model_development/requirements.txt

Use model_development/requirements.txt (NumPy / scikit-learn / imbalanced-learn pins aligned with the neoantigen_ml_numpy126–style stack used for these scripts). Do not mix this file with manuscript/requirements.txt unless you understand the version differences.

Artifacts shipped to pVACtools

The directory model_development/model/pvactools7.0_model/ holds the model bundle intended to be copied into the pVACtools repository for the v7 ML path. In pVACtools, the same files live under:

pvactools/supporting_files/ml_model_artifacts/
(griffithlab/pVACtools on GitHub)

Here (this repo)	In pVACtools
`model_development/model/pvactools7.0_model/`	`pvactools/supporting_files/ml_model_artifacts/`

Dataset

The entire dataset (training set, development test set, prospective test set) comprises 1,943 expert-labeled neoantigen peptide candidates spanning 33 patients and 8 cancer types from three clinical trials at Washington University School of Medicine:

Trial	Cancer type	Patients
NCT05111353	Pancreatic cancer	14
NCT03606967	Metastatic TNBC	11
NCT05741242	Basket trial (multiple types)	8

Each record includes ITB labels (Accept / Reject / Review) and up to 72 features covering MHC binding predictions, RNA expression, tumor variant allele frequency, transcript support, driver gene status, etc.

Note on data availability: Clinical genomic data from individual patients cannot be shared publicly due to IRB restrictions. Aggregate feature matrices used for model training are provided in manuscript/data/ in de-identified form.

Integration with pVACtools

The production model is integrated into pVACtools v7. End-user commands (for example):

pvacseq run ... --run-ml-predictions

pvacseq add_ml_predictions \
  input.tsv \
  output_dir/ \
  --accept-threshold 0.55 \
  --reject-threshold 0.30

Predictions are displayed in pVACview alongside binding affinity, expression, and variant-level features. Predicted labels are pre-populated but fully editable during ITB review.

Developers: when updating the bundled model in pVACtools from this repository, copy from model_development/model/pvactools7.0_model/ into pvactools/supporting_files/ml_model_artifacts/ after completing model_development/scripts: impute_missing.py → train.py → predict.py (and validate outputs), using model_development/requirements.txt.

Citation

If you use pVACml or the associated model in your work, please cite:

Yao J, Singhal K, Kiwala S, et al.
Automating immunogenomic tumor board decision-making for neoantigen cancer vaccine design.
[Journal] (2025). DOI: [pending]

Related resources

Questions and contributions

For questions about the model or codebase, please open an issue. For questions related to the pVACtools integration, see the pVACtools GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEAT - Neoantigen Evaluation & Automated Triage

Overview

Repository structure

Manuscript (`manuscript/`)

Environment

Demo prediction (manuscript)

Other manuscript analyses

Model development (`model_development/`)

Environment

Artifacts shipped to pVACtools

Dataset

Integration with pVACtools

Citation

Related resources

Questions and contributions

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
manuscript		manuscript
model_development		model_development
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

NEAT - Neoantigen Evaluation & Automated Triage

Overview

Repository structure

Manuscript (manuscript/)

Environment

Demo prediction (manuscript)

Other manuscript analyses

Model development (model_development/)

Environment

Artifacts shipped to pVACtools

Dataset

Integration with pVACtools

Citation

Related resources

Questions and contributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Manuscript (`manuscript/`)

Model development (`model_development/`)

Packages