Machine learning models for neoantigen candidate classification in personalized cancer vaccine design
pVACml houses the machine learning models and analysis code developed to support automated neoantigen candidate pre-classification within the pVACtools suite. The model is trained on expert Immunogenomics Tumor Board (ITB) decisions from real-world personalized cancer vaccine clinical trials and classifies neoantigen peptide candidates as Accept, Review, or Reject based on a combination of genomic, expression, and MHC binding features.
The repository is organized into two independent sections. They use different dependency files and different model bundles:
| Section | Purpose | Code & data location | Dependencies |
|---|---|---|---|
| Manuscript | Reproduce figures, analyses, manuscript model training workflows, and a demonstration prediction on a new case | manuscript/ |
manuscript/requirements.txt |
| Model development | Retrain / refresh the pipeline model and artifacts intended for pVACtools integration (current version compatible with pVACtools 7.0) | model_development/ |
model_development/requirements.txt |
Important distinctions
- The Model development section is the one meant for future retraining and for the files that are copied into the pVACtools codebase for the ML function. The staging folder is
model_development/model/pvactools7.0_model/, which corresponds topvactools/supporting_files/ml_model_artifacts/in griffithlab/pVACtools (seemodel_development/model/pvactools7.0_model/README.md).
End users running pVACseq with ML enabled should follow pVACtools documentation (e.g.
pvacseq add_ml_predictions). This README focuses on developers reproducing the paper or refreshing the v7 model from this repo.
NEAT/
├── README.md
├── LICENSE
├── .gitignore
│
├── manuscript/ # Publication reproducibility (NOT the pVACtools-shipped bundle)
│ ├── requirements.txt
│ ├── manuscript_model/ # Artifacts for manuscript/scripts/predict.py demo
│ ├── data/
│ │ ├── predict_new_case_data/ # Demo inputs for manuscript prediction script
│ │ ├── training_testing_data/
│ │ ├── imputation_analysis/
│ │ ├── review_time_analysis_data/
│ │ ├── manuscript_prediction_results/
│ │ └── …
│ └── scripts/
│ ├── ml_randomforest_model.py
│ ├── ml_logistic_model.py
│ ├── predict.py # Manuscript: demo prediction on a new case
│ ├── evaluation_on_prospective_test_set.py
│ ├── imputation_analysis.py
│ └── review_time_analysis.py
│
└── model_development/ # Model development
├── requirements.txt
├── data/ # Pre- & post-imputation tables; prediction outputs
├── scripts/
│ ├── impute_missing.py # Step 1: fit encoders + IterativeImputer
│ ├── train.py # Step 2: tune + train BalancedRandomForest
│ └── predict.py # Step 3: score a new case
└── model/
├── temporary_model_artifacts/
└── pvactools7.0_model/ # Staging → pVACtools …/ml_model_artifacts
├── README.md
├── rf_downsample_model_*.pkl
├── trained_imputer_*.joblib
└── label_encoders_*.pkl
Use this when reproducing figures, statistical analyses, manuscript RF/logistic workflows, and the manuscript walkthrough of prediction on a new case.
pip install -r manuscript/requirements.txtUse only manuscript/requirements.txt for scripts under manuscript/scripts/. That environment is aligned with the paper’s tooling (e.g. matplotlib, seaborn, broader analysis stack) and is separate from model_development/requirements.txt.
The script manuscript/scripts/predict.py merges three pVACseq-style TSVs for one sample, applies the manuscript imputer/encoders/model, and writes an aggregated TSV.
Place inputs under manuscript/data/predict_new_case_data/:
<sample>.MHC_I.all_epitopes.aggregated.tsv<sample>.MHC_I.all_epitopes.tsv<sample>.MHC_II.all_epitopes.aggregated.tsv
From the repository root:
python manuscript/scripts/predict.pycd manuscript/scripts
python <script_name>.pyExamples include prospective test evaluation, review-time analysis, and imputation comparisons. Each script may assume paths under manuscript/data/.
Use this when retraining or regenerating artifacts for the pVACtools–compatible pipeline. Scripts live under model_development/scripts/ and are intended to be run in this order:
model_development/scripts/impute_missing.py— load pre-imputation table, impute missing values.model_development/scripts/train.py— tune and trainBalancedRandomForestClassifierand save the model.model_development/scripts/predict.py— for a new case, merge class I/II inputs, apply saved imputer/encoders/model, write ML prediction TSV.
pip install -r model_development/requirements.txtUse model_development/requirements.txt (NumPy / scikit-learn / imbalanced-learn pins aligned with the neoantigen_ml_numpy126–style stack used for these scripts). Do not mix this file with manuscript/requirements.txt unless you understand the version differences.
The directory model_development/model/pvactools7.0_model/ holds the model bundle intended to be copied into the pVACtools repository for the v7 ML path. In pVACtools, the same files live under:
pvactools/supporting_files/ml_model_artifacts/
(griffithlab/pVACtools on GitHub)
| Here (this repo) | In pVACtools |
|---|---|
model_development/model/pvactools7.0_model/ |
pvactools/supporting_files/ml_model_artifacts/ |
The entire dataset (training set, development test set, prospective test set) comprises 1,943 expert-labeled neoantigen peptide candidates spanning 33 patients and 8 cancer types from three clinical trials at Washington University School of Medicine:
| Trial | Cancer type | Patients |
|---|---|---|
| NCT05111353 | Pancreatic cancer | 14 |
| NCT03606967 | Metastatic TNBC | 11 |
| NCT05741242 | Basket trial (multiple types) | 8 |
Each record includes ITB labels (Accept / Reject / Review) and up to 72 features covering MHC binding predictions, RNA expression, tumor variant allele frequency, transcript support, driver gene status, etc.
Note on data availability: Clinical genomic data from individual patients cannot be shared publicly due to IRB restrictions. Aggregate feature matrices used for model training are provided in
manuscript/data/in de-identified form.
The production model is integrated into pVACtools v7. End-user commands (for example):
pvacseq run ... --run-ml-predictionspvacseq add_ml_predictions \
input.tsv \
output_dir/ \
--accept-threshold 0.55 \
--reject-threshold 0.30Predictions are displayed in pVACview alongside binding affinity, expression, and variant-level features. Predicted labels are pre-populated but fully editable during ITB review.
Developers: when updating the bundled model in pVACtools from this repository, copy from model_development/model/pvactools7.0_model/ into pvactools/supporting_files/ml_model_artifacts/ after completing model_development/scripts: impute_missing.py → train.py → predict.py (and validate outputs), using model_development/requirements.txt.
If you use pVACml or the associated model in your work, please cite:
Yao J, Singhal K, Kiwala S, et al.
Automating immunogenomic tumor board decision-making for neoantigen cancer vaccine design.
[Journal] (2025). DOI: [pending]
For questions about the model or codebase, please open an issue. For questions related to the pVACtools integration, see the pVACtools GitHub.