Skip to content

lordim/SHIVER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SHIVER (Somatic Hypermutation Informed Vocabulary Encoder Representations)

Mutation-Aware BERT-style language model for human memory B-Cell (MBC) receptor sequences.

Code for paper "Somatic Hypermutation Informed Vocabulary Encoder Representations." [paper] (MLCB 2025, Spotlight)

Data pre-processing

  • To obtain the mutation labels of MBC receptor sequences, we need raw single-cell RNA sequencing reads (fastq.gz files)
  • We use MiXCR to profile the reads using the 10x Genomics single cell VDJ (10x-sc-xcr-vdj) preset

Commands

Example command to execute a complete upstream analysis pipeline from the raw fastq files to clonotype tables:

mixcr analyze 10x-sc-xcr-vdj --species hsa \\ 
READ_FILE_1.fastq.gz READ_FILE_2.fastq.gz /output/dir

Example command to export clonotypes or raw alignments by cell in a tabular form:

mixcr exportCloneGroups --drop-default-fields --dont-show-secondary-chain-on-export-cell-groups \\ 
-aaFeature VDJRegion -aaFeature '{FR1Begin:FR3End}' germline -aaFeature CDR3 -aaFeature '{FR4Begin:FR4End}' germline \\ 
-aaMutations '{FR1Begin:FR3End}' -aaMutations '{FR4Begin:FR4End}' -aaLength '{FR1Begin:FR4Begin}' -vGene -jGene \\
outputs/*.assembledCells.clns output_seqs.tsv

Mutation-Aware Vocabulary

Training

Example command for training the SHIVER model

python -m src.run_train --batch-size 32 --dataset-path dataset/dir/file.tsv \\
--lr 1e-4 --weight-decay 0.01 --max-epochs 100 --sample-prob 0.5 --partial-mask-tokens \\ 
--run-id RUN_ID --gpus-used 0 [--save-ckpts] --output-dir /output/dir [--use-wandb]

Pre-trained Weights:

  • The pre-trained weights for SHIVER is deposited at Zenodo

Evaluation

  • SHIVER was evaluated on binding assay data of MBC receptors against influenza hemagglutinin (HA) which we try to predict using 5-fold cross validation.
    • Data awaiting approval of release
  • Comparison against:
    • General protein language models: e.g. ESM-2
    • Antibody language models: e.g. AbLang2
    • MBC specific language model: e.g. mBLM

About

SHIVER: Mutation-Aware BERT-style language model for human memory B-Cell receptor sequences.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages