SHIVER (Somatic Hypermutation Informed Vocabulary Encoder Representations)

Mutation-Aware BERT-style language model for human memory B-Cell (MBC) receptor sequences.

Code for paper "Somatic Hypermutation Informed Vocabulary Encoder Representations." [paper] (MLCB 2025, Spotlight)

Data pre-processing

To obtain the mutation labels of MBC receptor sequences, we need raw single-cell RNA sequencing reads (fastq.gz files)
We use MiXCR to profile the reads using the 10x Genomics single cell VDJ (10x-sc-xcr-vdj) preset

Commands

Example command to execute a complete upstream analysis pipeline from the raw fastq files to clonotype tables:

mixcr analyze 10x-sc-xcr-vdj --species hsa \\ 
READ_FILE_1.fastq.gz READ_FILE_2.fastq.gz /output/dir

Example command to export clonotypes or raw alignments by cell in a tabular form:

mixcr exportCloneGroups --drop-default-fields --dont-show-secondary-chain-on-export-cell-groups \\ 
-aaFeature VDJRegion -aaFeature '{FR1Begin:FR3End}' germline -aaFeature CDR3 -aaFeature '{FR4Begin:FR4End}' germline \\ 
-aaMutations '{FR1Begin:FR3End}' -aaMutations '{FR4Begin:FR4End}' -aaLength '{FR1Begin:FR4Begin}' -vGene -jGene \\
outputs/*.assembledCells.clns output_seqs.tsv

Mutation-Aware Vocabulary

Training

Example command for training the SHIVER model

python -m src.run_train --batch-size 32 --dataset-path dataset/dir/file.tsv \\
--lr 1e-4 --weight-decay 0.01 --max-epochs 100 --sample-prob 0.5 --partial-mask-tokens \\ 
--run-id RUN_ID --gpus-used 0 [--save-ckpts] --output-dir /output/dir [--use-wandb]

Pre-trained Weights:

The pre-trained weights for SHIVER is deposited at Zenodo

Evaluation

SHIVER was evaluated on binding assay data of MBC receptors against influenza hemagglutinin (HA) which we try to predict using 5-fold cross validation.
- Data awaiting approval of release
Comparison against:
- General protein language models: e.g. ESM-2
- Antibody language models: e.g. AbLang2
- MBC specific language model: e.g. mBLM

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SHIVER (Somatic Hypermutation Informed Vocabulary Encoder Representations)

Data pre-processing

Commands

Mutation-Aware Vocabulary

Training

Pre-trained Weights:

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SHIVER (Somatic Hypermutation Informed Vocabulary Encoder Representations)

Data pre-processing

Commands

Mutation-Aware Vocabulary

Training

Pre-trained Weights:

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages