Mutation-Aware BERT-style language model for human memory B-Cell (MBC) receptor sequences.
Code for paper "Somatic Hypermutation Informed Vocabulary Encoder Representations." [paper] (MLCB 2025, Spotlight)
- To obtain the mutation labels of MBC receptor sequences, we need raw single-cell RNA sequencing reads (
fastq.gzfiles) - We use MiXCR to profile the reads using the 10x Genomics single cell VDJ (
10x-sc-xcr-vdj) preset
Example command to execute a complete upstream analysis pipeline from the raw fastq files to clonotype tables:
mixcr analyze 10x-sc-xcr-vdj --species hsa \\
READ_FILE_1.fastq.gz READ_FILE_2.fastq.gz /output/dir
Example command to export clonotypes or raw alignments by cell in a tabular form:
mixcr exportCloneGroups --drop-default-fields --dont-show-secondary-chain-on-export-cell-groups \\
-aaFeature VDJRegion -aaFeature '{FR1Begin:FR3End}' germline -aaFeature CDR3 -aaFeature '{FR4Begin:FR4End}' germline \\
-aaMutations '{FR1Begin:FR3End}' -aaMutations '{FR4Begin:FR4End}' -aaLength '{FR1Begin:FR4Begin}' -vGene -jGene \\
outputs/*.assembledCells.clns output_seqs.tsv
Example command for training the SHIVER model
python -m src.run_train --batch-size 32 --dataset-path dataset/dir/file.tsv \\
--lr 1e-4 --weight-decay 0.01 --max-epochs 100 --sample-prob 0.5 --partial-mask-tokens \\
--run-id RUN_ID --gpus-used 0 [--save-ckpts] --output-dir /output/dir [--use-wandb]
- The pre-trained weights for SHIVER is deposited at Zenodo
- SHIVER was evaluated on binding assay data of MBC receptors against influenza hemagglutinin (HA) which we try to predict using 5-fold cross validation.
- Data awaiting approval of release
- Comparison against:
- General protein language models: e.g. ESM-2
- Antibody language models: e.g. AbLang2
- MBC specific language model: e.g. mBLM