Metadata of all SMaHT samples at the time of this study.
We consider the following input:
ST001: healthy liver sample, sequenced at ~230x with PacBio Revio (this includes a ~26x Fiber-seq BAM) and at ~210x with ONT PromethION 24 (coverages estimated from chr1). Data downloaded from the benchmarking section of the data portal.SMHT001: death caused by liver failure, alcohol abuse as a death circumstance. Sequenced at ~10x with PacBio Revio. Data downloaded from workspace SMaHT_Short_Read_Long_Read_Analysis, tableSMAHT001_collaborator_long_read, fieldsample_collaborator_id=SMHT001-3I-001A2.
For each ST001 technology, we merge all the BAMs, we take random samples at multiples of 10x, and on every such subsample we run sniffles --mosaic requiring just two reads to support a call (i.e. we set --mosaic-af-min as a function of coverage). We only output calls that Sniffles considers as somatic, i.e. that have --mosaic-af-max=0.218 (the default). Each call is annotated with the IDs of the reads that support it. We only consider calls in the standard chromosomes.
Circles: 10x SMHT001 (liver failure). In the 230x ST001 VCF, TR calls do not seem to be enriched in specific TR intervals (according to
bedtools intersect -c).
Running sniffles with a constant min allele frequency (--mosaic-af-min 0.05, the default) shows a peak at 100x instead:
Calls from the BAM with max coverage:
Examples of calls of length >6k in the 230x VCF:
There are only 14 total calls, all of which except one occur in a TR.
We consider the following input:
ST001: healthy liver sample, sequenced at ~122x with Illumina NovaSeq X. Data downloaded from the benchmarking section of the data portal, file accession SMAFILE7Y4Y9.SMHT001: cirrotic liver sample sequenced at ~124x with Illumina NovaSeq X. Data downloaded from workspace SMaHT_Benchmarking_Short_Read, tableSMAHT001_collaborator_short_read, fieldSMAHT001_collaborator_short_read_id=SM-OLQZF, fieldscollaborator_sample_id=SMHT001-3I-001A1.
Many short-read somatic callers work in a tumor-normal setting, so we select the following data as "normal":
ST001: a skin sample sequenced at ~100x with Illumina NovaSeq X. Data downloaded from the benchmarking section of the data portal, file accession SMAFINU18SZV.SMHT001: a skin sample sequenced at ~90x with Illumina NovaSeq X. Data downloaded from workspace SMaHT_Benchmarking_Short_Read, tableSMAHT001_collaborator_short_read, fieldSMAHT001_collaborator_short_read_id=SM-OLRTD, fieldscollaborator_sample_id=SMHT001-3AF-001A2.
We run Manta in tumor-normal somatic mode (Manta has also a tumor-only mode, but it is experimental). For ST001 (healthy) we get 134 calls, 120 of which are BNDs. Example:
chr1 10012296 MantaBND:1200:0:0:0:0:0:0 G [chr1:10012368[G . MinSomaticScore SVTYPE=BND;MATEID=MantaBND:1200:0:0:0:0:0:1;CIPOS=0,3;HOMLEN=3;HOMSEQ=CTT;SOMATIC;SOMATICSCORE=25;BND_DEPTH=83;MATE_BND_DEPTH=85 PR:SR 58,0:88,0 63,2:115,2
chr1 10012365 MantaBND:1200:0:0:0:0:0:1 A [chr1:10012299[A . MinSomaticScore SVTYPE=BND;MATEID=MantaBND:1200:0:0:0:0:0:0;CIPOS=0,3;HOMLEN=3;HOMSEQ=GCA;SOMATIC;SOMATICSCORE=25;BND_DEPTH=85;MATE_BND_DEPTH=83 PR:SR 58,0:88,0 63,2:115,2
chr1 16035669 MantaDEL:0:3498:3499:0:0:0 G <DEL> . MinSomaticScore END=16058763;SVTYPE=DEL;SVLEN=-23094;IMPRECISE;CIPOS=-135,136;CIEND=-115,115;SOMATIC;SOMATICSCORE=25 PR 140,1 136,8
chr1 17868650 MantaBND:0:3926:3929:0:0:0:0 A [chr11:68505958[A . MinSomaticScore SVTYPE=BND;MATEID=MantaBND:0:3926:3929:0:0:0:1;IMPRECISE;CIPOS=-295,295;SOMATIC;SOMATICSCORE=21;BND_DEPTH=90;MATE_BND_DEPTH=119 PR 105,1 96,9
chr1 21981417 MantaDUP:TANDEM:2443:0:2:0:0:0 C <DUP:TANDEM> . MinSomaticScore END=22006096;SVTYPE=DUP;SVLEN=24679;IMPRECISE;CIPOS=-205,205;CIEND=-152,152;SOMATIC;SOMATICSCORE=10 PR 166,2 135,7
chr1 28377206 MantaBND:3449:0:1:0:0:0:1 A A]chr1:28377946] . MinSomaticScore SVTYPE=BND;MATEID=MantaBND:3449:0:1:0:0:0:0;CIPOS=0,18;HOMLEN=18;HOMSEQ=GGATTACAGGCGTGAGCC;SOMATIC;SOMATICSCORE=16;BND_DEPTH=92;MATE_BND_DEPTH=89 PR:SR 101,0:114,0 109,1:139,3
chr1 28377928 MantaBND:3449:0:1:0:0:0:0 T T]chr1:28377224] . MinSomaticScore SVTYPE=BND;MATEID=MantaBND:3449:0:1:0:0:0:1;CIPOS=0,18;HOMLEN=18;HOMSEQ=GGCTCACGCCTGTAATCC;SOMATIC;SOMATICSCORE=16;BND_DEPTH=89;MATE_BND_DEPTH=92 PR:SR 101,0:114,0 109,1:139,3
For SMHT001 (cirrotic) we get 32 calls, 26 of which are BNDs. Example:
chr1 90378706 MantaBND:8990:0:1:0:0:0:0 T T]chr7:134929711] . MinSomaticScore SVTYPE=BND;MATEID=MantaBND:8990:0:1:0:0:0:1;IMPRECISE;CIPOS=-128,128;SOMATIC;SOMATICSCORE=16;BND_DEPTH=12;MATE_BND_DEPTH=53 PR 17,0 28,5
chr2 91518954 MantaDEL:30438:1:1:0:1:0 CTGGCTGCCTGGCTGGCTGTTTGGCTTGGCTGGCTTGGCTGGCTGGGTGGCTTGGCTGGTTTGGCTAGCTGGCTGGCTGGGTAGCTTGGCTAGCTGGCTGGCTTGGCTGGTTGGCTGGCTTGGCTGGCTTGGCTAGCCGGCTGGCTTAGTTGGCTGGATGGCTTGGCTGGCATGCCTGGCT C . MaxMQ0Frac;MinSomaticScore END=91519134;SVTYPE=DEL;SVLEN=-180;CIGAR=1M180D;CIPOS=0,6;HOMLEN=6;HOMSEQ=TGGCTG;SOMATIC;SOMATICSCORE=14 PR:SR 19,0:58,0 38,0:83,8
chr2 94646936 MantaBND:30805:0:3:0:0:0:0 C C[chr6:9285873[ . MinSomaticScore SVTYPE=BND;MATEID=MantaBND:30805:0:3:0:0:0:1;IMPRECISE;CIPOS=-160,161;SOMATIC;SOMATICSCORE=21;BND_DEPTH=72;MATE_BND_DEPTH=41 PR 37,1 32,8
chr4 20885473 MantaBND:62961:0:2:0:0:0:1 G G]chr5:58943836] . PASS SVTYPE=BND;MATEID=MantaBND:62961:0:2:0:0:0:0;IMPRECISE;CIPOS=-85,85;SOMATIC;SOMATICSCORE=56;BND_DEPTH=103;MATE_BND_DEPTH=113 PR 119,0 107,9
chr4 63202121 MantaDEL:66559:0:0:0:4:0 TATATAGGATATATAGTATATATATAATATATATAGGATATATAGTATATATATAATATAGGATATATAGTATATTTACACTATATATAATGTCTAATAGGATATATAGTATATATATA T . MinSomaticScore END=63202239;SVTYPE=DEL;SVLEN=-118;CIGAR=1M118D;CIPOS=0,30;HOMLEN=30;HOMSEQ=ATATAGGATATATAGTATATATATAATATA;SOMATIC;SOMATICSCORE=10 PR:SR 4,0:20,4 7,0:18,5
chr5 58943836 MantaBND:62961:0:2:0:0:0:0 T T]chr4:20885473] . PASS SVTYPE=BND;MATEID=MantaBND:62961:0:2:0:0:0:1;IMPRECISE;CIPOS=-93,94;SOMATIC;SOMATICSCORE=56;BND_DEPTH=113;MATE_BND_DEPTH=103 PR 119,0 107,9
chr5 77395471 MantaBND:84650:0:4:1:0:0:1 C C]chr12:95432519] . PASS SVTYPE=BND;MATEID=MantaBND:84650:0:4:1:0:0:0;IMPRECISE;CIPOS=-81,81;SOMATIC;SOMATICSCORE=35;BND_DEPTH=106;MATE_BND_DEPTH=104 PR 100,0 70,7
We IGV'd each one of the top 50 expressed genes in the liver according to GTEx, but by eye we could only find intronic SVs in C1R and HPD that are germline (we checked other tissues from the same donor and they appear there as well).
In the ST001 230x PacBio BAM, ~3% of all reads have at least one clipped alignment, defined as an alignment with either a 100bp left or a 100bp right soft clip and that involves a canonical chromosome. Often these reads map to more than one chromosome, or contain sequence that does not map to any region of GRCh38. We assume that such reads are chimeras and discard them, unless their clips are aligned and their patterns of mismatching bases are supported by more than one read. The same phenomenon occurs in PacBio samples from HPRC, and in ONT reads from ST001.
More info on chimeras: Heinz et al. Detecting foldback artifacts in long reads. bioRxiv 2025.
People in the SMaHT network are already focusing on TRs, see e.g. this concept sheet.
We IGV'd (in the ST001 230x PacBio BAM) each one of the top 50 expressed genes in the liver according to GTEx, and we saw multiple haplotypes in some TRs:
It is well known that PacBio reads (top) can capture the structure of TRs better than Illumina reads (bottom). Examples with candidate somatic variation from ST001 230x PacBio:
PacBio reads can also show the 5mC status of TRs (examples from ST001 230x PacBio):
In the ST001 230x PacBio BAM.
Note: some interesting regions in or around these genes might be missing from what follows. Go over IGV in detail once again.
We extract spanning reads from the ST001 230x PacBio BAM with hapestry's command:
extract_reads_from_windows --output_dir ./reads_with_q/ \
--bam_csv ${SAMPLES_LIST} \
--windows ${FLANKED_WINDOWS_BED} \
--bam_not_hardclipped \
--require_spanning \
--flank_length 0 \
--fetch_max_length 100000 \
--get_qualities \
--tags NM \
--n_threads 1 \
--force_forward
Then, we load the FASTA into Jalview and use MAFFT with preset FFT-NS-1 (Speed oriented). Note that a multiple sequence alignment automatically hides all germline homs, and puts all the germline hets in the same cluster.
This TR has no 5mC marks.
TRGT detects two haplotypes in one interval:
chr4 99305501 .
CATATATATATATATATATATATACAATCACTTAACTATATGAACCACTTGCCCCATAGTGTATATATATATATATATATATATATATA
CATATATATATATATATATATATATACAATCACTTAACTATATGAACCACTTGCCCCATAGTGTATATATATATATATATATA,
CATATATATATATATATATATATACAATCACTTAACTATATGAACCACTTGCCCCATAGTGTATATATATATATATATATATATATATATA
. .
TRID=4-99305501-99305524-AT,4-99305558-99305562-GT;END=99305589;MOTIFS=AT,GT;STRUC=<VC181780>
GT:AL:ALLR:SD:MC:MS:AP:AM
1/2:82,90:66-86,56-95:114,136:25_2,29_2:0(0-24)_0(27-29)_0(38-42)_0(56-58)_1(59-63)_0(63-81),0(0-22)_0(25-27)_0(36-40)_0(54-56)_1(57-61)_0(61-89):0.682927,0.711111:.,.
And 250 reads span them (coverage in the region: 330). However, aligning the reads to the two haplotypes shows recurrent CIGAR patterns:
There is an overlapping TRGT catalog interval in this region, which is also genotyped by TRGT with two haplotypes:
chr4 99305561 .
GTATATATATATATATATATATATATATATATATATATATATATATATA
GTATATATATATATATATATATATATATATATATATATATA,
GTATATATATATATATATATATATATATATATATATATATATATATATATA
. .
TRID=4-99305561-99305609-TA;END=99305609;MOTIFS=TA;STRUC=<TR2721601>
GT:AL:ALLR:SD:MC:MS:AP:AM
1/2:40,50:24-44,16-54:113,137:20,25:0(0-40),0(0-50):1.000000,1.000000:.,.
And 250 reads span them. However, as above, aligning the reads to the two haplotypes shows recurrent CIGAR patterns.
- Sniffles: no support.
- Severus: no support.
This TR seems to be methylated, and to have >=2 distinct 5mC patterns (gray: FWD strand, green: REV strand):
Zooming in, several clipped alignments seem to have mismatching bases aligned with a hom INS, and the patterns of mismatch seem to be supported by more than one read. These may just be the beginning/end of the same hom INS, but it is worth investigating.
We extract all alignments (spanning and non-spanning) with the "Export alignments" feature of IGV. Then, we build a POA graph using abPOA in local alignment mode with flags -m 1 --amb-strand --sort-by-len --result 3 (the TR sequence in the reference is in red):
Strangely the reference sequence is not aligned with the INS, the left INS is supported by ~10 reads and the right INS is supported by ~5 reads. Probably the POA graph is inaccurate in this region.
TRGT detects two haplotypes:
chr1 16565544 .
CAGCTCAGTAATGGCCACTTGGAGCAGGAATATGATCTTTATATGGAAGACTCAGTGGATCCTTATCACCTTCATAGAAAGGTACTCACCTCCCACGTCAAGAGAAAAGCCAACATGTTTTTCCTCCAATGCATAAAAGGAACTTCCATAGGGCAGGCAGGAGTCAGGCTGTTCAAGACAACTGGAAGGAGTTGAATAACATCTATCCAGTGAGTCCTGCAAGACTTCAGGCTCTACTGCCTCCAGCAGCTCCCTGCTGAGCCTGGAAAAGTAGGAAAAAGTAAAGAATAAGCCAGGGGGAATCAGAAACCACACAGCCCCAGCTACATTTCATGGCTAACATAAGGAACTGTTTAAACAGAAAAAGGACAGATCCATTAATGAGGTAATGAATTATTGCCTTTATGTTGGGATAGACCAGGGCCAGGTAGAAAAGAATGAAAGAGAAAGACAGGGAGAGGGAGAGGGAGAGAGAGACAGAGGAGAAAGTGAGCTCAGCGAATTGGCCGGGTGACACACTGACGAAGGGGTCAAAGGACACTCTGAGTTAGTGCCCTCGGGACACACAGAGAACAGTGATCATGAAAAGAGTGGGCTCAATAATTTTCCATAAACTTGCTTAAGATTCCATGCAGTTGCCATACAGCCTTTGAGGTATGGTCAACCTACAGTAAGTTAGTAAATGATAAGGGGAGGAAGAAATGGAAACCTAAACATCTACTGCAAGGAAAACCAACAGCAATGTCAGTAGGAGTAATTCAACCTTCGTTGAAAACATGAAATTGAACATACTCTTGTTTTCCCTGGACCTGGCATCTCCAGGTGTCAACACAGAATTAAGCATCCATAATTGCTCAAAGTTACCTGGGGCATGATGGGTCTTGGTCTTCTTCCACTTCTTGGTACTTTTCAATTTCTGCAATAAGTTCAGACATGGACAGACATATTAAGCTGGTTCTCCTACACACATAACAATCCACTGTCTAATCCTCACGCAGGGACTTCAGGCTCCTCAGCATGAGAATAGGACACTGTGAGAGATCTTCTTCAGGAGGCCTGAAGGCTGATCATGATAGAGATTCCTGGGTTTTTGTCCCAGAAACTGTGGGTAAAATTCCCTATTCTGGTAGATCGTTATCCCAAGATCATTTGTCCCAAGTTTGTGCAAATGGTTATGCCATATTTTTCCAATCGATTTAAAGCAAATGCCCCCAAATGGTTGCTGGGAGAAAAACTGCAATATTCAGCCCTGTCTCATCAAATACTCAGATTCTTCATGGTAGCGAGGATTTTAGATGCTGAAATTAGAGTGAAGGATGAAATCTACAAGATCTACAAAATTGAGACAAAATCAGAGTTGTGTGAATTTGTCACATCTGCCCAGATCCAACATCTTGAGAGTGGGATTAGGGTGCCACAGGCATGGCCTGAGACTAGGAAGAGAGCCCTGCTCACTGACCCATCCCTTGCCTGGGCTTCCAAGTGGAACTAGAGTTTCATTCAACCTACATGTGCCTATAGGTCCTCCCTGTGGCAATGACATCTCTCAGCTCAGTAAGGGCCATTTGCAGTAGGAATATGACCCTAACCAGAAGACTCAGTGGATCCTTATCACCTTCATAGAAAGGTACTCACCATCCATGTCAAGAGCCCAGCCAACACGCTGTTGCTCCAATATGTAAAAGGCACTTCTGTAGGGCTGGCATGAGTCAGTCAGTTCAAGATAACCTGAAGGAGTTGAATAACATCTATCCAGTGAGTCCTGCAAGACTTCAGGCCCTTTCTCATCCAGCAGCTCCCTGCTGAGCCTGGAACAGTGGGAAAAAGTAAAGAATAAGCCAGGGGGAATCAGAAACCACACAGCCCCAGCTAGATTTCATGGCTAACATAAGGAAGAGTTTGAAAAGAAAAAGGACAGATCCATTAATGAGGTAACAAATTATTGCCTTTATATTGGGATAGACTAGGGCCAGGTAGAAAAGGATGAAAGAGAAAGACACACACACACACACACACACACACACACACACACACACACACACAGAGTGAGCTCAGTGAATTGGCCAGGTGACACACTGATGAGGGAGTCAACGGTCATTCTCTATTTGTGCTCTCAGGACACACAGTGAACAGTGATCATGAAAAGCATGGCCTCAATAATTTTGCATAAAATGTGCTCAAGTTTCCCTGCAGCCACCATGAGAATACAGCTTTTGAGGTATGGTCAACCTTCACTAGGTTAGTAAATGATAAGGGTAGGAAGAAATGGAAACCTAAACATTTACTCTAATGAGAACCAAAA
CAGCTCAGTAAGGGCCATTTGCAGTAGGAATATGACCCTAACCAGAAGACTCAGTGGATCCTTATCACCTTCATAGAAAGGTACTCACCATCCATGTCAAGAGCCCAGCCAACACGCTGTTGCTCCAATATGTAAAAGGCACTTCTGTAGGGCTGGCATGAGTCAGTCAGTTCAAGATAACCTGAAGGAGTTGAATAACATCTATCCAGTGAGTCCTGCAAGACTTCAGGCCCTTTCTCATCCAGCAGCTCCCTGCTGAGCCTGGAACAGTGGGAAAAAGTAAAGAATAAGCCAGGGGGAATCAGAAACCACACAGCCCCAGCTAGATTTCATGGCTAACATAAGGAAGAGTTTGAAAAGAAAAAGGACAGATCCATTAATGAGGTAACAAATTATTGCCTTTATGTTGGGATAGACTAGGGCCAGGTAGAAAAGGATGAAAGAGAAAGACACACACACACACACACACACACACACACACACACACACACACAGAGTGAGCTCAGTGAATTGGCCAGGTGACACACTGATGAGGGAGTCAACGGTCATTCTCTATTTGTGCTCTCAGGACACACAGTGAACAGTGATCATGAAAAGCATGGCCTCAATAATTTTGCATAAAATGTGCTCAAGTTTCCCTGCAGCCACCATGAGAATACAGCTTTTGAGGTATGGTCAACCTTCACTAGGTTAGTAAATGATAAGGGTAGGAAGAAATGGAAACCTAAACATTTACTCTAATGAGAACCAAAA,
CAGCTCAGTAATGGCCACTTGGAGCAGGAATATGATCTTTATATGGAAGACTCAGTGGATCCTTATCACCTTCATAGAAAGGTACTCACCTCCCACGTCAAGAGAAAAGCCAACATGTTTTTCCTCCAATGCATAAAAGGAACTTCCATAGGGCAGGCAGGAGTCAGGCTGTTCAAGACAACTGGAAGGAGTTGAATAACATCTATCCAGTGAGTCCTGCAAGACTTCAGGCTCTACTGCCTCCAGCAGCTCCCTGCTGAGCCTGGAAAAGGAGGAAAAAGTAAAGAATAAGCCAGGGGAAATCAGACACAACAGAGCCCCAACTAGGTTTCATGGGTAGCATAAGGAAGTGGTTAAAAAAGTAAAAGGATAGATCCATTAATGAGGTAACAAATTATTGCCTTCATGTTGGGACAGAACAGGGCCAAATGGAAAAGAATGAAAGAGAAAGACAGATAGACACACACACACACACACACACACACACACAGAGAGAGAGAGAATGAGCTCAGTGAATTGTCCAGGTGACACACTGATGAGGGAGTAACAGGACACTCTGAGTTAGTGCCCTCAGGACACACAGCATACAGTGATCAGGAAAGGACTGTGCTCAATAATTTTCCATAAAATGTGCTCAAGTTTCCATGCAGTCGCCATGAGAATACAGTTTTTGAAGTCTGGTCCACCTACAGTAGGTTAGTAAATGATAAGGGGAGGAAGAAATGGAAACCTAAATATCTACTGCAATGAAAACCAACAGCAATGTTAGTAGGAATAATTCAGGCTCGGTTGAAAAGATGTAATCGATAATGTCAGCCCGCCCTGTTTTCCCTGAACCAGGAGTCTCCAGATGTCAACACAGAAGAAGCTGTTCACAATTGCTCAGTTACCTGGGGCATGGTGGGCCTTGGTCTTCTTCCTCTTCTTGGTCCTTTTTAATTCCTGCAATACATTCAGACAGGGACAGACAAAATAAGCCAATTCACCTACACCCGTAACAGTCCACTGTCTAATCCCCACACAGGGATCTCAGGCTCCTCAGCAAGAGAACAGGACAATGTGAGAGATATACTTCAGGAGGCCTGAAAGCTGGTCATGATATTCTTTGGTTTGCATCTCAGAACCAAGGGTGAAATATCCCTATTCTGGTAGATCGTTATCCCAAAATCATTTATCCCAAGTTTGTGCAAACAGTTATGCCTTATTGTTCCCATCAGTTCAAAGACAATGCCCCAGATGATTTCTAGGAGGAAAACTGCAGTATTCAGCCCTGTCTCATCAAATGCCCAGCTCGTTCATGGATGCAAGAATTTTAGACACTGAAATTAGAATGAGGGAGGAAATCTACAAACCCTTGAGTCCAAATCATAGTTCTGTGAATTTTTTACATCTGCCTGGGTCCAATGTGCTGAGAGCGGGCTCAGGTTGATACAGGCATGGCTGGAGACTAGGAATAGAGCCTTGCTCACTGACCCATTTCATGTCTAGGCTTCCAGCGGAGACTACAGTTTCATTACAACCTATATGCGCCCATAGGTCCTGCCTGCGGCAATGACATCTCTCGGGTCAGTAAGGGCCACTGGGAACAGGAATATCACCCCTATCTGGAAGACCAGGTGGAGGCTTATCACCTTCATAGTAAGGTACTCACTGTCCACGTCAAGAGCCAAGCCAAGGTACTGTTCCTCCAATGAGTAAACAGCACTTCTGTAGGGCTGGCCTAAGTCAGGCAGTTCAAGATAACCTGAAGGAGTCGAATAACATCTATCCAGTGAGTCCTGCAAGACTTCAGGCTCTTTCTCATCCAGCAGCTCCCTGCTGAGCCTGGAAAAGTAGGAAAAAGTAAAGAATAAGCCAGGGGGAATCAGAAACCACACAGCCCCAGCTACATTTCATGGCTAACATAAGGAACTGTTTAAACAGAAAAAGGACAGATCCATTAATGAGGTAATGAATTATTGCCTTTATGTTGGGATAGACCAGGGCCAGGTAGAAAAGAATGAAAGAGAAAGACAGGGAGAGGGAGAGGGAGAGAGAGACAGAGGAGAAAGTGAGCTCAGCGAATTGGCCGGGTGACACACTGATGAAAGGGTCAAAGGACACTCTGAGTTAGTGCCCTCGGGACACACAGAGAACAGTGATCATGAAAAGAGTGGGCTCAATAATTTTCCATAAACTTGCTTAAGATTCCATGCAGTTGCCATACAGCCTTTGAGGTATGGTCAACCTACAGTAAGTTAGTAAATGATAAGGGGAGGAAGAAATGGAAACCTAAACATCTACTGCAAGGAAAACCAACAGCAATGTCAGTAGGAGTAATTCAACCTTCGTTGAAAACATGAAATTGAACATACTCTTGTTTTCCCTGGACCTGGCATCTCCAGGTGTCAACACAGAATTAAGCATCCATAATTGCTCAAAGTTACCTGGGGCATGATGGGTCTTGGTCTTCTTCCACTTCTTGGTACTTTTCAATTTCTGCAATAAGTTCAGACATGGACAGACATATTAAGCTGGTTCTCCTACACACATAACAATCCACTGTCTAATCCTCACACAGGGACTTCAGGCTCCTCAGCATGAGAATAGGACACTGTGAGAGATCTTCTTCAGGAGGCCTGAAGGCTGATCATGATAGAGATTCCTGGGTTTTTGTCCCAGAAACTGTGGGTAAAATTCCCTATTCTGGTAGATCGTTATCCCAAGATCATTTGTCCCAAGTTTGTGCAAATGGTTATGCCATATTTTTCCAATCGATTTAAAGCAAATGCCCCCAAATGGTTGCTGGGAGAAAAACTGCAATATTCAGCCCTGTCTCATCAAATACTCAGATTCTTCATGGTAGCGAGGATTTTAGACGCTGAAATTAGAGTGAAGGATGAAATCTACAAGATCTACAAAATTGAGACAAAATCAGAGTTGTGTGAATTTGTCACATCTGCCCAGATCCAACATCTTGAGAGTAGGATTAGGGTGCCACAGGCATGGCCTGAGACTAGGAAGAGAGCCCTGCTCACTGACCCATCCCTTGCCTGGGCTTCCAAGTGGAACTAGAGTTTCATTCAACCTACATGTGCCTATAGGTCCTCCCTGTGGCAATGACATCTCTCAGCTCAGTAAGGGCCATTTGCAGTAGGAATATGACCCTAACCAGAAGACTCAGTGGATCCTTATCACCTTCATAGAAAGGTACTCACCATCCATGTCAAGAGCCCAGCCAACACGCTGTTGCTCCAATATGTAAAAGGCACTTCTGTAGGGCTGGCATGAGTCAGTCAGTTCAAGATAACCTGAAGGAGTTGAATAACATCTATCCAGTGAGTCCTGCAAGACTTCAGGCCCTTTCTCATCCAGCAGCTCCCTGCTGAGCCTGGAACAGTGGGAAAAAGTAAAGAATAAGCCAGGGGGAATCAGAAACCACACAGCCCCAGCTAGATTTCATGGCTAACATAAGGAAGAGTTTGAAAAGAAAAAGGACAGATCCATTAATGAGGTAACAAATTATTGCCTTTATGTTGGGATAGACTAGGGCCAGGTAGAAAAGGATGAAAGAGAAAGACACACACACACACACACACACACACACACACACACACACACACAGAGTGAGCTCAGTGAATTGGCCAGGTGACACACTGATGAGGGAGTCAACGGTCATTCTCTATTTGTGCTCTCAGGACACACAGTGAACAGTGATCATGAAAAGCATGGCCTCAATAATTTTGCATAAAATGTGCTCAAGTTTCCCTGCAGCCACCATGAGAATACAGCTTTTGAGGTATGGTCAACCTTCACTAGGTTAGTAAATGACAAGGGTAGGAAGAAATGGAAACCTAAACATTTACTCTAATGAGAACCAAAA
And 219 reads span them (coverage in the region: 340). However, aligning the reads to the two haplotypes shows recurrent CIGAR patterns:
- Sniffles: no support.
- Severus: 1561bp INS, not somatic.
The ends of clipped alignments seem to be consistently located, and the mismatching bases seem to be supported by more than one read:
Moreover, this TR seems to be methylated with a consistent pattern, which seems to diverge in the clipped alignments (gray: FWD strand, green: REV strand):
We extract all alignments (spanning and non-spanning) with the "Export alignments" feature of IGV. Then, we build a POA graph using abPOA with flags --amb-strand --sort-by-len --result 3 (the TR sequence in the reference is in red):
No read supports the red branching arm (only the ref supports it), and the blue branching arm is supported by ~150 reads: this seems to contradict the BAM.
TRGT detects only the hom DEL:
chr19 3973097 .
TGCCTGTAGTCCCGGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATTGAGCCACTGCACTCCAGCCTGGGTGACAGAGCAAAGACTGTCTCAAAAAAAAAAAAAAATTCAGCTTCCACATACCATCTTAACGTTTGCACTACTAAAAAAGAATAGGACCCGGTATGGTGGCTCACACCTGTAACCCCAGCACTTTGGGAGGCTGAGGCGGGCTGATCTCTTACGGTCGAGCTTGAGACCAGCCTGGCCAACATGGTGAAACCCCATCTCACCTAAATATACAAAAATTACACAGACCTGGTGATAGGCGCCTGTAATTCCAGCTACTCGGGAGGCTGAAGCACAAGAATTGCTTGAGCACTTGAATCCGGAGGCAGAGGCTATAGTGAGCCAAGATCACAACACTGCACTCCATCCTGGGCGATAGAGTGAGATTCTCGAAACACAGTAATCCGTTACCCTGACTGCTGGAAACAGCGTCAGTTACAGCCCCAATTCAAATGCGACAGCCAAACCAAACCTCAGGCATACCTACATGTGGGCAAAACCTCAGCTGAGGCTTGGCCTTGACAATGCAGGGCCAGAGCTGCCCCCGCCTTTGCCGAGAACCCCACAGCCCTACTTCACAGTCCCCACCTGCCTTCCTGAGGACAGAGTAAAAGGCGGAAAATGATCATGTTAACCTTATCAGCACTGACAAGATTAGGCCTCCTCACATCTTAGGCACTATTTAGAAAAACAAATAGTTTTCTAAGCTATTGGCGCTCACACCTGAAATCCCAGCACTTTGGGAGGCTGAGGCAGAAGGATCGCTTGAGCCCGGGAAGGAGACCAGCCTGAGCAACATAGTGAGACCCCATTACTACAAAAAAAGTAAAAATTAGGTCAGCACAGTGGCTCACGCCTATAATTCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCACGAGGTCAGGAGATCGAGACTATCCTGGCTAACACGGTGAAACCCTGTCTCTACTAATAAATACAAAAAATTAGCCGGGCATGGTGGCGGGCGCCTGTAGTCCCGGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATTGAGCCACTGCACTCCAGCCTGGGTGACAGAGCAAAGACTGTCTCAAAAAAAAAAAAAAAT
TGCCTGTAGTCCCGGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATTGAGCCACTGCACTCCAGCCTGGGTGACAGAGCAAAGACTGTCTCAAAAAAAAAAAAAAAT
. .
TRID=19-3973223-3973238-A,19-3974295-3974310-A;END=3974311;MOTIFS=A;STRUC=<VC111219>
GT:AL:ALLR:SD:MC:MS:AP:AM
1/1:142,142:141-144,141-143:80,66:43,43:0(6-7)_0(16-17)_0(23-24)_0(29-30)_0(33-34)_0(36-37)_0(38-40)_0(47-49)_0(55-56)_0(61-62)_0(68-69)_0(72-73)_0(76-78)_0(79-80)_0(83-84)_0(87-88)_0(92-93)_0(97-98)_0(107-108)_0(109-110)_0(111-112)_0(114-117)_0(118-119)_0(126-141),0(6-7)_0(16-17)_0(23-24)_0(29-30)_0(33-34)_0(36-37)_0(38-40)_0(47-49)_0(55-56)_0(61-62)_0(68-69)_0(72-73)_0(76-78)_0(79-80)_0(83-84)_0(87-88)_0(92-93)_0(97-98)_0(107-108)_0(109-110)_0(111-112)_0(114-117)_0(118-119)_0(126-141):0.302817,0.302817:0.59,0.59
And 146 reads span it (coverage in the region: 160). However, aligning the reads to the two haplotypes shows recurrent CIGAR patterns, e.g.:
- Sniffles: no support.
- Severus: somatic 1070bp DEL with VAF=0.89.
Several clipped alignments seem to have mismatching bases aligned with a het INS, and the patterns of mismatch seem to be supported by more than one read. These may just be the beginning/end of the same het INS, but it is worth investigating.
Moreover, this TR seems to be methylated with a consistent pattern, which seems to diverge in the clipped alignments (gray: FWD strand, green: REV strand):
We extract all alignments (spanning and non-spanning) with the "Export alignments" feature of IGV. Then, we build a POA graph using abPOA with flags --amb-strand --sort-by-len --result 3 (the TR sequence in the reference is in red):
Note: some interesting regions in or around these genes might be missing from what follows. Go over IGV in detail once again.
We inspect each gene with IGV and we focus on regions that contain clusters of nearby SVs, clusters of clipped alignments, or clusters of SNPs with different patterns. In every IGV screenshot, the top track shows SMHT001 (cirrotic) and the bottom track shows ST001 (healthy). We perform two analyses:
- Do clipped alignments support different variants from spanning alignments? We try to answer this question by extracting all alignments in a window, from both ST001 and SMHT001, with the "Export alignments" feature of IGV. We convert them to FASTA with
samtools fasta -@ 16 file.samand we build a POA graph using abPOA in local alignment mode with flags-m 1 --amb-strand --sort-by-len --result 3(global alignment creates much more complex graphs). We show in blue POA nodes that are traversed by >=3 reads and in red nodes that belong to the reference sequence path. - Do spanning reads in the ST001 230x PacBio BAM support >2 haplotypes? We try to answer this question by extracting just the spanning portion of each read, with the hapestry command that follows, and then by loading the FASTA into Jalview and using MAFFT (preset
FFT-NS-1, speed oriented).
extract_reads_from_windows --output_dir ./reads_with_q/ \
--bam_csv ${SAMPLES_LIST} \
--windows ${FLANKED_WINDOWS_BED} \
--bam_not_hardclipped \
--require_spanning \
--flank_length 0 \
--fetch_max_length 100000 \
--tags NM \
--n_threads 1 \
--force_forward
Often several clipped alignments seem to have mismatching bases aligned with a hom or het INS, and the patterns of mismatch seem to be supported by more than one read. Of course these may just be the beginning/end of the same INS. Often variation occurs inside TRs.
Ng, Stanley WK, et al. "Convergent somatic mutations in metabolism genes in chronic liver disease." Nature 598.7881 (2021): 473-478.
Some notes on the paper. Interestingly every one of their short-read samples is just 31x, but they do take multiple microdissections from the same liver, at controlled distances between microdissections (to probe different clones and prove convergent mutation). It's also interesting that they find structural variants at or near AVCR2A, GPAM, FOXO1, and that they estimate telomere lengths (another analysis for which long reads may be superior).
See directory figures/ng_et_al_2021 for a full list of screenshots. In what follows we only show genes with potentially interesting features.
The POA graph seems to support the presence of multiple INS:
By selecting some arbitrary nodes in each branch of the graph, and by enumerating all paths that use any of these nodes, there seem to be one main haplotype and a second one with smaller read support:
nReads path
106 28157+,28384+,28808+,28877+,28969+,29184+,29329+,29495+,29687+,
6 29687-,29329-,29184-,28877-,28808-,28157-,
However the MSA of all spanning reads seems to show two frequent haplotypes:
- Severus: somatic 1168bp INS with VAF=0.72.
There seems to be just one INS:
Genes with clear associations with liver disease, e.g. genes that are either known genetic risk factors for fatty liver disease, or protect from alcohol-related liver disease. The Goodman lab has interests, tools and techniques to study GCKR, ADH1B, and MLXIPL, so any novel biology related to those three genes would have the lowest activation energy for mechanistic studies.
PNPLA3, TM6SF2, APOE, GCKR, TRIB1, GPAM, MARC1, MTTP, ADH1B, TOR1B, TMC4/MBOAT7, COBLL1, SREBF1, INSR, FTO, PNPLA2, MTARC1, MLXIPL, ADH1C, HFE, ATP7B, FRZB, IL18RAP, FLT3, GDF15, HGFAC, FSTL3, INHBA, INHBB
See directory figures/goodman for a full list of screenshots. In what follows we only show genes with potentially interesting features.
There seem to be multiple nearby INS, which may be the same ones described by the spanning alignments in the BAM:
By selecting some arbitrary nodes in each branch of the graph, and by enumerating all paths that use any of these nodes, there seem to be at least 3 well-supported haplotypes, which combine the 5 INS in different ways:
nReads path
66 16861+,17093+,17223+,17289+,17491+,17515+,17619+,17725+,17806+,
70 16861+,17093+,17223+,17289+,17491+,17515+,17725+,17806+,
4 17806-,77420-,17725-,17619-,17515-,17491-,17289-,77357-,17223-,17093-,16861-,
MSA of all spanning reads:
TRGT detects two haplotypes:
chr4 99588686 .
AATATATATATACACACATATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATGTTCATATATATACACACATATATATGTTCATATATATATTCATATATATATATATTTTTTTTTT
AATATATATATACACACATATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATGTTCATATATATACACACATATATATGTTCATATATATACACACATATATATGTTCATATATATTCATATATATATGTTCATATATATTCATATATATGTTCATATATATATTCATATATATGTTCATATATATTCATATATATGTTCATATATATATTCATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATATATATTTTTTTT,
AATATATATATACACACATATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATGTTCATATATATACACACATATATATATGTTCATATATATACACACATATATATGTTCATATATATACACACATATATATGTTCATATATATATTCATATATATATGTTCATGTATATTCATATATATATGTTCATATATATATTCATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATGTTCATATATATATTCATATATATATATATTTTTTTTTT
And 250 reads span them (coverage in the region: 300), but with no consistent CIGAR pattern.
Sniffles reports one 64bp INS:
chr4 99588927
A
AATGTTCATATATATATTCATATATATGTTCATATATATTCATATATATGTTCATATATATATTC
Severus: somatic 389bp INS with VAF=0.90.
TATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCATATATATATGTTCATATATATATTCAT
There seem to be more than two haplotypes:
TRGT detects two haplotypes:
chr19 54168770 .
CCTTTCTTTCTTTCTTTTCTTTTCTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTCTTTCCTTTCTTTCTTCTTTCTTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCTTCTTTCTCTCTCTCTCTCTCTCTCTCTATATATATATATATATTTTTTTTTTTTTCTTTTCTTTTCTTTTCTTTTTTTTTTTT
CCTTTCTTTCTTTCTTTTCTTTTCTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTCTTTCCTTTCTTTCTTCTTTCTTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCTTCTTTCTCTCTCTCTCTCTCTCTCTCTATATATATATATATTTTTTTTTTTTTTCTTTTCTTTTCTTTTCTTTTTTTTTTTT,
CCTTTCTTTCTTTCTTTTCTTTCTTTTCTTTCTTTCTTTCTTTTCTTTCTTTCTTTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTTCTTTCTTTCTTTCTTTCTTTCTTCCTTTCTTTCTTTTCTTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTTCTTTCTCTCTCTCTCTCTCTCTATATATATATATATATATTTTTCTTTTCTTTTCTTTTCTTTTTTTTTTTTT
And 123 reads span them (coverage in the region: 140). Aligning the reads to the two haplotypes shows some recurrent CIGAR edits, e.g.:
- Sniffles: no support.
- Severus: no support (but there is a somatic DEL in the last TR of this gene).
There seem to be more than two haplotypes:
TRGT detects only one haplotype:
chr19 54181615 .
CAGGGAGGGAGGGAGGGAGGGAGGGAAGGAGTGAAGAAGGGAAGAAAGAAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGAAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAGGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAAGGAAGGAAG
CAGGGAGGGAGGGAGGGAGGGAGGGAAGGAGTGAAGAAGGGAAGAAAGAAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGAAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGAAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAAGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAAGGAAGGAAG
And 87 reads span it (coverage in the region: 100). Aligning the reads to the REF and ALT does not show recurrent CIGAR patterns:
- Sniffles: no support.
- Severus: somatic 165bp INS with VAF=0.40.
GAGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAAGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAGGGAAGGAAGGAGGGAGGGAGGGAGGGAAGGAGGGAAGGAAGGAGGGAGGGGAAGGAGGGAAGGAAGGAGGGAGGGAAGGAAG
This variation is unlikely to be somatic, but it is interesting because it affects the cirrotic sample but not the healthy sample.
The variant seems to be a 1.2kb LINE1 INS:
It seems that all left clips continue on chr20 or on a consistent chrUn, and that all right clips are short and unmapped:
The simplest explanation might be that chr20 contains a new copy of that intronic segment, and in fact the intronic segment is an L1:
There seem to be more than one INS, which may be the same ones described by the spanning alignments in the BAM:
By selecting some arbitrary nodes in each branch of the graph, and by enumerating all paths that use any of these nodes, there seem to be mainly two haplotypes, with a third hap. only supported by one read:
nReads path
45 10542-,10389-,10198-,10014-,9858-,9690-,
45 10542-,64302-,10389-,64155-,63933-,10198-,10014-,9858-,9690-,
1 10542-,10389-,10198-,9858-,9690-,
MSA of all spanning reads:
TRGT detects two haplotypes:
chr19 7152980 .
TACACACACACACACACCCCACACACACACACACCACACACACACACCACACACACACACACCACACACCCCCCCACACACACACACCACACACATCACACAACACACCACACATACACACACCACAAACACACAACCACACACACACACCACACACCACACACACCACACAACACACCACACATACACACACCACAAACACACAACCACACACACACCACACACACCACACACCACACACACGCACCACACACACACCATACACGCACCACACACACACTACATGCACACCTCACACACCACACCACACACCGCACACACACCACACAAATCACACACACACCACACACACCCACGCCACACACCACAGACCACACACCACACACCCCACACACACCATACACGCACCACACACACACCACATACACACTTCACACACACCACACACCACACAAATCACACACACCACCACACCCACGCCACACACCACACACACACCACACACGCACCACACACACACACAGTGAAG
TACACACACACACACACCCCCCACACACACACACCACACACACACACCACACACACCACACATACACACACCACACACCCCCCCCACACACACACCACACACATCACACAACACACCACACACCACACATACACACACCACAAACACACAACCACACACACACACCACACACACCACACACACCACACAACACACCACACATACACACACCACAAACACACAACCACACACACCACACACACCACACACCACACACACCACACAACACACCACACATACACACACCACAAACACAACCACACACACACACCACACACACCACACACACCACACACGCACCACACACACACCATACACGCACCACACACACACTACATGCACACCTCACACACCACACCACACACCGCACACACACCACACAAATCACACACACACCACACACACCCACGCCACACACCACAGACCACACACACCACACACCCCACACACACCACACACCACACACACCAAACACACCATACATGCACCACACACACACCACATGCACACCTCACACACCATACCACACACCACACACACCACACAAATCACACACATACTACACACACCCACGCCACACACCACACACACACCACACACGCCACACAACGGACCACACATACACCACACACCACAAACACAACCACACACACACTATACACACACACCACACACACCACACAACAGACCACACATACACACACCACACACCACAAACACACACCATACACACACACCACACATACCACACACACCACACACACAACCACACACACACACCACACACACAACACACAACAGACCACACATACACACACCACAAACACAACCACACACACCACACATACCACACACACACCACACATACCACACACACACCACACACACAACCACACACACCACACCATACACACAACCACACACACACCATACACGCACCACACACACACCACATACACACTTCACACACACCACACACCACACAAATCACACACACCACCACACCCACGCCACACACCACACACACACCACACACGCACCACACACACACACAGTGAAG,
TACACACACACACACCCCCCCCACACACACACACCACACACACACACCACACACACCACACATACACACACCACACACCCCCCCACACACACACCACACACATCACACAACACACCACACATACACACACCACAAACACACAACCACACACACACACCACACACCACACACACCACACAACACACCACACATACACACACCACAAACACACAACCACACACACACCACACACCACACACCACACACACCACACAACACACCACACATACACACACCACAAACACACAACCACACACACACACCACACACACCACACACACCACACACGCACCACACACACACCATACACGCACCACACACACACTACATGCACACCTCACACACCACACCACACACCGCACACACACCACACAAATCACACACACACCACACACACCCACGCCACACACCACAGACCACACACACCACACACCCCACACACACCACACACCACACACACCAAACACACCATACATGCACCACACACACACCACATGCACACCTCACACACCATACCACACACCACACACACCACACAAATCACACACATACTACACACACCCACGCCACACACCACACACCACACACGCCACACAACGGACCACACATACACCACACACCACAAACACAACCACACACACACTATACACACACACCACACACACCACACAACAGACCACACATACACACACCACACACCACACACACCATACACACACACCACACATACCACACACACCACACACACAACCACACACACACACCACACACACAACACACAACAGACCACACATACACACACCACAAACACAACCACACACACCACACCATACACACACACCATACACGCACCACACACACACCACATACACACTTCACACACACCACACACCACACAAATCACACACACCACCACACCCACGCCACACACCACACACACACCATACACGCACCACACACACACTACATGCACACCTCACACACCACACCACACACCGCACACACACCACACAAATCACACACACACCACACACACCCACGCCACACACCACAGACCACACACACCACACACCCCACACACACCACACACCACACACACCAAACACACCATACATGCACCACACAACCACATGCACACCTCACACACCATACCACACACCACACACACCACACAAATCACACACATACTACACACACCCACGCCACACACCACACACACACCACACACGCCACACAACGGACCACACATACACCACACACCACAAACACAACCACACACACACTATACACACACACCACACACACCACACAACAGACCACACATACACACACCACACACCACACACACCATACACACACACCACACATACCACACACACCACACACACAACCACACACACACACCACACACACAACACACAACAGACCACACATACACACACCACAAACACAACCACACACACCACACCATACACACACACCATACACGCACCACACACACACCACATACACACTTCACACACACCACACACCACACAAATCACACACACCACCACACCCACGCCACACACCACACACGCACCACACACACACACAGTGAAG
And 135 reads span them (coverage in the region: 160). Aligning the reads to the two haplotypes does not show recurrent CIGAR patterns.
Sniffles report one 77bp INS:
chr19 7153186
A
CCACACACACACCACACACCACACACCACACACACCACACAACACACCACACATACACACACCACAAACACACAACCA
Severus: only a germline 1102bp INS.
There seem to be just 3 INS, which may be the same ones described by spanning alignments in the BAM:
By selecting some arbitrary nodes in each branch of the graph, and by enumerating all paths that use any of these nodes, there seems to be just one haplotype:
nReads path
74 1268-,730-,312-,158-,103-,71038-,70873-,70683-,70611-,
However, the MSA of all spanning reads seems to show two haplotypes:
TRGT detects two haplotypes:
chr13 27998700 .
ATGTGTGTGTGTGTGTATATGTAACTATGTGTGTGCAGTTATATATATATAAATATATATATTTATATATAAATATATGTTATATATACATAAATATATGTTATATATATTTATATATATTTATATACATATAAATATATATATTTATATACATATATAAAAATATATATATATTTATATATAAATATATGTTATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATTTATATACATATAAATATATGTATATACATATAAATATATATTTATATACATATAAATATATATTTATATACATATAAATATATGTATATACATATATATATGTATATACATATAAATATATATGTATATACATATAAATATATATGTATATACATATAAATATATATAATATTTATATACATATATATATTTATATTTATATGTATATACATATATATTTATATTTATATGTATATACATATATATATTTATATTTATATGTATATACATATATATATGTATATAAATATATATATTTATATGTATATAATATACATACATATATATATAT ATGTGTGTGTGTGTGTATATGTAACTATGTGTGTGCAGTTATATATATATAAATATATATATTTATATATAAATATATGTTATATATATAAATATATGTTATATATATAAATATATGTTATATATATAAATATATGTTATATATATAAATATATGTTATATATATAAATATATGTTATATATATTTATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATATAAATATATGTTATATATATTTATATGCATATAAATATATATTTATATGCATATAAATATATATTTATATGCATATAAATATATATTTATATACATATAAATATATATGTATATACATATAAATATATGTATATACATATAAATATATATGTATATACATATATATATGTATATACATATATATATATGTATATACATATAAATATATATGTATATACATATAAATATATGTATATACATATAAATATATATGTATATACATATAAATATATATGTATATACATATAAATATATATGTATGTATATTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATTTACATACATATAAATATATATTTACATACATATAAATATATACATTTATATACATATAAATATATACATATATACATATAAATATATATTTATATACATATAAATATATATATTTATATATATTTATATATATATAAATATATATATATTTGTATATATATATAAATATATATATATTTATATATATATAAATATATATATATTTATATATATATACATATAAATATATATTTATATACATATAAATATATAATTATATACATATAAATATATAATTATATACATATAAATATATATAATTATATACATATAAATATATATTTATATACATATAAATTTATATACATATAAATATAAATATTTATATTTATATACATATAAATATATATATTTATATTTATATACATATATATTTATATTTATATGTATATACATATATATATTTATATTTATATGTATATACATATATATATTTATATTTATGTATATACATATATATATTTATATTTATATGTATATAAATATATATATTTATATTTATATAATATACATACATATATATATAT,
ATGTGTGTGTGTGTGTATATGTAACTATGTGTGTGCAGTTATATATATATAAATATATATATTTATATATAAATATATGTTATATATATAAATATATGTTATATATATAAATATATGTTATATATATAAATATATATATTTATATATAAATATATGTTATATATATAAATATATGTTATATATATTTATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATATAAATATATGTTATATATATTTATATGCATATAAATATATATTTATATGCATATAAATATATATTTATATGCATATAAATATATATTTATATACATATAAATATATATGTATATACATATAAATATATGTATATACATATAAATATATATGTATATACATATAAATATATATGTATATACATATATATATATGTATATACATATAAATATATATGTATATACATATATATATGTATATACATATATATATATGTATATACATATAAATATATATGTATATACATATAAATATATGTATATACATATAAATATATATGTATATACATATAAATATATATGTATATACATATAAATATATATGTATGTATATTATATACATATAAATATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATTTATATACATATAAATATATATATTTATATACATATAAATATATATTTACATACATATAAATATATATTTACATACATATAAATATATACATTTATATACATATAAATATATACATATATACATATAAATATATATTTATATACATATAAATATATATATTTATATATATTTATATATATATAAATATATATATATTTGTATATATATATAAATATATATATATTTATATATATATAAATATATATATATTTATATATATATACATATAAATATATATTTATATACATATAAATATATAATTATATACATATAAATATATAATTATATACATATAAATATATATAATTATATACATATAAATATATATTTATATACATATAAATTTATATACATATAAATATAAATATTTATATTTATATACATATAAATATATATATTTATATTTATATACATATATATTTATATTTATATGTATATACATATATATATTTATATTTATATGTATATACATATATATATTTATATTTATGTATATACATATATATATTTATATTTATATGTATATAAATATATATATTTATATTTATATAATATACATACATATATATATAT
And 138 reads span them (coverage in the region: 150). It's unclear if aligning the reads to the two haplotypes shows recurrent CIGAR patterns, e.g.:
- Sniffles: no support.
- Severus: only a germline 964bp INS.
There seems to be just one INS, as described by spanning alignments in the BAM:
There seems to be just one INS, as described by spanning alignments in the BAM:
There seems to be just one INS, as described by spanning alignments in the BAM:
There seem to be just two haplotypes:
TRGT detects two haplotypes:
chr4 3440598 .
TCCCAGCACCCACCATGCCTCCACATTCACCGTGCCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCCCCCATTCACCGTGCCCCCAGCACCCACCATGCCTCCACATTCACCGTGCCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCCCCCATTCACCGTGCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCC
TCCCAGCACCCACCATGCCTCCACATTCACCGTGCCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCC,
TCCCAGCACCCACCATGCCTCCACATTCACCGTGCCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCCCCCATTCACCGTGCCCCCAGCACCCACCATGCCTCCACATTCACCGTGCCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCCCCCATTCACCGTGCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCCCCCATTCATCGTGCCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCCCCCATTCACCGTGCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACTC
. .
TRID=4-3440598-3440776-CCCAGCACCCACCATGCCTCCACATTCACCGTGCCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCCCCCATTCACCGTGCC;END=3440814;MOTIFS=CCCAGCACCCACCATGCCTCCACATTCACCGTGCCCCCAGGGCCCACCATGACCTCCCCCCCAACCCCCAACCCCCCATTCACCGTGCC;STRUC=<VC172810>
GT:AL:ALLR:SD:MC:MS:AP:AM
1/2:74,323:71-75,317-325:73,59:1,5:0(0-74),0(0-323):0.831461,0.723596:0.65,0.63
And 132 reads span them (coverage in the region: 140). Aligning the reads to the two haplotypes seems to show some recurrent CIGAR operations, e.g.:
- Sniffles: no support.
- Severus: only a germline 142bp DEL and a germline 106bp INS.
Brzozowska, N. et al. "Selection for somatic escape variants in SERPINA1 in the liver of patients with alpha-1 antitrypsin deficiency." Nature Genetics (2025): 1-9.
There seems to be just the 2kb hom INS described by the BAM:
It might be another copy of SERPINA2:
Brunnet, S. et al. "Somatic mutations and clonal dynamics in healthy and cirrhotic human liver". Nature 2019.
They do 30-70x Illumina WGS per sample. They call SVs with BRASS (breakpoint-oriented), and they remove calls supported by >2 reads. They run another tool for copy-number calling.
"Genomically, one middle-aged, healthy liver looks much like any other: a community of small, tightly-packed clones, each comprising a few hundred cells and containing around 1-1.5k mutations that come from a limited palette of signatures. Unhealthy livers diverge from this norm and instead exhibit large dynasties of clones, which are sequestered by bands of fibrosis and have a repertoire of signatures that is more variable, more vigorous and more regionally variegated."
They also mention that mature hepatocytes at the late stage of differentiation are polyploid and multinuclear. And they observe aneuploidy at whole-chromosome or arm level., unbalanced translocations, and patterns indicative of chromothripsis. These suggest that sustained toxicity and regeneration increase mitotic stress in hepatocytes.
Other genes mentioned but with no iteresting event in ST001:
ACVR2A, ARID2, TSC2, ALB, CDKN2A
We tried to run SAVANA 1.3.6 on the 230x PacBio ST001 liver. Specifically, we ran savana to --pb --tumour ~{aligned_bam} --ref ~{reference_fa} --g1000_vcf 1000g_hg38 --contigs /savana/example/contigs.chr.hg38.txt. Remarks:
- SAVANA seems to be designed for tumor-normal input, but it has a tumor-ony mode (
to) and we used that in the experiment. - It seems to be designed for ONT, but it has a PacBio mode (
--pb) and we used that in the experiment.
Unfortunately it crashed after 6.5h as follows:
[...]
INFO:root:Breakpoint 25663 rejected
INFO:root:Testing validity of 25716 in interval from 7393 to 25822 yields statistic 3.4292046763790474
INFO:root:Breakpoint 25716 accepted
Traceback (most recent call last):
File "/opt/conda/bin/savana", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/conda/lib/python3.11/site-packages/savana/savana.py", line 762, in main
args.func(args)
File "/opt/conda/lib/python3.11/site-packages/savana/savana.py", line 367, in savana_tumour_only
savana_cna(args, True)
File "/opt/conda/lib/python3.11/site-packages/savana/savana.py", line 277, in savana_cna
fit_absolute.fit_absolute_cn(outdir, log2r_cn_path, allele_counts_bed_path, args.sample,
File "/opt/conda/lib/python3.11/site-packages/savana/fit_absolute.py", line 167, in fit_absolute_cn
fits = cnfitter.estimate_grid_distances(min_cellularity, max_cellularity, cellularity_step, min_ploidy, max_ploidy, ploidy_step, relative_CN, weights=weights, distance_function=distance_function)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/savana/cn_functions.py", line 233, in estimate_grid_distances
d = acn_distance(relative_CN, cur_pur, cur_ploi, weights=weights, distance_function=distance_function)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/savana/cn_functions.py", line 181, in acn_distance
acn = [relative_to_absolute_CN(x, purity, ploidy) for x in relative_CN]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/savana/cn_functions.py", line 181, in <listcomp>
acn = [relative_to_absolute_CN(x, purity, ploidy) for x in relative_CN]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/savana/cn_functions.py", line 173, in relative_to_absolute_CN
acn = ploidy + (relative_CN - 1)*(ploidy+(2/purity)-2)
~^~~~~~~
ZeroDivisionError: division by zero
On a VM with 32 cores and 64GB of RAM, it used 28 cores and 20GB of RAM, and it cost $10.
We run Severus in single-sample mode, using the PON and VNTR files provided in the repo:
severus --target-bam aligned.bam --PON PoN_1000G_hg38.tsv.gz --vntr-bed human_GRCh38_no_alt_analysis_set.trf.bed.gz --out-dir ./severus
See this concept sheet. This person has already developed a tool and a paper, and is joining Tim's lab soon.
Cells under stress might replicate more often. TERT is the telomerase gene, and mutations in its promoter might affect telomere length. E.g. mutations could lead to liver cancer if they allow to escape checks on telomere length (Natalia).
Separate reads from mature hepatocytes vs progenitors. Using methylation or Fiber-seq. May depend on how good the annotation of the cell types we want is. However we could do it reference-free, by just clustering the methylation patterns of the reads.
All 60x PacBio samples should be available in the data portal (in the other section than benchmarking). But the portal is down now.
Joining the SV donors paper makes sense, esp. for doing a deeper dive on some genes or types of variation.
Might be useful also for detecting global splicing patterns (i.e. systematic differences in which transcripts are expressed), induced by a global splicing cofactor gene (Natalia).
According to Peter Park, non-human reads are in the BAMs as unmapped reads (i.e. they were not removed during library prep), so this analysis is doable. He points to a similar analysis on the UK Biobank.
Hepatitis viruses in liver BAMs?
https://pmc.ncbi.nlm.nih.gov/articles/PMC12339610/
https://trtools.readthedocs.io/en/stable/source/prancSTR.html
https://academic.oup.com/bioinformatics/article/40/8/btae485/7723996
Which liver genes have TRs in exons?
https://pubmed.ncbi.nlm.nih.gov/40004542/
https://www.mdpi.com/2073-4425/16/2/213
RNA aberrant splicing: https://github.com/gagneurlab/fraser?tab=readme-ov-file
https://www.nature.com/articles/s41586-019-1670-9
https://www.nature.com/articles/s41586-021-03974-6
https://pubmed.ncbi.nlm.nih.gov/31597092/
https://www.nature.com/articles/s41467-024-55325-4
Hepatocites become tetraploid at some point in their lifetime (Natalia).
Different types of hepatocyte:
- Zone 1 (peri-portal vein)
- Zone2
- Zone3 (peri-central vein) There is an oxygen gradient, since blood comes in with a lot of oxygen, and the hepatocytes in the three zones might have slightly diffeent metabolic function. However, they do not seem to separate in RNA umaps.
If there is rewiring of some metabolic pathways due to somatic mutations, in which zone is it happening? Maybe we could use methylation to assign variants to zones.
Other cell types in the liver:
- Cholangiocytes
- Kupfer cells.