DNA and sequencing (mostly
Illumina)
Many slides adapted from Ben Langmead. Thanks, Ben!
https://langmead-lab.org/teaching-materials/
What is DNA?
• There are many types of biomolecules
we pretty exclusively focus on these in class
• Carbohydrates, lipids, proteins, nucleic acids
• DNA is a type of nucleic acid (deoxyribonucleic acid)
• DNA stores all the genetic information that a particular organism needs to
survive
DNA is stored in nearly every human cell
Question: which cells don’t have DNA?
https://en.wikipedia.org/wiki/DNA
DNA, genes, RNA, and proteins
my
DNA 0
In
D RNA
f protein
DNA and the double helix
AA GT A
TEAT
5’ 3’
DNA, chromatid, chromosome
3
Some useful facts about DNA
Some useful facts about DNA
• A “genome” is about 3.1 Gb
• Just one side of a double helix
Some useful facts about DNA
• A “genome” is about 3.1 Gb
• Just one side of a double helix
• Humans are 99.9% genetically identical
• A great overestimate of a person’s variability is 3M genetic variants
Some useful facts about DNA
• A “genome” is about 3.1 Gb
• Just one side of a double helix
• Humans are 99.9% genetically identical
• A great overestimate of a person’s variability is 3M genetic variants
• If we take the union of all single nucleotide variants, it’s only ~8M (> 5%
allele frequency)
Some useful facts about DNA
• A “genome” is about 3.1 Gb
• Just one side of a double helix
• Humans are 99.9% genetically identical
• A great overestimate of a person’s variability is 3M genetic variants
• If we take the union of all single nucleotide variants, it’s only ~8M (> 5%
allele frequency)
• …so why sequence DNA?
Genomics technology
Sanger DNA 3rd-generation &
DNA Microarrays 2nd-generation DNA
sequencing single-molecule
sequencing
DNA sequencing
1977-1990s Since mid-1990s Since ~2007
Since ~2010
Fred Sanger
1918-2013
“Chain termination”
sequencing
Sanger sequencing
Sanger sequencing Fred Sanger in episode 3 of PBS documentary “DNA” Not-so-high-throughput Sanger sequencing
1977-1990s
First practical method invented by Fred Sanger
in 1977. Initially used to sequence shorter
genomes, e.g. viral genomes 10,000s of bases
long.
Sanger sequencing
From "DNA" documentary, episode 3
Genomics technology
Sanger DNA 3rd-generation &
DNA Microarrays 2nd-generation DNA
sequencing single-molecule
sequencing
DNA sequencing
1977-1990s Since mid-1990s Since ~2007
Since ~2010
Sequencing
No sequencing technology yet invented can read
much more than 10,000 nucleotides at a time with
reasonable cost, throughput, accuracy
Instead, there’s a vigorous race to see whose
sequencer can read “short” fragments of DNA (around
100s of nucleotides) with best cost, throughput,
accuracy
Decoding DNA With Semiconductors
By NICHOLAS WADE Company Unveils DNA Sequencing
Published: July 20, 2011 Device Meant to Be Portable, Disposable
and Cheap
Cost of Gene Sequencing Falls, Raising By ANDREW POLLACK
Hopes for Medical Advances Published: February 17, 2012
By JOHN MARKOFF
Published: March 7, 2012 Source: nytimes.com
Sequencing
Since 2005, many DNA sequencing instruments have been described
and released. They are based on a few different principles
Synthesis / ligation SMRT cell Nanopore
Sequencing by synthesis (“massively parallel sequencing”) provides
greatest throughput, and is the most prevalent today
Pictures: http://www.illumina.com/systems/miseq/technology.ilmn, http://www.genengnews.com/gen-articles/third-generation-sequencing-debuts/3257/
DNA: double helix
A T
G C
http://ghr.nlm.nih.gov/handbook/basics/dna
DNA: double helix
A T
G C
http://ghr.nlm.nih.gov/handbook/basics/dna
DNA: double helix
A T
G C
http://ghr.nlm.nih.gov/handbook/basics/dna
TCACACTGAGCGTGCTG
DNA: double helix
A T
G C
http://ghr.nlm.nih.gov/handbook/basics/dna
Forward strand
TCACACTGAGCGTGCTG
DNA: double helix
A T
G C
http://ghr.nlm.nih.gov/handbook/basics/dna
Forward strand
TCACACTGAGCGTGCTG
AGTGTGACTCGCACGAC
DNA: double helix
A T
G C
http://ghr.nlm.nih.gov/handbook/basics/dna
Forward strand
TCACACTGAGCGTGCTG
Reverse strand
AGTGTGACTCGCACGAC
Your genome
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG
Reads
Your genome
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT
Reads
Your genome
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG
Reads
Your genome
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
Reads
Your genome
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
TAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC
Reads
Your genome
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
TAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC
TGTCTTTGATTCCTG CGCGATAGCATTGCG GCATTGCGAGACGCT CCTATGTCGCAGTAT
Reads
Your genome
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
TAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC
TGTCTTTGATTCCTG CGCGATAGCATTGCG GCATTGCGAGACGCT CCTATGTCGCAGTAT
GACGCTGGAGCCGGA GCACCCTATGTCGCA GTATCTGTCTTTGAT CCTCATCCTATTATT
TATCGCACCTACGTT CAATATTCGATCATG GATCACAGGTCTATC ACCCTATTAACCACT
CACGGGAGCTCTCCA TGCATTTGGTATTTT CGTCTGGGGGGTATG CACGCGATAGCATTG
GTATGCACGCGATAG ACCTACGTTCAATAT TATTTATCGCACCTA CCACTCACGGGAGCT
Reads GCGAGACGCTGGAGC CTATCACCCTATTAA CTGTCTTTGATTCCT ACTCACGGGAGCTCT
CCTACGTTCAATATT GCACCTACGTTCAAT GTCTGGGGGGTATGC AGCCGGAGCACCCTA
GACGCTGGAGCCGGA GCACCCTATGTCGCA GTATCTGTCTTTGAT CCTCATCCTATTATT
TATCGCACCTACGTT CAATATTCGATCATG GATCACAGGTCTATC ACCCTATTAACCACT
CACGGGAGCTCTCCA TGCATTTGGTATTTT CGTCTGGGGGGTATG CACGCGATAGCATTG
Your genome
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
Reads
Your genome
Reads
100 nt
Your genome
Reads
100 nt
Your genome
100,000,000 nt
Reads
100 nt
Your genome a
f
?
100,000,000 nt
The sequencing Oracle
Your genome chri
Chris CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
chr2
A T
G C
Double stranded Double stranded
DNA (double helix) DNA (lego version)
C G
C G A T
A T
T A G C
A T
G C
C
G
C
G
A
T
T
A
A
G Single stranded T
templates
C
C
G
C
G
A
T
T
A
A
G C T
C
C
G
C
G
A
DNA polymerase T
T
A
A
G C T
C
C
G
C
G
A
T
T
T A
A
G C T
C
C
G
C
G
A
A T
T
T A
A
G C T
C
C
G
C
T G
A
A T
T
T A
A
G C T
C
C
G G
C
T G
A
A T
T
T A
A
G C T
C
C G
G G
C
T G
A
A T
T
T A
A
G C T
C
C G
G C G
C
T C G
A
A A T
T
T T A
A
G C A T
G C
More details: Accurate whole human genome sequencing using
reversible terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9
Input DNA
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
More details: Accurate whole human genome sequencing using
reversible terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9
Input DNA
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
Cut into snippets
CCATAGTA TATCTCGG CTCTAGGCCCTC ATTTTTT
CCA TAGTATAT CTCGGCTCTAGGCCCTCA TTTTTT
CCATAGTAT ATCTCGGCTCTAG GCCCTCA TTTTTT
CCATAG TATATCT CGGCTCTAGGCCCT CATTTTTT
More details: Accurate whole human genome sequencing using
reversible terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9
Input DNA
1 shr I
cellCCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
DNA
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
Cut into snippets
CCATAGTA TATCTCGG CTCTAGGCCCTC ATTTTTT
CCA TAGTATAT CTCGGCTCTAGGCCCTCA TTTTTT
CCATAGTAT ATCTCGGCTCTAG GCCCTCA TTTTTT
CCATAG TATATCT CGGCTCTAGGCCCT CATTTTTT
Deposit on slide
C
C
A
T
A
G
More details: Accurate whole human genome sequencing using
reversible terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9
Each DNA ‘cluster’ is about 1-2 microns
Flow cell with several lanes and capable of sequencing billions of reads
Prepped sequences “flow” on the flow cell and
bind to wells
C C C
C C T
A A T
T C A
A G A
G G G
Prepped sequences “flow” on the flow cell and
bind to wells
C C C
C C T
A A T
T C A
A G A
G G G
Prepped sequences “flow” on the flow cell and
bind to wells
C
C
A
T
A
G
C C
C T
A T
C A
G A
G G
Prepped sequences “flow” on the flow cell and
bind to wells
C
C
A
T
A
G
C
C
A
C
G
G
C
T
T
A
A
G
Prepped sequences “flow” on the flow cell and
bind to wells
C
C
A
T
A
G
C
C
A
C
G
G
C
T
T
A
A
G
Prepped sequences “flow” on the flow cell and
bind to wells
C
C
A
T
A
G Flow cell
C
C
A
C
G
G
C
T
T
A
A stranded
G Single
templates
tinyurl.com/cs121sp24
Prepped sequences “flow” on the flow cell and
bind to wells
C
C
A
T
A
G Flow cell
C
C
A
C
G
G
C
T
T
A
A
G
Billions of
microwells
which contain
“one” sequence
Template
(billions of them!)
Slide
• It is exceptionally difficult to sequence one
molecule
Template
(billions of them!) • Imagine that each template is actually many
copies of the same sequence in one microcell
Slide
DNA polymerase
A T DNA polymerase
C G
A T DNA polymerase
“Terminator”
C G
~~
(snap)
~
~~
~
~~
~
Remove terminators
DNA polymerase
A T DNA polymerase
C G
A T DNA polymerase
C G
Repeat!
(snap)
(snap)
(snap)
(snap)
(snap)
Sequencing by synthesis
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
Sequencing by synthesis
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
complement complement complement complement complement complement
G A T A C C
C
C
A
T
A
G
Sequencing by synthesis
Actual Illumina HiSeq 3000 image
http://dnatech.genomecenter.ucdavis.edu/2015/05/07/first-hiseq-3000-data-download/
Sequencing by synthesis
Billions of templates on a slide
Massively parallel: photograph captures all templates
simultaneously
Terminators are “speed bumps,” keeping reactions in sync
Sequencing by synthesis
Billions of templates on a slide
Massively parallel: photograph captures all templates
simultaneously
Terminators are “speed bumps,” keeping reactions in sync
Eh, I thought it was really hard to sequence
specific molecules?
Yes. Yes, it is.
https://youtu.be/oIJaA6h2bFM?t=613
Bridge amplification example
used to
uerces omplemntarpsequeces
tdinotesadapt.rs connect to the microwell
bend strand
have
of
a
Eh of
at original DNA it
c denature
effing in
ate
from
Man
if A
i fat
Cluster of clones
Unterminated
Ahead of schedule
Unterminated
Q = -10 · log10 p
Q = -10 · log10 p
Base quality
Q = -10 · log10 p
Probability that
Base quality
base call is
incorrect
Q = -10 · log10 p
Probability that
Base quality
base call is
incorrect
Q = 10 → 1 in 10 chance call is incorrect
Q = 20 → 1 in 100
Q = 30 → 1 in 1,000
Call: orange (C)
Call: orange (C)
Estimate p, probability incorrect:
Call: orange (C)
Estimate p, probability incorrect:
non-orange light / total light
Call: orange (C)
Estimate p, probability incorrect:
non-orange light / total light
p = 3 green / 9 total = 1/3
Call: orange (C)
Estimate p, probability incorrect:
non-orange light / total light
p = 3 green / 9 total = 1/3
Q = -10 log10 1/3
Call: orange (C)
Estimate p, probability incorrect:
non-orange light / total light
p = 3 green / 9 total = 1/3
Q = -10 log10 1/3 = 4.77
A read in FASTQ format
@ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1
ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
+
?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
A read in FASTQ format
Name @ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1
ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
+
?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
A read in FASTQ format
Name @ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1
Sequence ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
+
?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
A read in FASTQ format
Name @ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1
Sequence ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
(ignore) +
?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
A read in FASTQ format
Name @ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1
Sequence ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
(ignore) +
Base qualities ?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
FASTQ
Name
Read 1 Sequence
(placeholder)
Base qualities
Name
Read 2 Sequence
(placeholder)
Base qualities
Name
Read 3 Sequence
(placeholder)
Base qualities
Name
Read 4 Sequence
(placeholder)
Base qualities
Name
Read 5 Sequence
(placeholder)
Base qualities
Quality degrades as a function of length
How do we get errors?
• This process is called “base calling”
• “calling” the correct base from the images
• Lots of methods, my fav: “BayesCall”
https://genome.cshlp.org/content/19/10/1884.full
Base qualities
Bases and qualities line up:
AGCTCTGGTGACCCATGGGCAGCTGCTAGGGA
||||||||||||||||||||||||||||||||
HHHHHHHHHHHHHHHGCGC5FEFFFGHHHHHH
Base quality is ASCII-encoded version of Q = -10 log10 p
ASCII
Base qualities
Usual ASCII encoding is “Phred+33”:
take Q, rounded to integer, add 33, convert to character
def QtoPhred33(Q):
""" Turn Q into Phred+33 ASCII-encoded quality """
return chr(int(round(Q)) + 33)
def phred33ToQ(qual):
""" Turn Phred+33 ASCII-encoded quality into Q """
return ord(qual)-33
Base qualities
Usual ASCII encoding is “Phred+33”:
take Q, rounded to integer, add 33, convert to character
def QtoPhred33(Q):
""" Turn Q into Phred+33 ASCII-encoded quality """
return chr(int(round(Q)) + 33)
(converts character to integer according to ASCII table)
def phred33ToQ(qual):
""" Turn Phred+33 ASCII-encoded quality into Q """
return ord(qual)-33
(converts integer to character according to ASCII table)