0% found this document useful (0 votes)
13 views117 pages

1 Dna Sequencing

The document provides information about DNA, including its structure as a double helix, storage in cells, relationship to genes and proteins, and sequencing technologies. It discusses DNA sequencing technologies including Sanger sequencing, second generation sequencing, and third generation sequencing. It also provides some useful facts about the human genome and DNA sequencing process.

Uploaded by

Abraham Lincoln
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views117 pages

1 Dna Sequencing

The document provides information about DNA, including its structure as a double helix, storage in cells, relationship to genes and proteins, and sequencing technologies. It discusses DNA sequencing technologies including Sanger sequencing, second generation sequencing, and third generation sequencing. It also provides some useful facts about the human genome and DNA sequencing process.

Uploaded by

Abraham Lincoln
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

DNA and sequencing (mostly

Illumina)

Many slides adapted from Ben Langmead. Thanks, Ben!


https://langmead-lab.org/teaching-materials/
What is DNA?

• There are many types of biomolecules


we pretty exclusively focus on these in class

• Carbohydrates, lipids, proteins, nucleic acids

• DNA is a type of nucleic acid (deoxyribonucleic acid)

• DNA stores all the genetic information that a particular organism needs to
survive
DNA is stored in nearly every human cell

Question: which cells don’t have DNA?

https://en.wikipedia.org/wiki/DNA
DNA, genes, RNA, and proteins

my
DNA 0
In
D RNA

f protein
DNA and the double helix
AA GT A

TEAT
5’ 3’
DNA, chromatid, chromosome

3
Some useful facts about DNA
Some useful facts about DNA
• A “genome” is about 3.1 Gb

• Just one side of a double helix


Some useful facts about DNA
• A “genome” is about 3.1 Gb

• Just one side of a double helix

• Humans are 99.9% genetically identical

• A great overestimate of a person’s variability is 3M genetic variants


Some useful facts about DNA
• A “genome” is about 3.1 Gb

• Just one side of a double helix

• Humans are 99.9% genetically identical

• A great overestimate of a person’s variability is 3M genetic variants

• If we take the union of all single nucleotide variants, it’s only ~8M (> 5%
allele frequency)
Some useful facts about DNA
• A “genome” is about 3.1 Gb

• Just one side of a double helix

• Humans are 99.9% genetically identical

• A great overestimate of a person’s variability is 3M genetic variants

• If we take the union of all single nucleotide variants, it’s only ~8M (> 5%
allele frequency)

• …so why sequence DNA?


Genomics technology

Sanger DNA 3rd-generation &


DNA Microarrays 2nd-generation DNA
sequencing single-molecule
sequencing
DNA sequencing
1977-1990s Since mid-1990s Since ~2007
Since ~2010

Fred Sanger
1918-2013

“Chain termination”
sequencing
Sanger sequencing

Sanger sequencing Fred Sanger in episode 3 of PBS documentary “DNA” Not-so-high-throughput Sanger sequencing
1977-1990s
First practical method invented by Fred Sanger
in 1977. Initially used to sequence shorter
genomes, e.g. viral genomes 10,000s of bases
long.
Sanger sequencing

From "DNA" documentary, episode 3


Genomics technology

Sanger DNA 3rd-generation &


DNA Microarrays 2nd-generation DNA
sequencing single-molecule
sequencing
DNA sequencing
1977-1990s Since mid-1990s Since ~2007
Since ~2010
Sequencing

No sequencing technology yet invented can read


much more than 10,000 nucleotides at a time with
reasonable cost, throughput, accuracy
Instead, there’s a vigorous race to see whose
sequencer can read “short” fragments of DNA (around
100s of nucleotides) with best cost, throughput,
accuracy
Decoding DNA With Semiconductors
By NICHOLAS WADE Company Unveils DNA Sequencing
Published: July 20, 2011 Device Meant to Be Portable, Disposable
and Cheap
Cost of Gene Sequencing Falls, Raising By ANDREW POLLACK
Hopes for Medical Advances Published: February 17, 2012

By JOHN MARKOFF
Published: March 7, 2012 Source: nytimes.com
Sequencing
Since 2005, many DNA sequencing instruments have been described
and released. They are based on a few different principles

Synthesis / ligation SMRT cell Nanopore

Sequencing by synthesis (“massively parallel sequencing”) provides


greatest throughput, and is the most prevalent today
Pictures: http://www.illumina.com/systems/miseq/technology.ilmn, http://www.genengnews.com/gen-articles/third-generation-sequencing-debuts/3257/
DNA: double helix

A T

G C

http://ghr.nlm.nih.gov/handbook/basics/dna
DNA: double helix

A T

G C

http://ghr.nlm.nih.gov/handbook/basics/dna
DNA: double helix

A T

G C

http://ghr.nlm.nih.gov/handbook/basics/dna

TCACACTGAGCGTGCTG
DNA: double helix

A T

G C

http://ghr.nlm.nih.gov/handbook/basics/dna

Forward strand
TCACACTGAGCGTGCTG
DNA: double helix

A T

G C

http://ghr.nlm.nih.gov/handbook/basics/dna

Forward strand
TCACACTGAGCGTGCTG
AGTGTGACTCGCACGAC
DNA: double helix

A T

G C

http://ghr.nlm.nih.gov/handbook/basics/dna

Forward strand
TCACACTGAGCGTGCTG
Reverse strand
AGTGTGACTCGCACGAC
Your genome

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG

Reads

Your genome

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT

Reads

Your genome

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG

Reads

Your genome

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG

Reads

Your genome

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
TAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC

Reads

Your genome

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
TAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC
TGTCTTTGATTCCTG CGCGATAGCATTGCG GCATTGCGAGACGCT CCTATGTCGCAGTAT

Reads

Your genome

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
TAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC
TGTCTTTGATTCCTG CGCGATAGCATTGCG GCATTGCGAGACGCT CCTATGTCGCAGTAT
GACGCTGGAGCCGGA GCACCCTATGTCGCA GTATCTGTCTTTGAT CCTCATCCTATTATT
TATCGCACCTACGTT CAATATTCGATCATG GATCACAGGTCTATC ACCCTATTAACCACT
CACGGGAGCTCTCCA TGCATTTGGTATTTT CGTCTGGGGGGTATG CACGCGATAGCATTG
GTATGCACGCGATAG ACCTACGTTCAATAT TATTTATCGCACCTA CCACTCACGGGAGCT
Reads GCGAGACGCTGGAGC CTATCACCCTATTAA CTGTCTTTGATTCCT ACTCACGGGAGCTCT
CCTACGTTCAATATT GCACCTACGTTCAAT GTCTGGGGGGTATGC AGCCGGAGCACCCTA
GACGCTGGAGCCGGA GCACCCTATGTCGCA GTATCTGTCTTTGAT CCTCATCCTATTATT
TATCGCACCTACGTT CAATATTCGATCATG GATCACAGGTCTATC ACCCTATTAACCACT
CACGGGAGCTCTCCA TGCATTTGGTATTTT CGTCTGGGGGGTATG CACGCGATAGCATTG

Your genome

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
Reads

Your genome
Reads

100 nt

Your genome
Reads

100 nt

Your genome
100,000,000 nt
Reads

100 nt

Your genome a
f
?
100,000,000 nt
The sequencing Oracle

Your genome chri


Chris CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
chr2
A T

G C

Double stranded Double stranded


DNA (double helix) DNA (lego version)
C G
C G A T
A T
T A G C
A T
G C
C
G
C
G
A
T
T
A
A
G Single stranded T
templates
C
C
G
C
G
A
T
T
A
A
G C T
C
C
G
C
G
A
DNA polymerase T
T
A
A
G C T
C
C
G
C
G
A
T
T
T A
A
G C T
C
C
G
C
G
A
A T
T
T A
A
G C T
C
C
G
C
T G
A
A T
T
T A
A
G C T
C
C
G G
C
T G
A
A T
T
T A
A
G C T
C
C G
G G
C
T G
A
A T
T
T A
A
G C T
C
C G
G C G
C
T C G
A
A A T
T
T T A
A
G C A T
G C
More details: Accurate whole human genome sequencing using
reversible terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9
Input DNA
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT

More details: Accurate whole human genome sequencing using


reversible terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9
Input DNA
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT

Cut into snippets


CCATAGTA TATCTCGG CTCTAGGCCCTC ATTTTTT
CCA TAGTATAT CTCGGCTCTAGGCCCTCA TTTTTT
CCATAGTAT ATCTCGGCTCTAG GCCCTCA TTTTTT
CCATAG TATATCT CGGCTCTAGGCCCT CATTTTTT

More details: Accurate whole human genome sequencing using


reversible terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9
Input DNA
1 shr I
cellCCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
DNA
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT

Cut into snippets


CCATAGTA TATCTCGG CTCTAGGCCCTC ATTTTTT
CCA TAGTATAT CTCGGCTCTAGGCCCTCA TTTTTT
CCATAGTAT ATCTCGGCTCTAG GCCCTCA TTTTTT
CCATAG TATATCT CGGCTCTAGGCCCT CATTTTTT

Deposit on slide
C
C
A
T
A
G

More details: Accurate whole human genome sequencing using


reversible terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9
Each DNA ‘cluster’ is about 1-2 microns

Flow cell with several lanes and capable of sequencing billions of reads
Prepped sequences “flow” on the flow cell and
bind to wells

C C C
C C T
A A T
T C A
A G A
G G G
Prepped sequences “flow” on the flow cell and
bind to wells

C C C
C C T
A A T
T C A
A G A
G G G
Prepped sequences “flow” on the flow cell and
bind to wells

C
C
A
T
A
G

C C
C T
A T
C A
G A
G G
Prepped sequences “flow” on the flow cell and
bind to wells

C
C
A
T
A
G
C
C
A
C
G
G

C
T
T
A
A
G
Prepped sequences “flow” on the flow cell and
bind to wells

C
C
A
T
A
G
C
C
A
C
G
G
C
T
T
A
A
G
Prepped sequences “flow” on the flow cell and
bind to wells

C
C
A
T
A
G Flow cell
C
C
A
C
G
G
C
T
T
A
A stranded
G Single
templates
tinyurl.com/cs121sp24
Prepped sequences “flow” on the flow cell and
bind to wells

C
C
A
T
A
G Flow cell
C
C
A
C
G
G
C
T
T
A
A
G
Billions of
microwells
which contain
“one” sequence
Template
(billions of them!)

Slide
• It is exceptionally difficult to sequence one
molecule
Template
(billions of them!) • Imagine that each template is actually many
copies of the same sequence in one microcell

Slide
DNA polymerase
A T DNA polymerase

C G
A T DNA polymerase

“Terminator”

C G
~~
(snap)

~
~~
~
~~
~
Remove terminators
DNA polymerase
A T DNA polymerase

C G
A T DNA polymerase

C G

Repeat!
(snap)
(snap)
(snap)
(snap)
(snap)
Sequencing by synthesis

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6


Sequencing by synthesis
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

complement complement complement complement complement complement

G A T A C C

C
C
A
T
A
G
Sequencing by synthesis

Actual Illumina HiSeq 3000 image


http://dnatech.genomecenter.ucdavis.edu/2015/05/07/first-hiseq-3000-data-download/
Sequencing by synthesis
Billions of templates on a slide
Massively parallel: photograph captures all templates
simultaneously
Terminators are “speed bumps,” keeping reactions in sync
Sequencing by synthesis
Billions of templates on a slide
Massively parallel: photograph captures all templates
simultaneously
Terminators are “speed bumps,” keeping reactions in sync
Eh, I thought it was really hard to sequence
specific molecules?
Yes. Yes, it is.

https://youtu.be/oIJaA6h2bFM?t=613
Bridge amplification example

used to
uerces omplemntarpsequeces
tdinotesadapt.rs connect to the microwell
bend strand
have
of
a

Eh of
at original DNA it
c denature
effing in
ate
from
Man
if A
i fat
Cluster of clones
Unterminated
Ahead of schedule
Unterminated
Q = -10 · log10 p
Q = -10 · log10 p

Base quality
Q = -10 · log10 p

Probability that
Base quality
base call is
incorrect
Q = -10 · log10 p

Probability that
Base quality
base call is
incorrect

Q = 10 → 1 in 10 chance call is incorrect


Q = 20 → 1 in 100
Q = 30 → 1 in 1,000
Call: orange (C)
Call: orange (C)

Estimate p, probability incorrect:


Call: orange (C)

Estimate p, probability incorrect:


non-orange light / total light
Call: orange (C)

Estimate p, probability incorrect:


non-orange light / total light

p = 3 green / 9 total = 1/3


Call: orange (C)

Estimate p, probability incorrect:


non-orange light / total light

p = 3 green / 9 total = 1/3


Q = -10 log10 1/3
Call: orange (C)

Estimate p, probability incorrect:


non-orange light / total light

p = 3 green / 9 total = 1/3


Q = -10 log10 1/3 = 4.77
A read in FASTQ format

@ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1
ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
+
?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
A read in FASTQ format

Name @ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1


ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
+
?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
A read in FASTQ format

Name @ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1


Sequence ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
+
?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
A read in FASTQ format

Name @ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1


Sequence ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
(ignore) +
?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
A read in FASTQ format

Name @ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1


Sequence ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
(ignore) +
Base qualities ?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
FASTQ

Name

Read 1 Sequence
(placeholder)
Base qualities
Name

Read 2 Sequence
(placeholder)
Base qualities
Name

Read 3 Sequence
(placeholder)
Base qualities
Name

Read 4 Sequence
(placeholder)
Base qualities
Name

Read 5 Sequence
(placeholder)
Base qualities
Quality degrades as a function of length
How do we get errors?
• This process is called “base calling”

• “calling” the correct base from the images

• Lots of methods, my fav: “BayesCall”


https://genome.cshlp.org/content/19/10/1884.full
Base qualities

Bases and qualities line up:

AGCTCTGGTGACCCATGGGCAGCTGCTAGGGA
||||||||||||||||||||||||||||||||
HHHHHHHHHHHHHHHGCGC5FEFFFGHHHHHH

Base quality is ASCII-encoded version of Q = -10 log10 p


ASCII
Base qualities

Usual ASCII encoding is “Phred+33”:


take Q, rounded to integer, add 33, convert to character

def QtoPhred33(Q):
""" Turn Q into Phred+33 ASCII-encoded quality """
return chr(int(round(Q)) + 33)

def phred33ToQ(qual):
""" Turn Phred+33 ASCII-encoded quality into Q """
return ord(qual)-33
Base qualities

Usual ASCII encoding is “Phred+33”:


take Q, rounded to integer, add 33, convert to character

def QtoPhred33(Q):
""" Turn Q into Phred+33 ASCII-encoded quality """
return chr(int(round(Q)) + 33)
(converts character to integer according to ASCII table)
def phred33ToQ(qual):
""" Turn Phred+33 ASCII-encoded quality into Q """
return ord(qual)-33
(converts integer to character according to ASCII table)

You might also like