BIOINFORMATICS(BIOCOMPUTING)
(3)
ALIGNMENT AND MATCHING
DR. IBRAHIM ZAGHLOUL
PROTEIN STRUCTURE
https://www.rcsb.org/structure/3ERT 2
MACROMOLECULAR STRUCTURE
• Primary structure of proteins
– Linear polymers linked by peptide bonds
– Sense of direction
3
SECONDARY STRUCTURE
• Polypeptide chains fold into regular local structures
– alpha helix, beta sheet, turn, loop
– based on energy considerations
4
ALPHA HELIX
5
BETA SHEET
anti-parallel parallel
schematic
6
TERTIARY STRUCTURE
• 3-d structure of a polypeptide sequence
– interactions between non-local and foreign atoms
– often separated into domains
tertiary structure of domains of CD4
myoglobin
7
QUATERNARY STRUCTURE
• Arrangement of protein subunits
quaternary structure
of Cro
human hemoglobin
tetramer
8
ACTIVE SITE (BINDING SITE)
- Upon folding, the protein active site is formed.
- The spot at which molecules fit and interact.
- The major point for protein activity.
- Usually it is a Cleft, Pocket, Cavity.
- Called active because interaction usually
results in some chemical change or reaction.
- Basis of the lock and key model.
https://www.slideshare.net/MerlynH/protein-structure-
function-46933802
http://www.chemeddl.org/collections/TSTS/Gellman/Gellm
anpg5-8/Active%20Sites.html 9
ACTIVE SITE AND DRUG DESIGN: LOCK AND KEY
• Structure (chemically interact).
• Shape should match .
• Interacting molecules (ligands).
• Lock: Protein (Receptor, Target).
• Key: Ligand (Compound)
10
PROTEIN-LIGAND DOCKING
Computational method that mimics the binding of a
ligand to a protein
Given: Target (Protein), Binding Site, Ligand (set of
ligands)
• Predicts:
• The pose of the molecule in the binding site
• The binding affinity or a score representing the
strength of binding
• Docking can be used for:
• virtual screening: Virtual testing of compounds
• Lead optimization: Investigate specific compounds
• De novo design of ligands: Synthesize new
compounds.
Image credit: Charaka Goonatilake, Glen Group, University of Cambridge.
11
http://www-ucc.ch.cam.ac.uk/research/cg369-research.html
VIDEO: MOLECULAR DOCKING USING GLIDE
Bioinformatics: A simple view
Biological Computational
+
Data methods
13
Challenges of working in bioinformatics
• Need to feel comfortable in interdisciplinary area
• Depend on others for primary data
• Need to address important biological and computer
science problems
14
Skill set
• Artificial intelligence
• Machine learning
• Statistics & probability
• Algorithms
• Databases
• Programming
15
Bioinformatics Topics Genome Sequence
• Finding Genes in Genomic DNA
– introns
– exons
– promotors
• Characterizing Repeats in Genomic DNA
– Statistics
– Patterns
• Duplications in the Genome
– Large scale genomic alignment
16
Bioinformatics Topics Protein Sequence
• Sequence Alignment
– non-exact string matching, gaps
• Scoring schemes and Matching
– How to align two strings optimally via statistics
Dynamic Programming
• Patterns – How to tell if a given
– Local vs Global Alignment alignment or match is
– TM-helix finding – Amino acid substitution scoring
statistically significant
– Motifs
– A P-value (or an e-value)?
• Secondary Structure “Prediction”
– Assessing Secondary Structure Prediction – Score Distributions
(extreme val. dist.)
– Ab initio
– Low Complexity Sequences
• Function Prediction
• Evolutionary Issues
– Active site identification
– Rates of mutation and
• Tertiary Structure Prediction change
– Fold Recognition • Relation of Sequence Similarity to Structural
– Threading Similarity
17
Evolution
1
DNA Sequencers
DNA Sequencing
• DNA sequencing refers to the general laboratory technique for determining the exact
sequence of nucleotides, or bases, in a DNA molecule.
• A DNA sequencer is a scientific instrument used to automate the DNA
sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the
order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then
reported as a text string, called a read.
DNA Sequencers
DNA Sequencers
Sequencing Reads
Alignment
Assembly
Evolution
3
Sequence conservation implies function
Alignment is the key to
• Finding important regions
• Determining function
• Uncovering the evolutionary forces
Sequence alignment
• Comparing DNA/protein sequences for
– Similarity
– Homology
• Prediction of function
• Construction of phylogeny: the history of the evolution of a species or
group.
• Shotgun assembly
– End-space-free alignment / overlap alignment
• Finding motifs
3
Homology
• Homology: Homology among DNA, or proteins is
inferred from their sequence similarity. Significant
similarity is strong evidence that two sequences are related
by evolutionary changes from a common ancestral
sequence.
• Orthologs (Different Species)
– Divergence follows speciation
– Similarity can be used to construct phylogeny between
species
• Paralogs (Same Species)
- Divergence follows duplication
3
Sequence Alignment
• Procedure of comparing two (pairwise) or more
(multiple) sequences by searching for a series of
individual characters that are in the same order in
the sequences
GCTAGTCAGATCTGACGCTA
| |||| ||||| |||
TGGTCACATCTGCCGC
35
Sequence Alignment
• Procedure of comparing two (pairwise) or more
(multiple) sequences by searching for a series of
individual characters that are in the same order
in the sequences
VLSPADKTNVKAAWGKVGAHAGYEG
||| | | | || | ||
VLSEGDWQLVLHVWAKVEADVAGEG
3
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Given two strings x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions 0,…, M
in x, and 0,…, N in y, so as to line up each letter in one
sequence with either a letter, or a gap in the other sequence
3
Sources of variation
• Nucleotide substitution
– Replication error
– Chemical reaction
• Insertions or deletions (indels)
– Unequal crossing over
– Replication slippage
• Duplication
– a single gene (complete gene duplication)
– part of a gene (internal or partial gene duplication)
• Domain duplication
• Exon shuffling
– part of a chromosome (partial polysomy)
– an entire chromosome (aneuploidy or polysomy)
– the whole genome (polyploidy)
38
A simple alignment
• Let us try to align two short nucleotide sequences:
– AATCTATA and AAGATA
• Without considering any gaps (insertions/deletions) there
are 3 possible ways to align these sequences
AATCTATA AATCTATA AATCTATA
AAGATA AAGATA AAGATA
• Which one is better?
39
Scoring the alignments
• We need to have a scoring mechanism to evaluate alignments
– match score
– mismatch score
• We can have the total score as:
n
∑
=1
i
match or mismatch score at position i
• For the simple example, assume a match score of 1 and a
mismatch score of 0:
AATCTATA AATCTATA AATCTATA
AAGATA AAGATA AAGATA
4 1 3
40
Simple alignment with gaps
• Considering gapped alignments vastly
increases the number of possible alignments.
• If gap penalty is -1 what will be the new
scores?
AATCTATA AATCTATA AATCTATA
AAG-AT-A AA-G-ATA AA--GATA
1 3 3
41
BLOSUM 62 matrix
String Definitions
A string S is a finite ordered list of characters.
Characters are drawn from an alphabet Σ.
Nucleic acid alphabet: { A, C, G, T }
Amino acid alphabet: { A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V }
Length of S, |S |, is the number of characters in S
ϵ is the empty string. | ϵ | = 0
43
String Definitions
• For strings S and T over Σ, their concatenation consists of the characters of
S followed by the characters of T, denoted ST
• S is a substring of T if there exist (possibly empty) strings u and v such that
T = uSv
• S is a prefix of T if there exists a string u such that T = Su.
If neither S nor u are ϵ, S is a proper prefix of T.
• Definitions of suffix and proper suffix are similar.
44
String Definitions
• We defined substring. Subsequence is similar except the
characters need not be consecutive.
• “cat” is a substring and a subsequence of “concatenate”
• “cant” is a subsequence of “concatenate”, but not a
substring
45
Exact matching
• Looking for places where a pattern P occurs as a substring
of a text T. Each such place is an occurrence or match.
• An alignment is a way of putting P’s characters opposite
T’s characters. It may or may not correspond to an
occurrence.
46
Exact Matching
47
Exact matching: naïve algorithm
48
Exact matching: naïve algorithm
49
Exact matching: naïve algorithm
50
Can we improve on the naïve algorithm?
P: word
T: There would have been a time for such a word
word
u doesn’t occur in P,so skip next two alignments
P: word
T: There would have been a time for such a word
word
word skip!
word skip!
word
51