Substitution Matrix

A substitution matrix is a scoring system that quantifies the likelihood of one amino acid being substituted for another in an alignment. It is derived from statistical analysis of reliable alignments of highly related protein sequences. There are two main types of amino acid substitution matrices: those based on amino acid properties and those empirically derived like PAM and BLOSUM matrices from actual sequence alignments. PAM and BLOSUM matrices assign scores reflecting the observed frequency of substitutions versus the expected random frequency, converted to log-odds ratios. Higher scores indicate more evolutionarily conserved substitutions.

Uploaded by

Rashmi Dhiman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

561 views10 pages

Substitution Matrix

Uploaded by

Rashmi Dhiman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Substitution Matrix

It is a scoring system which entails a set of values for quantifying the

likelihood of one residue being substituted by another in an alignment.
It is derived from statistical analysis of residue substitution data from
sets of reliable alignments of highly related sequences.
SCORING MATRICES
• Scoring matrices for nucleotide sequences are relatively simple.
• A positive value or high score is given for a match and a negative
value or low score for a mismatch.
• Scoring matrices are based on the assumption that the frequencies of
mutation are equal for all bases.
• However, this assumption may not be realistic; observations show
that transitions (substitutions between purines and purines or
between pyrimidines and pyrimidines) occur more frequently than
transversions (substitutions between purines and pyrimidines).
SCORING MATRICES
• Scoring matrices for amino acids are more complicated because scoring has
to reflect the physicochemical properties of amino acid residues, as well as
the likelihood of certain residues being substituted among true homologous
sequences.
• Certain amino acids with similar physicochemical properties can be more
easily substituted than those without similar characteristics.
• Substitutions among similar residues are likely to preserve the essential
functional and structural features. However, substitutions between residues
of different physicochemical properties are more likely to cause disruptions
to the structure and function. This type of disruptive substitution is less
likely to be selected in evolution because it renders nonfunctional proteins.
Amino Acid Scoring Matrices
• Amino acid substitution matrices, which are 20 × 20 matrices, have been devised to reflect the
likelihood of residue substitutions.
• There are essentially two types of amino acid substitution matrices.
• One type is based on interchangeability of the genetic code or amino acid properties, and the
other is derived from empirical studies of amino acid substitutions. Although the two different
approaches coincide to a certain extent, the first approach, which is based on the genetic code or
the physicochemical features of amino acids, has been shown to be less accurate than the second
approach, which is based on surveys of actual amino acid substitutions among related proteins.
• The empirical matrices, which include PAM and BLOSUM matrices, are derived from actual
alignments of highly similar sequences. By analyzing the probabilities of amino acid substitutions
in these alignments, a scoring system can be developed by giving a high score for a more likely
substitution and a low score for a rare substitution.
• For a given substitution matrix, a positive score means that the frequency of amino acid
substitutions found in a data set of homologous sequences is greater than would have occurred
by random chance. A zero score means that the frequency of amino acid substitutions found in
the homologous sequence data set is equal to that expected by chance. A negative score means
that the frequency of amino acid substitutions found in the homologous sequence data set is less
than would have occurred by random chance.
Log-odds Ratio
• The substitution matrices apply logarithmic conversions to describe the probability
of amino acid substitutions.
• The converted values are the so-called log-odds scores (or log-odds ratios), which
are logarithmic ratios of the observed mutation frequency divided by the
probability of substitution expected by random chance.
• The conversion can be either to the base of 10 or to the base of 2.
• For example, in an alignment that involves ten sequences, each having only one
aligned position, nine of the sequences are F (phenylalanine) and the remaining
one I (isoleucine). The observed frequency of I being substituted by F is one in ten
(0.1), whereas the probability of I being substituted by F by random chance is one
in twenty (0.05). Thus, the ratio of the two probabilities is 2 (0.1/0.05). After taking
this ratio to the logarithm to the base of 2, this makes the log odds equal to 1.
PAM Matrices
• PAM stands for “point accepted mutation”. It was first constructed by Margaret Dayhoff, who compiled
alignments of seventy-one groups of very closely related protein sequences.
• One PAM unit is defined as 1% of the amino acid positions that have been changed.
• Construction of the PAM1 matrix involves alignment of full-length sequences and subsequent construction of
phylogenetic trees using the parsimony principle. This allows computation of ancestral sequences for each
internal node of the trees.
• Ancestral sequence information is used to count the number of substitutions along each branch of a tree.
• The PAM score for a particular residue pair is derived from a multistep procedure involving calculations of
relative mutability (which is the number of mutational changes from a common ancestor for a particular
amino acid residue divided by the total number of such residues occurring in an alignment), normalization of
the expected residue substitution frequencies by random chance, and logarithmic transformation to the
base of 10 of the normalize mutability value divided by the frequency of a particular residue.
• The resulting value is rounded to the nearest integer and entered into the substitution matrix, which reflects
the likelihood of amino acid substitutions. This completes the log-odds score computation.
• After compiling all substitution probabilities of possible amino acid mutations, a 20 × 20 PAM matrix is
established.
• Positive scores in the matrix denote substitutions occurring more frequently than expected among
evolutionarily conserved replacements. Negative scores correspond to substitutions that occur less
frequently than expected.
PAM Matrices

• A PAM unit is defined as 1% amino acid change or

one mutation per 100 residues. The increasing
PAM numbers correlate with increasing PAM units
and thus evolutionary distances of protein
sequences. For example, PAM250, which
corresponds to 20% amino acid identity,
represents 250 mutations per 100 residues. In
theory, the number of evolutionary changes
approximately corresponds to an expected
evolutionary span of 2,500 million years. Thus, the
PAM250 matrix is normally used for divergent
sequences. Accordingly, PAM matrices with lower
serial numbers are more suitable for aligning more
closely related sequences. The extrapolated values
of the PAM250 amino acid substitution matrix are
shown in Figure.
BLOSUM Matrices
• This is the series of blocks amino acid substitution matrices (BLOSUM), all of which are derived based on
direct observation for every possible amino acid substitution in multiple sequence alignments.
• These were constructed based on more than 2,000 conserved amino acid patterns representing 500 groups
of protein sequences. The sequence patterns, also called blocks, are ungapped alignments of less than sixty
amino acid residues in length.
• The frequencies of amino acid substitutions of the residues in these blocks are calculated to produce a
numerical table, or block substitution matrix.
• Instead of using the extrapolation function, the BLOSUM matrices are actual percentage identity values of
sequences selected for construction of the matrices. For example, BLOSUM62 indicates that the sequences
selected for constructing the matrix share an average identity value of 62%.
• Other BLOSUM matrices based on sequence groups of various identity levels have also been constructed. In
the reversing order as the PAM numbering system, the lower the BLOSUM number, the more divergent
sequences they represent.
• The BLOSUM score for a particular residue pair is derived from the log ratio of observed residue substitution
frequency versus the expected probability of a particular residue. The log odds is taken to the base of 2
instead of 10 as in the PAM matrices.
• The resulting value is rounded to the nearest integer and entered into the substitution matrix. As in the PAM
matrices, positive and negative values correspond to substitutions that occur more or less frequently than
expected among evolutionarily conserved replacements. The values of the BLOSUM62 matrix are shown in
Figure.

BIOINFORMATICS
No ratings yet
BIOINFORMATICS
21 pages
Phylogenetic Trees
No ratings yet
Phylogenetic Trees
11 pages
Blast
100% (1)
Blast
21 pages
Bioinformatics Assignment Topic: Phylogenetics Analysis Softwares
No ratings yet
Bioinformatics Assignment Topic: Phylogenetics Analysis Softwares
12 pages
Bioinformatics II Course Overview
No ratings yet
Bioinformatics II Course Overview
91 pages
Group # 13
No ratings yet
Group # 13
49 pages
Bioinformatics Pratical File
No ratings yet
Bioinformatics Pratical File
63 pages
PFAM Database
No ratings yet
PFAM Database
22 pages
Introduction of Proteomics
No ratings yet
Introduction of Proteomics
21 pages
ArgusLab 4.0 Molecular Docking Guide
100% (1)
ArgusLab 4.0 Molecular Docking Guide
24 pages
Bioinformatics Notes
No ratings yet
Bioinformatics Notes
40 pages
Emboss (Pairwise Sequence Alignment: Prepared By:-Bansari Patel (19it02) M.Sc. IT (SEM-2
No ratings yet
Emboss (Pairwise Sequence Alignment: Prepared By:-Bansari Patel (19it02) M.Sc. IT (SEM-2
19 pages
Phylogenetic Analysis
100% (1)
Phylogenetic Analysis
25 pages
Bioinformatics & NCBI Overview
No ratings yet
Bioinformatics & NCBI Overview
9 pages
Lecture 5-6 - Databases NR
No ratings yet
Lecture 5-6 - Databases NR
35 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Approaches and Methods of Study of Animal Behaviour - Zoology For IAS, IfoS and Other Competitive Exams - Part 2
0% (1)
Approaches and Methods of Study of Animal Behaviour - Zoology For IAS, IfoS and Other Competitive Exams - Part 2
3 pages
Sequence Similarity Searching: Basic Local Alignment Search Tool
No ratings yet
Sequence Similarity Searching: Basic Local Alignment Search Tool
47 pages
Databases Bioinformatics
No ratings yet
Databases Bioinformatics
42 pages
Protein Structure Prediction
No ratings yet
Protein Structure Prediction
17 pages
Lab Report 2 Bioinformatics
No ratings yet
Lab Report 2 Bioinformatics
17 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Clustalw
No ratings yet
Clustalw
5 pages
RFLP
No ratings yet
RFLP
1 page
Omics-Based On Science, Technology, and Applications Omics
50% (2)
Omics-Based On Science, Technology, and Applications Omics
22 pages
Microarray 09
No ratings yet
Microarray 09
73 pages
Replica Plating Method
No ratings yet
Replica Plating Method
3 pages
Bioinformatics Notes
No ratings yet
Bioinformatics Notes
104 pages
Homology Modelling
No ratings yet
Homology Modelling
29 pages
FASTA Algorithm
No ratings yet
FASTA Algorithm
15 pages
Physical and Chemical Properties of DNA
No ratings yet
Physical and Chemical Properties of DNA
6 pages
Molecular Phylogenetics
No ratings yet
Molecular Phylogenetics
4 pages
PSSM
No ratings yet
PSSM
17 pages
Lab Report 05
No ratings yet
Lab Report 05
20 pages
DOT PLOT and SEQUENTIAL ALIGNMENT
No ratings yet
DOT PLOT and SEQUENTIAL ALIGNMENT
22 pages
BTT302 - Ktu Qbank
No ratings yet
BTT302 - Ktu Qbank
6 pages
Phylogenetic Tree Lab (FASTA)
No ratings yet
Phylogenetic Tree Lab (FASTA)
8 pages
Manual PDF
100% (1)
Manual PDF
53 pages
Biological Database Overview
No ratings yet
Biological Database Overview
31 pages
BIOINFORMATICS
100% (1)
BIOINFORMATICS
4 pages
Methods of Protein Analysis
No ratings yet
Methods of Protein Analysis
41 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
3 Proteomics Tools and Techniques
No ratings yet
3 Proteomics Tools and Techniques
51 pages
Screening and Improving Biotech Strains
100% (1)
Screening and Improving Biotech Strains
63 pages
Multiple Sequence Alignment Tools: Tutorials and Comparative Analysis
No ratings yet
Multiple Sequence Alignment Tools: Tutorials and Comparative Analysis
19 pages
Microscopy: Principles & Types
No ratings yet
Microscopy: Principles & Types
15 pages
Abzymes/ Catalytic Antibodies
No ratings yet
Abzymes/ Catalytic Antibodies
24 pages
Industrial Biotech Overview
No ratings yet
Industrial Biotech Overview
35 pages
Omics Technology: October 2010
No ratings yet
Omics Technology: October 2010
28 pages
Carbohydrates: Definition and Roles
100% (1)
Carbohydrates: Definition and Roles
45 pages
DIC Microscopy for Scientists
No ratings yet
DIC Microscopy for Scientists
5 pages
BLAST
100% (1)
BLAST
4 pages
Gene Mapping
No ratings yet
Gene Mapping
4 pages
Cell and Molecular Biology: Nternational Eview of
100% (1)
Cell and Molecular Biology: Nternational Eview of
294 pages
Mendel Law
No ratings yet
Mendel Law
4 pages
Lecture 3 - Genome Mapping
No ratings yet
Lecture 3 - Genome Mapping
47 pages
Data Base in Bioinformatics
No ratings yet
Data Base in Bioinformatics
30 pages
Vietnam National University Ho Chi Minh International University
100% (1)
Vietnam National University Ho Chi Minh International University
5 pages
PAM and BLOSUM
No ratings yet
PAM and BLOSUM
21 pages
Bioinformatics Module 2 Notes
No ratings yet
Bioinformatics Module 2 Notes
28 pages
Cell Cycle & Cell Division Comp. Notes
100% (2)
Cell Cycle & Cell Division Comp. Notes
18 pages
Pretest 10
No ratings yet
Pretest 10
5 pages
Ab Production Theories
No ratings yet
Ab Production Theories
15 pages
KVS Test Series Schedule
No ratings yet
KVS Test Series Schedule
1 page
DNA Replication and Repair
100% (2)
DNA Replication and Repair
13 pages
Cell Cycle and Cell Division DPP 02 Botany by Vipin Sharma
No ratings yet
Cell Cycle and Cell Division DPP 02 Botany by Vipin Sharma
2 pages
Developmental Biology Exam #1
No ratings yet
Developmental Biology Exam #1
4 pages
Handout For Biology Grade 5: Biotic and Abiotic Factors
No ratings yet
Handout For Biology Grade 5: Biotic and Abiotic Factors
3 pages
AP Frqs - Keys
No ratings yet
AP Frqs - Keys
39 pages
Mitosis Family PDF
No ratings yet
Mitosis Family PDF
20 pages
Molecular Basis of Cell Aging
No ratings yet
Molecular Basis of Cell Aging
57 pages
2.1 Ecosystems and Niches 1
No ratings yet
2.1 Ecosystems and Niches 1
31 pages
Organisation and Control of Prokaryotic and Eukaryotic Genome
No ratings yet
Organisation and Control of Prokaryotic and Eukaryotic Genome
8 pages
Ecosystem
No ratings yet
Ecosystem
1 page
GB1 Act4 CellCycle Mitosis Meiosis
No ratings yet
GB1 Act4 CellCycle Mitosis Meiosis
7 pages
TOPIC 4 - Cell Division
No ratings yet
TOPIC 4 - Cell Division
7 pages
Edwin Issac - Ecosystems Questions 1-3
No ratings yet
Edwin Issac - Ecosystems Questions 1-3
1 page
Kuby Immunology 8th Edition Jenni Punt Download
100% (1)
Kuby Immunology 8th Edition Jenni Punt Download
95 pages
Mitosis Quiz
No ratings yet
Mitosis Quiz
2 pages
Genes Assignment
No ratings yet
Genes Assignment
4 pages
M. Sc. Batch (3) Final Semester (1) Examination
No ratings yet
M. Sc. Batch (3) Final Semester (1) Examination
13 pages
CE1 (T2) Presentation Molecular Biology
No ratings yet
CE1 (T2) Presentation Molecular Biology
11 pages
1.heredity and Evolution 2.life Process in Living Organisms (Part - I)
No ratings yet
1.heredity and Evolution 2.life Process in Living Organisms (Part - I)
2 pages
Genetic 4
No ratings yet
Genetic 4
33 pages
British Biology Olympiad Syllabus
No ratings yet
British Biology Olympiad Syllabus
1 page
Basics of Ecology
100% (1)
Basics of Ecology
15 pages
T-Cells & Cell-Mediated Immunity
No ratings yet
T-Cells & Cell-Mediated Immunity
10 pages
NCI-Nature Pathway Interaction Database
No ratings yet
NCI-Nature Pathway Interaction Database
2 pages
Regulation of Gene Expression From Lehninger - 4e PDF
No ratings yet
Regulation of Gene Expression From Lehninger - 4e PDF
21 pages
Arabinose + Tryptophan Operon
No ratings yet
Arabinose + Tryptophan Operon
58 pages