0% found this document useful (0 votes)
194 views12 pages

Molecular Phylogeny - Introduction

DU Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
194 views12 pages

Molecular Phylogeny - Introduction

DU Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Molecular Phylogeny- Introduction

Subject: Bioinformatics

Lesson: Molecular Phylogeny- Introduction

Lesson Developer: Shailendra Goel

College/ Department: Department of Botany, University of Delhi

0
Molecular Phylogeny- Introduction

Table of Contents

Chapter: Molecular Phylogeny

Introduction

How to generate trees


Positive and negative selection
Understanding Trees
Cladograms vs Phylograms
Rooted vs Unrooted trees
Tree Terminology

Methods of Phylogenetic reconstruction


Distance Method
UPGMA distance based method
Neighbor joining method

Statistical methods of phylogeny

 Summary
 Exercise/ Practice
 Glossary
 References/ Bibliography/ Further Reading

1
Molecular Phylogeny- Introduction

Introduction
Mutation is the basis of evolution driven by the process of selection. All life forms are
expected to be part of a tree of life, which should be able to explain their origin and
evolution. Practically, this may not happen due to extinction of species and further
complications arising from ways by which organisms can acquire genes (e.g. lateral transfer
of genes). Phylogenetics exploits available comparative information to generate trees, which
can explain evolution. Traditionally morphological features were used to compare data and
generate trees. More recently molecular sequences are used for comparisons among
species, helping in defining species, families and other taxa, hence named as “Molecular
Phylogeny”.

How to generate trees


Trees are generated by comparing traits among organisms. For classical phylogeny these
traits are morphological traits but for molecular phylogeny we can use DNA, RNA or protein
sequence data. As a general rule DNA has more phylogenetic information as compared to
proteins. Proteins are derived through triplet code, in which third bases follow the “wobble
hypothesis” leading to loss of phylogenetic information. DNA sequences comprise coding
and non-coding regions that have differing rates of evolution. The rate of evolution also
depends on the type of organism.
Comparison of sequences can only be done after aligning them. Without alignment it is very
difficult to decide which nucleotide/amino acid should be compared with which one
(homology). Proteins show two types of changes- synonymous and non-synonymous. A
synonymous change does not result in change in the coded amino acid.

Positive and negative selection

Traditionally, any change which is favored by natural selection is called positive selection. It
is favored by natural selection because it helps in the survival of organism. Similarly, any
trait which is not favored by natural selection is normally eliminated and is called negative
selection.

2
Molecular Phylogeny- Introduction

Similar kind of selection also operates for molecular sequences. It is common among genes
to go through duplication. A duplicated copy of gene is free to undergo mutation and create
variation. This variation goes through positive/negative selection and often leads to
neofunctionalization, leading to new genes with new functions.

Understanding Trees
Cladograms vs Phylograms
Trees fall under two categories – Cladogram and Phylogram. Cladogram just provide the
information about relationship between different organisms while phylograms also provide a
measure of the amount of evolutionary change, as seen in the branch-lengths. Due to this
fact, branch length has no meaning in cladograms while it has meaning in phylograms.

Figure: Phylogram Figure: Cladogram


Source: Author Source: Author

Rooted vs Unrooted trees


The root in a tree denotes the ultimate common ancestor and provides direction in time. At
times, it is not possible to have this information hence there are both types of algorithms
available- those we do apply a common ancestor hypothesis and those we does not. A

3
Molecular Phylogeny- Introduction

common way to decide the root of tree is by using an outgroup. An outgroup is a taxon
from a group closely related to the ingroup, which includes the taxa under study.
Another way to identify the root is to use midpoint as the rooting point for the longest
branch.

Figure: Midpoint Rooted Tree Figure: Outgroup rooted tree


Source: Author Source: Author

Tree Terminology
Trees can be described based on branches and nodes. Terminal branches represent
Operational Taxonomic Unites (OTU’s). When two branches are connected, it results in
internal nodes. When two terminal branches are directly connected to each other, they are
called sister branches.

4
Molecular Phylogeny- Introduction

Figure: Defining Trees


Source: Author
If two lineages (branches) originate from one internal node, it is called bifurcation or
dichotomy. If there are more than two branches are coming out of one internal node, this is
called as polytomy and tree is said to be multifurcating.

Methods of Phylogenetic reconstruction


Various methods have been proposed to build a phylogenetic tree. We will only consider
three here: distance based method (UPGMA and NJ), maximum parsimony (MP) and
maximum likelihood (ML).

Distance Method

Distance based methods start with calculating pairwise distances between sequences based
on pairwise alignment. These distances form a distance matrix which is used to generate
the tree. Commonly known methods to generate the tree from this matrix are Unweighted
Pair Group Method using Arithmetic mean (UPGMA) and Neighbor Joining (NJ). Distance
based methods are fast but overlook substantial amount of information in a multiple

5
Molecular Phylogeny- Introduction

sequence alignment. Distance is calculated as dissimilarity between the sequences of each


pair of taxa.

Figure: Triangular and rectangular matrix. Notice that upper part in rectangular matrix is
identical to the lower part and, therefore, is redundant.
Source: Author

UPGMA distance based method


It is no longer a popular method and distance based tree now use NJ as a method of choice.
In UPGMA is a progressive clustering method. All the sequences are first considered in
calculating the matrix. Now closest taxa are considered as a group. Again matrix is
calculated considering this group as a node, subsequent to which taxa with minimum
distance are considered as a group. Now matrix is calculated again and so on...continue till
only two groups are formed and connect them also. UPGMA assumes that rate of nucleotide
or amino acid substitution is constant due to which branch length reflects actual dates of
divergence. This assumption is often not true hence can produce an inaccurate tree.
Midpoint rooting is applied in this method.

Neighbor joining method


It allows different rates of evolution in different branches of tree. It starts with connecting
OTU’s with minimum distance and the node thus created is used for subsequent calculation.
The tree is not rooted because it does not assume a constant rate of evolution but can be
rooted using an outgroup.

6
Molecular Phylogeny- Introduction

Figure: How NJ tree is made. OTU’s with lowest distance are connected first (Shown as
orange). This work as a node and next OTU with lowest distance is connected (shown as
blue).
Source: Author

Corrections: Observed distances are not always a good measure of evolutionary distance.
Because they do not take into account hidden changes due to multiple hits. Due to this
reason converting a measure of distance to a measure of evolution requires correction. Two
such common corrections are Jukes–Cantor and Kimura-2 parameter models.
The Jukes-Cantor one parameter model considers that each nucleotide is free to convert to
others with equal rates for transition and transversion hence any nucleotide has equal
chance to covert to other three. It also assumes that four bases are present in equal
frequencies.
Usually, transition rate is higher than transversion rate. Kimura two parameter model
adjusts pairwise distances taking into account the transition transversion ratio. Various
other models have been developed that are more sophisticated.

Figure: Jukes cantor model Figure: Kimura two parameter model

7
Molecular Phylogeny- Introduction

rate of transition=rate of transversion (x) rate of transition (y) ≠ rate of transverion (x)
Source: Author Source: Author

Maximum Parsimony
Parsimony based method work on the principle of choosing the most parsimonious tree. The
maximum parsimony works on the idea of minimizing the number of evolutionary changes.
It works as follows:
 Identify informative sites in a dataset. Sites which represent alternative possibilities for
OTU’s are considered informative.
 Construct trees. All possible trees are constructed and evaluated. Score is based on
number of evolutionary changes required to generate the particular tree.
 The trees with minimum score are retained. It is possible to retain more than one tree if
they have equal minimum score.

Figure: For the column shown in color, there are three possible unrooted trees.
Source: Author

Figure: Numbers of changes (steps) are counted and trees with minimum score are
selected (changes marked with bullet). For the example given here, there are 15 possible
rooted trees for one column in sequence alignment, but we are showing only four as
example.
Source: Author

Statistical methods of phylogeny


8
Molecular Phylogeny- Introduction

Distance and Maximum parsimony method are often criticized for lack of a statistical
approach. Both these methods do have criteria to select trees but are unable to calculate
the probability of one tree being the true tree over the other. Various methods have been
proposed to overcome this drawback. Two such methods are provided by likelihood and
Bayesian approaches.
In simplistic terms, likelihood can be considered as the probability assigned to
each dataset (observed characters such as nucleotides) generated for a particular
hypothesis (tree and model of evolution). In a way this is similar to maximum parsimony
because each tree is assigned a score, but this score is a likelihood score based on
statistical analysis. The best tree is the one, which has highest probability for a particular
model of how changes occur. Both maximum parsimony and maximum likelihood are
computationally exhaustive exercise and hence are slow. A detailed discussion about
likelihood can be found in referenced text books.
Another statistical method for phylogeny is Bayesian method. In maximum
likelihood we calculate the probability of observing data for a given hypothesis, in Bayesian
method, probability is calculated for a particular hypothesis.

Summary
Molecular Phylogeny is to study evolutionary relationships based on molecular sequence
data. Different methods have been proposed for studying phylogeny. Earlier methods were
distance based and considered constant evolutionary rates. These methods used more
exhaustive and computationally exhaustive methods like maximum parsimony. These
methods are now being supplemented or replaced with more sophisticated statistical
methods like maximum likelihood and Bayesian method. The benefits and pitfalls of these
methods are still debated and their applicability may depend upon the situation. A basic
understanding of these methods is a must for effective use of them for reconstructing
phylogeny.

Exercises
1. Define phylogenetics.
2. How will you define an Alignment?
3. Name three methods to draw a phylogenetic tree?
4. Define bootstrap?

9
Molecular Phylogeny- Introduction

5. How will you differentiate a dendrogram from a cladogram?


6. What is the difference between a distance based method (NJ) and maximum parsimony
(MP) methods?
7. What is the difference between UPGMA and NJ method?
8. Differentiate between maximum parsimony (MP) and maximum likelihood method (ML).
9. What is the difference between positive and negative selection?
10. Explain the priniciple of parsimony.
11. How will you explain a monophyletic group?
12. How will you explain a paraphyletic group?
13. How will you explain a polyphyletic group?
14. Differentiate between cladogram and phylogram.
15. How will you define root in a phylogenetic tree?
16. What is an outgroup?
17. Explain Jukes-Cantor model for calculating distance.
18. Explain Kimura-2 parameter model for calculating distance.
19. Differentiate between triangular and rectangular matrix.
20. How will you define sister clades?
21. Explain polytomy.
22. Explain how neighbour joining method of phylogenetic tree construction work?

Glossary
Monophyly: when a group include its ancestor all its descendants.
Polyphyly: When different species in a taxon evolve from different ancestors.
Polytomy: When phylogeny is not resolved.
Bootstrap: A statistical method to assess confidence of groupings through random
resampling of data.
Homoplasy: A condition when similarity is a coincidence and not due to common lineage.
Outgroup: A taxon from a closely related group.
Informative sites: sites which represent alternate forms for different OUT’s

References

10
Molecular Phylogeny- Introduction

Suggested Readings
1. Bioinformatics and Functional Genomics by Jonathan Pevsner. YEAR Publisher Wiley-
Blackwell
2. Evolution by Nicholas H Barton, Derek E.G. Briggs, Jonathan A Eisen, David B. Goldstein,
Nipam H Patel.2007-2010. Publisher CSHL press

Web Links
http://evolution.genetics.washington.edu
http://evolution.genetics.washington.edu/phylip.html

11

You might also like