Molecular Biology Primer
Molecular Biology Primer
2 Polymers
Chemical characteristics of organisms, particularly polymers, can be readily quantified
and correlated using logical and statistical methods.
Three types of polymers (DNA, RNA, proteins) play an essential role in biology,
either as carriers of information, or as activating molecules of the metabolism.
• DNA sequences are the information-containing molecules and are composed of
nucleotides from an alphabet of four letters: a, c, g and t.1
The DNA of an organism plays a central role in its existence. It is arranged in
the form of chromosomes. These strings may be millions of nucleotides long,
measured in base pairs (bp).
The entire set of genetic information of an organism is called its genome.
There are the following genome sizes of certain species2 :
1 The meaning of these symbols we will describe below.
2 Apart from our own species, the organisms listed are important in molecular biology and genetic
research
1
Species Number of Genome Size
chromosomes (haploid)
(diploid) (base pairs)
Roughly speaking, the order of genome size is kbp, Mbp and Gbp for Viruses,
Prokarya and Eukarya, respectively.
• Proteins, which are the operational molecules, are composed of chains of amino
acids, called polypeptides, each from an alphabet of 20 letters:
1 A ala alanine
2 C cys cysteine
3 D asp aspartic acid
4 E glu glutamatic acid
5 F phe phenylalanine
6 G gly glycine
7 H his histidine
8 I ile isoleucine
9 K lys lysine
10 L leu leucine
11 M met methionine
12 N asn asparagine
13 P pro proline
14 Q gln glutamine
15 R arg arginine
16 S ser serine
17 T thr threonine
18 V val valine
19 W trp tryptophan
20 Y tyr tyrosine
2
Typical proteins contain about 300 amino acids (aa), but there are proteins with
fewer than 100 or as many as 5000 aa.
• RNA sequences, which stand between DNA and protein, are composed of nu-
cleotides from an alphabet of four letters: a, c, g and u.
a adenine
c cytosine
g guanine
t tymine
u uracil
r purine (a or g)
y pyrimidine (c or t)
The Central Dogma of Molecular Biology describes the interaction of these poly-
mers:
- DNA acts as a template to replicate itself;
- DNA is also transcribed into RNA; and
- RNA is translated into protein.
More precisely,
• Integral form: DNA makes RNA makes protein.
• Differential form: Changed DNA can make changed protein.
This runs in the following steps:
1. Replication of DNA.
Each strand in a DNA is a chemical ”mirror image” of the other. If there is
an a on one strand, there will always be a t in the same position on the other
strand, and vice versa; if there is a c on the one strand, its ”partner” on the
other strand will always be a g, and vice versa.
When a cell divides to form daughter cells, DNA is replicated by untwisting the
two strands and using each strand as a template to produce its chemical mirror
image.
2. Transcription of DNA.
DNA also act as a blueprint for RNA, more exactly three main types of RNA:
messenger RNA (mRNA), transfer RNA (tRNA), and ribosomal RNA (rRNA).
They carry information from the genome to the ribosomes, the protein synthesis
apparatus in a cell.
3. Translation of mRNA.
The information in an mRNA will be translated into a sequence of amino acids,
creating a polypeptide molecule.3
3 The coding scheme we will discuss below.
3
3 Proteins
Organic chemistry is the chemistry of carbon compounds. Biochemistry
is the study of carbon compounds that crawl.
Mike Adam
Structural proteins act as tissue building blocks, whereas other proteins known as
enzymes act as catalysts of chemical reactions. Proteins are not laid out simply as
straight chains of amino acids. The fact that they curl and fold into complex forms
plays a crucial role in determining the distinctive biological properties of each protein.
We distinguish the following structural levels for proteins:
1. The primary structure is the amino acid sequence.
2. The secondary structure is the arrangement of the amino acids in space.
3. The tertiary structure is the three-dimensional folding pattern, which is super-
imposed on the secondary structure.
4. The quarternary structure is the composition of two or more polypeptides.
For instance human insulin is composed by two words (chains):
A: gly ile val glu gln cys cys thr ser ile cys ser leu tyr glu leu glu asn tyr cys asn.
B: phe val asn gln his leu cys gly ser his leu val glu ala leu tyr leu val cys gly glu arg
gly phe phe tyr thr pro lys thr.
The function of a protein being a direct consequence of its three-dimensional structure,
shortly written by
Sequence ⇒ Structure ⇒ Function.
4 Genes
Historically, the heritable factors which determine much of the physical make up of
organisms are called genes.
4
The AB0 blood groups in humans are determined by a system of three allels A,B
and 0. The phenotypes resulting from the various unions of gametic genotypes are
shown in the following table:
male/female A B 0
A A AB A
B AB B B
0 A B 0
4.3 Mutations
Although the DNA replication is a very accurate system, it does not work correctly
on every occasion. Sometimes errors, called mutations, can creep into the process.
There are many different types of mutation:
DNA mutations These point mutations can be placed in the following categories:
1. Transitions occur when a purine nucleotide (a or g) is substituted for an-
other purine; or a pyrimidine (c or t) is replaced by another pyrimidine.
2. Transversions occur when a pyrimidine is substituted for a purine, or vice
versa.
3. Indels lead to insertions or deletions of nucleotides.
Indels change the nucleotide sequence such that the grouping of the nu-
cleotides into tripletts during the translation is no longer the same.
4 Note that the ratio (a − t) : (c − g) base pairs can vary widely from species to species.
5
Chromosomal mutations These mutations can occur at the chromosomal level,
classified in the following way:
1. The number of chromosomes in the cell is altered.
2. An inversion is a break in the chromosome such that the broken part
flips end-for-end before rejoining the rest of the chromosome in the reverse
direction.
3. In a translocation a part of the broken chromosome may join another chro-
mosome.
4. If breaks occur in the chromosome twice it is called a duplication.
5. A deletion is given if a part of the broken chromosome is lost.
auggcugcuauucccacccacaauaugcccuga
6
2. Decompose the sequence into successive triples (codons):
aug gcu gcu auu ccc acc cac aau aug ccc uga
5 Classifications
Classifications are of great relevance in biology. Here a class is defined as a group of
entities which are
• similar, and
• related.
In the book The System of Nature Linnaeus introduced a system still in use today.
He gave every species two Latinized names; the first for the group it belongs to, the
genus; and the second for the particular organism itself. Today we divide life into
- Domain5 ;
- Kingdom;
- Phylum;
- Class;
- Order;
- Family;
- Genus;
- Species.
More or less all of these groups are artificial, insofar as their members are categorized
according to agreed-upon levels of similarity rather than precise definitions. The
exceptions are species, which are defined as a maximal group of individual organisms
that are able to interbreed and produce fertile offspring.
For example
group \ species human fruit fly
7
6 Mendel’s laws
A Mendelian population may be considered to be a group of reproducing organisms
with a relatively close of genetic relationship. We consider all the gametes produced
by a Mendelian population as a hypothetical mixture of genetic units from which
the next generation will develop. In such organisms adults produce female and male
gametes6 , which fuse to form zygotes, which develop and mature to adulthood. These
factors determining various traits are passed through the generations. It is of great
interest to describe and to understand this process.
Mendel published the result of his genetic studies on the garden pea in 1866 and
thereby laid the foundation of modern genetics. In this paper Mendel proposed some
basic genetic principles.
Principle of segregation: From any one parent, only one allelic form of a gene is
transmitted through a gamete to the offspring.
Principle of independent assortment: The segregation of one factor pair occurs
independently of any other factor pair.
We discuss several specific cases:
• Suppose that there are two and only two alleles A and a that are to be found
at a locus. A given individual may then have one of three genotypes: the
homozygotes AA or aa or the heterozygote Aa. The allel A may be dominant
over a, so that we cannot distinguish between the appearance of AA or Aa.
Generation 0 is known as the parental generation (P = F0 ), and generation n
as the nth filial generation (Fn ). Then
F0 : AA aa (1)
is followed by the generation
F1 : Aa aA (2)
which is uniform. But in the next generation we find
F2 : AA Aa aA aa (3)
with a ratio of 3 : 1 regarding the phenotype of the dominant allel. This leads
to the following phenotypes in the next generation:
AA Aa aA aa
AA 4A 4A 4A 4A
Aa 4A 3A + 1a 3A + 1a 2A + 2a
F3 : (4)
aA 4A 3A + 1a 3A + 1a 2A + 2a
aa 4A 2A + 2a 2A + 2a 4a
8
Altogether 48A + 16a = 3A + 1a. Thus
#A : #a = 3 : 1. (5)
RY RG WY WG
RY RY RY RY RY
RG RY RG RY RG
WY RY RY WY WY
WG RY RG WY WG
This shows the phenotype that results from each union of gametic genotypes.
Each of these possibilities is equally likely, so that
7 Darwin’s evolution
Biological evolution is part of the general idea that the universe has changed through
time.7 Moreover, Dobzhansky said that ”Nothing in biology makes sense except in
the light of evolution.”
In his fundamental book The origin of species [3] Darwin created a theory of
evolution, the core of which is described by the following three facts.
• Reproduction;
• Mutation;
• Selection.
(Mayr [16] added a fourth fact: Catalyse.)
Evolution, by definition, is the change in allelic frequencies in populations from
generation to generation. Evolution by natural selection depends on five factors:
Excess progeny: More offsprings are produced than can survive to reproduce.
Variability: The characteristics of living entities differ among individuals of the same
species.
Heritability: Many differences are the result of heritable genetic differences.
7 Don’t confuse the origin of life itself with evolution, the two are conceptually separate.
9
Differential adaptedness: Some differences affect how well adapted an organism
is.
Differential reproduction: Some differences in the quality adaptation are reflected
in the number of offspring successfully reared.
Evolutionary biologists are concerned with both
• The history of life; and
• The processes and mechanisms that produced the tree of life.
Natural selection is evolution’s major cause. The principle is simple: Generate a
varity of possible solutions, and then pick one that works good for the problem. So
the essence of natural selection is
1. Genetic variation within a population,
2. An environmental condition favors some of these variations more than others,
and
3. Differential reproduction of the individuals who happen to have the favored
variations.
Note that
Natural selection is the ”survival of the fit enough”;
not the well-described phrase of ”survival of the fittest”, it is not expected that optimal
structures will always be the end result. We will see that ”survival of the fittest” can
be false, and cannot be a scientific term.
It is crucial to define the term ”fitness” for a genotype. We distinguish
Darwinian natural selection, it also became clear that this understanding could not be sought only
at a qualitative level. Mathematical methods must to be added.
10
• The metabolic reactions are catalyzed largely by proteins.
• Proteins are manufactured in the cell by a complete coding process. The se-
quence of amino acids in each protein is determined by the sequence of nu-
cleotides in its gene, ”written” as a DNA.
• The universal genetic code.
That so many things could have originated independently in different organisms by
chance is incredible.
This synthetic theory is usually called Neo-Darwinism, and has the following fea-
tures:
1. The average fitness increases; most of the mutations which are fixed in a popu-
lation are advantageous.
2. The molecular clock goes faster or slower depending on the population size.
In contrast there is a Neutral Theory, created by Kimura which says:
1. Most of the offspring have disadvantegeous (fatal) genes, few have advantegeous
genes.
2. The molecular clock holds.
11
extinct species... The limbs divided into great branches, and these into
lesser and lesser branches, were themselves once, when the tree was small,
budding twigs; and this connexion of the former and present buds by
ramifying branches may well represent the classification of all extinct and
living species in groups subordinate to groups... From the first growth of
the tree, many a limb and branch has decayed and dropped off, and these
lost branches of various sizes may represent those whole orders, families,
and genera which have now no living representatives, and which are known
to us only from having been found in a fossil state... As buds give rise by
growth to fresh buds, and these, if vigorous, branch out and overtop on
all a feebler branch, so by generation I belive it has been with the great
Tree of Life, which fills with its dead and broken branches the crust of
the earth, and covers the surface with its ever branching and beautiful
ramifications.
Historically, this was a new idea: The concept of species having a continuity through
time was only developed in the late 17th century; higher life forms were no longer
thought to transmute into different kinds during the lifetime of an individual. It took
over 150 years from the development of this concept before a rooted tree was proposed
by Darwin.9
The phylogenetic tree can therefore be thought of as a central metaphor for evolution,
providing a natural and meaningful way to order data, and with an enormous amount
of evolutionary information contained within its branches.
for more information compare [18].
12
are, the closer they are to their common ancestor.
It is a central tenet of modern evolutionary biology that all ”living things” trace
back to a single common ancestor. Humans and other mammals are descended from
shrew-like creatures that lived more than 150 Mya (million years ago); mammals,
birds, reptiles and fish share as ancestors aquatic worms that lived 600 Mya; all plants
and animals are derived from bacteria-like organisms that originated more than 3000
Mya. If we go back far enough, humans, frogs, bacteria and slime moulds share a
common ancestor.
Then in the series of species from the origin of life up to today there must be a last
universal common ancestor (LUCA). Note that this proposition does not assert that
life arose just once, but that all starting points except one became extinct.11
Finding the LUCA for a set of species, or a set of populations, or a collection of
genes is a very difficult task. How the LUCA for species can be found is discussed in
[24].
Eigen [5] found that the LUCA for genes is an RNA-molecule of length 76 bp and
3.5 - 4 Gya.
7.3 Diversity
The theory of evolution is concerned with the extraordinary diversity of life on Earth.
The diversity of the living world is staggering: more than 2 million existing species
of plants and animals have been named and described; and many more remain to be
discovered - until up to 10 times this number according to some estimates. What
is impressive is not just the numbers but also the incredible heterogeneity. These
virtually infinite variations of life are the fruit of the evolutionary process.
Taxonomy is the classification of organisms for the first aspect in any view of the life.
Each phylogenetic tree is also a classification, but not vice versa.
The classification of animals and plants played an important role as a basis for Dar-
win’s theory of evolution. Moreover, taxonomy is necessary to describe the diversity
of living organisms.
The diversity of genomes is twofold:
• The presence of numerous species on Earth; and
• The polymorphism within each species.
There are many reasons why knowledge of the biodiversity is necessary, compare [8],
[13] and [21].12 There are several subquestions:
1. How many species are there?
2. How many have become extinct? In both the past and in the present. How
many are lost every year?
11 Formore facts about early (molecular) evolution see Eigen [6].
12 Inparticular, there is no successful vaccine to prevent or halt HIV infection. In part, this is
because of the high genetic diversity of HIV. For this specific case see Dress and Wetzel [4]. Here,
the main question is the prediction of the winning strain (or strains).
13
3. How long did species typically survive?
4. How much of evolutionary history is knowable?
For the idea of using evolutionary history for describing the biodiversity see Schleifer
and Horn [19].
8 Bioinformatics
It is extremely remarkable that the molecules which are the carriers of information
and the operational units which make life work are all linear polymers. Such polymers
can be written as sequences or words; and exactly these entities are the subjects which
can be handled by computers.
Bioinformatics stands for discussing biological questions with a computer, in par-
ticular about
• Searching in biological databases, in particular using public databases;
• Comparing sequences, in particular alignment sequences;
• Looking at protein structures;
• Phylogenetic analysis.
It may be of importance here to note that the culture of computational biology
differs from the culture of bioinformatics, [11]:
Sequence analysis plays an important role in both fields, but its methods
and goal are understood differently by computational biologists and by
bioinformaticians. Computational biology originally attracted a consider-
able number of practically minded theoretical biologists in the 1970s and
1980s who were both curious about the phenomenon of life and mathe-
matical literate. They wanted to study nucleic acid and protein sequences
in order to better understand life itself. In contrast, bioinformatics has at-
tracted a large number of skilled computer enthusiasts with knowledge of
computer programs that could serve as tools for laboratory biologists. . . .
Today’s split between computational biology and bioinformatics appears
to be a reflection of a profound cultural clash between curiosity-driven atti-
tude of computational scientists and adversarial competitiviness of molec-
ular biology software providers.
The main sources are:
Introduction: - www.molgen.mpg.de
- www.bioinformatik.de
- www.bioinformaticsonline.org
14
Genebank: - www.ncbi.nlm.nih.gov
- www.ebi.ac.uk
- www.ddbj.nig.ac.ac.jp
Human genome: - www.nhgri.nih.gov
Proteinbank: - www.expasy.ch
- www.embl-heidelberg.de
- www.pdb.bni.gov
Phylogeny: - www.ucmp.berkeley.edu/exhibit/phylogeny.html
- evolution.genetics.washington.edu
- awcmee.massey.ac.nz
- tolweb.org
9 Further reading
About biology - most of them in view of evolution - compare Gould and Keeton [9],
Maynard Smith and Szathmáry [14], [15], Mayr [16].
Surveys about ”Computational Molecular Biology” (with different approaches)
you find by Clote and Backofen [2], Fitch [7], Konopka and Crabbe [11], Setubal and
Meidanis [20], Vingron, Lenhof and Mutzel [22], Waterman [23].
Introductions into bioinformatics we can find by Attwood and Parry-Smith [1],
Kanehisa [12], Mount [17].
References
[1] T.K. Attwood and D.J. Parry-Smith. Introduction to bioinformatics. Prentice
Hall, 1999.
[2] P. Clote and R. Backofen. Computational Molecular Biology. John Wiley & Sons,
2000.
[3] C. Darwin. The Origin of Species. London, 1859.
[4] A. Dress and R. Wetzel. The Human Organism - a Place to Thrive for the
Immuno-Deficiency Virus. In E. Diday, Y. Lechevallier, M. Schader, P. Bertrand,
and B. Burtschy, editors, New Approaches in Classification and Data Analysis,
pages 636–643. Springer Verlag, 1994.
[5] M. Eigen. Das Urgen. Nova Acta Leopoldina 243/52, Deutsche Akademie der
Naturforscher Leopoldina, 1980.
[6] M. Eigen. Stufen zum Leben. Serie Piper, 1992.
[7] W.M. Fitch. An Introduction to Molecular Biology for Mathematicians and Com-
puter Programmers. DIMACS Series in Discrete Mathematics and Theoretical
Computer Science, 47:1–31, 1999.
15
[8] M. Glaubrecht. Die ganze Welt ist eine Insel. Hirzel Verlag, 2002.
[9] J.L. Gould and W.T. Keeton. Biological Sciences. W.W.Norton and Company,
1996.
[10] D. Graur and W.H. Li. Fundamentals of Molecular Evolution. Sinauer Associates,
Inc., 1999.
[11] A.K. Konopka and M.J.C. Crabbe. Compact Handbook of Computational Biology.
Marcel Dekker, 2004.
[12] M. Kanehisa. Post-genome Informatics. Oxford University Press, 2000.
[13] B. Lomborg. The sceptical environmentalist. Cambridge University Press, 2002.
16