Chapter 4
Introduction to Bioinformatics
Tolga Can
Abstract
Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science,
mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computa-
tional point of view. The most common problems are modeling biological processes at the molecular level
and making inferences from collected data. A bioinformatics solution usually involves the following steps:
•	   Collect statistics from biological data.
•	   Build a computational model.
•	   Solve a computational modeling problem.
•	   Test and evaluate a computational algorithm.
      This chapter gives a brief introduction to bioinformatics by first providing an introduction to biologi-
cal terminology and then discussing some classical bioinformatics problems organized by the types of data
sources. Sequence analysis is the analysis of DNA and protein sequences for clues regarding function and
includes subproblems such as identification of homologs, multiple sequence alignment, searching sequence
patterns, and evolutionary analyses. Protein structures are three-dimensional data and the associated prob-
lems are structure prediction (secondary and tertiary), analysis of protein structures for clues regarding
function, and structural alignment. Gene expression data is usually represented as matrices and analysis of
microarray data mostly involves statistics analysis, classification, and clustering approaches. Biological net-
works such as gene regulatory networks, metabolic pathways, and protein–protein interaction networks are
usually modeled as graphs and graph theoretic approaches are used to solve associated problems such as
construction and analysis of large-scale networks.
      Key words Bioinformatics, Sequence analysis, Structure analysis, Microarray data analysis, Biological
      networks
1  A Brief Introduction to Biological Terminology
                                The most abundant data in Bioinformatics consists of DNA
                                sequences. DNA is composed of four bases, A, G, C, and T. A
                                DNA sequence can be a coding or a noncoding sequence. There
                                can be several thousand bases in a gene and several million bases in
Malik Yousef and Jens Allmer (eds.), miRNomics: MicroRNA Biology and Computational Analysis, Methods in Molecular Biology,
vol. 1107, DOI 10.1007/978-1-62703-748-8_4, © Springer Science+Business Media New York 2014
                                                           51
52   Tolga Can
                 a typical bacterial genome [1]. There are about 3.2 billion bases in
                 the human genome. Protein sequences are sequences of amino
                 acids and there are 20 different types of basic amino acids in nature
                 making a protein sequence a character string composed of a
                 20-letter alphabet. An average length protein consists of about 300
                 amino acids [1]. There are millions of known protein sequences.
                 The genotype, i.e., the DNA and the corresponding protein
                 sequences, determines the phenotype (the appearance of an organ-
                 ism and how it performs its functions). The structure of a protein
                 plays an important role in determining its function. Prediction of
                 protein structure from primary protein sequence is one of the big-
                 gest unsolved challenges in biophysics and bioinformatics today.
                 There are ab initio techniques that are based on energy-
                 minimization, and there are knowledge-based techniques such as
                 homology modeling and threading.
                      Gene expression is the process of using the information in the
                 DNA to synthesize a messenger-RNA (transcription). The tran-
                 scribed messenger RNA (mRNA) is then used to synthesize the
                 corresponding protein via translation. This whole work-flow is
                 also known as the central dogma of molecular biology. Regulation
                 of gene expression, in other words, determining which genes are
                 expressed in a specific tissue or condition and which genes are
                 not, is also a challenge in molecular biology. The area of tran-
                 scriptomics deals with this problem and latest advances in bio-
                 technology allow researchers to assay gene expression on a large
                 scale in a high-throughput manner. Microarrays (or gene expres-
                 sion arrays, or gene chips) provide massive amount of data about
                 gene activity under a specific condition or in a certain biological
                 sample. Interaction assays also provide information about which
                 proteins interact with each other and allow researchers to gain a
                 systems level view of the cell. Interaction data accumulated in
                 various databases such as MINT [2], Intact [3], and DIP [4]
                 continuously grow and methods that are able to analyze large-
                 scale graphs which can represent this type of information are dis-
                 cussed in Subheading 5.
                      Analysis of molecular biology data has many challenges such as
                 redundancy and multiplicity, noise, and incompleteness. There are
                 thousands of data sources for molecular biology data. As of January
                 2012, the Nucleic Acids Research (NAR) online database collec-
                 tion available at http://www.oxfordjournals.org/nar/database/a/
                 lists 1,380 carefully selected databases covering aspects of molecu-
                 lar and cell biology. The most popular databases are NCBI Entrez
                 [5], EBI Ensemble [6], UCSC Genome Browser [7], and KEGG
                 [8] databases. Although there are initiatives for data standardiza-
                 tion such as the Gene Ontology [9] consortium, most of the data-
                 bases have their own data format and one has to learn specifics of a
                 database to be able to use the data effectively.
                                                            Introduction to Bioinformatics          53
2  Sequence Analysis
                       In this section, the main sequence analysis problems in
                       Bioinformatics are described, such as pairwise sequence alignment,
                       multiple sequence alignment, pattern search, and construction of
                       phylogenetic trees.
2.1  Pairwise          Pairwise sequence alignment is the problem of identifying similari-
Sequence Alignment     ties and differences between a pair of DNA or protein sequences [1].
                       The common assumption is that the sequences have diverged from
                       a common ancestor and the alignment result shows us the conserved
                       (i.e., unchanged) regions and regions that are diverged by evolu-
                       tionary events such as mutations, insertions, inversions, duplications,
                       and deletions. Figure 1 shows an alignment of two DNA sequences.
                            The level of similarity between two sequences is measured
                       using a scoring mechanism applied on the obtained alignment.
                       A simple scoring mechanism involves three parameters: match
                       score, mismatch penalty, and gap penalty. Biologically more accu-
                       rate scoring mechanisms give different match scores to different
                       types of matched amino acids (e.g., scoring matrices such as
                       BLOSUM62 and PAM250) and penalize the opening and exten-
                       sion of a gaps differently (i.e., affine gap penalties). Given a scoring
                       mechanism the goal of sequence alignment is to align two given
                       input sequences such that the obtained score is optimal with
                       respect to the scoring function. The well-known dynamic pro-
                       gramming solutions to the sequence alignment problem by
                       Needleman–Wunsch [10] for global alignment and Smith–
                       Waterman [11] for local alignment find the optimal alignment
                       between two sequences in polynomial time (O(mn), where m and
                       n are the lengths of the input sequences). The two variations of the
                       sequence alignment problem, the global alignment problem and
                       the local alignment problem, have very similar dynamic program-
                       ming solutions. In global alignment every nucleotide or amino
                       Fig. 1 Alignment of two DNA sequences of same length. The unaligned sequences
                       are shown at the top and the aligned sequences are shown at the bottom. The
                       dashes show the deletions, i.e., gaps, in the respective sequence. Likewise, the
                       nucleotide in the other sequence matched against a gap can be considered
                       insertion. As the two events cannot be differentiated easily, they are usually
                       referred to as indels
54   Tolga Can
                 acid in each sequence has to be aligned against an amino acid or a
                 gap, whereas in local alignment some parts of one or both sequences
                 may be ignored and a local sub-sequence alignment can be
                 obtained. The dynamic programming solution to the sequence
                 alignment problem makes use of the observation that the align-
                 ment score is additive and does not have non-local terms. Therefore,
                 the alignment problem can be divided into subproblems and the
                 scores of the subproblems can be added to obtain the score of the
                 overall alignment. Using this observation one can construct the
                 following recurrence relations to optimally align a pair of DNA or
                 protein sequences for global alignment (Eq. 1).
                            F (i , 0 ) =   ås ( A (k ) , - )     0£k £i
                            F ( 0,j ) =    ås ( -, B (k ) )      0£k £ j                      (1)
                                                é F (i , j - 1) + s (-, B ( j )), ù
                                               ê                                      ú
                            F (i ,j ) =    max ê F (i - 1, j ) + s (A (i), -), ú
                                               êëF (i - 1, j - 1) + s (A (i), B ( j )úû	                                                                                         	
                      Equation 1: the recurrence relations for the dynamic program-
                 ming solution to global sequence alignment.
                      In the recurrence relations given above, A and B denote the
                 input sequences and F(i,j) denotes the score of the optimum align-
                 ment of the prefixes of A and B including the first i and j charac-
                 ters, respectively. The scoring function s(x,y) calculates the
                 individual score for aligning particular amino acids or nucleotides,
                 x and y, or alignment against a gap, i.e., s(x,−). The recurrence rela-
                 tions above are for the scoring mechanism which assumes a linear
                 gap penalty. The recurrence relations can be modified easily for the
                 affine gap model [1].
                      Using the recurrence relations given above, one can fill a table
                 named partial scores table by using F(0,0) = 0 and the first two
                 equations above two initialize the first column and row of the table
                 in case of global alignment. The subsequent entries in the table can
                 be easily filled in by using the third equation and at the end of
                 algorithm execution, the entry F(m,n) indicates the optimum score
                 of globally aligning A and B. In order to construct the actual align-
                 ment which shows which nucleotide or amino acid is matched
                 against which, one can trace back the partial scores table and fol-
                 low a path from F(m,n) to F(0,0) by tracing the cells that deter-
                 mine the value at a particular cell F(i,j). In other words, by noting
                 which of the three terms inside the max function is used to find the
                 maximum value, we can construct the alignment easily. Figure 2
                 shows a completed partial scores table and the corresponding
                 alignment. The alignment path is indicated as arrows on the partial
                 scores table. The overall optimum score of the alignment is 2.
                 A linear gap penalty of −8 is used in this example. The match scores
                 and mismatch penalties are taken from a scoring matrix which gives
                 different scores and penalties to different types of amino acids.
                                          Introduction to Bioinformatics        55
    Fig. 2 The completed partial scores table for the global alignment of two
    sequences. The alignment path is indicated with arrows. The table is filled by
    using Eq. 1
       The local sequence alignment problem can be solved by a small
    modification to the dynamic programming recurrence relations as
    shown in Eq. 2 below.
                F (i , 0 ) = 0
                F ( 0,j ) = 0                                                  (2)
                                  é                 0                    ù
                                  ê F i , j - 1 + s (-, B ( j )), ú
                                         (        )                      ú
                F (i ,j ) =   max ê
                                  ê F (i - 1, j ) + s (A (i), -), ú
                                  ê                                      ú
                                  êëF (i - 1, j - 1) + s (A (i), B ( j )úû	                                                                          	
        Equation 2: the recurrence relations for the dynamic program-
    ming solution to local sequence alignment.
        The modifications ensure that there are no negative values in
    the partial scores table; hence, once can start an alignment any-
    where within the sequences and end the alignment anywhere.
    Figure 3 shows the partial scores table for a local alignment of two
    protein sequences. Linear gap penalty with a gap penalty of −1 is
    used. Match score is 4 and mismatch score is −2 in this example.
    The optimum local alignment score is 14 as highlighted by a circle
    and the alignment path is indicated by arrows.
        The running time and space complexities of both algorithms
    are O(mn). Although polynomial, the quadratic time complexity
    may be a big bottleneck when aligning whole genomes and when
    performing database searches. Therefore, algorithms which use
    several indexing mechanisms and heuristics have been developed
    among which BLAST [12] and FASTA [13] are two prominent
    ones which provide faster ways to perform larger-scale alignments.
56      Tolga Can
                     Fig. 3 The completed partial scores table for the local alignment of two sequences.
                     The alignment path is indicated with arrows. The table is filled by using Eq. 2
                     These algorithms do not guarantee optimality; however, they are
                     widely used in practice due to the interactive running-time perfor-
                     mances they provide. Another reason for their popularity is that
                     they provide statistical significance measures such as an E-value to
                     indicate a level of significance for the alignments.
2.2  Multiple        Quite often biologists need to align multiple related sequences
Sequence Alignment   simultaneously. The problem of aligning three or more DNA or
                     protein sequences is called the multiple sequence alignment (MSA)
                     problem. MSA tools are one of the most essential tools in molecu-
                     lar biology. Biologists use MSA tools for finding highly conserved
                     subregions or embedded patterns within a set of biological
                     sequences, for estimating evolutionary distance between sequences,
                     and for predicting protein secondary/tertiary structure. An exam-
                     ple alignment of eight short protein sequences is shown in Fig. 4.
                          In the example alignment (Fig. 4), conserved regions and pat-
                     terns are marked with different shades of gray. Extending the
                     dynamic programming solution for pairwise sequence alignment
                     to multiple sequence alignment, results in a computationally
                     expensive algorithm. For three sequences of length n, the running
                     time complexity is 7n3 which is O(n3). For k sequences, in order to
                     run the dynamic programming solution a k-dimensional matrix
                     needs to be built which results in a running time complexity of
                     (2k − 1)(nk) which is O(2knk). Therefore, the dynamic programming
                     approach for alignment between k sequences is impractical due to
                     exponential running time and can only be used for few or very
                     short sets of sequences. Because of this drawback several heuristic
                     solutions have been developed to solve the multiple sequence
                     alignment problem suboptimally in a reasonable amount of time.
                                    Introduction to Bioinformatics          57
Fig. 4 Conserved residues, regions, patterns in a multiple alignment of eight
protein sequences are highlighted
Fig. 5 Star alignment of four sequences. The center sequence is S2. Each of the
other sequences is aligned pairwise to S2 and these alignments are combined
one by one to get the multiple sequence alignment
Most of these heuristics use pairwise alignments between the input
sequences to build a multiple sequence alignment progressively.
These heuristics are named “progressive alignment approaches.”
The simplest of progressive alignment methods is the star align-
ment [14]. In star alignment, one of the input sequences is selected
as the center and the pairwise alignments of the center sequence to
the rest of the sequences are used to construct a multiple sequence
alignment progressively. The center sequence is chosen as the most
similar sequence to all of the other sequences. One may perform an
all-to-all pairwise alignment of input sequences for this purpose
and choose the sequence which maximizes the total score of align-
ments to that sequence. After the center sequence is chosen, the
pairwise alignments to the center sequence can be written one after
another by using the center sequence as the reference in the align-
ment. A star alignment example is given in Fig. 5.
     In this example, S2 is selected as the center sequence and the
pairwise alignments shown on the left are used to build the multi-
ple sequence alignment shown on the lower right. If the average
length of k input sequences is n, the cost of finding the center
58        Tolga Can
Fig. 6 Multiple alignment of four sequences using a guide tree. The guide tree shows that we should align S1
to S3 and S2 to S4 and then align the resulting pairwise alignments to get the multiple sequence alignment
                            sequence is O(k2n2), which constitutes the overall cost of the star
                            alignment. The biggest disadvantage of the progressive alignment
                            approach is that decisions made early in the iterations are fixed and
                            propagated to the final alignment. If one makes an incorrect align-
                            ment decision due to not foreseeing the rest of the sequences, this
                            error is going to appear in the final alignment (once a gap always a
                            gap). Therefore, the order of pairwise alignments in the progres-
                            sive alignment approach is very important and affects the final
                            quality of the alignment. Aligning more similar sequences in the
                            early iterations is more likely to produce more accurate alignments.
                            Better progressive alignment techniques, such as ClustalW [15],
                            use a guide tree, which is a schedule of pairwise alignments to build
                            the multiple sequence alignment. ClustalW is one of the most pop-
                            ular progressive alignment methods. The guide tree is built using
                            the neighbor joining method [16] which is a method for building
                            phylogenetic trees. Figure 6 shows the main components of
                            ClustalW for an example run of four sequences.
                                The distance matrix on the left of Fig. 6 shows the pairwise
                            distances between the input sequences. This distance matrix is used
                            to build the guide tree shown on the right. The guide tree provides
                            the schedule of progressive alignments and the multiple sequence
                            alignment of the sequences are constructed by iterative pairwise
                            alignments. Some of the drawbacks of the progressive approach are
                            that they depend on pairwise sequence alignments. They produce
                            inaccurate alignments if the sequences are very distantly related,
                            and care must be made in choosing scoring matrices and penalties.
                            In addition to progressive alignment approaches, there are also
                            iterative approaches which make random changes in the final align-
                            ment successively as long as the alignment gets better [17].
                                The result of a multiple alignment of sequences is usually rep-
                            resented as a sequence profile. A simple sequence profile may indi-
                            cate for each alignment position the composition of amino acids or
                            nucleotides at that position. There are also more elaborate statisti-
                            cal methods to represent profiles, such as Hidden Markov Models
                            (HMMs).
                                                                 Introduction to Bioinformatics              59
                         Fig. 7 The suffix tree for the string xabxac. The root of the tree is on the left. Each
                         suffix can be traced from the root to the corresponding leaf node by a unique
                         path. The leaf node number indicates the corresponding suffix
2.3  Efficient Pattern   Some problems in sequence analysis involve searching a small
Matching Using Suffix    sequence (a pattern) in a large sequence database, finding the longest
Trees                    common subsequence of two sequences, and finding an oligonucle-
                         otide sequence specific to a gene sequence in a set of thousands of
                         genes. Finding a pattern P of length m in a sequence S of length n can
                         be solved simply with a scan of the string S in O(mn) time. However,
                         when S is very long and we want to perform many pattern searches,
                         it would be desirable to have a search algorithm that could take O(m)
                         time. To facilitate this running time we have to preprocess S. The
                         preprocessing step is especially useful in scenarios where the sequence
                         is relatively constant over time (e.g., a genome), and when search is
                         needed for many different patterns. In 1973, Weiner [18] introduced
                         a data structure named suffix trees that allows searching of patterns in
                         large sequences in O(m) time. He also proposed an algorithm to con-
                         struct the suffix tree in O(n) time. The construction of the suffix tree
                         is an offline onetime cost which can be ignored when multiple
                         searches are performed. Below we give a quadratic-time construction
                         algorithm which is easier to describe and implement. But first we
                         formally define the suffix tree. Let S be a sequence of length n over a
                         fixed alphabet Σ. In biological applications the alphabet usually con-
                         sists of the four nucleotides for DNA sequences or the 20 amino
                         acids for protein sequences. A suffix tree for S is a tree with n leaves
                         (representing n suffixes) and the following properties:
                         	1.	 Every internal node other than the root has at least two children.
                         	2.	 Every edge is labeled with a nonempty substring of S.
                         	3.	 The edges leaving a given node have labels starting with differ-
                              ent letters.
                         	4.	 The concatenation of the labels of the path from the root to the
                              leaf i spells out the ith suffix of S. We denote the ith suffix by Si. Si
                              is the substring of S from the ith character to the last character of S.
                             Figure 7 shows an example suffix tree which is constructed for
                         the sequence xabxac.
60       Tolga Can
                            Note that if a suffix is a prefix of another suffix we cannot have
                       a tree with the properties defined above. Therefore, we introduce
                       a terminal characters at the end of the sequences, such as $, which
                       does not exist in the input alphabet. The following quadratic time
                       algorithm can be used to construct a suffix tree for a given input
                       sequence S of length n.
                       	 1.	 Start with a root and a leaf numbered 1, connected by an edge
                             labeled S$.
                       	 2.	 Enter suffixes S2$, S3$; … ; Sn$ into the tree as follows:
                           	(a)	 To insert Si = S[i … n]$, follow the path from the root
                                 matching characters of Si until the first mismatch at char-
                                 acter Si[j] (which is bound to happen).
                           	(b)	 If the matching cannot continue from a node, denote that
                                 node by w.
                           	(c)	 If the mismatch occurs at the middle of an edge e with a
                                 label S[u…v], split e and create a new node w. Replace e
                                 by two edges S[u … u + k − 1] and S[u + k … v] if the mis-
                                 match occurs at character S[k].
                           	(d)	 Create a new leaf numbered i, and connect w to it by an
                                 edge labeled with Si[j … |Si|].
                           Given a sequence S and a pattern P, we search for all occur-
                       rences of P in S using the observation that each occurrence of P has
                       to be a prefix of some suffix in S. Each such prefix corresponds to
                       a path starting at the root. So, we try to match P on a path, starting
                       from the root. There are three possible cases.
                       	 1.	 The pattern does not match along the path → P does not occur
                             in T.
                       	 2.	 The pattern match ends in a node u of the tree (set x = u) → All
                             leaves below x represent occurrences of P.
                       	 3.	 The match ends inside an edge (v,w) of the tree (set x = w) → All
                             leaves below x represent occurrences of P.
                           It is important to note that the suffix tree method described
                       above performs exact pattern searches. In certain biological problems,
                       where one is looking for similarities rather than exact occurrences,
                       approximate string matching may be sought. In such cases, approx-
                       imate extensions of suffix trees or other probabilistic methods
                       should be used [19].
2.4  Construction of   Phylogeny is a term coined by Haeckel in 1866 [20] which means
Phylogenetic Trees     the line of descent or evolutionary development of any plant or
                       animal species or the origin and evolution of a division, group or
                       race of animals or plants. Phylogenetic analyses can be used for
                       understanding evolutionary history like the origin of species, assist
                       in epidemiology of infectious diseases or genetic defects, aid in
                                                                  Introduction to Bioinformatics          61
Fig. 8 The UPGMA method to construct a phylogenetic tree for five sequences. The method works on the distance
matrix on the upper right and constructs the tree on the upper left
                             prediction of function of novel genes, used in biodiversity studies,
                             and can be used for understanding microbial ecologies. The data
                             for building phylogenetic trees may be characteristics such as traits
                             and biomolecular features, or they can be numerical distance esti-
                             mates such as the edit distance between genomic sequences. The
                             result of a phylogenetic analysis is a phylogenetic tree which shows
                             the relationships among a group of sequences or species in a hier-
                             archical manner. In this regard, phylogenetic trees can be consid-
                             ered as a hierarchical clustering. Below, we describe two
                             distance-based methods, UPGMA [21] and neighbor joining [16],
                             which construct a phylogenetic tree of input sequences given an
                             all-to-all distance (or similarity) matrix between them. Both
                             UPGMA and neighbor joining methods are agglomerative cluster-
                             ing methods which start with all the input sequences in separate
                             clusters and group them into larger clusters in subsequent steps.
                             UPGMA stands for unweighted pair-group method with arithme-
                             tic mean. At each iteration, the closest sequences (or groups of
                             sequences) are identified and merged to form a combined larger
                             cluster which makes it a greedy algorithm. In order to proceed,
                             one has to define the new distance values between clusters. In
                             UPGMA, the cluster distance is defined as the average distance of
                             all pairwise distances between two cluster elements, one from each
                             cluster. Figure 8 shows the resulting phylogenetic tree after appli-
                             cation of UPGMA on five input sequences. The corresponding
                             distance matrix is shown on the right.
                                  In the first iteration, B and C are combined to obtain the cluster
                             BC. The total distance of 4 between B and C is divided equally
                             between the branches that connect B and C at the intermediate
                             node BC. B and C are then removed from the original distance
                             matrix and replaced by BC and the distances are updated accord-
                             ingly (shown on the lower left). BC is then combined with D and
62   Tolga Can
                 Fig. 9 The phylogenetic tree constructed by the neighbor joining method on the
                 five sequences. The distance matrix is shown on the right and the final NJ tree is
                 shown on the left
                 A is combined with E in the following step. In the last step BCD is
                 combined with AE. The intermediate distance matrices are shown
                 in Fig. 8. UPGMA produces ultrametric trees, in other words the
                 distance of every leaf to the root is equal in the tree. If the original
                 distance matrix is ultrametric, then the UPGMA method can pro-
                 duce the correct tree; otherwise, it produces an approximate tree.
                 A more accurate algorithm is the neighbor joining algorithm. The
                 neighbor joining algorithm was developed in 1987 by Saitou and
                 Nei [16]. At each iteration, clusters that are close to each other and
                 far from the rest are joined. Neighbor joining always finds the cor-
                 rect tree if the distances are additive. However, in biological prac-
                 tice, the distances may not be additive; hence, this method produces
                 approximate trees since finding the optimum tree, in which the
                 difference between path distances and actual distances is minimum,
                 is an NP-hard problem [22]. The neighbor joining algorithm is
                 given below for n sequences provided as input:
                 	 1.	For each node i, define ui as summation of all the distance
                      from i to all other nodes divided by n − 2.
                 	 2.	 Iterate until two nodes are left.
                 	 3.	 Choose a pair of nodes (i,j) with smallest Dij − ui − uj.
                 	 4.	 Merge the chosen node to a new node ij and update distance
                       matrix.
                      	(a)	 Dk,ij = (Di,k + Dj,k − Di,j)/2.
                      	(b)	 Di,ij = (Di,j + ui − uj)/2.
                      	(c)	 Dj,ij = Di,j − Di,ij.
                 	 5.	 Delete nodes i and j.
                 	 6.	 For the final group (i,j), use Di,j as the edge weight.
                     Figure 9 shows the neighbor joining tree for the same set of
                 sequences as in the UPGMA example (Fig. 8). Note the differ-
                 ences between the two resulting trees.
                                                        Introduction to Bioinformatics     63
3  Structure Analysis
                        Proteins perform their functions within the cell by their structural
                        and physiochemical properties. Therefore, analysis of protein
                        structure is very important to understand their function. Drugs
                        can be designed more accurately if the three-dimensional structures
                        of proteins are known. There are different levels of protein struc-
                        ture: primary structure, secondary structure, tertiary structure,
                        and quaternary structure [1]. Primary structure is the linear
                        sequence of amino acids. Secondary structure is the formation of
                        local shapes such as helices or extended beta strands. Tertiary
                        structure is the global three-dimensional shape of a single polypep-
                        tide chain. Quaternary structure is the formation of a molecular
                        complex consisting of multiple polypeptide chains and or pros-
                        thetic groups. One of the biggest challenges of structural bioinfor-
                        matics is the prediction of tertiary structure from primary structure.
                        This problem is named as the protein folding problem, and all the
                        methods that are known today are approximate methods which do
                        no guarantee to find the actual true structure. Therefore, biolo-
                        gists mostly trust the experimentally determined structures when
                        working with protein structures. However, the number of experi-
                        mentally determined structures is less than 1 % of the known
                        protein sequences. Determination of structure using X-ray crystal-
                        lography and nucleic magnetic resonance is very time-consuming
                        and expensive. Moreover, crystallization leads to distortion of the
                        protein structure and thus the resulting structure is an approxima-
                        tion of the exact structure of the native protein. This section
                        describes the main approaches used for protein structure predic-
                        tion and then details the problem of structural alignment.
3.1  Prediction of      There are three main approaches for predicting protein structure
Protein Structures      from protein sequence: ab initio methods, homology modeling
from Primary            methods, and threading methods. In the ab initio methods struc-
Sequence                ture is predicted using pure chemistry and physics knowledge
                        without the use of other information. These techniques exhaus-
                        tively search the fold space and therefore can only predict structures
                        of small globular proteins (length < 50 amino acids). Homology
                        modeling, on the other hand, makes use of the experimentally
                        determined structures and uses sequence similarity to predict the
                        structure of the target protein. The main steps of homology mod-
                        eling are as follows:
                        	 1.	Identify a set of template proteins (with known structures)
                             related to the target protein. This is based on sequence similar-
                             ity (BLAST, FASTA) with sequence identity of 30 % or more.
                        	 2.	 Align the target sequence with the template proteins. This is
                              based on multiple sequence alignment (ClustalW). Identify
                              conserved regions.
64       Tolga Can
                      	 3.	 Build a model of the protein backbone, taking the backbone
                            of the template structures (conserved regions) as a model.
                      	 4.	 Model the loops. In regions with gaps, use a loop modeling
                            procedure to substitute segments of appropriate length.
                      	 5.	 Add side chains to the model backbone.
                      	 6.	 Evaluate and optimize entire structure.
                          In order to be able to apply homology modeling approaches,
                      the target protein sequence has to exhibit at least 30 % sequence
                      identity to a protein sequence with known experimental structure.
                      To be able to predict novel sequences with no similarity to sequences
                      with known structures, one can use threading based approaches
                      [1]. In threading, the target sequence is aligned with a library of
                      structures called the template library. The motivation behind
                      threading is that if the template library covers the structural space
                      well, with no sequence similarity present, one can evaluate the fit-
                      ness of the target sequence on known structures (i.e., folds) and
                      predict its overall structure. The threading problem involves the
                      computation of the optimal alignment score between a given
                      sequence and a template structure, i.e., a fold. If we can solve this
                      problem, then given a sequence, we can try each fold and find the
                      best structure that fits this sequence. Because there are only a few
                      thousands folds, we can find the correct fold for the given sequence.
                      However, it was shown that threading is an NP-hard problem [23].
                      Therefore, researchers have proposed heuristics which do not guar-
                      antee the optimal alignment between the sequence and the fold.
                      The components of a full threading method include the following:
                      ●●   Template library: Use structures from DB classification catego-
                           ries (PDB).
                      ●●   Scoring function.
                      ●●   Single and pairwise energy terms.
                      ●●   Alignment.
                      ●●   Consideration of pairwise terms leads to NP-hardness.
                      ●●   Use heuristics.
                      ●●   Confidence assessment.
                      ●●   Z-score, P-value similar to sequence alignment statistics.
                      ●●   Improving the final structure.
                      ●●   Improvement by local threading and multi-structure threading.
3.2  The Structural   Pairwise structural alignment of protein structures is the problem
Alignment Problem     of finding similar subregions of two given protein structures. The
                      key problem in structural alignment is to find an optimal corre-
                      spondence between the arrangements of atoms in two molecular
                      structures in order to align them in 3D. Optimality of the
                                Introduction to Bioinformatics     65
alignment is often determined using a root mean square measure
of the distances between corresponding atoms in the two mole-
cules although other measures have been proposed. However, the
difficulty is that it is not known a priori which atom in the first
molecule corresponds to which atom in the second molecule.
Furthermore, the two molecules may not even have the same num-
ber of atoms. The alignment can be done at various levels such as
the atom level, amino acid level, or secondary structure element
(SSE) level. During comparison, one can take into account only
the geometry or architecture of coordinates or relative positions of
amino acids or the sequential order of residues along the backbone
may be considered. Some approaches use the physiochemical prop-
erties of amino acids but most approaches just treat protein struc-
tures as geometrical shapes during comparison. Pairwise structure
alignment problem is an NP-hard problem unlike pairwise sequence
alignment [23]. Therefore, many different techniques have been
proposed for the structural alignment problem. The most widely
used measures in assessing the quality of the alignment is the length
of the alignment and the RMSD between the two aligned regions.
RMSD (root mean squared distance) shows how similar the identi-
fied regions are in terms of their shapes and the length of the align-
ment shows whether the method is capable of identifying more
significant larger similarities between structures. An RMSD value
of less than 3.0 Å is considered a good structural match. One of the
earliest algorithms for structural alignment is an iterative dynamic
programming algorithm called STRUCTAL [24]. STRUCTAL
was proposed by Subbiah in 1993 and has the following steps:
	 1.	 Start with arbitrary alignment of the amino acids in the two
      protein structures A and B.
	2.	Superimpose the two structures with respect to the initial
    alignment in order to minimize RMSD.
	 3.	 Compute a structural alignment (SA) matrix where entry (i,j)
      is the score for the structural similarity between the ith amino
      acid of A and the jth amino acid of B. Structural similarity is
      some constant divided by the distance between the ith amino
      acid of A and the jth amino acid of B.
	 4.	Use Dynamic Programming to compute the next alignment
     with gap penalty 0.
	5.	Iterate steps 2–4 until the overall score converges.
	 6.	 Repeat with a number of initial alignments.
    Another algorithm for structural alignment is Dali developed
by Holm and Sander in 1993 [25]. Dali uses intra-atomic distance
matrices to represent three-dimensional proteins structures as two-
dimensional matrices. It then compares the intra-atomic distance
matrices of the two input protein structures in order to find a
structural alignment. There have been many other algorithms like
66     Tolga Can
                     VAST [26], CE [27], LOCK [28], and TOPS [29] developed for
                     structural alignment, which differ in how they represent the pro-
                     tein structure, extract structural features, and match matching
                     structural features. Some algorithms find short but very similar
                     regions, whereas some algorithms are able to detect larger regions
                     with less similarity. The statistical theory of structural alignments is
                     similar to that of BLAST as many methods compare the likelihood
                     of a match as compared to a random match. However, there is less
                     agreement regarding the score matrix; hence, the z-scores of CE,
                     DALI, and VAST may not be compatible.
4  Microarray Data Analysis
                     Cells containing the same DNA are still different because of dif-
                     ferential gene expression. Only about 40 % of human genes are
                     expressed at any one time. A gene is expressed by transcribing
                     DNA into single-stranded mRNA. Then, the mRNA is translated
                     into a protein. This is known as the central dogma of molecular
                     biology [1]. The microarray technology allows for measuring the
                     level of mRNA expression in a high-throughput manner. Messenger
                     RNA expression represents the dynamic aspects of the cell. In
                     order to measure the mRNA levels, mRNA is isolated and labeled
                     using a fluorescent material. When the mRNA is hybridized to the
                     target, the level of hybridization corresponds to light emission
                     which is measured with a LASER. Therefore, higher concentration
                     means more hybridization which indicates more mRNA.
                          Microarrays can be used to measure mRNA levels in different
                     tissues, different developmental stages, different disease states, and
                     in response to different treatments. One of the main sources of
                     microarray data is NCBI’s Gene Expression Omnibus (GEO) [30].
                     As of May 2012, there are about 750,000 microarray samples per-
                     formed in about 30,000 different experiments in the GEO. A typi-
                     cal microarray sample’s raw data is about 10–30 Mb.
                          The main characteristic of microarray data are that they are
                     extremely high dimensional where the number of genes is usually
                     on the order of tens of thousands and the number of experiments
                     is on the order of tens. Microarray data is considered to be noisy
                     due to imperfect hybridization and off-target hybridization.
                     Normalization and thresholding are important for interpreting
                     microarray data (see Chapters 16 and 17). Some, mRNA levels may
                     not be read correctly and may be missing from the final output due
                     to a gene failing to hybridize the microarray spot. These character-
                     istics of microarray data make data mining on this data a challeng-
                     ing task and having too many genes leads to many false positive
                     identifications. For exploration purposes, a large set of all relevant
                                        Introduction to Bioinformatics                                  67
    genes is desired, whereas for diagnostics or identification of thera-
    peutic targets, a small set of genes is needed.
         In a microarray experiment, biologists usually seek for differen-
    tially expressed genes between two conditions such as a diseased
    tissue versus a normal tissue. There are well-developed statistical
    techniques to identify differentially expressed genes accurately, such
    as the significance analysis of microarrays [31]. There are also some
    other data mining related problems in microarray data analysis.
    Some clustering problems can be identified by trying to find genes
    with similar function, or trying to subdivide experiments or genes
    into meaningful classes. Classification problems also exist by trying
    to correctly classify an unknown experiment or gene into a known
    class. Also by classifying the patient’s microarray sample into a
    known cancer subtype accurately one can make better treatment
    decisions for a cancer patient based on gene expression profile.
         The rest of this section describes the distance measures used in
    clustering methods and describes a well-known clustering algo-
    rithm: the K-means algorithm [32].
         The goal of a clustering algorithm is to group similar genes or
    samples. In order to identify similarity, one needs a distance or
    similarity function. The most widely accepted distance measures in
    microarray analysis are the Euclidean distance and the Pearson’s
    correlation coefficient. For two n dimensional vectors, the
    Euclidean distance is defined as:
                                               n
                        d ( X ,Y ) =          å(X             - Yi )
                                                                          2
                                                          i                                            (3)
	                                             i =1                                                                              	
         Equation  3: the Euclidean distance between two n dimen-
    sional points X and Y.
         The upregulation and downregulation behavior of a gene is
    relative to its basal expression level which may vary among genes.
    Therefore, the Euclidean distance measure may result in high dis-
    tances between similarly behaving genes if their basal expression
    level is different. To overcome this problem, Pearson’s correlation
    coefficient is used. Pearson’s correlation coefficient for two n
    dimensional vectors is defined as:
                                        n
                                       å(X           i            (
                                                         - X ) Yi - Y                 )
               PCC ( X ,Y ) =          i =1
                                                                                                       (4)
                                 n                                    n
                                å(X                -X)            å (Y                     )
                                                              2                                2
                                              i                                   i   -Y
	                               i =1                              i =1
                                                                                                   	
       Equation 4: the Pearson’s correlation coefficient between two
    genes X and Y with n samples
       where X and Y are the means of the dimensions of the vectors.
68     Tolga Can
                         Given a distance measure and a microarray dataset, K-means
                     algorithm can be used to group genes or samples to a predetermined
                     number of groups, i.e., k. K-means algorithm proceeds as follows:
                     	 1.	 Randomly assign k points to k clusters.
                     	2.	Iterate.
                         	(a)	 Assign each point to its nearest cluster (use centroid of
                               clusters to compute distance).
                         (b)	After all points are assigned to clusters, compute new                         	
                             centroids of the clusters and reassign all the points to the
                             cluster of the closest centroid.
                          In order to find the natural cluster number in the dataset, dif-
                     ferent values of k may be tried.
                          There are many other clustering techniques such are hierarchi-
                     cal clustering (e.g., UPGMA [21] and neighbor joining [16]) and
                     self organizing maps (SOM) (see Chapter 16 and [33]). Different
                     clustering algorithms produce different clusterings of the data and
                     one can choose to use a clustering method based on the applica-
                     tion requirements such as the cluster sizes.
5  Construction and Analysis of Biological Networks
                     Recent discoveries show that the complex biological functions of
                     higher organisms are due to combinatorial interactions between
                     their proteins. There are signaling pathways, metabolic pathways,
                     protein complexes, gene regulatory networks that take part in the
                     complex organization of the cell. There are many research projects
                     to find the complete repertoire of interactions between proteins in
                     an organism. There are experimental high throughput methods,
                     like yeast-two-hybrid (Y2H) and affinity purification with mass
                     spectrometry (APMS). There are also machine learning methods
                     to predict interactions from indirect data sources such as genomic
                     context, sequence similarity, and microarray expression profiles.
                          A collection of interactions defines a network and a pathway is
                     a subset of it. We can define a pathway as a biological network that
                     relates to a known physiological process or a complete function.
                     Figure 10 shows an example of a biological network which are usu-
                     ally modeled as graphs.
                          Results of high-throughput experiments to identify interacting
                     proteins are usually collected in databases such as DIP [4], MINT
                     [2], and IntAct [3]. The results of low-throughput experiments can
                     be found in literature by using text mining techniques. Hundreds
                     of thousands of unstructured free text articles should be processed
                     automatically to extract protein–protein interaction information.
                     The main challenges are that there is no standard naming of genes,
                     proteins, or processes and methods to understand natural language
                                        Introduction to Bioinformatics              69
                   FANCD2                 H2AFX
           RBBP8
                                        ATM
                                                            TP53BP1
            BARD1            RAD51                                       CDKN1A
   BRIP1
                                                  TP53
                     BRCA2      BRCA1                                 MDM4
                                                                                  USP7
                                                                 MDM2
           UIMC1
                              ESR1
                                           SP1           EP300
                                                                        HIF1A
                                 SRC
Fig. 10 Functional associations around the BRCA1 gene in human. The snapshot
is created from the interactive network visualization tool from the STRING
Database at string.embl.de
need to be developed to facilitate the process. The accuracy and
coverage of current techniques is limited.
     In order to construct large-scale biological networks, predicted
and experimentally determined interactions can be combined to
generate a large-scale graph of interactions. Large-scale protein–
protein interaction networks can be analyzed for predicting mem-
bers of a partially known protein complex or pathway, for inferring
individual genes’ functions on the basis of linked neighbors, for
finding strongly connected components by detecting clusters to
reveal unknown complexes, and for finding the best interaction
path between a source and a target gene. Several clustering
algorithms such as MCL (Markov clustering) [34], RNSC
(restricted neighborhood search clustering) [35], SPC (super para-
magnetic clustering) [36], and MCODE (molecular complex
detection) [37] were developed to find clusters in biological net-
works. The MCL algorithm simulates a flow on the input graph
and calculates successive powers of the adjacency matrix to reveal
clusters in the graph. It has only one parameter which influences
the number of clusters that are generated. The RNSC algorithm
starts with an initial random clustering and tries to minimize a cost
function by iteratively moving vertices between neighboring clus-
ters using several parameters. SPC is a hierarchical algorithm
70        Tolga Can
                            inspired from an analogy with the physical properties of a ferro-
                            magnetic model subject to fluctuation at nonzero temperature.
                            MCODE gives a weight to each vertex by its local neighborhood
                            density (using a modified version of clustering coefficient using
                            k-cores). It starts from the top weighted vertex and includes neigh-
                            borhood vertices with similar weights to the cluster. MCODE has
                            a post-processing step to remove or add new vertices. Iteratively, it
                            continues with the next highest weight vertex in the network.
                            MCODE may provide overlapping clusters.
                                 In a cell, all these networks are intertwined and perform their
                            functions in a dynamic manner. As new components, regulatory
                            mechanisms are discovered; we are able to construct a more com-
                            plete picture of the genome-wide systems biology of the cell. One
                            such regulatory mechanism is the regulation of the translation of a
                            protein by microRNAs. microRNAs and their targets form another
                            network of regulatory interactions and this network can be com-
                            bined with gene regulatory networks, which contain transcription
                            factors and their targets, to get a more complete picture of regula-
                            tory mechanisms. Further integration of the regulatory mecha-
                            nisms with protein–protein interaction networks may provide
                            insights into the dynamics of signaling pathways and other pro-
                            cesses in the cell.
References
	1.	Zvelebil M, Baum J (2007) Understanding           	 9.	 The Gene Ontology Consortium (2000) Gene
      bioinformatics. Garland Science, New York,            ontology: tool for the unification of biology.
      NY. ISBN 978-0815340249                               Nat Genet 25(1):25–29
	 2.	 Chatr-aryamontri A, Ceol A, Palazzi LM et al    	10.	 Needleman SB, Wunsch CD (1970) A general
      (2007) MINT: the Molecular INTeraction                method applicable to the search for similarities
      database. Nucleic Acids Res 35(Suppl                  in the amino acid sequence of two proteins.
      1):D572–D574                                          J Mol Biol 48(3):443–453
	 3.	 Kerrien S, Aranda B, Breuza L et al (2012)      	11.	 Smith TF, Waterman MS (1981) Identification
      The IntAct molecular interaction database             of common molecular subsequences. J Mol
      in 2012. Nucleic Acids Res 40(D1):                    Biol 147:195–197
      D841–D846                                       	12.	 Altschul S, Gish W, Miller W et al (1990) Basic
	 4.	 Xenarios I, Rice DW, Salwinski L et al (2000)         local alignment search tool. J Mol Biol
      DIP: the database of interacting proteins.            215(3):403–410
      Nucleic Acids Res 28:289–291                    	13.	Lipman DJ, Pearson WR (1985) Rapid and
	 5.	 Maglott D, Ostell J, Pruitt KD, Tatusova T            sensitive protein similarity searches. Science
      (2010) Entrez Gene: gene-centered informa-            227(4693):1435–1441
      tion at NCBI. Nucleic Acids Res 33(D1):         	14.	Bafna V, Lawler EL, Pevzner PA (1993)
      D54–D58                                               Approximation algorithms for multiple
	 6.	 Flicek P, Amode MR, Barrell D et al (2012)            sequence alignment. Theor Comput Sci
      Ensemble 2012. Nucleic Acids Res 40(D1):              182(1–2):233–244
      D84–D90                                         	15.	 Chenna R, Sugawara H, Koike T et al (2003)
	 7.	 Kent WJ, Sugnet CW, Furey TS et al (2002)             Multiple sequence alignment with the Clustal
      The human genome browser at UCSC.                     series of programs. Nucleic Acids Res 31(13):
      Genome Res 12(6):996–1006                             3497–3500
	8.	Kanehisa M, Goto S, Sato Y et al (2012)           	16.	 Saitou N, Nei M (1987) The neighbor-joining
      KEGG for integration and interpretation of            method: a new method for reconstructing
      large-scale molecular datasets. Nucleic Acids         phylogenetic trees. Mol Biol Evol 4(4):
      Res 40(D1):D109–D114                                  406–425
                                                                   Introduction to Bioinformatics          71
	17.	 Chakrabarti S, Lanczycki CJ, Panchenko AR et       	28.	Singh AP, Brutlag DL (1997) Hierarchical
      al (2006) State of the art: refinement of mul-           protein structure superposition using both sec-
      tiple sequence alignments. BMC Bioinformatics            ondary structure and atomic representations.
      7:499                                                    In Proc. Fifth Int. Conf. on Intell. Sys. for
	18.	Weiner, P. (1973) Linear pattern matching                 Mol. Biol. AAAI Press, Menlo Park, CA, pp
      algorithm, 14th Annual IEEE Symposium on                 284–293
      Switching and Automata Theory, 15–17               	29.	 Viksna J, Gilbert D (2001) Pattern matching
      October, 1973, USA, pp 1–11                              and pattern discovery algorithms for protein
	19.	 Cobbs AL (1995) Fast approximate matching                topologies. Algorithms in bioinformatics: first
      using suffix trees. Combinatorial pattern                international workshop, WABI 2001 proceed-
      matching, vol 937, Lecture notes in computer             ings, vol 2149, Lecture notes in computer sci-
      science. Springer, New York, NY, pp 41–54                ence. Springer, New York, NY, pp 98–111
	20.	 Haeckel E (1868) The history of creation, vol      	30.	Edgar R, Domrachev M, Lash AE (2002)
      1, 3rd edn. Trench & Co., London, Translated             Gene expression omnibus: NCBI gene expres-
      by E. Ray Lankester, Kegan Paul                          sion and hybridization array data repository.
	21.	Sokal R, Michener C (1958) A statistical                  Nucleic Acids Res 30(1):207–210
      method for evaluating systematic relationships.    	31.	Tusher VG, Tibshirani R, Chu G (2001)
      Univ Kans Sci Bull 38:1409–1438                          Significance analysis of microarrays applied to
	22.	 Day WHE (1986) Computational complexity                  the ionizing radiation response. Proc Natl
      of inferring phylogenies from dissimilarity              Acad Sci 98(9):5116–5121
      matrices. Bull Math Biol 49:461–467                	32.	 Lloyd SP (1982) Least squares quantization in
	23.	Lathrop RH (1994) The protein threading                   PCM. IEEE Trans Inform Theor 28(2):
      problem with sequence amino acid interaction             129–137
      preferences is NP-complete. Protein Eng 7(9):      	33.	 Kohonen T (1982) Self-organized formation
      1059–1068                                                of topologically correct feature maps. Biol
	24.	Subbiah S, Laurents DV, Levitt M (1993)                   Cybern 43(1):59–69
      Structural similarity of DNA-binding domains       	34.	van Dongen, S. (2000) Graph clustering by
      of bacteriophage repressors and the globin               flow simulation. Ph.D. thesis, University of
      core. Curr Biol 3:141–148                                Utrecht, May 2000
	25.	 Holm L, Sander C (1993) Protein structure          	35.	 King AD, Pržulj N, Jurisica I (2004) Protein
      comparison by alignment of distance matrices.            complex prediction via cost-based clustering.
      J Mol Biol 233(1):123–138                                Bioinformatics 20(17):3013–3020
	26.	Gibrat JF, Madej T, Bryant SH (1996)                	36.	Blatt M, Wiseman S, Domany E (1996)
      Surprising similarities in structure comparison.         Superparamagnetic clustering of data. Phys
      Curr Opin Struct Biol 6(3):377–385                       Rev Lett 76:3251–3254
	27.	Shindyalov IN, Bourne PE (1998) Protein             	37.	 Bader GD, Hogue CW (2003) An automated
      structure alignment by incremental combina-              method for finding molecular complexes in
      torial extension of the optimum path. Protein            large protein interaction networks. BMC
      Eng 11(9):739–747                                        Bioinformatics 4:2