Bioinformatics is an interdisciplinary field that combines biology, computer science,
mathematics, and statistics to analyze and interpret biological data.
     Example Tools & Databases
         •   Databases: NCBI, UniProt, Ensembl, KEGG, Pfam
         •   Tools: BLAST, Clustal Omega, FASTQC, GROMACS, HMMER
     Computational Biology is a broad field that applies mathematical models, computational
     simulations, and algorithmic techniques to study biological systems. While closely related to
     bioinformatics, computational biology places a stronger emphasis on developing theoretical
     models and simulations to understand biological behavior.
      Bioinformatics                           Computational Biology
      Focuses on data management &             Focuses on modeling and simulation
      analysis
      Works with biological databases          Builds predictive biological models
      Emphasizes tools and pipelines           Emphasizes theory and mathematical
                                               models
      Data-driven (e.g., genome                Hypothesis-driven (e.g., simulating
      sequencing)                              evolution)
     Biological sequences are linear arrangements of biological molecules that carry the information
     required for life. They are fundamental to molecular biology and bioinformatics.
     Biological databases are organized collections of biological data, essential for storing, retrieving,
     and analyzing information such as DNA sequences, protein structures, gene expression data,
     and more.
     CLASSIFICATION OF DATABASE
1.      Nucleotide Sequence Databases
     Store DNA or RNA sequences.
         •   GenBank (NCBI, USA)
         •   EMBL-EBI (Europe)
         •   DDBJ (Japan)
     These three collaborate and exchange data daily (called the INSDC – International Nucleotide
     Sequence Database Collaboration).
        Protein Sequence Databases
     Store amino acid sequences of proteins.
         •   UniProt (Universal Protein Resource)
                 o   UniProtKB/Swiss-Prot: Manually curated, high-quality data
                 o   UniProtKB/TrEMBL: Computationally annotated, not manually reviewed
   •   PIR (Protein Information Resource)
   •   PDB (Protein Data Bank – also includes 3D structures)
Applications
   •   Gene identification and annotation
   •   Phylogenetic analysis
   •   Comparative genomics
   •   Protein structure and function prediction
   •   Drug target discovery
 2.        Structure Databases in Bioinformatics
   •   Structure databases store three-dimensional (3D) structural data of biological
       macromolecules like proteins, nucleic acids (DNA/RNA), and complexes. These
       structures are crucial for understanding biological function, drug design, enzyme
       mechanisms, and protein-ligand interactions.
       Eg: PDB, RCSB PDB, CATH, SCOP
       Visualization Tools
   •   PyMOL
   •   Chimera/ChimeraX
   •   Jmol
Applications
   •   Drug design: Understanding how molecules bind to proteins
   •   Molecular docking: Predicting binding interactions
   •   Enzyme engineering
   •   Protein folding and dynamics
   •   Structure-function relationship analysis
  3.   Genome-Specific Databases in Bioinformatics
       Genome-specific databases store and organize genomic information of a
       particular species, group of organisms, or model organisms. These databases are
       essential for studying gene function, genome organization, evolution, mutations,
       and comparative genomics.
          Eg: Ensembl, FlyBase, WormBase
          Applications
      •   Studying gene regulation and structure
      •   Comparative genomics and evolution
      •   Understanding model organisms in biomedical research
      •   Finding candidate genes for diseases or traits
      •   CRISPR guide RNA design and gene editing
4. Specialized Databases in Bioinformatics
  Specialized databases (or special databases) are focused repositories that contain specific types
  of biological data—such as pathways, gene expression, protein families, diseases, or molecular
  interactions—rather than general sequences or structures. These are critical for functional
  genomics, systems biology, and translational research.
  KEGG, Pfam, ExPASy Enzyme
  Applications of Special Databases
      •   Functional annotation of genes/proteins
      •   Disease-gene association studies
      •   Drug discovery and toxicology
      •   Biomarker identification
      •   Systems biology and network modelling
                                             Microarray
  Microarray Technology in Bioinformatics
  Microarray is a high-throughput technique used to measure the expression levels of thousands
  of genes simultaneously or to genotype multiple regions of a genome. It plays a key role in
  transcriptomics, diagnostics, and disease research.
      What is a Microarray?
  A microarray is a small chip (glass or silicon) onto which DNA probes are fixed in a grid
  pattern. These probes hybridize with complementary DNA (cDNA) or RNA from a sample, and
  the intensity of the signal indicates the expression level of each gene.
   Basic Steps of a Microarray Experiment
   1. Sample Preparation
           o   Extract RNA from cells or tissue
           o   Convert it to cDNA and label it with fluorescent dyes (e.g., Cy3 and Cy5)
   2. Hybridization
           o   Apply the labeled cDNA to the microarray chip
           o   cDNA binds (hybridizes) to complementary DNA probes on the chip
   3. Scanning
           o   Use a laser scanner to detect fluorescence intensity at each spot
   4. Data Analysis
           o   Convert fluorescence signals into gene expression values
           o   Normalize and compare data across samples (e.g., normal vs. diseased)
Applications of Microarray
          Area                                   Applications
Gene expression profiling    Compare expression in healthy vs. diseased tissues
Cancer research              Identify tumor-specific gene signatures
Drug discovery               Evaluate drug impact on gene expression
Diagnostics                  Identify infections, genetic disorders
Microarray Data Repositories
Database                                    Description
GEO (Gene Expression Omnibus – NCBI) Repository for microarray and RNA-seq datasets
ArrayExpress (EMBL-EBI)                     Public archive for functional genomics data
                                      Metabolic Pathway
Metabolic Pathway in Bioinformatics and Biology
A metabolic pathway is a series of interconnected biochemical reactions that transform a
starting molecule into a final product, catalyzed by enzymes. These pathways are essential for
energy production, growth, and maintenance of cellular functions in living organisms.
   Types of Metabolic Pathways
Pathway Type Function                                   Example
Catabolic     Break down molecules to release energy Glycolysis, β-oxidation
Anabolic      Build complex molecules using energy      Protein synthesis, Gluconeogenesis
Amphibolic    Serve both anabolic and catabolic roles Citric Acid Cycle (TCA/Krebs cycle)
   Examples of Key Metabolic Pathways
Pathway                              Description
Glycolysis                           Converts glucose to pyruvate, generating ATP
TCA Cycle (Krebs cycle)              Oxidizes acetyl-CoA to CO₂, producing NADH, FADH₂
Electron Transport Chain             Uses electrons from NADH/FADH₂ to produce ATP
Pentose Phosphate Pathway            Generates NADPH and ribose sugars
Fatty Acid Metabolism                Includes β-oxidation (breakdown) and synthesis
Amino Acid Metabolism                Transamination, deamination, urea cycle
Photosynthesis (plants)              Converts light energy into chemical energy
Nitrogen Fixation (microbes/plants) Converts atmospheric nitrogen to ammonia
   Bioinformatics Resources for Metabolic Pathways
Database         Description
KEGG Pathway Interactive maps of metabolic pathways with gene-enzyme-compound links
MetaCyc          Curated database of experimentally validated metabolic pathways
BioCyc           Collection of organism-specific pathway/genome databases
Components of a Metabolic Pathway
   •   Substrates: Starting molecules (e.g., glucose)
   •   Products: Final molecules (e.g., pyruvate)
   •   Enzymes: Biological catalysts for each step
   •   Intermediates: Compounds formed between start and end
   •   Coenzymes: NAD⁺, FAD, ATP – help transfer energy or atoms
Applications of Metabolic Pathway Analysis
   •   Drug target identification
   •   Metabolic engineering (e.g., in synthetic biology)
   •   Understanding disease mechanisms (e.g., cancer metabolism)
                                                   Motif
Motif in Bioinformatics and Molecular Biology
A motif is a short, conserved sequence pattern in DNA, RNA, or protein molecules that has a
biological function. Motifs are often involved in important roles such as binding sites, active
sites, structural regions, or regulatory elements.
   Types of Motifs
Molecule Motif Type          Function
          Regulatory         Act as transcription factor binding sites (e.g., TATA box, CAAT
DNA
          motifs             box)
RNA       RNA motifs         Secondary structure motifs (e.g., hairpin loops, riboswitches)
                             Structural/functional units (e.g., zinc finger, helix-turn-helix, EF-
Protein Protein motifs
                             hand)
   Examples of Biological Motifs
   DNA Motifs
   •   TATA Box: Promoter region in eukaryotes for transcription initiation
   •   CpG Island: Regions rich in CG nucleotides, often near promoters
   •   Enhancer motifs: Bind activator proteins to boost gene expression
   Protein Motifs
   •   Zinc Finger: Binds DNA; involved in gene regulation
   •   Leucine Zipper: Mediates protein-protein interactions
   •   SH2 Domain: Binds phosphorylated tyrosines in signaling pathways
   •   Walker A/B: ATP binding motifs in enzymes
   Motif Discovery Tools
Tool          Purpose
MEME Suite Discover novel motifs in DNA/protein sequences
Tool           Purpose
PROSITE        Database of protein motifs and patterns
Pfam           Protein families and domains (some include motifs)
JASPAR         DNA binding motifs for transcription factors
MotifScan      Scans sequences for known motifs
   Applications of Motif Analysis
   •   Predicting gene regulatory elements
   •   Identifying binding sites in proteins and DNA
   •   Understanding evolutionary conservation
   •   Classifying proteins into functional families
   •   Designing mutational studies or synthetic biology parts
                                              Domain Databases
Domain Databases in Bioinformatics
Domains are structurally and functionally distinct units within proteins that can evolve,
function, and exist independently. Domain databases collect and annotate these conserved
regions, helping to predict protein function, structure, and evolutionary relationships.
   Major Domain Databases
Database Description                            Key Features
                                                Uses multiple sequence alignments and hidden
Pfam        Protein families and domains
                                                Markov models (HMMs)
            Integrative resource of protein     Combines Pfam, SMART, PROSITE,
InterPro
            domains                             TIGRFAMs, and more
            Protein domains, families, and
PROSITE                                         Uses patterns and profiles for detection
            functional sites
            Structural classification of
SCOP                                            Hierarchical classification based on structure
            protein domains
            Protein domain classification by Class, Architecture, Topology, Homologous
CATH
            structure                        superfamily
   Difference Between Motif and Domain
Motif                              Domain
Short, conserved sequence          Larger, functional unit
May not fold independently         Can fold/function independently
Often a small part of a domain Can consist of several motifs
                                           Data file formats
1. Sequence Data Formats
Format                Used For                          Description
FASTA                                                   Plain text; starts with > followed by
                      DNA/RNA/protein sequences
(.fa/.fasta)                                            sequence
FASTQ                 Raw sequencing reads with         Includes quality scores per base (e.g., from
(.fq/.fastq)          quality                           Illumina)
GenBank               Annotated nucleotide
                                                        Includes features, genes, CDS, organism
(.gb/.gbk)            sequences
2. Protein Structure Formats
Format            Used For                           Description
                                                     Atomic coordinates from X-ray, NMR, cryo-
PDB (.pdb) 3D structure of biomolecules
                                                     EM
mmCIF
                  Alternative to PDB format          Richer metadata, used by RCSB PDB now
(.cif)
                  Secondary structure
DSSP                                                 Assigns alpha-helix, beta-sheet from PDB files
                  assignments
   3. Expression and Microarray Formats
Format       Used For                            Description
CEL files Raw microarray data (Affymetrix) Contains probe intensities
CHP files Processed microarray data              Results from Affymetrix analysis
4. Phylogenetic and Multiple Sequence Alignments
Format                       Used For                        Description
CLUSTAL                      Multiple sequence               From tools like ClustalW/Clustal
(.aln/.clustal)              alignments                      Omega
PHYLIP (.phy)                Phylogenetic analysis           Input for PHYLIP programs
Format                      Used For                     Description
NEXUS (.nex)                Phylogeny + character data   Used in PAUP*, MrBayes
                                               UNIT-II
Sequence Alignment in Bioinformatics
Sequence alignment is the process of arranging DNA, RNA, or protein sequences to identify
regions of similarity. These similarities may indicate functional, structural, or evolutionary
relationships between the sequences.
Difference Between Homology and Similarity
 Feature            Homology                                     Similarity
 Definition         Indicates a common evolutionary              Measures how alike two
                    origin                                       sequences are
 Type               Qualitative (yes or no)                      Quantitative (measured in %)
 Expression         "Two genes are homologous" or "are           "Sequences are 80% similar"
                    not homologous"
 Basis              Inferred from sequence similarity,           Calculated from aligned sequence
                    structure, or function                       data
 Types              - Orthologs (speciation)                     - Sequence similarity
                    - Paralogs (duplication)                     - Structural similarity
 Measurement        Not directly measurable — inferred           Directly measurable via tools like
 Tool                                                            BLAST, Clustal
 Implication        Shared ancestry                              Possible shared function or
                                                                 structure
 Aspect          Identity                           Similarity
 Definition      Exact match of residues at the     Degree of resemblance between residues
                 same position                      (including similar ones)
 Type of         Strict: only identical residues    Flexible: includes chemically similar
 Match                                              residues
 Applicable      DNA, RNA, and protein              Mostly protein sequences
 To              sequences
 Tools          BLAST, Clustal, MAFFT              BLAST with substitution matrices (e.g.,
                                                   BLOSUM, PAM)
                                  Types of sequence alignment
1.Pairwise Sequence Alignment
Pairwise sequence alignment is a fundamental bioinformatics method used to compare two
biological sequences — either DNA, RNA, or protein — to identify regions where they are
similar or different. The goal of pairwise alignment is to arrange the sequences in such a way
that equivalent or related residues (nucleotides or amino acids) are aligned to each other,
highlighting evolutionary, structural, or functional relationships.
There are two main types of pairwise alignment:
   1. Global Alignment: This method attempts to align the entire length of both sequences
      from beginning to end. It is most effective when the sequences are of roughly the same
      length and are expected to be similar throughout. The global alignment algorithm
      systematically scores all possible alignments and finds the best overall match, including
      gaps introduced to maximize alignment. The classic algorithm used for global alignment
      is the Needleman-Wunsch algorithm.
   2. Local Alignment: This method focuses on finding the best matching region(s) or
      subsequences within the two sequences, rather than aligning them from end to end.
      Local alignment is useful when sequences may only share small regions of similarity,
      such as conserved functional domains within otherwise divergent sequences. The Smith-
      Waterman algorithm is a popular method for local alignment. A widely used practical
      tool that performs fast local alignments is BLAST.
In pairwise alignment, matches, mismatches, and gaps are scored using specific scoring
schemes, often guided by substitution matrices (for proteins) like BLOSUM or PAM. Matches
add positive scores, while mismatches and gaps usually incur penalties. The alignment with the
highest score represents the best alignment under the scoring criteria.
2.Multiple Sequence Alignment (MSA)
Multiple Sequence Alignment (MSA) is an extension of pairwise alignment where three or more
biological sequences (DNA, RNA, or proteins) are aligned simultaneously. The objective is to
arrange the sequences so that similar regions, conserved motifs, or functional domains are
aligned across all sequences in the set.
Why is MSA important?
•   Detect conserved regions: MSA helps identify sequences or motifs that have been
    preserved throughout evolution, suggesting they have important structural or functional
    roles.
•   Infer evolutionary relationships: By comparing multiple sequences, MSA supports the
    construction of phylogenetic trees, which show the evolutionary history of genes or
    species.
•   Predict structure and function: Conserved regions highlighted by MSA can indicate
    critical parts of a protein or gene involved in binding, catalysis, or regulation.
•   Guide experimental design: MSA can help design primers for PCR or identify target
    sites for mutagenesis.
    Common Tools for MSA
•   Clustal Omega: Widely used, good balance between speed and accuracy.
•   MUSCLE: Faster and often more accurate for large datasets.
•   MAFFT: Efficient for very large numbers of sequences.
•   T-Coffee: Provides consensus alignments combining results from different methods
    3.Global Alignment
    Global alignment is a method used in bioinformatics to align two biological sequences
    (DNA, RNA, or protein) across their entire lengths. It attempts to match the sequences
    from the beginning of both sequences to the end, even if there are mismatches or gaps.
       When is Global Alignment Used?
•   When the two sequences are of similar length
•   When the sequences are closely related (e.g., from the same gene family or species)
•   When a full-length comparison is important — such as comparing two homologous
    genes or complete protein sequences
    Key Characteristics
    Feature                             Description
    Alignment Scope                     Full-length (start to end of both sequences)
    Gaps                                Allowed and inserted to optimize overall alignment
    Match/Mismatch
                                        All positions are considered, even mismatches
    Handling
         Feature                             Description
         Common Algorithm                    Needleman–Wunsch Algorithm
                                             EMBOSS Needle, Clustal Omega (for MSA),
         Tools Used
                                             MAFFT
4.Local Alignment in Bioinformatics
Local alignment is a sequence alignment technique used to find the most similar region(s)
between two biological sequences — such as DNA, RNA, or proteins. Unlike global alignment, it
does not try to align the sequences from end to end, but instead focuses on aligning the best
matching subsequences within the larger sequences.
   Key Features of Local Alignment
Feature            Description
Scope              Aligns only the most similar region(s) (subsequences)
Gaps               Allowed, only within the aligned region
Algorithm Used Smith–Waterman algorithm (dynamic programming)
Common Tools BLAST (heuristic local alignment), EMBOSS Water
Best For           Dissimilar sequences or domain-level comparisons
Output             A high-scoring segment pair showing best local match
                                    Dot Plot in Bioinformatics
A dot plot is a graphical method used in bioinformatics to compare two biological sequences
(DNA, RNA, or protein). It helps visually identify regions of similarity, such as conserved
motifs, repeats, or alignments, by placing a dot wherever residues (nucleotides or amino acids)
in the two sequences match.
   What is a Dot Plot?
A dot plot is a 2D matrix where:
   •    One sequence is plotted along the horizontal (x-axis).
   •    The other sequence is plotted along the vertical (y-axis).
     •       A dot is placed at (i, j) if the character at position i in sequence 1 matches the character
             at position j in sequence 2.
     Purpose of a Dot Plot
     •       To visually detect similarity between two sequences
     •       To identify repeats, inversions, or palindromes
     •       To give an intuitive overview before running full alignments
     How to Interpret a Dot Plot
Pattern Seen                       Meaning
A diagonal line (↘)                Good match or alignment between sequences
Breaks/gaps in the diagonal Mismatches or insertions/deletions
Parallel diagonals                 Repeats or duplications
Horizontal/vertical lines          Gaps in one of the sequences
Inverted diagonals (↙)             Inversions or palindromic sequences
🛠 Tools for Creating Dot Plots
Tool                         Features
EMBOSS Dotmatcher Creates dot plots with customizable window size
Example
Let’s compare two sequences:
     •       Sequence 1: ACGTG
     •       Sequence 2: ACGTC
 ACGTC
 -----------
A| ●
C| ●
G|       ●
T|       ●
G|      ● ← mismatch here; C ≠ G
                                     Alignment Algorithms
Sequence alignment algorithms are at the core of bioinformatics and computational biology.
They allow us to compare biological sequences (DNA, RNA, or proteins) to find similarities,
differences, and evolutionary relationships. These algorithms use mathematical and
computational techniques to optimally align two or more sequences, considering matches,
mismatches, and gaps.
Types of Alignment Algorithms
 Type                Purpose                                   Common Algorithms
 Global              Aligns sequences from end to end          Needleman–Wunsch
 Alignment
 Local Alignment     Aligns best matching regions within       Smith–Waterman
                     sequences
 Multiple            Aligns 3 or more sequences                Clustal, MUSCLE, MAFFT, T-
 Alignment           simultaneously                            Coffee
                 Needleman–Wunsch Algorithm – Global Sequence Alignment
The Needleman–Wunsch algorithm is a dynamic programming method used for global
alignment of two biological sequences — DNA, RNA, or proteins. It was the first algorithm
developed for sequence alignment and remains a foundational concept in bioinformatics.
     Purpose
To find the best alignment of two sequences across their entire lengths, including matches,
mismatches, and gaps, in a way that maximizes an alignment score.
     Key Concepts
Term                    Meaning
Global alignment        Aligns sequences from beginning to end
Dynamic programming Breaks problem into subproblems and builds solution step by step
Term                      Meaning
Scoring scheme            Rewards matches (+), penalizes mismatches (−), and gaps (−)
    Scoring Example
    •   Match = +1
    •   Mismatch = –1
    •   Gap (insertion/deletion) = –2
        Step-by-Step Procedure
Sequence A: G A C
Sequence B: G A T
Step 1: Initialization
Create a matrix with dimensions (len(A)+1) × (len(B)+1), and initialize the first row and column
with gap penalties.
Step 2: Fill the Matrix
Score(i,j) = max( Score(i–1, j–1) + match/mismatch, Score(i–1, j) + gap, Score(i, j–1) + gap)
Step 3: Traceback
Start from the bottom-right cell and trace back to the top-left, following the path that gave the
optimal score (diagonal = match/mismatch, up = gap in seq B, left = gap in seq A).
This gives you the optimal global alignment.
Applications
    •   Comparing full gene or protein sequences
   •     Studying closely related species
   •     Detecting mutations, insertions, deletions
   •     Building tools like EMBOSS Needle
Tools That Use Needleman–Wunsch
   •     EMBOSS Needle (online & command-line)
   •     Biopython and BioPerl (programmatic implementation)
                    Smith–Waterman Algorithm – Local Sequence Alignment
The Smith–Waterman algorithm is a dynamic programming method used for local alignment of
two biological sequences (DNA, RNA, or protein). Unlike the Needleman–Wunsch algorithm
(which aligns entire sequences), **Smith–Waterman identifies the highest scoring subsequences
— i.e., the best matching region.
   Purpose
To find the most similar local region (subsequence) between two sequences by maximizing the
alignment score, allowing matches, mismatches, and gaps.
   Key Features
Aspect            Smith–Waterman
Type              Local alignment
Approach          Dynamic programming
Best for          Comparing distantly related sequences
Output            Best matching subsequences, not full-length alignment
Algorithm Basis Recurrence relation with zero as minimum
   Scoring System
Element      Score Example
Match        +2
Mismatch –1
Gap (indel) –2
   Step-by-Step: Smith–Waterman Algorithm
Let’s align these sequences:
   •   Sequence A = G A C T
   •   Sequence B = G A T
   Step 1: Initialize Matrix
Create a (m+1) × (n+1) matrix, where m and n are lengths of the sequences. Initialize the first
row and column to 0 (important for local alignment!).
Step 2: Fill the Matrix
Score(i,j) = max( Score(i–1, j–1) + match/mismatch, Score(i–1, j) + gap, Score(i, j–1) + gap)
Step 3: Traceback
Start from the bottom-right cell and trace back to the top-left, following the path that gave the
optimal score (diagonal = match/mismatch, up = gap in seq B, left = gap in seq A).
This gives you the optimal global alignment.
Applications
   •   Comparing full gene or protein sequences
   •   Studying closely related species
   •   Detecting mutations, insertions, deletions
   Tools Using Smith–Waterman
    Tool                  Usage
    EMBOSS Water          Command-line & web alignment
   Substitution Matrices – PAM (Point Accepted Mutation)
   In bioinformatics, substitution matrices are used to score alignments between protein
   sequences by assigning values to amino acid substitutions. PAM is one of the earliest and
   most widely used substitution matrices in sequence alignment.
    What is PAM?
PAM stands for Point Accepted Mutation. It is a scoring matrix used in protein sequence
alignment to estimate the likelihood of one amino acid being replaced by another during
evolution.
•   Developed by Margaret Dayhoff in the 1970s.
•   Based on observed mutations in closely related protein families.
•   Measures evolutionary distance between proteins.
    Concept of 1 PAM
•   1 PAM = 1% of amino acids have undergone an accepted point mutation.
•   Constructed from alignments of closely related proteins.
•   A PAM1 matrix shows the probabilities of amino acid substitutions after 1% sequence
    divergence.
To model larger evolutionary distances, PAM matrices are extrapolated:
Matrix         Meaning
PAM1           1% divergence (closely related sequences)
PAM250         ~250% accepted mutations (more distant sequences)
    Structure of a PAM Matrix
It is a 20 × 20 matrix (for the 20 amino acids), where:
•   Each cell (i, j) contains a log-odds score:
         o   High positive → substitution is likely
         o   Negative → substitution is unlikely
Substitution Matrices – BLOSUM (BLOcks SUbstitution Matrix)
  BLOSUM is another widely used substitution matrix in bioinformatics, especially for
  protein sequence alignments. It helps score amino acid substitutions based on evolutionary
  conservation, similar to PAM, but is constructed using a different strategy and is more
  effective for local alignments and distantly related sequences.
      What is BLOSUM?
  •   BLOSUM = BLOcks SUbstitution Matrix
  •   Developed by Henikoff & Henikoff in 1992
  •   Based on observed substitutions in conserved protein blocks (ungapped regions of
      multiple alignments)
  •   Unlike PAM (which is extrapolated), BLOSUM is directly derived from real sequence
      alignments
      Key Concept
  •   BLOSUM matrices are labeled as BLOSUMx, where x is the percentage identity
      threshold used to cluster sequences.
   BLOSUM Matrix          Best For
   BLOSUM80               Closely related sequences
   BLOSUM62               Moderately divergent sequences (default in BLAST)
   BLOSUM45               Distantly related sequences
      Lower BLOSUM number → greater evolutionary distance.
  structure of a BLOSUM Matrix
  Like PAM, BLOSUM is a 20×20 matrix (for amino acids) with log-odds scores:
Positive = more likely substitution
Negative = less likely substitution
 Feature          PAM                                   BLOSUM
 Full form        Point Accepted Mutation               BLOcks SUbstitution Matrix
 Based on         Extrapolated mutations in             Observed substitutions in blocks
                  families
 Suited for       Closely related sequences             Distantly related sequences
 Label meaning    PAM250 = 250% divergence              BLOSUM62 = sequences ≤62%
                                                        identity
 Common           Older tools, full alignments          BLAST, protein alignment tools
 usage
Applications of Multiple Sequence Alignment (MSA)
Multiple Sequence Alignment (MSA) is a core technique in bioinformatics used to compare
three or more biological sequences (DNA, RNA, or protein) simultaneously. Its applications
span evolutionary biology, genomics, drug design, and functional annotation.
   1. Identification of Conserved Regions
   •   Conserved sequences often indicate important functional or structural roles (e.g., active
       sites, binding domains).
   •   Helps identify motifs or signatures characteristic of a protein family.
   Example: Finding conserved catalytic residues in enzymes across different organisms.
   2. Phylogenetic Tree Construction
   •   MSA is the starting point for building evolutionary trees.
   •   It helps trace common ancestry and divergence between species or genes.
   Example: Studying evolutionary relationships among coronavirus spike proteins.
   3. Protein Structure and Function Prediction
   •   Conserved regions suggest functional importance and structural stability.
   •   Aligning unknown proteins with known structures may reveal 3D folding patterns.
   Example: Predicting zinc finger domain in a newly discovered transcription factor.
   4. Primer and Probe Design
   •   Helps in designing universal primers or probes that bind to conserved regions across
       species.
   •   Critical for PCR, qPCR, microarray, or diagnostic kits.
   Example: Designing a primer to detect conserved rRNA genes in bacteria.
   5. Annotation of New Sequences
   •   Annotate newly sequenced DNA or proteins based on alignment with well-annotated
       homologs.
   •   Assign gene function, exon-intron boundaries, or domain labels.
   Example: Assigning function to a novel gene based on alignment with known kinase family
genes.
   6. Drug Target Discovery and Vaccine Design
   •   Identify conserved drug targets across multiple pathogenic strains.
   •   Use conserved epitopes to design broad-spectrum vaccines.
    Example: Conserved epitopes in the influenza virus HA protein used in universal flu vaccine
design.
   7. Detecting Mutations and SNPs
   •   Compare aligned sequences to detect point mutations, insertions, or deletions.
   •   Useful in cancer genomics, personalized medicine, and evolution studies.
   Example: Identifying a pathogenic SNP in the BRCA1 gene.
Viewing and Editing Multiple Sequence Alignments (MSA)
Once you perform a Multiple Sequence Alignment (MSA), viewing and editing it effectively is
crucial for interpretation, annotation, or preparing it for further analyses like phylogenetic tree
construction, conserved motif discovery, or domain prediction.
   Why View or Edit MSA?
   •   To manually correct misaligned regions
   •   Highlight conserved sequences or motifs
   •   Annotate functional or structural features
   •   Trim or remove poorly aligned regions
   •   Export in desired formats (FASTA, Clustal, Phylip)
🛠 Popular Tools for Viewing & Editing MSA
1. Jalview (Desktop Application)
   •         GUI-based tool for visualizing and editing MSA.
   •   Supports color-coding, annotations, trees, and structure overlay.
   •   Can fetch sequences from databases (UniProt, EMBL).
   •   Compatible with Clustal, FASTA, Stockholm formats.
   Website: https://www.jalview.org
2. AliView
   •   Lightweight, fast MSA editor and viewer.
   •   Suitable for large datasets (e.g., viral genomes).
   •   Allows quick manual adjustments, trimming, and exporting.
   Website: http://ormbunkar.se/aliview/
3. UGENE
   •   Bioinformatics suite that includes MSA editing.
   •   Integrates with tools like ClustalW, MAFFT, MUSCLE.
   •   Great for annotation and local analysis.
   Website: https://ugene.net/
4. MEGA (Molecular Evolutionary Genetics Analysis)
   •   Primarily used for phylogenetic analysis, but allows MSA viewing/editing.
   •   Integrates MSA tools and supports tree construction.
   Website: https://www.megasoftware.net/
5. Web-based Viewers
Tool                    Features
MAFFT Viewer            Online visualization after alignment
Clustal Omega Viewer View alignments with colored conservation
Wasabi                  Interactive MSA + phylogenetic tree viewer
   Common Features in MSA Viewers
Feature              Description
Color Coding         Based on amino acid properties or conservation
Gap Editing          Manually add/delete gaps in specific regions
Consensus View Show residues most conserved across sequences
Annotation           Add structural, functional, or domain features
Format Export Save as FASTA, Clustal, Stockholm, etc.
   Common File Formats
Format        Extension Description
FASTA         .fasta      Basic format for sequences
Clustal       .aln        Used by ClustalW, supports alignment
Stockholm .sto            Annotated alignment format
Phylip        .phy        Input for phylogenetic tools
   Tips for Editing MSA
   •      Use color schemes like Zappo or Taylor for proteins.
   •      Trim low-confidence regions at sequence ends.
   •      Remove redundant or low-quality sequences.
   •      Always save a backup of the original alignment.
Scoring Function in Multiple Sequence Alignment (MSA)
Scoring functions in MSA are used to evaluate the quality of the alignment by measuring how
well the sequences are conserved across aligned columns. A higher score usually means better
biological relevance, reflecting evolutionary, structural, or functional relationships.
   Key Scoring Functions in MSA
1. Sum-of-Pairs (SP) Score
   Most common scoring function for MSA.
How it works:
   •   For every column in the alignment, calculate all pairwise scores.
   •   Use a substitution matrix (e.g., PAM, BLOSUM) for amino acids or match/mismatch for
       nucleotides.
2. Weighted Sum-of-Pairs Score
   •   Improves SP score by applying weights to reduce redundancy (e.g., multiple similar
       sequences).
   •   Helps avoid overrepresentation of closely related sequences.
3. Entropy Score (Information Content)
Used to evaluate the variability at each column.
4. Consistency-Based Scoring
Used by advanced MSA tools like T-Coffee.
   •   Compares final alignment with a library of pairwise alignments.
   •   Scores columns based on how consistent they are with pairwise alignments.
 Scoring Function     Purpose                                     Used In
 Sum-of-Pairs (SP)    Basic scoring by adding all pairwise        ClustalW, MAFFT,
                      scores                                      MUSCLE
 Weighted SP          Reduces over-representation bias            T-Coffee, advanced aligners
 Entropy-based        Evaluates conservation at each position     Profile analysis, motif
                                                                  finding
 Consistency-         Increases alignment reliability             T-Coffee
 based
Database Similarity Searching:
BLAST (Basic Local Alignment Search Tool)
BLAST is one of the most widely used tools in bioinformatics for comparing a query sequence
(DNA, RNA, or protein) against a database of sequences to find regions of local similarity.
   What is BLAST?
   •   Full Form: Basic Local Alignment Search Tool
   •   Purpose: Find sequences in a database that closely match a query sequence
  •     Type: Local alignment algorithm
  •     Database search: Compares the query against large databases (e.g., NCBI, UniProt)
  How BLAST Works (Steps)
  1. Input Query (DNA or protein sequence)
  2. Word Matching:
            o    Breaks query into short sequences (called words, e.g., 3-mers for proteins, 11-
                 mers for DNA)
  3. Database Scanning:
            o    Searches for exact or similar word matches in database sequences
  4. Extension:
            o    Matches are extended in both directions to find High-scoring Segment Pairs
                 (HSPs)
  5. Scoring & Ranking:
            o    Uses substitution matrices (e.g., BLOSUM) and gap penalties to compute
                 alignment scores
  6. Output:
            o    List of sequences with alignment scores, identities, e-values, and links to
                 database entries
  Scoring Terms in BLAST
Term               Meaning
Score              Numerical value based on alignment (match, mismatch, gaps)
E-value            Expected number of matches by chance. Lower = more significant
Bit Score          Normalized score that allows comparison between searches
% Identity         Percentage of identical residues in alignment
Query Cover How much of the query is covered by the alignment
  Types of BLAST Programs
Program         Query Type Database Type Use Case
BLASTn          Nucleotide   Nucleotide        Finding similar DNA sequences
BLASTp          Protein      Protein           Protein homology and function prediction
BLASTx          Nucleotide   Protein           Translates DNA → Protein, then searches
 tBLASTn Protein             Nucleotide      Protein → translated DNA
 tBLASTx       Nucleotide    Nucleotide      Translates both and compares
   Applications of BLAST
   1. Gene/Protein Identification
   2. Function Prediction
   3. Homology Detection
   4. Annotation of Genomic Data
   5. SNP or Mutation Analysis
   6. Evolutionary Studies
   7. Drug Target Discovery
   BLAST Online Tool:
NCBI BLAST portal: https://blast.ncbi.nlm.nih.gov/Blast.cgi
FASTA Format in Bioinformatics
FASTA is one of the simplest and most widely used file formats in bioinformatics for
representing nucleotide or protein sequences.
                            PHI-BLAST (Pattern-Hit Initiated BLAST)
PHI-BLAST is a specialized version of the BLAST algorithm that combines pattern matching
with sequence similarity searching. It’s useful when you know a specific motif or conserved
pattern in your protein and want to find other sequences that both:
   1. Contain the same pattern, and
   2. Are homologous (similar) to your query sequence.
   What is PHI-BLAST?
   •   Full Form: Pattern Hit Initiated BLAST
   •   Purpose: Find protein sequences in a database that:
           o    Contain a predefined pattern (motif), and
           o    Are significantly similar to the query sequence in the surrounding regions.
   How PHI-BLAST Works
   1. User Inputs:
           o   A protein sequence (query)
           o   A motif/pattern (in PROSITE syntax)
   2. PHI-BLAST Search:
           o   Searches a protein database for sequences that match the pattern
           o   Among these, it performs local alignments to find statistically significant
               matches
   3. Output:
           o   List of protein sequences that contain the motif and have significant similarity to
               the query sequence.
 Application                     Description
 Functional annotation           Identifies proteins with similar functions
 Motif-based homology search More specific than standard BLAST
 Protein family classification   Finds members of a protein family with conserved motifs
 Evolutionary analysis           Combines sequence conservation and motif preservation
 Domain-specific search          Focused analysis around functional sites
   PSI-BLAST (Position-Specific Iterated BLAST)
PSI-BLAST is an advanced BLAST variant that improves the detection of remote homologous
sequences by using a position-specific scoring matrix (PSSM), which gets refined over multiple
iterations. It’s especially useful for finding distant evolutionary relationships that standard
BLAST might miss.
   What is PSI-BLAST?
   •   Full Form: Position-Specific Iterated BLAST
   •   Purpose: Detect distant protein homologs by creating and refining a PSSM over
       multiple search rounds.
   •   Input: A protein sequence only.
   How PSI-BLAST Works (Step-by-Step)
   1. First Iteration:
           o   Performs a standard BLASTp search.
           o   Identifies sequences with significant similarity.
   2. PSSM Creation:
           o   From aligned hits, PSI-BLAST builds a Position-Specific Scoring Matrix.
           o   This matrix contains evolutionary information (which amino acids are conserved
               at each position).
   3. Subsequent Iterations:
           o   The PSSM is used to search again, detecting distant homologs that match the
               conserved profile, even if they have low overall identity.
           o   The user may choose to include/exclude hits for refining the matrix.
   4. Stopping Criteria:
           o   Iterations stop when no new significant matches are found or maximum
               iterations is reached (default is 5).
   Why Use PSI-BLAST?
 Reason                    Explanation
 Improved sensitivity      Finds remote homologs missed by normal BLAST
 Profile-based searching Uses biologically relevant conservation info
 Evolutionary insight      Detects protein families, domains, and functional motifs
   Applications of PSI-BLAST
   •   Detecting protein families and superfamilies
   •   Predicting function of unknown proteins
   •   Exploring evolutionary relationships
   •   Finding weak but meaningful sequence similarity
BLAST Algorithm (Basic Local Alignment Search Tool)
BLAST is a powerful and widely used algorithm for comparing a query biological sequence
(DNA, RNA, or protein) against a large database of sequences, to find regions of local similarity.
It is designed to be fast and sensitive, allowing researchers to identify homologous sequences
quickly.
   How the BLAST Algorithm Works: Step-by-Step
   1. Query Sequence Input:
      You provide a nucleotide or protein sequence as the query.
   2. Word (K-mer) Generation:
           o   The query is broken into short subsequences called words or k-mers (default
               length depends on the sequence type, e.g., 3 for proteins, 11 for DNA).
           o   These words serve as seeds for the search.
   3. Word Matching in Database:
           o   The database is scanned to find exact or similar matches to these words.
           o   BLAST uses a lookup table to find where words occur in the database sequences.
   4. Extension of Hits:
           o   Each word match is extended in both directions to find longer alignments called
               High-scoring Segment Pairs (HSPs).
           o   Extension stops when the score drops below a threshold.
   5. Scoring Alignments:
           o   Alignments are scored using substitution matrices (e.g., BLOSUM62 for
               proteins) and gap penalties.
           o   Only alignments above a certain score threshold are kept.
   6. Statistical Evaluation:
           o   The E-value (expectation value) is calculated to estimate the likelihood of finding
               the match by chance. Lower E-values indicate more significant matches.
   7. Results Output:
           o   BLAST produces a ranked list of database sequences similar to the query,
               showing alignment details, scores, identities, and E-values.
                                            UNIT-III
Phylogenetics Basics
Phylogenetics is the study of evolutionary relationships among biological species or entities
based on genetic, morphological, or molecular data. It helps to reconstruct the "family tree" or
phylogenetic tree showing how organisms are related through common ancestry.
Types of Phylogenetic Trees
 Type             Description
 Rooted Tree      Shows direction of evolution, with a single common ancestor at the root
 Unrooted         Shows relationships but no information about common ancestor or direction
 Tree
 Cladogram        Shows only the branching order, branch lengths not proportional to time or
                  changes
 Phylogram        Branch lengths proportional to evolutionary change or time
Steps in Phylogenetic Analysis
   1. Data Collection:
      Obtain sequences (DNA, RNA, or protein) or morphological data.
   2. Multiple Sequence Alignment (MSA):
      Align sequences to identify homologous positions.
   3. Model Selection:
      Choose an evolutionary model describing how sequences change over time.
   4. Tree Construction:
      Use methods like:
            o   Distance-based (e.g., Neighbor-Joining)
            o   Character-based (e.g., Maximum Parsimony, Maximum Likelihood, Bayesian
                Inference)
   5. Tree Evaluation:
      Assess reliability with methods like bootstrapping.
Applications of Phylogenetics
   •    Understanding evolutionary relationships among species
   •    Tracing the origin and spread of pathogens
   •    Studying gene family evolution and duplication
   •    Conservation biology and species classification
   •    Molecular clock studies estimating divergence times
Molecular Clock Theory
The Molecular Clock Theory is a method used in molecular evolution to estimate the time of
divergence between species or genes based on the assumption that genetic mutations accumulate
at a relatively constant rate over time.
How Molecular Clock Works
   1. Mutation Rate:
           o   Assume a roughly constant rate of nucleotide or amino acid substitutions per
               unit time.
   2. Sequence Comparison:
           o   Count the number of differences (mutations) between two homologous
               sequences.
   3. Time Estimation:
   4. Calibration:
           o   Use known fossil records or geological events to calibrate the clock.
Ultrametric Trees
An ultrametric tree is a special type of rooted phylogenetic tree where all the tips (leaves) are
equidistant from the root. This means that the distance from the root to any leaf (representing
present-day species or sequences) is the same across the tree, reflecting the idea that all
sequences have evolved for the same amount of time.
Distance Matrix Methods – UPGMA
What is UPGMA?
   •   UPGMA stands for Unweighted Pair Group Method with Arithmetic Mean.
   •   It is a simple hierarchical clustering method used in phylogenetics to construct
       ultrametric trees (rooted trees with equal distances from root to leaves).
   •   UPGMA builds a tree based on a distance matrix representing pairwise evolutionary
       distances between sequences or species.
How does UPGMA work? — Step-by-step
   1. Start with a distance matrix:
           o   The matrix contains pairwise distances between all sequences.
   2. Find the closest pair:
           o   Identify the two clusters (initially, each sequence is its own cluster) with the
               smallest distance.
   3. Merge clusters:
           o   Combine the two closest clusters into a new cluster.
   4. Update the distance matrix:
           o   Calculate distances from the new cluster to all other clusters as the arithmetic
               mean of distances of the merged clusters:
   5. Repeat:
            o   Repeat steps 2-4 until all sequences are clustered into a single tree.
Key Characteristics of UPGMA:
 Feature                     Description
 Produces                    Ultrametric (rooted) tree
 Assumes                     Constant molecular clock rate across lineages
 Clustering approach         Agglomerative hierarchical clustering
 Input                       Pairwise distance matrix
 Distance update method Arithmetic mean of merged clusters' distances
 Output                      Phylogenetic tree with branch lengths proportional to time
Advantages of UPGMA:
   •     Simple and easy to implement.
   •     Fast and computationally efficient.
   •     Provides an explicit time scale due to ultrametric assumption.
         Distance Matrix Methods – Neighbor-Joining (NJ)
         What is Neighbor-Joining (NJ)?
   •     Neighbor-Joining is a popular distance-based method for reconstructing phylogenetic
         trees.
   •     Unlike UPGMA, NJ produces an unrooted tree and does not assume a constant
         molecular clock.
   •     It’s widely used because it is fast, efficient, and often produces more accurate trees when
         evolutionary rates vary among lineages.
         How does Neighbor-Joining work? — Step-by-step
   1. Start with a distance matrix:
            o   Contains pairwise distances between all sequences or taxa.
   2. Calculate the Q-matrix:
   3. Find the pair with the smallest Q-value:
           o    This pair is chosen to be neighbors (closest relatives) and will be joined in the
                tree.
   4. Join the pair into a new node:
           o    Calculate branch lengths from the new node to each neighbor
   5. Update the distance matrix:
           o    Remove the joined taxa and add the new node.
           o    Calculate distances.
   6. Repeat steps 2-5:
           o    Continue until only two nodes remain, which are then connected.
       Key Characteristics of Neighbor-Joining:
        Feature                        Description
        Produces                       Unrooted phylogenetic tree
        Assumes                        No molecular clock (variable evolutionary rates allowed)
        Clustering                     Agglomerative but uses corrected distance (Q-matrix)
        approach
        Input                          Pairwise distance matrix
        Output                         Tree with branch lengths proportional to evolutionary
                                       distances
        Accuracy                       More accurate than UPGMA when rates vary
Character-Based Methods – Maximum Parsimony (MP)
   What is Maximum Parsimony?
Maximum Parsimony (MP) is a character-based method used in phylogenetic analysis. It aims
to find the simplest tree—the one that explains the observed data with the fewest evolutionary
changes (mutations).
Principle of Parsimony:
"The simplest explanation is preferred."
In phylogenetics, this means choosing the tree with minimum total number of character state
changes.
   Key Features of Maximum Parsimony
 Feature         Description
 Data Type Uses aligned characters (nucleotides, amino acids, etc.)
 Tree Type       Usually unrooted (can be rooted using an outgroup)
 Goal            Find the tree requiring minimum evolutionary steps
 Approach        Character-based, not dependent on distance matrices
 Output          One or more equally parsimonious trees
   How Does MP Work?
   1. Input:
             o    Aligned sequences (DNA, RNA, or protein).
             o    Each character (e.g., A, T, G, C) is analyzed independently.
   2. Generate all possible tree topologies for the given taxa.
   3. Evaluate each tree:
             o    For each character, determine the minimum number of changes needed to
                  explain its distribution on the tree.
             o    Sum over all characters to get the tree length.
   4. Select the tree with the least total changes (most parsimonious)
Tools for MP Analysis
   •    MEGA
   •    PHYLIP
        Methods of Evaluating Phylogenetic Trees – Bootstrapping
             Why Evaluate Phylogenetic Trees?
        Phylogenetic trees are hypotheses about evolutionary relationships. Since multiple trees
        may explain the data similarly well, we need methods to assess the confidence or
        reliability of inferred trees or tree branches.
             What is Bootstrapping in Phylogenetics?
        Bootstrapping is a statistical resampling method used to assess the reliability of tree
        branches (clades) in phylogenetic analysis.
It estimates how consistently a particular branch (or grouping) appears across many replicated
analyses of resampled data.
           How Bootstrapping Works – Step-by-Step
   1. Start with aligned sequence data (e.g., DNA, protein alignment).
   2. Generate multiple replicate datasets:
           o    Each bootstrap replicate is created by randomly resampling (with replacement)
                columns from the original alignment.
           o    Each replicate has the same number of columns as the original.
   3. Reconstruct a tree for each replicate using a chosen method (e.g., MP, NJ, ML).
   4. Count how often each clade appears across all replicate trees.
   5. Assign bootstrap support values:
           o    For each clade in the original tree, the bootstrap value is the percentage of
                replicate trees in which that clade appears.
           o    Expressed as a percentage (e.g., 80% support means the clade appeared in 80
                out of 100 trees).
           Interpreting Bootstrap Values
        Bootstrap Value (%)           Confidence Interpretation
        > 90%                         Very strong support
        70% – 89%                     Moderate to strong support
        50% – 69%                     Weak support
        < 50%                         Clade not considered reliable
           Example
   •   If a branch grouping species A and B appears in 95 out of 100 bootstrap replicates, the
       branch is labeled 95% in the final tree.
   •   It means high confidence that A and B share a close evolutionary relationship.
           Advantages of Bootstrapping
   •   Provides a quantitative measure of tree reliability.
   •   Applicable to various phylogenetic methods: Maximum Parsimony, Maximum
       Likelihood, Neighbor-Joining, etc.
   •   Helps identify robust clades vs. uncertain relationships.
                        Methods of Evaluating Phylogenetic Trees – Jackknifing
           What is Jackknifing in Phylogenetics?
       Jackknifing is a resampling-based statistical method used to assess the stability and
       reliability of branches (clades) in a phylogenetic tree—similar in spirit to bootstrapping,
       but with a slightly different approach.
It involves systematically removing a subset of data (usually columns in an alignment) and then
reconstructing trees to evaluate how often a clade appears.
           How Jackknifing Works – Step-by-Step
   1. Start with a multiple sequence alignment (DNA, RNA, or protein).
   2. Create jackknife replicate datasets:
           o    Randomly delete a fixed proportion of characters (e.g., 30–50%) from the
                alignment without replacement.
           o    The remaining data (e.g., 70%) is used to reconstruct the tree.
   3. Reconstruct phylogenetic trees from each reduced dataset using a method like MP, NJ,
      or ML.
   4. Track how frequently each clade appears across all replicate trees.
   5. Assign jackknife support values to the nodes:
           o    The support value indicates the percentage of replicates in which the clade
                appears.
           Interpreting Jackknife Support Values
        Jackknife Value (%)           Confidence Interpretation
        > 85%                         Strong support
        70% – 85%                     Moderate support
        < 70%                         Weak support
       Note: Jackknife values are usually slightly lower than bootstrap values.
           Advantages of Jackknifing
   •   Helps assess robustness of tree topology.
    •   Avoids potential biases of bootstrap resampling (e.g., overrepresentation of some
        characters).
    •   Simpler computation since it avoids repeated sampling of the same data.
            Jackknifing vs. Bootstrapping
        Feature                     Jackknifing                         Bootstrapping
                                    Deletes a fraction of               Resamples columns with
        Data Resampling
                                    columns                             replacement
                                    Less variation in                   More variation due to
        Variability
                                    replicates                          replacement
                                    Slightly faster, fewer              Requires more replicates
        Computation
                                    replicates needed                   for stability
        Use in                      Less common but still
                                                                        More widely used
        Phylogenetics               valid
                                              UNIT-IV
What is Gene Prediction?
Gene prediction is the process of identifying the locations of genes (coding regions) in a genomic
DNA sequence. This is a crucial step in genome annotation, especially for newly sequenced
organisms.
It involves detecting features like exons, introns, promoters, start/stop codons, etc., to predict
protein-coding genes and non-coding RNAs.
    Types of Gene Prediction Methods
Gene prediction approaches can be broadly classified into two types:
Type                  Description
Ab initio (de novo) Uses statistical models and signals in DNA sequence alone
Homology-based        Uses known genes or sequences from other organisms
   Ab Initio Gene Prediction
•   Based on sequence features: codon usage, GC content, exon/intron boundaries, ORFs
    (Open Reading Frames), etc.
•   Common algorithms:
         o   Hidden Markov Models (HMM)
         o   Neural networks and machine learning approaches
Tools:
•   GENSCAN
•   GeneMark
•   Glimmer
•   AUGUSTUS
Homology-Based Gene Prediction
•   Compares the input genome to known genes or protein sequences from related species.
•   Relies on alignment tools like BLAST, TBLASTN, or spliced alignment.
Tools:
•   BLAST
•   Exonerate
•   GeneWise
•   EST2GENOME
Steps in Gene Prediction
1. Input: Raw DNA sequence (e.g., entire genome or chromosome).
2. Scan for signals:
         o   Start codons (ATG)
         o   Stop codons (TAA, TAG, TGA)
         o   Splice sites (GT-AG rule)
         o   Promoter regions
3. Detect coding regions:
         o   Open Reading Frames (ORFs)
         o   Codon usage bias
4. Use prediction model/tool (ab initio or homology-based).
5. Output: Annotated gene structure:
           o   Exons, introns, UTRs
           o   Predicted protein sequence
What is a Conserved Domain?
A conserved domain is a recurring structural or functional unit in a protein that has remained
relatively unchanged (conserved) during evolution.
These domains often correlate with specific biological functions, such as binding DNA,
catalyzing reactions, or interacting with other proteins.
   Purpose of Conserved Domain Analysis
Conserved domain analysis aims to:
   •   Identify known domains in a query protein sequence
   •   Predict the function of unknown proteins
   •   Study evolutionary relationships
   •   Understand protein structure-function relationships
   How Conserved Domain Analysis Works
   1. Input: A protein sequence (FASTA format).
   2. Search the sequence against domain databases.
   3. Align the sequence with known conserved domains.
   4. Output: Domain hits, positions, and functional annotations.
   Popular Tools for Conserved Domain Analysis
Tool / Database   Description
NCBI CD-Search Compares protein sequences against the Conserved Domain Database (CDD)
InterPro          Integrates several databases (Pfam, SMART, TIGRFAMs, PROSITE, etc.)
Pfam              Database of protein families and domains using HMMs
SMART             Identifies signaling domains, regulatory motifs
ScanProsite       Scans for PROSITE motifs and profiles
HMMER             Uses Hidden Markov Models to search against domain profiles
       What is Protein Structure Visualization?
Protein structure visualization refers to using software tools to view and analyze the 3D
structure of proteins. It helps researchers understand protein folding, active sites, binding
interactions, and structure-function relationships.
       Why Visualize Protein Structures?
Purpose                            Explanation
Understanding protein function Structure reveals mechanisms like catalysis or binding
Drug design                        Visualizing binding pockets for ligand docking
Mutation impact analysis           Locating mutation sites to assess structural effect
Education & communication          Visual tools aid in teaching molecular biology concepts
       Popular Protein Visualization Tools
Tool                         Description
PyMOL                        Widely used; powerful, scriptable, publication-quality images
Chimera / ChimeraX           Advanced analysis; good for large complexes and density maps
RasMol                       Lightweight; good for quick 3D visualization
Jmol                         Java-based, web-friendly visualization
iCn3D (NCBI)                 Web-based viewer for 3D structure + annotations
Mol Viewer (RCSB PDB)* Modern, fast web tool from RCSB for visualizing PDB files
       Common File Formats
Format Description
.pdb       Protein Data Bank format (atomic coordinates)
.cif       mmCIF format (updated version of PDB)
.mol2      Molecular structures with charges/bonds
.sdf       Structure-data file for small molecules
       What Can You Visualize?
Feature                          Description
Backbone/secondary structure α-helices, β-sheets, loops
Surface and volume               Solvent-accessible and molecular surfaces
Ligand interactions              Binding sites and non-covalent interactions
Electrostatics                   Charge distribution over protein surface
Mutations and domains            Highlight regions/domains/mutations
                               What is Protein Secondary Structure?
Protein secondary structure refers to localized, repetitive structural motifs formed by hydrogen
bonding in the polypeptide backbone.
There are three major types:
Structure        Description
Alpha-helix (α) Right-handed coil stabilized by H-bonds
Beta-sheet (β) Extended strands connected by hydrogen bonds
Coil/Loop        Irregular or flexible regions (non-α/β structures)
   Why Predict Secondary Structure?
   •   To infer protein function when 3D structure is unknown
   •   To aid in tertiary structure prediction
   •   To design mutations or modifications in proteins
   •   To support structural annotations in genome projects
   Basic Principles of Prediction
Secondary structure prediction typically relies on:
   •   Amino acid propensities (likelihood of forming α/β/coils)
   •   Sliding window approaches (local sequence analysis)
   •   Multiple sequence alignments (evolutionary conservation)
   Common Methods & Algorithms
Method                              Description
Chou-Fasman                         Early method using amino acid propensities
GOR Method                          Uses information theory and probability
Neural Networks (e.g., PSIPRED) Most accurate, uses MSA + machine learning
    Popular Tools for Secondary Structure Prediction
Tool / Server Features
PSIPRED        High-accuracy; uses neural networks + MSA
JPred          Web-based; reliable secondary structure prediction
SOPMA          Fast, basic method using statistical analysis
PORTER         Deep learning-based predictor
PHYRE2         Combines secondary + 3D structure prediction
    Example Workflow: Predicting with PSIPRED
    1. Go to: http://bioinf.cs.ucl.ac.uk/psipred/
    2. Paste your protein sequence (FASTA format).
    3. Submit and wait for results.
    4. Output:
           o    Sequence with predicted secondary structures (H = Helix, E = Strand, C = Coil)
           o    Confidence scores
           o    Graphical visualization
    Typical Output Format
AA A A A A V V L L E E E G G
SS H H H H H H C C E E E C C
Symbol Meaning
H        Alpha-helix
E        Beta-strand
C        Coil or loop
What is Tertiary Structure Prediction?
Tertiary structure prediction involves determining the 3D structure of a protein based on its
amino acid sequence. Among the various methods, homology modeling (also known as
comparative modeling) is the most widely used when a similar structure (template) is already
known.
   What is Homology Modeling?
Homology modeling predicts the 3D structure of a target protein using an experimentally
determined structure of a homologous protein (template) with similar sequence.
Key assumption: Proteins with similar sequences adopt similar structures.
   Steps in Homology Modeling
Step No. Step Description
          Template Identification (via BLAST or HHblits)
          Sequence Alignment (align target to template)
          Model Building (build 3D model from alignment)
          Model Refinement (correct side chains, loops)
          Model Validation (check stereochemistry, clashes)
   Popular Tools for Homology Modeling
Tool / Server     Features
SWISS-MODEL Fully automated, web-based; good for beginners
Modeller          Command-line tool; flexible and scriptable
Phyre2            Uses fold recognition when homology is weak
I-TASSER          Hybrid method; includes threading + ab initio if needed
RaptorX           Handles low-homology sequences well using deep learning
   Example Workflow – Using SWISS-MODEL
   1. Go to https://swissmodel.expasy.org
   2. Submit your protein sequence (FASTA format)
   3. It automatically:
           o    Finds the best template
           o    Aligns the sequence
           o    Builds a 3D model
   4. Download/view the predicted 3D structure (PDB format)
   5. Use structure viewers like PyMOL or Mol* to analyze
       What is Threading?
       Threading, also called fold recognition, is a protein structure prediction method used
       when:
   •   There is low sequence similarity (<30%) between the query protein and known
       structures.
   •   Homology modeling fails due to lack of a suitable close template.
It tries to "thread" the amino acid sequence of the unknown protein onto known folds (3D
templates) in structural databases, even when there's no clear sequence homology.
           How Does Threading Work?
       Step          Description
                     Compare the target sequence against a library of known structures
                     Try fitting (threading) the sequence onto each fold
                     Score each model using energy functions, compatibility & alignment
                     Choose the best-scoring fold as the predicted 3D model
       Threading evaluates both sequence compatibility and structural environment (e.g.,
       burial, secondary structure match, etc.).
           Popular Threading Tools
       Tool /
                             Features
       Server
       I-TASSER              Combines threading + ab initio; high accuracy
       Phyre2                Fold recognition with profile matching; very user-friendly
       RaptorX               Deep learning-based threading; handles remote homologs
       LOMETS                Meta-server combining multiple threading algorithms
        Tool /
                              Features
        Server
                              Improved scoring with neural networks and statistical energy
        SPARKS-X
                              functions
           When to Use Threading?
        Situation                               Method
        Sequence identity > 50%                 Homology modeling
        Identity 20–30%                         Threading
        Identity < 20%, no known fold           Ab initio
       Example Workflow – Using Phyre2
   1. Visit: http://www.sbg.bio.ic.ac.uk/phyre2
   2. Paste your protein sequence (FASTA format).
   3. Choose Intensive mode for best threading predictions.
   4. Wait for results:
           o     Predicted 3D model
           o     Secondary structure
           o     Confidence scores
           o     Functional insights
What is Ab-initio Prediction?
Ab-initio (Latin for “from the beginning”) prediction refers to predicting a protein’s 3D
structure using only its amino acid sequence, without relying on template structures from
known databases.
It is based on biophysical principles and energetics, not sequence similarity or known folds.
   How Does It Work?
Ab-initio methods predict structure by:
   1. Exploring all possible conformations of the protein.
   2. Using energy functions to evaluate each conformation.
   3. Selecting structures with lowest free energy (most stable state).
   Steps in Ab-initio Prediction
Step No. Step Description
           Use the amino acid sequence as input
           Generate many possible structures (decoys) randomly
           Evaluate them with physics-based or knowledge-based energy functions
           Refine and select the lowest-energy, most stable model
   Popular Tools for Ab-initio Prediction
Tool / Server            Features
Rosetta                  Highly accurate; uses fragment assembly and energy minimization
QUARK                    Designed specifically for ab-initio prediction
AlphaFold (DeepMind) Combines deep learning and physics; works even without templates
I-TASSER (hybrid)        Starts with threading but can switch to ab-initio if needed
    Note: AlphaFold is more than traditional ab-initio — it uses AI + structural databases for
highly accurate predictions.
   When to Use Ab-initio?
Use Case                                  Preferred Method
No similar template in PDB                Ab-initio
Short proteins (<100 residues)            More suitable
Novel proteins from new organisms         Useful
Experimental 3D structure not available Essential
Workflow Using QUARK (Web-Based, Easy)
   Website: https://zhanggroup.org/QUARK/
   Steps:
   1. Go to the QUARK server
            o   Open the link above.
   2. Input your protein sequence
           o   Paste a FASTA-format sequence (≤200 residues for best results).
           o   Example:
>MyProtein
MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQKDWMA...
   3. Provide email (optional)
           o   You’ll receive a link to results when the job is finished.
   4. Submit the job
           o   Click Submit. Wait time may vary (from 1 hour to a day).
   5. Download results
           o   You will get:
                   ▪   Predicted 3D structure (.pdb)
                   ▪   Visualization
                   ▪   Confidence score
What is a Ramachandran Plot?
A Ramachandran plot is a graphical representation of the phi (φ) and psi (ψ) backbone dihedral
angles of amino acids in a protein structure.
It is used to validate the stereochemical quality of predicted 3D protein structures.
   Purpose of the Ramachandran Plot
   •   Assesses structural validity of a predicted protein model.
   •   Highlights allowed vs. disallowed regions for torsion angles.
   •   Helps detect steric clashes or modeling errors.
   Workflow for Ramachandran Plot-Based Validation
   Step 1: Obtain the Predicted Structure
   •   Format: .pdb file
   •   Source: AlphaFold, Rosetta, QUARK, I-TASSER, etc.
   Step 2: Choose a Validation Tool
Tool                     Type          Website / Access
PROCHECK                 Web-          https://www.ebi.ac.uk/thornton-
(PDBsum)                 based         srv/software/PROCHECK/
                         Web-
MolProbity                             http://molprobity.biochem.duke.edu/
                         based
                         Web-
SAVES server                           https://saves.mbi.ucla.edu/
                         based
PyMOL / ChimeraX         Desktop       For local visual generation of plots
   Step 3: Upload the Structure
   1. Visit a server (e.g., SAVES).
   2. Upload your .pdb file.
   3. Select Ramachandran Plot (PROCHECK).
   4. Run analysis.
   Step 4: Interpret the Ramachandran Plot
The plot divides the φ-ψ space into:
Region Type           Description
Most favored regions Conformations frequently observed in known protein structures
Allowed regions       Less common but still acceptable conformations
Disallowed regions    Sterically unfavorable, indicate possible errors
   Ideal Results
Parameter                           Acceptable Range
Residues in most favored regions > 90% (ideal: 95–98%)
Residues in disallowed regions      0% (or very minimal < 1–2%)
Example Output (Typical AlphaFold Result)
Region                         % Residues
Most Favored Regions           96.2%
Additional Allowed Regions 2.8%
Generously Allowed Regions 0.7%
Disallowed Regions             0.3%
What Are Stereochemical Properties?
Stereochemical properties refer to the geometrical and chemical correctness of a protein
structure at the atomic level, including:
   •     Bond lengths
   •     Bond angles
   •     Planarity of peptide bonds
   •     Chirality of amino acids
   •     Side chain conformations
   •     Non-bonded atomic interactions (steric clashes)
Why Are They Important?
   •     Ensure the predicted structure obeys chemical and physical constraints.
   •     Detect unrealistic distortions or errors introduced during modeling.
   •     Confirm that the model is physiologically plausible for biological interpretation.
Tools for Stereochemical Validation
 Tool                   Features
 MolProbity             Comprehensive analysis including clashes, bond geometry, rotamers
 WHAT_CHECK             Checks bond lengths, angles, and other stereochemical properties
 PROCHECK               Focuses on stereochemistry and geometry
What is Structure-Structure Alignment?
Structure-structure alignment is the process of comparing and superimposing two or more
protein 3D structures to identify their similarities and differences in spatial conformation.
Unlike sequence alignment, it focuses on the 3D coordinates of atoms, usually backbone atoms,
to analyze:
   •     Overall fold similarity
   •     Conserved structural motifs
   •     Evolutionary relationships
   •     Functional inference
Why Perform Structure-Structure Alignment?
   •     To assess the quality of a predicted protein model by comparing it with experimentally
         solved structures (e.g., from PDB).
   •     To study protein evolution by comparing structural conservation.
   •     To identify functional sites conserved in structure but not necessarily in sequence.
   •     To assist in drug design and protein engineering by comparing target structures.
Key Metrics in Structure Alignment
Metric                       Description                         Interpretation
RMSD (Root Mean              Average distance between aligned Lower RMSD (e.g., <2 Å)
Square Deviation)            atoms after superposition        indicates high similarity
TM-score (Template           Normalized score reflecting         Score between 0 and 1; >0.5
Modeling Score)              structural similarity               means similar fold
GDT-TS (Global Distance Percentage of residues aligned           Higher GDT-TS (closer to 100)
Test - Total Score)     within a certain distance                means better alignment
Common Tools for Structure-Structure Alignment
Tool                  Features                                                      Website/Access
TM-align              Fast structural alignment, provides TM-score and RMSD TM-align
                      Aligns based on distance matrices, detects structural
DALI                                                                                DALI server
                      homologs
PyMOL                 Visualization and manual alignment                            PyMOL
Chimera/ChimeraX Advanced visualization with alignment capabilities                 Chimera
Workflow for Structure Alignment
    1. Obtain protein structures in PDB format (e.g., predicted model and reference structure).
    2. Upload or open structures in a chosen tool.
    3. Perform alignment to superimpose the structures.
    4. Analyze RMSD, TM-score, or GDT-TS for quantitative similarity.
    5. Visually inspect the superposition for matching folds and deviations.
Example Interpretation
    •   RMSD < 2 Å: High structural similarity, often same fold.
    •   TM-score > 0.5: Generally considered significant fold similarity.
    •   High GDT-TS: Excellent alignment, often indicates evolutionary relatedness.
                                             UNIT- V
Introduction to Systems Biology
Systems Biology is an interdisciplinary field that focuses on the systematic study of complex
biological systems through integration of experimental and computational methods.
Key Points:
    •   Holistic approach: Instead of studying individual genes or proteins separately, systems
        biology investigates how components interact as a network to understand the behavior
        of the entire system (cells, tissues, organisms).
    •   Data integration: Combines data from genomics, proteomics, metabolomics, and other
        omics technologies.
    •   Modeling and simulation: Uses mathematical and computational models to simulate
        biological processes and predict system responses under different conditions.
    •   Applications: Understanding disease mechanisms, drug discovery, metabolic
        engineering, personalized medicine.
Introduction to Synthetic Biology
Synthetic Biology is an emerging discipline that involves the design and construction of new
biological parts, devices, and systems, or the re-design of existing natural biological systems for
useful purposes.
Key Points:
    •   Engineering principles: Applies engineering concepts like modularity, standardization,
        and abstraction to biology.
    •   Creation of synthetic circuits: Designs gene circuits, metabolic pathways, or entire
        organisms with novel functions.
    •   Tools used: DNA synthesis, genome editing (e.g., CRISPR), computational design.
    •   Applications: Production of biofuels, pharmaceuticals, biosensors, environmental
        remediation.
Microarray Data Analysis – Overview
Microarray technology is used to measure the expression levels of thousands of genes
simultaneously. Microarray data analysis involves processing and interpreting this data to
identify patterns of gene expression under different conditions (e.g., disease vs. healthy, treated
vs. untreated).
   Key Steps in Microarray Data Analysis
1. Data Acquisition
    •   Obtain raw data from microarray experiments (e.g., .CEL files from Affymetrix chips).
    •   Sources: Lab experiments or public repositories like GEO (Gene Expression Omnibus)
        or ArrayExpress.
2. Preprocessing
Involves cleaning and standardizing the data:
    •   Background correction: Adjusts for noise and non-specific binding.
    •   Normalization: Ensures data from different arrays or conditions are comparable (e.g.,
        Quantile Normalization).
    •   Filtering: Removes low-quality or non-informative genes.
3. Gene Expression Calculation
    •   Compute expression levels for each gene.
    •   Tools like R packages (limma, affy, oligo) are commonly used.
4. Differential Expression Analysis
    •   Identify genes that are significantly upregulated or downregulated between conditions.
   •   Use statistical tests (e.g., t-test, ANOVA, moderated t-test in limma).
   •   Correct for multiple testing (e.g., using FDR or Benjamini-Hochberg method).
5. Clustering and Visualization
   •   Hierarchical clustering or k-means to group genes with similar expression patterns.
   •   Use heatmaps, volcano plots, and PCA (Principal Component Analysis) for visualization.
6. Functional Annotation
   •   Perform Gene Ontology (GO) enrichment or pathway analysis using:
           o   DAVID
           o   Enrichr
           o   GSEA (Gene Set Enrichment Analysis)
           o   KEGG pathway mapping
7. Biological Interpretation
   •   Relate identified gene expression changes to biological processes, diseases, or treatments.
   •   Validate with literature or further experiments like qPCR.
Applications of Microarray Analysis
   •   Cancer gene expression profiling
   •   Drug response analysis
   •   Biomarker discovery
   •   Toxicogenomics
   •   Comparative genomics
   DNA Computing – An Introduction
DNA computing is a branch of unconventional computing that uses DNA molecules and
biochemical reactions to perform computations, rather than traditional electronic circuits.
It is an interdisciplinary field combining molecular biology, computer science, mathematics, and
nanotechnology.
   Key Concept
DNA computing uses:
   •   Strands of DNA as information carriers (analogous to bits in digital computers)
   •   Enzymes and chemical reactions as logic gates and processors
The goal is to solve complex computational problems by harnessing the massive parallelism and
storage capacity of DNA molecules.
   Historical Background
   •   Leonard Adleman (1994): The field began when he used DNA to solve a Hamiltonian
       Path Problem (a variation of the traveling salesman problem).
   •   Showed that a combinatorial problem could be solved using DNA strands and biological
       operations.
   How DNA Computing Works – Basic Workflow
Step                          Description
                              Represent input data or problem instances using specific DNA
Encoding
                              sequences
Hybridization                 DNA strands naturally bind to complementary strands
Ligation                      Enzymes join DNA fragments to form new combinations
Amplification (PCR)           Duplicate strands that represent potential solutions
Separation (Gel
                              Filter correct-length strands (possible solutions)
Electrophoresis)
Detection                     Identify the strand(s) representing the correct solution
   Applications of DNA Computing
   •   Combinatorial problem solving (e.g., optimization problems like TSP)
   •   Cryptography
   •   Biological sensors and molecular diagnostics
   •   Smart drug delivery systems
   •   DNA-based logic gates for nano-computing
   •   Molecular robotics
   Bioinformatics Approaches for Drug Discovery
Bioinformatics plays a central role in modern drug discovery, enabling scientists to identify and
design potential therapeutic agents faster and more cost-effectively. By leveraging biological
data and computational tools, it allows researchers to understand disease mechanisms, identify
drug targets, and screen potential compounds.
   Key Stages of Drug Discovery Involving Bioinformatics
Stage                             Bioinformatics Role
                                  Find disease-related genes/proteins using genomics,
1. Target Identification
                                  transcriptomics, etc.
                                  Use expression data and literature mining to confirm target
2. Target Validation
                                  involvement.
3. Lead Compound                  Screen compounds (ligands) that may bind the target using
Identification                    virtual tools.
                                  Predict drug-likeness, toxicity, and improve molecular binding
4. Optimization
                                  properties.
                                  Model interactions, pathways, and simulate biological
5. Preclinical Studies
                                  responses.
   Bioinformatics Techniques Used
1. Genomic & Transcriptomic Analysis
   •    Use microarrays and RNA-Seq to identify genes involved in diseases.
   •    Analyze single nucleotide polymorphisms (SNPs) for genetic variations linked to drug
        response.
2. Protein Structure Prediction
   •    Predict 3D structures of target proteins using:
           o   Homology modeling (SWISS-MODEL)
           o   Threading (Phyre2)
           o   Ab-initio prediction (AlphaFold)
3. Molecular Docking
   •    Simulate how drug molecules bind to a target protein.
   •    Tools: AutoDock, PyRx, DockThor, SwissDock
4. Virtual Screening
   •    Screen large libraries of compounds in silico to find potential binders.
   •   Reduces cost and time compared to high-throughput lab screening.
5. Pharmacophore Modeling
   •   Identify the essential features in a molecule that ensure biological activity.
6. ADMET Prediction
   •   Predict:
           o   Absorption
           o   Distribution
           o   Metabolism
           o   Excretion
           o   Toxicity
   •   Tools: SwissADME, pkCSM, ADMETlab
7. Pathway and Network Analysis
   •   Understand the role of a gene/protein in biological pathways.
   •   Databases: KEGG, Reactome, BioCyc
8. Drug Repositioning
   •   Use data mining and similarity analysis to find new uses for existing drugs.
   •   Example: Using gene expression correlation via Connectivity Map (CMap).
Applications of Bioinformatics in Genomics and Proteomics
Bioinformatics provides powerful computational tools and techniques to analyze, interpret, and
visualize the massive datasets generated in genomics (study of genomes) and proteomics (study
of proteins). Below is a structured explanation of its applications in both fields:
   A. Applications in Genomics
Application                 Description
1. Genome Sequencing & Bioinformatics helps in assembling short DNA reads from NGS into
Assembly               full genomes. Tools: SPAdes, Velvet, SOAPdenovo
                            Algorithms predict the location and structure of genes. Tools:
2. Gene Prediction
                            Glimmer, GENSCAN
                            Assigning biological information to gene sequences (function,
3. Genome Annotation
                            structure, etc.). Databases: GenBank, Ensembl
Application                 Description
                            Comparing genomes across species to identify conserved and unique
4. Comparative Genomics
                            elements. Tools: BLAST, ClustalW, OrthoMCL
                            Detecting single nucleotide polymorphisms to study variation and
5. SNP Analysis
                            disease. Tools: SAMtools, GATK
6. Transcriptomics (RNA- Analyzing gene expression using sequencing of mRNA. Tools:
Seq)                     HISAT2, Cufflinks, DESeq2
                            Studying DNA methylation, histone modification using ChIP-seq
7. Epigenomics
                            and bisulfite data.
                            Analyzing genetic material from environmental samples. Tools:
8. Metagenomics
                            QIIME, MetaPhlAn
   B. Applications in Proteomics
Application                  Description
1. Protein Structure         Predicting 2D/3D protein structures from sequences. Tools:
Prediction                   AlphaFold, SWISS-MODEL
2. Protein Sequence          Identifying motifs, domains, signal peptides. Tools: Pfam, InterPro,
Analysis                     PROSITE
                             Identifying proteins using mass spectrometry data. Tools: Mascot,
3. Protein Identification
                             SEQUEST
4. Protein-Protein           Predicting and visualizing interactions. Databases: STRING,
Interaction (PPI)            BioGRID
5. Protein Function          Predicting protein functions using sequence and structural data.
Annotation                   Tools: Blast2GO, InterProScan
6. Post-Translational        Identifying phosphorylation, glycosylation, etc. Tools: ModPred,
Modification                 NetPhos
7. Proteomic Pathway         Mapping proteins to pathways to understand biological processes.
Analysis                     Databases: KEGG, Reactome
   Summary Table
Field         Key Applications                               Key Tools/Databases
              Gene prediction, SNP analysis, genome          BLAST, GenBank, GATK,
Genomics
              assembly, RNA-Seq                              Ensembl, HISAT2
Field        Key Applications                                Key Tools/Databases
             Structure prediction, protein identification,   AlphaFold, STRING, Mascot,
Proteomics
             PPI, function annotation                        Pfam, InterPro
   Real-World Applications
   •    Personalized medicine: Predicting patient-specific responses based on genome/proteome.
   •    Drug discovery: Identifying new targets and biomarkers.
   •    Disease diagnosis: Through expression profiling and proteomic signatures.
   •    Agricultural genomics: Enhancing traits like drought resistance, productivity in crops.
Assembling the Genome – An Overview
Genome assembly is the process of reconstructing the complete DNA sequence of an organism
from short DNA fragments (reads) generated by sequencing technologies such as Illumina,
PacBio, or Oxford Nanopore.
   Why Genome Assembly Is Needed
Sequencers don't read entire chromosomes in one go. Instead, they produce millions of small
fragments (reads). Genome assembly puts these fragments back together in the correct order to
recreate the original genome.
   Types of Genome Assembly
Type                        Description
De novo Assembly            Building the genome from scratch, without a reference genome.
Reference-based             Aligning reads to an existing reference genome to build the
Assembly                    sequence.
   Steps in De Novo Genome Assembly
   1. Read Preprocessing
           o    Remove low-quality reads and contaminants
           o    Tools: FastQC, Trimmomatic
   2. Error Correction
           o    Fix sequencing errors before assembly
           o   Tools: BayesHammer, Lighter
   3. Assembly Algorithm
           o   Overlap Layout Consensus (OLC) for long reads
           o   De Bruijn Graph for short reads
           o   Tools:
                    ▪   SPAdes (short reads)
                    ▪   Canu, Flye (long reads)
                    ▪   Velvet, SOAPdenovo
   4. Scaffolding
           o   Connecting contigs (contiguous sequences) into larger scaffolds using mate-pair
               or paired-end reads.
   5. Gap Filling & Polishing
           o   Closing gaps and correcting misassemblies.
           o   Tools: Pilon, GapCloser, Racon
   6. Annotation
           o   Identify genes, regulatory elements, and functional regions.
           o   Tools: Prokka, Augustus, GENSCAN
STS Content Mapping for Clone Contigs – Overview
STS content mapping (Sequence Tagged Site mapping) is a physical mapping technique used to
determine the relative positions of DNA clones in a genomic library by identifying the presence
or absence of known DNA markers (STS sites) in each clone.
It played a key role in the Human Genome Project before high-throughput sequencing became
dominant.
   What is an STS?
   •   STS (Sequence Tagged Site): A short, unique DNA sequence (200–500 bp) that occurs
       only once in the genome, and its sequence and location are known.
   •   Detected by PCR, making STSs a reliable way to mark locations on DNA.
   Purpose of STS Content Mapping
To assemble overlapping DNA clones (contigs) into a correct order without sequencing them
fully, by identifying which STS markers are present in which clones.
   Process of STS Content Mapping
   1. Generate a genomic library
           o    Use BACs, YACs, or cosmids to clone fragments of the genome.
   2. Design STS markers
           o    Select known unique sequences across the genome.
   3. Screen clones for STS markers
           o    Use PCR to test which STS markers are present in each clone.
   4. Create a binary matrix
           o    Rows = clones
           o    Columns = STS markers
           o    Entry = 1 (marker present) or 0 (marker absent)
   5. Assemble clone contigs
           o    Clones sharing common STS markers are likely to overlap.
           o    Overlapping clones are merged into contigs.
Functional Annotation – Explained
Functional annotation is the process of identifying and assigning biological functions to genes or
proteins based on sequence data. It is a critical step after gene prediction or protein
identification, especially in genome and transcriptome studies.
   Purpose of Functional Annotation
To determine:
   •   What a gene or protein does
   •   Where it functions (cellular localization)
   •   How it interacts with other molecules
   •   Biological processes it’s involved in
   Key Steps in Functional Annotation
Step                            Description
1. Sequence Similarity          Compare predicted genes/proteins to known sequences using
Search                          tools like BLAST
Step                          Description
2. Gene Ontology (GO)         Assign standardized terms for biological process, molecular
Mapping                       function, and cellular component
3. Protein Domain Analysis Identify conserved domains using Pfam, InterPro, SMART
4. Pathway Mapping            Map genes to biological pathways (e.g., KEGG, Reactome)
                              Predict where in the cell a protein functions (e.g., nucleus,
5. Subcellular Localization
                              mitochondria)
6. Enzyme Commission (EC)
                          Assign catalytic function to enzymes based on reaction class
Number
   Tools and Databases
Tool/Database         Function
BLAST                 Sequence similarity search
InterProScan          Protein domain and family identification
GO (Gene Ontology) Function, process, localization terms
Pfam, SMART           Domain identification
KEGG                  Pathway mapping
UniProt               Curated protein function database
   Peptide Mass Fingerprinting (PMF) – Explained
Peptide Mass Fingerprinting (PMF) is a mass spectrometry-based technique used to identify
proteins by measuring the masses of peptides generated from enzymatic digestion (usually by
trypsin) of the protein.
   Principle of PMF
   1. A protein is enzymatically digested into smaller peptides.
   2. The mass spectrum of these peptides is obtained using a mass spectrometer (typically
      MALDI-TOF).
   3. The experimental peptide mass list is compared with theoretical peptide masses in a
      protein database.
   4. The best matching protein is identified.
   PMF Workflow
Step                       Description
1. Protein Isolation       Protein is purified from cells or tissues.
                           Protein is digested with trypsin, cleaving at lysine (K) and arginine
2. Enzymatic Digestion
                           (R).
3. Mass Spectrometry
                           Peptide mixture analyzed using MALDI-TOF or ESI-MS.
(MS)
4. Peak List Generation List of m/z values (mass/charge) of peptides is created.
                           Compare experimental mass values with theoretical digests using
5. Database Search         software like:
                           Mascot, ProteinProspector, PeptIdent
6. Protein Identification Protein with the highest match score is identified.
   Tools and Databases
Tool                   Purpose
Mascot                 PMF database search engine
ProteinProspector      Protein ID from MS data
PeptIdent (ExPASy) Match peptide masses to proteins
SwissProt, NCBI        Protein sequence databases
   Key Characteristics of PMF
   •   Highly specific: Each protein has a unique peptide "mass fingerprint".
   •   Quick and cost-effective for known proteins.
   •   Usually used with MALDI-TOF instruments.
   •   Requires the protein to be present in databases for identification.
   Advantages
   •   High-throughput protein identification
   •   Sensitive and accurate
   •   Automated and fast when integrated with databases