0% found this document useful (0 votes)
25 views62 pages

Bioinformatics 1

Bioinformatics is an interdisciplinary field that integrates biology, computer science, mathematics, and statistics to analyze biological data, utilizing tools and databases such as NCBI and BLAST. It differs from computational biology, which focuses more on modeling and simulations. The document also covers various types of biological databases, sequence alignment methods, and applications in gene expression, metabolic pathways, and motif analysis.

Uploaded by

buiikauwow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views62 pages

Bioinformatics 1

Bioinformatics is an interdisciplinary field that integrates biology, computer science, mathematics, and statistics to analyze biological data, utilizing tools and databases such as NCBI and BLAST. It differs from computational biology, which focuses more on modeling and simulations. The document also covers various types of biological databases, sequence alignment methods, and applications in gene expression, metabolic pathways, and motif analysis.

Uploaded by

buiikauwow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Bioinformatics is an interdisciplinary field that combines biology, computer science,

mathematics, and statistics to analyze and interpret biological data.


Example Tools & Databases
• Databases: NCBI, UniProt, Ensembl, KEGG, Pfam
• Tools: BLAST, Clustal Omega, FASTQC, GROMACS, HMMER
Computational Biology is a broad field that applies mathematical models, computational
simulations, and algorithmic techniques to study biological systems. While closely related to
bioinformatics, computational biology places a stronger emphasis on developing theoretical
models and simulations to understand biological behavior.

Bioinformatics Computational Biology

Focuses on data management & Focuses on modeling and simulation


analysis

Works with biological databases Builds predictive biological models


Emphasizes tools and pipelines Emphasizes theory and mathematical
models

Data-driven (e.g., genome Hypothesis-driven (e.g., simulating


sequencing) evolution)

Biological sequences are linear arrangements of biological molecules that carry the information
required for life. They are fundamental to molecular biology and bioinformatics.
Biological databases are organized collections of biological data, essential for storing, retrieving,
and analyzing information such as DNA sequences, protein structures, gene expression data,
and more.
CLASSIFICATION OF DATABASE

1. Nucleotide Sequence Databases


Store DNA or RNA sequences.
• GenBank (NCBI, USA)
• EMBL-EBI (Europe)
• DDBJ (Japan)
These three collaborate and exchange data daily (called the INSDC – International Nucleotide
Sequence Database Collaboration).
Protein Sequence Databases
Store amino acid sequences of proteins.
• UniProt (Universal Protein Resource)
o UniProtKB/Swiss-Prot: Manually curated, high-quality data
o UniProtKB/TrEMBL: Computationally annotated, not manually reviewed
• PIR (Protein Information Resource)
• PDB (Protein Data Bank – also includes 3D structures)
Applications
• Gene identification and annotation
• Phylogenetic analysis
• Comparative genomics
• Protein structure and function prediction
• Drug target discovery

2. Structure Databases in Bioinformatics

• Structure databases store three-dimensional (3D) structural data of biological


macromolecules like proteins, nucleic acids (DNA/RNA), and complexes. These
structures are crucial for understanding biological function, drug design, enzyme
mechanisms, and protein-ligand interactions.

Eg: PDB, RCSB PDB, CATH, SCOP

Visualization Tools
• PyMOL
• Chimera/ChimeraX
• Jmol

Applications
• Drug design: Understanding how molecules bind to proteins
• Molecular docking: Predicting binding interactions
• Enzyme engineering
• Protein folding and dynamics
• Structure-function relationship analysis

3. Genome-Specific Databases in Bioinformatics


Genome-specific databases store and organize genomic information of a
particular species, group of organisms, or model organisms. These databases are
essential for studying gene function, genome organization, evolution, mutations,
and comparative genomics.
Eg: Ensembl, FlyBase, WormBase

Applications
• Studying gene regulation and structure
• Comparative genomics and evolution
• Understanding model organisms in biomedical research
• Finding candidate genes for diseases or traits
• CRISPR guide RNA design and gene editing

4. Specialized Databases in Bioinformatics


Specialized databases (or special databases) are focused repositories that contain specific types
of biological data—such as pathways, gene expression, protein families, diseases, or molecular
interactions—rather than general sequences or structures. These are critical for functional
genomics, systems biology, and translational research.
KEGG, Pfam, ExPASy Enzyme
Applications of Special Databases
• Functional annotation of genes/proteins
• Disease-gene association studies
• Drug discovery and toxicology
• Biomarker identification
• Systems biology and network modelling

Microarray

Microarray Technology in Bioinformatics


Microarray is a high-throughput technique used to measure the expression levels of thousands
of genes simultaneously or to genotype multiple regions of a genome. It plays a key role in
transcriptomics, diagnostics, and disease research.

What is a Microarray?

A microarray is a small chip (glass or silicon) onto which DNA probes are fixed in a grid
pattern. These probes hybridize with complementary DNA (cDNA) or RNA from a sample, and
the intensity of the signal indicates the expression level of each gene.
Basic Steps of a Microarray Experiment

1. Sample Preparation
o Extract RNA from cells or tissue
o Convert it to cDNA and label it with fluorescent dyes (e.g., Cy3 and Cy5)
2. Hybridization
o Apply the labeled cDNA to the microarray chip
o cDNA binds (hybridizes) to complementary DNA probes on the chip
3. Scanning
o Use a laser scanner to detect fluorescence intensity at each spot
4. Data Analysis
o Convert fluorescence signals into gene expression values
o Normalize and compare data across samples (e.g., normal vs. diseased)

Applications of Microarray
Area Applications
Gene expression profiling Compare expression in healthy vs. diseased tissues
Cancer research Identify tumor-specific gene signatures
Drug discovery Evaluate drug impact on gene expression
Diagnostics Identify infections, genetic disorders

Microarray Data Repositories

Database Description

GEO (Gene Expression Omnibus – NCBI) Repository for microarray and RNA-seq datasets

ArrayExpress (EMBL-EBI) Public archive for functional genomics data

Metabolic Pathway

Metabolic Pathway in Bioinformatics and Biology


A metabolic pathway is a series of interconnected biochemical reactions that transform a
starting molecule into a final product, catalyzed by enzymes. These pathways are essential for
energy production, growth, and maintenance of cellular functions in living organisms.

Types of Metabolic Pathways


Pathway Type Function Example

Catabolic Break down molecules to release energy Glycolysis, β-oxidation

Anabolic Build complex molecules using energy Protein synthesis, Gluconeogenesis

Amphibolic Serve both anabolic and catabolic roles Citric Acid Cycle (TCA/Krebs cycle)

Examples of Key Metabolic Pathways

Pathway Description

Glycolysis Converts glucose to pyruvate, generating ATP

TCA Cycle (Krebs cycle) Oxidizes acetyl-CoA to CO₂, producing NADH, FADH₂

Electron Transport Chain Uses electrons from NADH/FADH₂ to produce ATP

Pentose Phosphate Pathway Generates NADPH and ribose sugars

Fatty Acid Metabolism Includes β-oxidation (breakdown) and synthesis

Amino Acid Metabolism Transamination, deamination, urea cycle

Photosynthesis (plants) Converts light energy into chemical energy

Nitrogen Fixation (microbes/plants) Converts atmospheric nitrogen to ammonia

Bioinformatics Resources for Metabolic Pathways

Database Description

KEGG Pathway Interactive maps of metabolic pathways with gene-enzyme-compound links

MetaCyc Curated database of experimentally validated metabolic pathways

BioCyc Collection of organism-specific pathway/genome databases

Components of a Metabolic Pathway


• Substrates: Starting molecules (e.g., glucose)
• Products: Final molecules (e.g., pyruvate)
• Enzymes: Biological catalysts for each step
• Intermediates: Compounds formed between start and end
• Coenzymes: NAD⁺, FAD, ATP – help transfer energy or atoms
Applications of Metabolic Pathway Analysis
• Drug target identification
• Metabolic engineering (e.g., in synthetic biology)
• Understanding disease mechanisms (e.g., cancer metabolism)

Motif

Motif in Bioinformatics and Molecular Biology


A motif is a short, conserved sequence pattern in DNA, RNA, or protein molecules that has a
biological function. Motifs are often involved in important roles such as binding sites, active
sites, structural regions, or regulatory elements.

Types of Motifs

Molecule Motif Type Function

Regulatory Act as transcription factor binding sites (e.g., TATA box, CAAT
DNA
motifs box)

RNA RNA motifs Secondary structure motifs (e.g., hairpin loops, riboswitches)

Structural/functional units (e.g., zinc finger, helix-turn-helix, EF-


Protein Protein motifs
hand)

Examples of Biological Motifs

DNA Motifs

• TATA Box: Promoter region in eukaryotes for transcription initiation


• CpG Island: Regions rich in CG nucleotides, often near promoters
• Enhancer motifs: Bind activator proteins to boost gene expression

Protein Motifs

• Zinc Finger: Binds DNA; involved in gene regulation


• Leucine Zipper: Mediates protein-protein interactions
• SH2 Domain: Binds phosphorylated tyrosines in signaling pathways
• Walker A/B: ATP binding motifs in enzymes

Motif Discovery Tools

Tool Purpose

MEME Suite Discover novel motifs in DNA/protein sequences


Tool Purpose

PROSITE Database of protein motifs and patterns

Pfam Protein families and domains (some include motifs)

JASPAR DNA binding motifs for transcription factors

MotifScan Scans sequences for known motifs

Applications of Motif Analysis

• Predicting gene regulatory elements


• Identifying binding sites in proteins and DNA
• Understanding evolutionary conservation
• Classifying proteins into functional families
• Designing mutational studies or synthetic biology parts

Domain Databases
Domain Databases in Bioinformatics
Domains are structurally and functionally distinct units within proteins that can evolve,
function, and exist independently. Domain databases collect and annotate these conserved
regions, helping to predict protein function, structure, and evolutionary relationships.

Major Domain Databases

Database Description Key Features

Uses multiple sequence alignments and hidden


Pfam Protein families and domains
Markov models (HMMs)

Integrative resource of protein Combines Pfam, SMART, PROSITE,


InterPro
domains TIGRFAMs, and more

Protein domains, families, and


PROSITE Uses patterns and profiles for detection
functional sites

Structural classification of
SCOP Hierarchical classification based on structure
protein domains

Protein domain classification by Class, Architecture, Topology, Homologous


CATH
structure superfamily
Difference Between Motif and Domain

Motif Domain

Short, conserved sequence Larger, functional unit

May not fold independently Can fold/function independently

Often a small part of a domain Can consist of several motifs

Data file formats


1. Sequence Data Formats

Format Used For Description

FASTA Plain text; starts with > followed by


DNA/RNA/protein sequences
(.fa/.fasta) sequence

FASTQ Raw sequencing reads with Includes quality scores per base (e.g., from
(.fq/.fastq) quality Illumina)

GenBank Annotated nucleotide


Includes features, genes, CDS, organism
(.gb/.gbk) sequences

2. Protein Structure Formats

Format Used For Description

Atomic coordinates from X-ray, NMR, cryo-


PDB (.pdb) 3D structure of biomolecules
EM

mmCIF
Alternative to PDB format Richer metadata, used by RCSB PDB now
(.cif)

Secondary structure
DSSP Assigns alpha-helix, beta-sheet from PDB files
assignments

3. Expression and Microarray Formats

Format Used For Description

CEL files Raw microarray data (Affymetrix) Contains probe intensities

CHP files Processed microarray data Results from Affymetrix analysis

4. Phylogenetic and Multiple Sequence Alignments

Format Used For Description

CLUSTAL Multiple sequence From tools like ClustalW/Clustal


(.aln/.clustal) alignments Omega

PHYLIP (.phy) Phylogenetic analysis Input for PHYLIP programs


Format Used For Description

NEXUS (.nex) Phylogeny + character data Used in PAUP*, MrBayes

UNIT-II
Sequence Alignment in Bioinformatics
Sequence alignment is the process of arranging DNA, RNA, or protein sequences to identify
regions of similarity. These similarities may indicate functional, structural, or evolutionary
relationships between the sequences.
Difference Between Homology and Similarity

Feature Homology Similarity

Definition Indicates a common evolutionary Measures how alike two


origin sequences are
Type Qualitative (yes or no) Quantitative (measured in %)

Expression "Two genes are homologous" or "are "Sequences are 80% similar"
not homologous"

Basis Inferred from sequence similarity, Calculated from aligned sequence


structure, or function data

Types - Orthologs (speciation) - Sequence similarity


- Paralogs (duplication) - Structural similarity

Measurement Not directly measurable — inferred Directly measurable via tools like
Tool BLAST, Clustal

Implication Shared ancestry Possible shared function or


structure

Aspect Identity Similarity

Definition Exact match of residues at the Degree of resemblance between residues


same position (including similar ones)

Type of Strict: only identical residues Flexible: includes chemically similar


Match residues

Applicable DNA, RNA, and protein Mostly protein sequences


To sequences
Tools BLAST, Clustal, MAFFT BLAST with substitution matrices (e.g.,
BLOSUM, PAM)

Types of sequence alignment


1.Pairwise Sequence Alignment
Pairwise sequence alignment is a fundamental bioinformatics method used to compare two
biological sequences — either DNA, RNA, or protein — to identify regions where they are
similar or different. The goal of pairwise alignment is to arrange the sequences in such a way
that equivalent or related residues (nucleotides or amino acids) are aligned to each other,
highlighting evolutionary, structural, or functional relationships.
There are two main types of pairwise alignment:
1. Global Alignment: This method attempts to align the entire length of both sequences
from beginning to end. It is most effective when the sequences are of roughly the same
length and are expected to be similar throughout. The global alignment algorithm
systematically scores all possible alignments and finds the best overall match, including
gaps introduced to maximize alignment. The classic algorithm used for global alignment
is the Needleman-Wunsch algorithm.
2. Local Alignment: This method focuses on finding the best matching region(s) or
subsequences within the two sequences, rather than aligning them from end to end.
Local alignment is useful when sequences may only share small regions of similarity,
such as conserved functional domains within otherwise divergent sequences. The Smith-
Waterman algorithm is a popular method for local alignment. A widely used practical
tool that performs fast local alignments is BLAST.
In pairwise alignment, matches, mismatches, and gaps are scored using specific scoring
schemes, often guided by substitution matrices (for proteins) like BLOSUM or PAM. Matches
add positive scores, while mismatches and gaps usually incur penalties. The alignment with the
highest score represents the best alignment under the scoring criteria.
2.Multiple Sequence Alignment (MSA)
Multiple Sequence Alignment (MSA) is an extension of pairwise alignment where three or more
biological sequences (DNA, RNA, or proteins) are aligned simultaneously. The objective is to
arrange the sequences so that similar regions, conserved motifs, or functional domains are
aligned across all sequences in the set.

Why is MSA important?


• Detect conserved regions: MSA helps identify sequences or motifs that have been
preserved throughout evolution, suggesting they have important structural or functional
roles.
• Infer evolutionary relationships: By comparing multiple sequences, MSA supports the
construction of phylogenetic trees, which show the evolutionary history of genes or
species.
• Predict structure and function: Conserved regions highlighted by MSA can indicate
critical parts of a protein or gene involved in binding, catalysis, or regulation.
• Guide experimental design: MSA can help design primers for PCR or identify target
sites for mutagenesis.

Common Tools for MSA


• Clustal Omega: Widely used, good balance between speed and accuracy.
• MUSCLE: Faster and often more accurate for large datasets.
• MAFFT: Efficient for very large numbers of sequences.
• T-Coffee: Provides consensus alignments combining results from different methods

3.Global Alignment
Global alignment is a method used in bioinformatics to align two biological sequences
(DNA, RNA, or protein) across their entire lengths. It attempts to match the sequences
from the beginning of both sequences to the end, even if there are mismatches or gaps.

When is Global Alignment Used?

• When the two sequences are of similar length


• When the sequences are closely related (e.g., from the same gene family or species)
• When a full-length comparison is important — such as comparing two homologous
genes or complete protein sequences

Key Characteristics

Feature Description

Alignment Scope Full-length (start to end of both sequences)

Gaps Allowed and inserted to optimize overall alignment

Match/Mismatch
All positions are considered, even mismatches
Handling
Feature Description

Common Algorithm Needleman–Wunsch Algorithm

EMBOSS Needle, Clustal Omega (for MSA),


Tools Used
MAFFT

4.Local Alignment in Bioinformatics


Local alignment is a sequence alignment technique used to find the most similar region(s)
between two biological sequences — such as DNA, RNA, or proteins. Unlike global alignment, it
does not try to align the sequences from end to end, but instead focuses on aligning the best
matching subsequences within the larger sequences.

Key Features of Local Alignment

Feature Description

Scope Aligns only the most similar region(s) (subsequences)

Gaps Allowed, only within the aligned region

Algorithm Used Smith–Waterman algorithm (dynamic programming)

Common Tools BLAST (heuristic local alignment), EMBOSS Water

Best For Dissimilar sequences or domain-level comparisons

Output A high-scoring segment pair showing best local match

Dot Plot in Bioinformatics


A dot plot is a graphical method used in bioinformatics to compare two biological sequences
(DNA, RNA, or protein). It helps visually identify regions of similarity, such as conserved
motifs, repeats, or alignments, by placing a dot wherever residues (nucleotides or amino acids)
in the two sequences match.

What is a Dot Plot?

A dot plot is a 2D matrix where:


• One sequence is plotted along the horizontal (x-axis).
• The other sequence is plotted along the vertical (y-axis).
• A dot is placed at (i, j) if the character at position i in sequence 1 matches the character
at position j in sequence 2.

Purpose of a Dot Plot

• To visually detect similarity between two sequences


• To identify repeats, inversions, or palindromes
• To give an intuitive overview before running full alignments

How to Interpret a Dot Plot

Pattern Seen Meaning

A diagonal line (↘) Good match or alignment between sequences

Breaks/gaps in the diagonal Mismatches or insertions/deletions

Parallel diagonals Repeats or duplications

Horizontal/vertical lines Gaps in one of the sequences

Inverted diagonals (↙) Inversions or palindromic sequences

🛠 Tools for Creating Dot Plots

Tool Features

EMBOSS Dotmatcher Creates dot plots with customizable window size

Example
Let’s compare two sequences:
• Sequence 1: ACGTG
• Sequence 2: ACGTC

ACGTC
-----------
A| ●
C| ●
G| ●
T| ●
G| ● ← mismatch here; C ≠ G

Alignment Algorithms
Sequence alignment algorithms are at the core of bioinformatics and computational biology.
They allow us to compare biological sequences (DNA, RNA, or proteins) to find similarities,
differences, and evolutionary relationships. These algorithms use mathematical and
computational techniques to optimally align two or more sequences, considering matches,
mismatches, and gaps.

Types of Alignment Algorithms

Type Purpose Common Algorithms

Global Aligns sequences from end to end Needleman–Wunsch


Alignment
Local Alignment Aligns best matching regions within Smith–Waterman
sequences

Multiple Aligns 3 or more sequences Clustal, MUSCLE, MAFFT, T-


Alignment simultaneously Coffee

Needleman–Wunsch Algorithm – Global Sequence Alignment


The Needleman–Wunsch algorithm is a dynamic programming method used for global
alignment of two biological sequences — DNA, RNA, or proteins. It was the first algorithm
developed for sequence alignment and remains a foundational concept in bioinformatics.

Purpose

To find the best alignment of two sequences across their entire lengths, including matches,
mismatches, and gaps, in a way that maximizes an alignment score.

Key Concepts

Term Meaning

Global alignment Aligns sequences from beginning to end

Dynamic programming Breaks problem into subproblems and builds solution step by step
Term Meaning

Scoring scheme Rewards matches (+), penalizes mismatches (−), and gaps (−)

Scoring Example

• Match = +1
• Mismatch = –1
• Gap (insertion/deletion) = –2

Step-by-Step Procedure

Sequence A: G A C
Sequence B: G A T

Step 1: Initialization
Create a matrix with dimensions (len(A)+1) × (len(B)+1), and initialize the first row and column
with gap penalties.

Step 2: Fill the Matrix


Score(i,j) = max( Score(i–1, j–1) + match/mismatch, Score(i–1, j) + gap, Score(i, j–1) + gap)
Step 3: Traceback
Start from the bottom-right cell and trace back to the top-left, following the path that gave the
optimal score (diagonal = match/mismatch, up = gap in seq B, left = gap in seq A).
This gives you the optimal global alignment.
Applications
• Comparing full gene or protein sequences
• Studying closely related species
• Detecting mutations, insertions, deletions
• Building tools like EMBOSS Needle
Tools That Use Needleman–Wunsch
• EMBOSS Needle (online & command-line)
• Biopython and BioPerl (programmatic implementation)

Smith–Waterman Algorithm – Local Sequence Alignment


The Smith–Waterman algorithm is a dynamic programming method used for local alignment of
two biological sequences (DNA, RNA, or protein). Unlike the Needleman–Wunsch algorithm
(which aligns entire sequences), **Smith–Waterman identifies the highest scoring subsequences
— i.e., the best matching region.

Purpose

To find the most similar local region (subsequence) between two sequences by maximizing the
alignment score, allowing matches, mismatches, and gaps.

Key Features

Aspect Smith–Waterman

Type Local alignment

Approach Dynamic programming

Best for Comparing distantly related sequences

Output Best matching subsequences, not full-length alignment

Algorithm Basis Recurrence relation with zero as minimum

Scoring System

Element Score Example

Match +2

Mismatch –1

Gap (indel) –2

Step-by-Step: Smith–Waterman Algorithm


Let’s align these sequences:
• Sequence A = G A C T
• Sequence B = G A T

Step 1: Initialize Matrix

Create a (m+1) × (n+1) matrix, where m and n are lengths of the sequences. Initialize the first
row and column to 0 (important for local alignment!).

Step 2: Fill the Matrix


Score(i,j) = max( Score(i–1, j–1) + match/mismatch, Score(i–1, j) + gap, Score(i, j–1) + gap)
Step 3: Traceback
Start from the bottom-right cell and trace back to the top-left, following the path that gave the
optimal score (diagonal = match/mismatch, up = gap in seq B, left = gap in seq A).
This gives you the optimal global alignment.
Applications
• Comparing full gene or protein sequences
• Studying closely related species
• Detecting mutations, insertions, deletions

Tools Using Smith–Waterman

Tool Usage

EMBOSS Water Command-line & web alignment

Substitution Matrices – PAM (Point Accepted Mutation)


In bioinformatics, substitution matrices are used to score alignments between protein
sequences by assigning values to amino acid substitutions. PAM is one of the earliest and
most widely used substitution matrices in sequence alignment.
What is PAM?

PAM stands for Point Accepted Mutation. It is a scoring matrix used in protein sequence
alignment to estimate the likelihood of one amino acid being replaced by another during
evolution.
• Developed by Margaret Dayhoff in the 1970s.
• Based on observed mutations in closely related protein families.
• Measures evolutionary distance between proteins.

Concept of 1 PAM

• 1 PAM = 1% of amino acids have undergone an accepted point mutation.


• Constructed from alignments of closely related proteins.
• A PAM1 matrix shows the probabilities of amino acid substitutions after 1% sequence
divergence.
To model larger evolutionary distances, PAM matrices are extrapolated:

Matrix Meaning

PAM1 1% divergence (closely related sequences)

PAM250 ~250% accepted mutations (more distant sequences)

Structure of a PAM Matrix

It is a 20 × 20 matrix (for the 20 amino acids), where:


• Each cell (i, j) contains a log-odds score:
o High positive → substitution is likely
o Negative → substitution is unlikely

Substitution Matrices – BLOSUM (BLOcks SUbstitution Matrix)


BLOSUM is another widely used substitution matrix in bioinformatics, especially for
protein sequence alignments. It helps score amino acid substitutions based on evolutionary
conservation, similar to PAM, but is constructed using a different strategy and is more
effective for local alignments and distantly related sequences.

What is BLOSUM?

• BLOSUM = BLOcks SUbstitution Matrix


• Developed by Henikoff & Henikoff in 1992
• Based on observed substitutions in conserved protein blocks (ungapped regions of
multiple alignments)
• Unlike PAM (which is extrapolated), BLOSUM is directly derived from real sequence
alignments

Key Concept

• BLOSUM matrices are labeled as BLOSUMx, where x is the percentage identity


threshold used to cluster sequences.

BLOSUM Matrix Best For

BLOSUM80 Closely related sequences

BLOSUM62 Moderately divergent sequences (default in BLAST)

BLOSUM45 Distantly related sequences

Lower BLOSUM number → greater evolutionary distance.

structure of a BLOSUM Matrix


Like PAM, BLOSUM is a 20×20 matrix (for amino acids) with log-odds scores:

Positive = more likely substitution


Negative = less likely substitution
Feature PAM BLOSUM

Full form Point Accepted Mutation BLOcks SUbstitution Matrix

Based on Extrapolated mutations in Observed substitutions in blocks


families

Suited for Closely related sequences Distantly related sequences

Label meaning PAM250 = 250% divergence BLOSUM62 = sequences ≤62%


identity

Common Older tools, full alignments BLAST, protein alignment tools


usage

Applications of Multiple Sequence Alignment (MSA)


Multiple Sequence Alignment (MSA) is a core technique in bioinformatics used to compare
three or more biological sequences (DNA, RNA, or protein) simultaneously. Its applications
span evolutionary biology, genomics, drug design, and functional annotation.

1. Identification of Conserved Regions

• Conserved sequences often indicate important functional or structural roles (e.g., active
sites, binding domains).
• Helps identify motifs or signatures characteristic of a protein family.

Example: Finding conserved catalytic residues in enzymes across different organisms.

2. Phylogenetic Tree Construction

• MSA is the starting point for building evolutionary trees.


• It helps trace common ancestry and divergence between species or genes.

Example: Studying evolutionary relationships among coronavirus spike proteins.

3. Protein Structure and Function Prediction

• Conserved regions suggest functional importance and structural stability.


• Aligning unknown proteins with known structures may reveal 3D folding patterns.

Example: Predicting zinc finger domain in a newly discovered transcription factor.

4. Primer and Probe Design

• Helps in designing universal primers or probes that bind to conserved regions across
species.
• Critical for PCR, qPCR, microarray, or diagnostic kits.

Example: Designing a primer to detect conserved rRNA genes in bacteria.

5. Annotation of New Sequences

• Annotate newly sequenced DNA or proteins based on alignment with well-annotated


homologs.
• Assign gene function, exon-intron boundaries, or domain labels.

Example: Assigning function to a novel gene based on alignment with known kinase family
genes.

6. Drug Target Discovery and Vaccine Design

• Identify conserved drug targets across multiple pathogenic strains.


• Use conserved epitopes to design broad-spectrum vaccines.

Example: Conserved epitopes in the influenza virus HA protein used in universal flu vaccine
design.

7. Detecting Mutations and SNPs

• Compare aligned sequences to detect point mutations, insertions, or deletions.


• Useful in cancer genomics, personalized medicine, and evolution studies.

Example: Identifying a pathogenic SNP in the BRCA1 gene.

Viewing and Editing Multiple Sequence Alignments (MSA)


Once you perform a Multiple Sequence Alignment (MSA), viewing and editing it effectively is
crucial for interpretation, annotation, or preparing it for further analyses like phylogenetic tree
construction, conserved motif discovery, or domain prediction.

Why View or Edit MSA?

• To manually correct misaligned regions


• Highlight conserved sequences or motifs
• Annotate functional or structural features
• Trim or remove poorly aligned regions
• Export in desired formats (FASTA, Clustal, Phylip)
🛠 Popular Tools for Viewing & Editing MSA

1. Jalview (Desktop Application)

• GUI-based tool for visualizing and editing MSA.

• Supports color-coding, annotations, trees, and structure overlay.


• Can fetch sequences from databases (UniProt, EMBL).
• Compatible with Clustal, FASTA, Stockholm formats.

Website: https://www.jalview.org

2. AliView
• Lightweight, fast MSA editor and viewer.
• Suitable for large datasets (e.g., viral genomes).
• Allows quick manual adjustments, trimming, and exporting.

Website: http://ormbunkar.se/aliview/

3. UGENE
• Bioinformatics suite that includes MSA editing.
• Integrates with tools like ClustalW, MAFFT, MUSCLE.
• Great for annotation and local analysis.

Website: https://ugene.net/

4. MEGA (Molecular Evolutionary Genetics Analysis)


• Primarily used for phylogenetic analysis, but allows MSA viewing/editing.
• Integrates MSA tools and supports tree construction.

Website: https://www.megasoftware.net/

5. Web-based Viewers

Tool Features

MAFFT Viewer Online visualization after alignment

Clustal Omega Viewer View alignments with colored conservation

Wasabi Interactive MSA + phylogenetic tree viewer


Common Features in MSA Viewers

Feature Description

Color Coding Based on amino acid properties or conservation

Gap Editing Manually add/delete gaps in specific regions

Consensus View Show residues most conserved across sequences

Annotation Add structural, functional, or domain features

Format Export Save as FASTA, Clustal, Stockholm, etc.

Common File Formats

Format Extension Description

FASTA .fasta Basic format for sequences

Clustal .aln Used by ClustalW, supports alignment

Stockholm .sto Annotated alignment format

Phylip .phy Input for phylogenetic tools

Tips for Editing MSA

• Use color schemes like Zappo or Taylor for proteins.


• Trim low-confidence regions at sequence ends.
• Remove redundant or low-quality sequences.
• Always save a backup of the original alignment.

Scoring Function in Multiple Sequence Alignment (MSA)


Scoring functions in MSA are used to evaluate the quality of the alignment by measuring how
well the sequences are conserved across aligned columns. A higher score usually means better
biological relevance, reflecting evolutionary, structural, or functional relationships.

Key Scoring Functions in MSA

1. Sum-of-Pairs (SP) Score

Most common scoring function for MSA.


How it works:
• For every column in the alignment, calculate all pairwise scores.
• Use a substitution matrix (e.g., PAM, BLOSUM) for amino acids or match/mismatch for
nucleotides.
2. Weighted Sum-of-Pairs Score
• Improves SP score by applying weights to reduce redundancy (e.g., multiple similar
sequences).
• Helps avoid overrepresentation of closely related sequences.

3. Entropy Score (Information Content)


Used to evaluate the variability at each column.
4. Consistency-Based Scoring
Used by advanced MSA tools like T-Coffee.
• Compares final alignment with a library of pairwise alignments.
• Scores columns based on how consistent they are with pairwise alignments.

Scoring Function Purpose Used In


Sum-of-Pairs (SP) Basic scoring by adding all pairwise ClustalW, MAFFT,
scores MUSCLE

Weighted SP Reduces over-representation bias T-Coffee, advanced aligners

Entropy-based Evaluates conservation at each position Profile analysis, motif


finding

Consistency- Increases alignment reliability T-Coffee


based

Database Similarity Searching:


BLAST (Basic Local Alignment Search Tool)
BLAST is one of the most widely used tools in bioinformatics for comparing a query sequence
(DNA, RNA, or protein) against a database of sequences to find regions of local similarity.

What is BLAST?

• Full Form: Basic Local Alignment Search Tool


• Purpose: Find sequences in a database that closely match a query sequence
• Type: Local alignment algorithm
• Database search: Compares the query against large databases (e.g., NCBI, UniProt)

How BLAST Works (Steps)

1. Input Query (DNA or protein sequence)


2. Word Matching:
o Breaks query into short sequences (called words, e.g., 3-mers for proteins, 11-
mers for DNA)
3. Database Scanning:
o Searches for exact or similar word matches in database sequences
4. Extension:
o Matches are extended in both directions to find High-scoring Segment Pairs
(HSPs)
5. Scoring & Ranking:
o Uses substitution matrices (e.g., BLOSUM) and gap penalties to compute
alignment scores
6. Output:
o List of sequences with alignment scores, identities, e-values, and links to
database entries

Scoring Terms in BLAST

Term Meaning

Score Numerical value based on alignment (match, mismatch, gaps)


E-value Expected number of matches by chance. Lower = more significant

Bit Score Normalized score that allows comparison between searches

% Identity Percentage of identical residues in alignment

Query Cover How much of the query is covered by the alignment

Types of BLAST Programs

Program Query Type Database Type Use Case

BLASTn Nucleotide Nucleotide Finding similar DNA sequences

BLASTp Protein Protein Protein homology and function prediction

BLASTx Nucleotide Protein Translates DNA → Protein, then searches


tBLASTn Protein Nucleotide Protein → translated DNA

tBLASTx Nucleotide Nucleotide Translates both and compares

Applications of BLAST

1. Gene/Protein Identification
2. Function Prediction
3. Homology Detection
4. Annotation of Genomic Data
5. SNP or Mutation Analysis
6. Evolutionary Studies
7. Drug Target Discovery

BLAST Online Tool:

NCBI BLAST portal: https://blast.ncbi.nlm.nih.gov/Blast.cgi

FASTA Format in Bioinformatics


FASTA is one of the simplest and most widely used file formats in bioinformatics for
representing nucleotide or protein sequences.

PHI-BLAST (Pattern-Hit Initiated BLAST)

PHI-BLAST is a specialized version of the BLAST algorithm that combines pattern matching
with sequence similarity searching. It’s useful when you know a specific motif or conserved
pattern in your protein and want to find other sequences that both:
1. Contain the same pattern, and
2. Are homologous (similar) to your query sequence.

What is PHI-BLAST?

• Full Form: Pattern Hit Initiated BLAST


• Purpose: Find protein sequences in a database that:
o Contain a predefined pattern (motif), and
o Are significantly similar to the query sequence in the surrounding regions.
How PHI-BLAST Works

1. User Inputs:
o A protein sequence (query)
o A motif/pattern (in PROSITE syntax)
2. PHI-BLAST Search:
o Searches a protein database for sequences that match the pattern
o Among these, it performs local alignments to find statistically significant
matches
3. Output:
o List of protein sequences that contain the motif and have significant similarity to
the query sequence.

Application Description

Functional annotation Identifies proteins with similar functions

Motif-based homology search More specific than standard BLAST


Protein family classification Finds members of a protein family with conserved motifs

Evolutionary analysis Combines sequence conservation and motif preservation

Domain-specific search Focused analysis around functional sites

PSI-BLAST (Position-Specific Iterated BLAST)

PSI-BLAST is an advanced BLAST variant that improves the detection of remote homologous
sequences by using a position-specific scoring matrix (PSSM), which gets refined over multiple
iterations. It’s especially useful for finding distant evolutionary relationships that standard
BLAST might miss.

What is PSI-BLAST?

• Full Form: Position-Specific Iterated BLAST


• Purpose: Detect distant protein homologs by creating and refining a PSSM over
multiple search rounds.
• Input: A protein sequence only.

How PSI-BLAST Works (Step-by-Step)

1. First Iteration:
o Performs a standard BLASTp search.
o Identifies sequences with significant similarity.
2. PSSM Creation:
o From aligned hits, PSI-BLAST builds a Position-Specific Scoring Matrix.
o This matrix contains evolutionary information (which amino acids are conserved
at each position).
3. Subsequent Iterations:
o The PSSM is used to search again, detecting distant homologs that match the
conserved profile, even if they have low overall identity.
o The user may choose to include/exclude hits for refining the matrix.
4. Stopping Criteria:
o Iterations stop when no new significant matches are found or maximum
iterations is reached (default is 5).

Why Use PSI-BLAST?

Reason Explanation
Improved sensitivity Finds remote homologs missed by normal BLAST

Profile-based searching Uses biologically relevant conservation info

Evolutionary insight Detects protein families, domains, and functional motifs

Applications of PSI-BLAST

• Detecting protein families and superfamilies


• Predicting function of unknown proteins
• Exploring evolutionary relationships
• Finding weak but meaningful sequence similarity

BLAST Algorithm (Basic Local Alignment Search Tool)


BLAST is a powerful and widely used algorithm for comparing a query biological sequence
(DNA, RNA, or protein) against a large database of sequences, to find regions of local similarity.
It is designed to be fast and sensitive, allowing researchers to identify homologous sequences
quickly.

How the BLAST Algorithm Works: Step-by-Step

1. Query Sequence Input:


You provide a nucleotide or protein sequence as the query.
2. Word (K-mer) Generation:
o The query is broken into short subsequences called words or k-mers (default
length depends on the sequence type, e.g., 3 for proteins, 11 for DNA).
o These words serve as seeds for the search.
3. Word Matching in Database:
o The database is scanned to find exact or similar matches to these words.
o BLAST uses a lookup table to find where words occur in the database sequences.
4. Extension of Hits:
o Each word match is extended in both directions to find longer alignments called
High-scoring Segment Pairs (HSPs).
o Extension stops when the score drops below a threshold.
5. Scoring Alignments:
o Alignments are scored using substitution matrices (e.g., BLOSUM62 for
proteins) and gap penalties.
o Only alignments above a certain score threshold are kept.
6. Statistical Evaluation:
o The E-value (expectation value) is calculated to estimate the likelihood of finding
the match by chance. Lower E-values indicate more significant matches.
7. Results Output:
o BLAST produces a ranked list of database sequences similar to the query,
showing alignment details, scores, identities, and E-values.

UNIT-III

Phylogenetics Basics
Phylogenetics is the study of evolutionary relationships among biological species or entities
based on genetic, morphological, or molecular data. It helps to reconstruct the "family tree" or
phylogenetic tree showing how organisms are related through common ancestry.
Types of Phylogenetic Trees

Type Description

Rooted Tree Shows direction of evolution, with a single common ancestor at the root

Unrooted Shows relationships but no information about common ancestor or direction


Tree

Cladogram Shows only the branching order, branch lengths not proportional to time or
changes
Phylogram Branch lengths proportional to evolutionary change or time

Steps in Phylogenetic Analysis


1. Data Collection:
Obtain sequences (DNA, RNA, or protein) or morphological data.
2. Multiple Sequence Alignment (MSA):
Align sequences to identify homologous positions.
3. Model Selection:
Choose an evolutionary model describing how sequences change over time.
4. Tree Construction:
Use methods like:
o Distance-based (e.g., Neighbor-Joining)
o Character-based (e.g., Maximum Parsimony, Maximum Likelihood, Bayesian
Inference)
5. Tree Evaluation:
Assess reliability with methods like bootstrapping.
Applications of Phylogenetics
• Understanding evolutionary relationships among species
• Tracing the origin and spread of pathogens
• Studying gene family evolution and duplication
• Conservation biology and species classification
• Molecular clock studies estimating divergence times

Molecular Clock Theory


The Molecular Clock Theory is a method used in molecular evolution to estimate the time of
divergence between species or genes based on the assumption that genetic mutations accumulate
at a relatively constant rate over time.
How Molecular Clock Works
1. Mutation Rate:
o Assume a roughly constant rate of nucleotide or amino acid substitutions per
unit time.
2. Sequence Comparison:
o Count the number of differences (mutations) between two homologous
sequences.
3. Time Estimation:
4. Calibration:
o Use known fossil records or geological events to calibrate the clock.

Ultrametric Trees
An ultrametric tree is a special type of rooted phylogenetic tree where all the tips (leaves) are
equidistant from the root. This means that the distance from the root to any leaf (representing
present-day species or sequences) is the same across the tree, reflecting the idea that all
sequences have evolved for the same amount of time.

Distance Matrix Methods – UPGMA

What is UPGMA?
• UPGMA stands for Unweighted Pair Group Method with Arithmetic Mean.
• It is a simple hierarchical clustering method used in phylogenetics to construct
ultrametric trees (rooted trees with equal distances from root to leaves).
• UPGMA builds a tree based on a distance matrix representing pairwise evolutionary
distances between sequences or species.

How does UPGMA work? — Step-by-step


1. Start with a distance matrix:
o The matrix contains pairwise distances between all sequences.
2. Find the closest pair:
o Identify the two clusters (initially, each sequence is its own cluster) with the
smallest distance.
3. Merge clusters:
o Combine the two closest clusters into a new cluster.
4. Update the distance matrix:
o Calculate distances from the new cluster to all other clusters as the arithmetic
mean of distances of the merged clusters:
5. Repeat:
o Repeat steps 2-4 until all sequences are clustered into a single tree.

Key Characteristics of UPGMA:

Feature Description
Produces Ultrametric (rooted) tree

Assumes Constant molecular clock rate across lineages

Clustering approach Agglomerative hierarchical clustering

Input Pairwise distance matrix

Distance update method Arithmetic mean of merged clusters' distances

Output Phylogenetic tree with branch lengths proportional to time

Advantages of UPGMA:
• Simple and easy to implement.
• Fast and computationally efficient.
• Provides an explicit time scale due to ultrametric assumption.

Distance Matrix Methods – Neighbor-Joining (NJ)

What is Neighbor-Joining (NJ)?


• Neighbor-Joining is a popular distance-based method for reconstructing phylogenetic
trees.
• Unlike UPGMA, NJ produces an unrooted tree and does not assume a constant
molecular clock.
• It’s widely used because it is fast, efficient, and often produces more accurate trees when
evolutionary rates vary among lineages.

How does Neighbor-Joining work? — Step-by-step


1. Start with a distance matrix:
o Contains pairwise distances between all sequences or taxa.
2. Calculate the Q-matrix:
3. Find the pair with the smallest Q-value:
o This pair is chosen to be neighbors (closest relatives) and will be joined in the
tree.
4. Join the pair into a new node:
o Calculate branch lengths from the new node to each neighbor
5. Update the distance matrix:
o Remove the joined taxa and add the new node.
o Calculate distances.
6. Repeat steps 2-5:
o Continue until only two nodes remain, which are then connected.

Key Characteristics of Neighbor-Joining:

Feature Description

Produces Unrooted phylogenetic tree


Assumes No molecular clock (variable evolutionary rates allowed)

Clustering Agglomerative but uses corrected distance (Q-matrix)


approach
Input Pairwise distance matrix

Output Tree with branch lengths proportional to evolutionary


distances
Accuracy More accurate than UPGMA when rates vary

Character-Based Methods – Maximum Parsimony (MP)

What is Maximum Parsimony?

Maximum Parsimony (MP) is a character-based method used in phylogenetic analysis. It aims


to find the simplest tree—the one that explains the observed data with the fewest evolutionary
changes (mutations).
Principle of Parsimony:
"The simplest explanation is preferred."
In phylogenetics, this means choosing the tree with minimum total number of character state
changes.
Key Features of Maximum Parsimony

Feature Description

Data Type Uses aligned characters (nucleotides, amino acids, etc.)

Tree Type Usually unrooted (can be rooted using an outgroup)

Goal Find the tree requiring minimum evolutionary steps

Approach Character-based, not dependent on distance matrices


Output One or more equally parsimonious trees

How Does MP Work?

1. Input:
o Aligned sequences (DNA, RNA, or protein).
o Each character (e.g., A, T, G, C) is analyzed independently.
2. Generate all possible tree topologies for the given taxa.
3. Evaluate each tree:
o For each character, determine the minimum number of changes needed to
explain its distribution on the tree.
o Sum over all characters to get the tree length.
4. Select the tree with the least total changes (most parsimonious)

Tools for MP Analysis


• MEGA
• PHYLIP

Methods of Evaluating Phylogenetic Trees – Bootstrapping

Why Evaluate Phylogenetic Trees?

Phylogenetic trees are hypotheses about evolutionary relationships. Since multiple trees
may explain the data similarly well, we need methods to assess the confidence or
reliability of inferred trees or tree branches.

What is Bootstrapping in Phylogenetics?

Bootstrapping is a statistical resampling method used to assess the reliability of tree


branches (clades) in phylogenetic analysis.
It estimates how consistently a particular branch (or grouping) appears across many replicated
analyses of resampled data.

How Bootstrapping Works – Step-by-Step

1. Start with aligned sequence data (e.g., DNA, protein alignment).


2. Generate multiple replicate datasets:
o Each bootstrap replicate is created by randomly resampling (with replacement)
columns from the original alignment.
o Each replicate has the same number of columns as the original.
3. Reconstruct a tree for each replicate using a chosen method (e.g., MP, NJ, ML).
4. Count how often each clade appears across all replicate trees.
5. Assign bootstrap support values:
o For each clade in the original tree, the bootstrap value is the percentage of
replicate trees in which that clade appears.
o Expressed as a percentage (e.g., 80% support means the clade appeared in 80
out of 100 trees).

Interpreting Bootstrap Values

Bootstrap Value (%) Confidence Interpretation

> 90% Very strong support

70% – 89% Moderate to strong support

50% – 69% Weak support

< 50% Clade not considered reliable

Example

• If a branch grouping species A and B appears in 95 out of 100 bootstrap replicates, the
branch is labeled 95% in the final tree.
• It means high confidence that A and B share a close evolutionary relationship.

Advantages of Bootstrapping

• Provides a quantitative measure of tree reliability.


• Applicable to various phylogenetic methods: Maximum Parsimony, Maximum
Likelihood, Neighbor-Joining, etc.
• Helps identify robust clades vs. uncertain relationships.

Methods of Evaluating Phylogenetic Trees – Jackknifing

What is Jackknifing in Phylogenetics?

Jackknifing is a resampling-based statistical method used to assess the stability and


reliability of branches (clades) in a phylogenetic tree—similar in spirit to bootstrapping,
but with a slightly different approach.
It involves systematically removing a subset of data (usually columns in an alignment) and then
reconstructing trees to evaluate how often a clade appears.

How Jackknifing Works – Step-by-Step

1. Start with a multiple sequence alignment (DNA, RNA, or protein).


2. Create jackknife replicate datasets:
o Randomly delete a fixed proportion of characters (e.g., 30–50%) from the
alignment without replacement.
o The remaining data (e.g., 70%) is used to reconstruct the tree.
3. Reconstruct phylogenetic trees from each reduced dataset using a method like MP, NJ,
or ML.
4. Track how frequently each clade appears across all replicate trees.
5. Assign jackknife support values to the nodes:
o The support value indicates the percentage of replicates in which the clade
appears.

Interpreting Jackknife Support Values

Jackknife Value (%) Confidence Interpretation

> 85% Strong support

70% – 85% Moderate support

< 70% Weak support

Note: Jackknife values are usually slightly lower than bootstrap values.

Advantages of Jackknifing

• Helps assess robustness of tree topology.


• Avoids potential biases of bootstrap resampling (e.g., overrepresentation of some
characters).
• Simpler computation since it avoids repeated sampling of the same data.

Jackknifing vs. Bootstrapping

Feature Jackknifing Bootstrapping

Deletes a fraction of Resamples columns with


Data Resampling
columns replacement

Less variation in More variation due to


Variability
replicates replacement

Slightly faster, fewer Requires more replicates


Computation
replicates needed for stability

Use in Less common but still


More widely used
Phylogenetics valid

UNIT-IV

What is Gene Prediction?


Gene prediction is the process of identifying the locations of genes (coding regions) in a genomic
DNA sequence. This is a crucial step in genome annotation, especially for newly sequenced
organisms.
It involves detecting features like exons, introns, promoters, start/stop codons, etc., to predict
protein-coding genes and non-coding RNAs.

Types of Gene Prediction Methods

Gene prediction approaches can be broadly classified into two types:

Type Description

Ab initio (de novo) Uses statistical models and signals in DNA sequence alone

Homology-based Uses known genes or sequences from other organisms

Ab Initio Gene Prediction


• Based on sequence features: codon usage, GC content, exon/intron boundaries, ORFs
(Open Reading Frames), etc.
• Common algorithms:
o Hidden Markov Models (HMM)
o Neural networks and machine learning approaches

Tools:

• GENSCAN
• GeneMark
• Glimmer
• AUGUSTUS

Homology-Based Gene Prediction

• Compares the input genome to known genes or protein sequences from related species.
• Relies on alignment tools like BLAST, TBLASTN, or spliced alignment.

Tools:

• BLAST
• Exonerate
• GeneWise
• EST2GENOME

Steps in Gene Prediction

1. Input: Raw DNA sequence (e.g., entire genome or chromosome).


2. Scan for signals:
o Start codons (ATG)
o Stop codons (TAA, TAG, TGA)
o Splice sites (GT-AG rule)
o Promoter regions
3. Detect coding regions:
o Open Reading Frames (ORFs)
o Codon usage bias
4. Use prediction model/tool (ab initio or homology-based).
5. Output: Annotated gene structure:
o Exons, introns, UTRs
o Predicted protein sequence
What is a Conserved Domain?
A conserved domain is a recurring structural or functional unit in a protein that has remained
relatively unchanged (conserved) during evolution.
These domains often correlate with specific biological functions, such as binding DNA,
catalyzing reactions, or interacting with other proteins.

Purpose of Conserved Domain Analysis

Conserved domain analysis aims to:


• Identify known domains in a query protein sequence
• Predict the function of unknown proteins
• Study evolutionary relationships
• Understand protein structure-function relationships

How Conserved Domain Analysis Works

1. Input: A protein sequence (FASTA format).


2. Search the sequence against domain databases.
3. Align the sequence with known conserved domains.
4. Output: Domain hits, positions, and functional annotations.

Popular Tools for Conserved Domain Analysis

Tool / Database Description

NCBI CD-Search Compares protein sequences against the Conserved Domain Database (CDD)

InterPro Integrates several databases (Pfam, SMART, TIGRFAMs, PROSITE, etc.)

Pfam Database of protein families and domains using HMMs

SMART Identifies signaling domains, regulatory motifs

ScanProsite Scans for PROSITE motifs and profiles

HMMER Uses Hidden Markov Models to search against domain profiles


What is Protein Structure Visualization?

Protein structure visualization refers to using software tools to view and analyze the 3D
structure of proteins. It helps researchers understand protein folding, active sites, binding
interactions, and structure-function relationships.

Why Visualize Protein Structures?

Purpose Explanation

Understanding protein function Structure reveals mechanisms like catalysis or binding

Drug design Visualizing binding pockets for ligand docking

Mutation impact analysis Locating mutation sites to assess structural effect

Education & communication Visual tools aid in teaching molecular biology concepts

Popular Protein Visualization Tools

Tool Description

PyMOL Widely used; powerful, scriptable, publication-quality images

Chimera / ChimeraX Advanced analysis; good for large complexes and density maps

RasMol Lightweight; good for quick 3D visualization

Jmol Java-based, web-friendly visualization

iCn3D (NCBI) Web-based viewer for 3D structure + annotations

Mol Viewer (RCSB PDB)* Modern, fast web tool from RCSB for visualizing PDB files

Common File Formats

Format Description

.pdb Protein Data Bank format (atomic coordinates)

.cif mmCIF format (updated version of PDB)

.mol2 Molecular structures with charges/bonds

.sdf Structure-data file for small molecules

What Can You Visualize?


Feature Description

Backbone/secondary structure α-helices, β-sheets, loops

Surface and volume Solvent-accessible and molecular surfaces

Ligand interactions Binding sites and non-covalent interactions

Electrostatics Charge distribution over protein surface

Mutations and domains Highlight regions/domains/mutations

What is Protein Secondary Structure?


Protein secondary structure refers to localized, repetitive structural motifs formed by hydrogen
bonding in the polypeptide backbone.
There are three major types:

Structure Description

Alpha-helix (α) Right-handed coil stabilized by H-bonds

Beta-sheet (β) Extended strands connected by hydrogen bonds

Coil/Loop Irregular or flexible regions (non-α/β structures)

Why Predict Secondary Structure?

• To infer protein function when 3D structure is unknown


• To aid in tertiary structure prediction
• To design mutations or modifications in proteins
• To support structural annotations in genome projects

Basic Principles of Prediction

Secondary structure prediction typically relies on:


• Amino acid propensities (likelihood of forming α/β/coils)
• Sliding window approaches (local sequence analysis)
• Multiple sequence alignments (evolutionary conservation)

Common Methods & Algorithms


Method Description

Chou-Fasman Early method using amino acid propensities

GOR Method Uses information theory and probability

Neural Networks (e.g., PSIPRED) Most accurate, uses MSA + machine learning

Popular Tools for Secondary Structure Prediction

Tool / Server Features

PSIPRED High-accuracy; uses neural networks + MSA

JPred Web-based; reliable secondary structure prediction

SOPMA Fast, basic method using statistical analysis

PORTER Deep learning-based predictor

PHYRE2 Combines secondary + 3D structure prediction

Example Workflow: Predicting with PSIPRED

1. Go to: http://bioinf.cs.ucl.ac.uk/psipred/
2. Paste your protein sequence (FASTA format).
3. Submit and wait for results.
4. Output:
o Sequence with predicted secondary structures (H = Helix, E = Strand, C = Coil)
o Confidence scores
o Graphical visualization

Typical Output Format

AA A A A A V V L L E E E G G
SS H H H H H H C C E E E C C

Symbol Meaning

H Alpha-helix

E Beta-strand

C Coil or loop
What is Tertiary Structure Prediction?
Tertiary structure prediction involves determining the 3D structure of a protein based on its
amino acid sequence. Among the various methods, homology modeling (also known as
comparative modeling) is the most widely used when a similar structure (template) is already
known.

What is Homology Modeling?

Homology modeling predicts the 3D structure of a target protein using an experimentally


determined structure of a homologous protein (template) with similar sequence.
Key assumption: Proteins with similar sequences adopt similar structures.

Steps in Homology Modeling

Step No. Step Description

Template Identification (via BLAST or HHblits)

Sequence Alignment (align target to template)

Model Building (build 3D model from alignment)

Model Refinement (correct side chains, loops)

Model Validation (check stereochemistry, clashes)

Popular Tools for Homology Modeling

Tool / Server Features

SWISS-MODEL Fully automated, web-based; good for beginners

Modeller Command-line tool; flexible and scriptable

Phyre2 Uses fold recognition when homology is weak

I-TASSER Hybrid method; includes threading + ab initio if needed

RaptorX Handles low-homology sequences well using deep learning

Example Workflow – Using SWISS-MODEL

1. Go to https://swissmodel.expasy.org
2. Submit your protein sequence (FASTA format)
3. It automatically:
o Finds the best template
o Aligns the sequence
o Builds a 3D model
4. Download/view the predicted 3D structure (PDB format)
5. Use structure viewers like PyMOL or Mol* to analyze

What is Threading?
Threading, also called fold recognition, is a protein structure prediction method used
when:
• There is low sequence similarity (<30%) between the query protein and known
structures.
• Homology modeling fails due to lack of a suitable close template.
It tries to "thread" the amino acid sequence of the unknown protein onto known folds (3D
templates) in structural databases, even when there's no clear sequence homology.

How Does Threading Work?

Step Description

Compare the target sequence against a library of known structures

Try fitting (threading) the sequence onto each fold

Score each model using energy functions, compatibility & alignment

Choose the best-scoring fold as the predicted 3D model

Threading evaluates both sequence compatibility and structural environment (e.g.,


burial, secondary structure match, etc.).

Popular Threading Tools

Tool /
Features
Server

I-TASSER Combines threading + ab initio; high accuracy

Phyre2 Fold recognition with profile matching; very user-friendly

RaptorX Deep learning-based threading; handles remote homologs

LOMETS Meta-server combining multiple threading algorithms


Tool /
Features
Server

Improved scoring with neural networks and statistical energy


SPARKS-X
functions

When to Use Threading?

Situation Method

Sequence identity > 50% Homology modeling

Identity 20–30% Threading

Identity < 20%, no known fold Ab initio

Example Workflow – Using Phyre2


1. Visit: http://www.sbg.bio.ic.ac.uk/phyre2
2. Paste your protein sequence (FASTA format).
3. Choose Intensive mode for best threading predictions.
4. Wait for results:
o Predicted 3D model
o Secondary structure
o Confidence scores
o Functional insights

What is Ab-initio Prediction?


Ab-initio (Latin for “from the beginning”) prediction refers to predicting a protein’s 3D
structure using only its amino acid sequence, without relying on template structures from
known databases.
It is based on biophysical principles and energetics, not sequence similarity or known folds.

How Does It Work?

Ab-initio methods predict structure by:


1. Exploring all possible conformations of the protein.
2. Using energy functions to evaluate each conformation.
3. Selecting structures with lowest free energy (most stable state).
Steps in Ab-initio Prediction

Step No. Step Description

Use the amino acid sequence as input

Generate many possible structures (decoys) randomly

Evaluate them with physics-based or knowledge-based energy functions

Refine and select the lowest-energy, most stable model

Popular Tools for Ab-initio Prediction

Tool / Server Features

Rosetta Highly accurate; uses fragment assembly and energy minimization

QUARK Designed specifically for ab-initio prediction

AlphaFold (DeepMind) Combines deep learning and physics; works even without templates

I-TASSER (hybrid) Starts with threading but can switch to ab-initio if needed

Note: AlphaFold is more than traditional ab-initio — it uses AI + structural databases for
highly accurate predictions.

When to Use Ab-initio?

Use Case Preferred Method

No similar template in PDB Ab-initio

Short proteins (<100 residues) More suitable

Novel proteins from new organisms Useful

Experimental 3D structure not available Essential

Workflow Using QUARK (Web-Based, Easy)

Website: https://zhanggroup.org/QUARK/

Steps:

1. Go to the QUARK server


o Open the link above.
2. Input your protein sequence
o Paste a FASTA-format sequence (≤200 residues for best results).
o Example:
>MyProtein
MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQKDWMA...
3. Provide email (optional)
o You’ll receive a link to results when the job is finished.
4. Submit the job
o Click Submit. Wait time may vary (from 1 hour to a day).
5. Download results
o You will get:
▪ Predicted 3D structure (.pdb)
▪ Visualization
▪ Confidence score

What is a Ramachandran Plot?

A Ramachandran plot is a graphical representation of the phi (φ) and psi (ψ) backbone dihedral
angles of amino acids in a protein structure.
It is used to validate the stereochemical quality of predicted 3D protein structures.

Purpose of the Ramachandran Plot

• Assesses structural validity of a predicted protein model.


• Highlights allowed vs. disallowed regions for torsion angles.
• Helps detect steric clashes or modeling errors.

Workflow for Ramachandran Plot-Based Validation

Step 1: Obtain the Predicted Structure

• Format: .pdb file


• Source: AlphaFold, Rosetta, QUARK, I-TASSER, etc.
Step 2: Choose a Validation Tool

Tool Type Website / Access

PROCHECK Web- https://www.ebi.ac.uk/thornton-


(PDBsum) based srv/software/PROCHECK/

Web-
MolProbity http://molprobity.biochem.duke.edu/
based

Web-
SAVES server https://saves.mbi.ucla.edu/
based

PyMOL / ChimeraX Desktop For local visual generation of plots

Step 3: Upload the Structure

1. Visit a server (e.g., SAVES).


2. Upload your .pdb file.
3. Select Ramachandran Plot (PROCHECK).
4. Run analysis.

Step 4: Interpret the Ramachandran Plot

The plot divides the φ-ψ space into:

Region Type Description

Most favored regions Conformations frequently observed in known protein structures

Allowed regions Less common but still acceptable conformations

Disallowed regions Sterically unfavorable, indicate possible errors

Ideal Results

Parameter Acceptable Range

Residues in most favored regions > 90% (ideal: 95–98%)

Residues in disallowed regions 0% (or very minimal < 1–2%)


Example Output (Typical AlphaFold Result)

Region % Residues

Most Favored Regions 96.2%

Additional Allowed Regions 2.8%

Generously Allowed Regions 0.7%

Disallowed Regions 0.3%

What Are Stereochemical Properties?


Stereochemical properties refer to the geometrical and chemical correctness of a protein
structure at the atomic level, including:
• Bond lengths
• Bond angles
• Planarity of peptide bonds
• Chirality of amino acids
• Side chain conformations
• Non-bonded atomic interactions (steric clashes)

Why Are They Important?


• Ensure the predicted structure obeys chemical and physical constraints.
• Detect unrealistic distortions or errors introduced during modeling.
• Confirm that the model is physiologically plausible for biological interpretation.

Tools for Stereochemical Validation

Tool Features

MolProbity Comprehensive analysis including clashes, bond geometry, rotamers


WHAT_CHECK Checks bond lengths, angles, and other stereochemical properties

PROCHECK Focuses on stereochemistry and geometry


What is Structure-Structure Alignment?
Structure-structure alignment is the process of comparing and superimposing two or more
protein 3D structures to identify their similarities and differences in spatial conformation.
Unlike sequence alignment, it focuses on the 3D coordinates of atoms, usually backbone atoms,
to analyze:
• Overall fold similarity
• Conserved structural motifs
• Evolutionary relationships
• Functional inference

Why Perform Structure-Structure Alignment?


• To assess the quality of a predicted protein model by comparing it with experimentally
solved structures (e.g., from PDB).
• To study protein evolution by comparing structural conservation.
• To identify functional sites conserved in structure but not necessarily in sequence.
• To assist in drug design and protein engineering by comparing target structures.

Key Metrics in Structure Alignment

Metric Description Interpretation

RMSD (Root Mean Average distance between aligned Lower RMSD (e.g., <2 Å)
Square Deviation) atoms after superposition indicates high similarity

TM-score (Template Normalized score reflecting Score between 0 and 1; >0.5


Modeling Score) structural similarity means similar fold

GDT-TS (Global Distance Percentage of residues aligned Higher GDT-TS (closer to 100)
Test - Total Score) within a certain distance means better alignment

Common Tools for Structure-Structure Alignment

Tool Features Website/Access

TM-align Fast structural alignment, provides TM-score and RMSD TM-align

Aligns based on distance matrices, detects structural


DALI DALI server
homologs

PyMOL Visualization and manual alignment PyMOL

Chimera/ChimeraX Advanced visualization with alignment capabilities Chimera


Workflow for Structure Alignment
1. Obtain protein structures in PDB format (e.g., predicted model and reference structure).
2. Upload or open structures in a chosen tool.
3. Perform alignment to superimpose the structures.
4. Analyze RMSD, TM-score, or GDT-TS for quantitative similarity.
5. Visually inspect the superposition for matching folds and deviations.

Example Interpretation
• RMSD < 2 Å: High structural similarity, often same fold.
• TM-score > 0.5: Generally considered significant fold similarity.
• High GDT-TS: Excellent alignment, often indicates evolutionary relatedness.

UNIT- V

Introduction to Systems Biology


Systems Biology is an interdisciplinary field that focuses on the systematic study of complex
biological systems through integration of experimental and computational methods.
Key Points:
• Holistic approach: Instead of studying individual genes or proteins separately, systems
biology investigates how components interact as a network to understand the behavior
of the entire system (cells, tissues, organisms).
• Data integration: Combines data from genomics, proteomics, metabolomics, and other
omics technologies.
• Modeling and simulation: Uses mathematical and computational models to simulate
biological processes and predict system responses under different conditions.
• Applications: Understanding disease mechanisms, drug discovery, metabolic
engineering, personalized medicine.

Introduction to Synthetic Biology


Synthetic Biology is an emerging discipline that involves the design and construction of new
biological parts, devices, and systems, or the re-design of existing natural biological systems for
useful purposes.
Key Points:
• Engineering principles: Applies engineering concepts like modularity, standardization,
and abstraction to biology.
• Creation of synthetic circuits: Designs gene circuits, metabolic pathways, or entire
organisms with novel functions.
• Tools used: DNA synthesis, genome editing (e.g., CRISPR), computational design.
• Applications: Production of biofuels, pharmaceuticals, biosensors, environmental
remediation.

Microarray Data Analysis – Overview


Microarray technology is used to measure the expression levels of thousands of genes
simultaneously. Microarray data analysis involves processing and interpreting this data to
identify patterns of gene expression under different conditions (e.g., disease vs. healthy, treated
vs. untreated).

Key Steps in Microarray Data Analysis

1. Data Acquisition
• Obtain raw data from microarray experiments (e.g., .CEL files from Affymetrix chips).
• Sources: Lab experiments or public repositories like GEO (Gene Expression Omnibus)
or ArrayExpress.

2. Preprocessing
Involves cleaning and standardizing the data:
• Background correction: Adjusts for noise and non-specific binding.
• Normalization: Ensures data from different arrays or conditions are comparable (e.g.,
Quantile Normalization).
• Filtering: Removes low-quality or non-informative genes.

3. Gene Expression Calculation


• Compute expression levels for each gene.
• Tools like R packages (limma, affy, oligo) are commonly used.

4. Differential Expression Analysis


• Identify genes that are significantly upregulated or downregulated between conditions.
• Use statistical tests (e.g., t-test, ANOVA, moderated t-test in limma).
• Correct for multiple testing (e.g., using FDR or Benjamini-Hochberg method).

5. Clustering and Visualization


• Hierarchical clustering or k-means to group genes with similar expression patterns.
• Use heatmaps, volcano plots, and PCA (Principal Component Analysis) for visualization.

6. Functional Annotation
• Perform Gene Ontology (GO) enrichment or pathway analysis using:
o DAVID
o Enrichr
o GSEA (Gene Set Enrichment Analysis)
o KEGG pathway mapping

7. Biological Interpretation
• Relate identified gene expression changes to biological processes, diseases, or treatments.
• Validate with literature or further experiments like qPCR.

Applications of Microarray Analysis


• Cancer gene expression profiling
• Drug response analysis
• Biomarker discovery
• Toxicogenomics
• Comparative genomics

DNA Computing – An Introduction

DNA computing is a branch of unconventional computing that uses DNA molecules and
biochemical reactions to perform computations, rather than traditional electronic circuits.
It is an interdisciplinary field combining molecular biology, computer science, mathematics, and
nanotechnology.

Key Concept
DNA computing uses:
• Strands of DNA as information carriers (analogous to bits in digital computers)
• Enzymes and chemical reactions as logic gates and processors
The goal is to solve complex computational problems by harnessing the massive parallelism and
storage capacity of DNA molecules.

Historical Background

• Leonard Adleman (1994): The field began when he used DNA to solve a Hamiltonian
Path Problem (a variation of the traveling salesman problem).
• Showed that a combinatorial problem could be solved using DNA strands and biological
operations.

How DNA Computing Works – Basic Workflow

Step Description

Represent input data or problem instances using specific DNA


Encoding
sequences

Hybridization DNA strands naturally bind to complementary strands

Ligation Enzymes join DNA fragments to form new combinations

Amplification (PCR) Duplicate strands that represent potential solutions

Separation (Gel
Filter correct-length strands (possible solutions)
Electrophoresis)

Detection Identify the strand(s) representing the correct solution

Applications of DNA Computing

• Combinatorial problem solving (e.g., optimization problems like TSP)


• Cryptography
• Biological sensors and molecular diagnostics
• Smart drug delivery systems
• DNA-based logic gates for nano-computing
• Molecular robotics
Bioinformatics Approaches for Drug Discovery

Bioinformatics plays a central role in modern drug discovery, enabling scientists to identify and
design potential therapeutic agents faster and more cost-effectively. By leveraging biological
data and computational tools, it allows researchers to understand disease mechanisms, identify
drug targets, and screen potential compounds.

Key Stages of Drug Discovery Involving Bioinformatics

Stage Bioinformatics Role

Find disease-related genes/proteins using genomics,


1. Target Identification
transcriptomics, etc.

Use expression data and literature mining to confirm target


2. Target Validation
involvement.

3. Lead Compound Screen compounds (ligands) that may bind the target using
Identification virtual tools.

Predict drug-likeness, toxicity, and improve molecular binding


4. Optimization
properties.

Model interactions, pathways, and simulate biological


5. Preclinical Studies
responses.

Bioinformatics Techniques Used

1. Genomic & Transcriptomic Analysis


• Use microarrays and RNA-Seq to identify genes involved in diseases.
• Analyze single nucleotide polymorphisms (SNPs) for genetic variations linked to drug
response.
2. Protein Structure Prediction
• Predict 3D structures of target proteins using:
o Homology modeling (SWISS-MODEL)
o Threading (Phyre2)
o Ab-initio prediction (AlphaFold)
3. Molecular Docking
• Simulate how drug molecules bind to a target protein.
• Tools: AutoDock, PyRx, DockThor, SwissDock
4. Virtual Screening
• Screen large libraries of compounds in silico to find potential binders.
• Reduces cost and time compared to high-throughput lab screening.
5. Pharmacophore Modeling
• Identify the essential features in a molecule that ensure biological activity.
6. ADMET Prediction
• Predict:
o Absorption
o Distribution
o Metabolism
o Excretion
o Toxicity
• Tools: SwissADME, pkCSM, ADMETlab
7. Pathway and Network Analysis
• Understand the role of a gene/protein in biological pathways.
• Databases: KEGG, Reactome, BioCyc
8. Drug Repositioning
• Use data mining and similarity analysis to find new uses for existing drugs.
• Example: Using gene expression correlation via Connectivity Map (CMap).

Applications of Bioinformatics in Genomics and Proteomics

Bioinformatics provides powerful computational tools and techniques to analyze, interpret, and
visualize the massive datasets generated in genomics (study of genomes) and proteomics (study
of proteins). Below is a structured explanation of its applications in both fields:

A. Applications in Genomics

Application Description

1. Genome Sequencing & Bioinformatics helps in assembling short DNA reads from NGS into
Assembly full genomes. Tools: SPAdes, Velvet, SOAPdenovo

Algorithms predict the location and structure of genes. Tools:


2. Gene Prediction
Glimmer, GENSCAN

Assigning biological information to gene sequences (function,


3. Genome Annotation
structure, etc.). Databases: GenBank, Ensembl
Application Description

Comparing genomes across species to identify conserved and unique


4. Comparative Genomics
elements. Tools: BLAST, ClustalW, OrthoMCL

Detecting single nucleotide polymorphisms to study variation and


5. SNP Analysis
disease. Tools: SAMtools, GATK

6. Transcriptomics (RNA- Analyzing gene expression using sequencing of mRNA. Tools:


Seq) HISAT2, Cufflinks, DESeq2

Studying DNA methylation, histone modification using ChIP-seq


7. Epigenomics
and bisulfite data.

Analyzing genetic material from environmental samples. Tools:


8. Metagenomics
QIIME, MetaPhlAn

B. Applications in Proteomics

Application Description

1. Protein Structure Predicting 2D/3D protein structures from sequences. Tools:


Prediction AlphaFold, SWISS-MODEL

2. Protein Sequence Identifying motifs, domains, signal peptides. Tools: Pfam, InterPro,
Analysis PROSITE

Identifying proteins using mass spectrometry data. Tools: Mascot,


3. Protein Identification
SEQUEST

4. Protein-Protein Predicting and visualizing interactions. Databases: STRING,


Interaction (PPI) BioGRID

5. Protein Function Predicting protein functions using sequence and structural data.
Annotation Tools: Blast2GO, InterProScan

6. Post-Translational Identifying phosphorylation, glycosylation, etc. Tools: ModPred,


Modification NetPhos

7. Proteomic Pathway Mapping proteins to pathways to understand biological processes.


Analysis Databases: KEGG, Reactome

Summary Table

Field Key Applications Key Tools/Databases

Gene prediction, SNP analysis, genome BLAST, GenBank, GATK,


Genomics
assembly, RNA-Seq Ensembl, HISAT2
Field Key Applications Key Tools/Databases

Structure prediction, protein identification, AlphaFold, STRING, Mascot,


Proteomics
PPI, function annotation Pfam, InterPro

Real-World Applications

• Personalized medicine: Predicting patient-specific responses based on genome/proteome.


• Drug discovery: Identifying new targets and biomarkers.
• Disease diagnosis: Through expression profiling and proteomic signatures.
• Agricultural genomics: Enhancing traits like drought resistance, productivity in crops.

Assembling the Genome – An Overview


Genome assembly is the process of reconstructing the complete DNA sequence of an organism
from short DNA fragments (reads) generated by sequencing technologies such as Illumina,
PacBio, or Oxford Nanopore.

Why Genome Assembly Is Needed

Sequencers don't read entire chromosomes in one go. Instead, they produce millions of small
fragments (reads). Genome assembly puts these fragments back together in the correct order to
recreate the original genome.

Types of Genome Assembly

Type Description

De novo Assembly Building the genome from scratch, without a reference genome.

Reference-based Aligning reads to an existing reference genome to build the


Assembly sequence.

Steps in De Novo Genome Assembly

1. Read Preprocessing
o Remove low-quality reads and contaminants
o Tools: FastQC, Trimmomatic
2. Error Correction
o Fix sequencing errors before assembly
o Tools: BayesHammer, Lighter
3. Assembly Algorithm
o Overlap Layout Consensus (OLC) for long reads
o De Bruijn Graph for short reads
o Tools:
▪ SPAdes (short reads)
▪ Canu, Flye (long reads)
▪ Velvet, SOAPdenovo
4. Scaffolding
o Connecting contigs (contiguous sequences) into larger scaffolds using mate-pair
or paired-end reads.
5. Gap Filling & Polishing
o Closing gaps and correcting misassemblies.
o Tools: Pilon, GapCloser, Racon
6. Annotation
o Identify genes, regulatory elements, and functional regions.
o Tools: Prokka, Augustus, GENSCAN

STS Content Mapping for Clone Contigs – Overview

STS content mapping (Sequence Tagged Site mapping) is a physical mapping technique used to
determine the relative positions of DNA clones in a genomic library by identifying the presence
or absence of known DNA markers (STS sites) in each clone.
It played a key role in the Human Genome Project before high-throughput sequencing became
dominant.

What is an STS?

• STS (Sequence Tagged Site): A short, unique DNA sequence (200–500 bp) that occurs
only once in the genome, and its sequence and location are known.
• Detected by PCR, making STSs a reliable way to mark locations on DNA.

Purpose of STS Content Mapping

To assemble overlapping DNA clones (contigs) into a correct order without sequencing them
fully, by identifying which STS markers are present in which clones.
Process of STS Content Mapping

1. Generate a genomic library


o Use BACs, YACs, or cosmids to clone fragments of the genome.
2. Design STS markers
o Select known unique sequences across the genome.
3. Screen clones for STS markers
o Use PCR to test which STS markers are present in each clone.
4. Create a binary matrix
o Rows = clones
o Columns = STS markers
o Entry = 1 (marker present) or 0 (marker absent)
5. Assemble clone contigs
o Clones sharing common STS markers are likely to overlap.
o Overlapping clones are merged into contigs.

Functional Annotation – Explained


Functional annotation is the process of identifying and assigning biological functions to genes or
proteins based on sequence data. It is a critical step after gene prediction or protein
identification, especially in genome and transcriptome studies.

Purpose of Functional Annotation

To determine:
• What a gene or protein does
• Where it functions (cellular localization)
• How it interacts with other molecules
• Biological processes it’s involved in

Key Steps in Functional Annotation

Step Description

1. Sequence Similarity Compare predicted genes/proteins to known sequences using


Search tools like BLAST
Step Description

2. Gene Ontology (GO) Assign standardized terms for biological process, molecular
Mapping function, and cellular component

3. Protein Domain Analysis Identify conserved domains using Pfam, InterPro, SMART

4. Pathway Mapping Map genes to biological pathways (e.g., KEGG, Reactome)

Predict where in the cell a protein functions (e.g., nucleus,


5. Subcellular Localization
mitochondria)

6. Enzyme Commission (EC)


Assign catalytic function to enzymes based on reaction class
Number

Tools and Databases

Tool/Database Function

BLAST Sequence similarity search

InterProScan Protein domain and family identification

GO (Gene Ontology) Function, process, localization terms

Pfam, SMART Domain identification

KEGG Pathway mapping

UniProt Curated protein function database

Peptide Mass Fingerprinting (PMF) – Explained

Peptide Mass Fingerprinting (PMF) is a mass spectrometry-based technique used to identify


proteins by measuring the masses of peptides generated from enzymatic digestion (usually by
trypsin) of the protein.

Principle of PMF

1. A protein is enzymatically digested into smaller peptides.


2. The mass spectrum of these peptides is obtained using a mass spectrometer (typically
MALDI-TOF).
3. The experimental peptide mass list is compared with theoretical peptide masses in a
protein database.
4. The best matching protein is identified.
PMF Workflow

Step Description

1. Protein Isolation Protein is purified from cells or tissues.

Protein is digested with trypsin, cleaving at lysine (K) and arginine


2. Enzymatic Digestion
(R).

3. Mass Spectrometry
Peptide mixture analyzed using MALDI-TOF or ESI-MS.
(MS)

4. Peak List Generation List of m/z values (mass/charge) of peptides is created.

Compare experimental mass values with theoretical digests using


5. Database Search software like:
Mascot, ProteinProspector, PeptIdent

6. Protein Identification Protein with the highest match score is identified.

Tools and Databases

Tool Purpose

Mascot PMF database search engine

ProteinProspector Protein ID from MS data

PeptIdent (ExPASy) Match peptide masses to proteins

SwissProt, NCBI Protein sequence databases

Key Characteristics of PMF

• Highly specific: Each protein has a unique peptide "mass fingerprint".


• Quick and cost-effective for known proteins.
• Usually used with MALDI-TOF instruments.
• Requires the protein to be present in databases for identification.

Advantages

• High-throughput protein identification


• Sensitive and accurate
• Automated and fast when integrated with databases

You might also like