Introduction to Bioinformatics
Definition:
Bioinformatics is an interdisciplinary field that combines biology, computer science, and
information technology to analyze and interpret biological data.
It involves solving biological science problems through computation.
Handles biological data collected from experimental techniques and documents them in databases for
scientific use.
Origins of the Term:
First used by Paulien Hogeweg (Dutch theoretical biologist) and Ben Hesper in the early 1970s.
From the late 1980s, it primarily referred to the computational analysis of genome data.
    Goals of Bioinformatics:
    •   Data Organization: Efficient storage, management, and retrieval of biological data.
    •   Tool Development: Creating algorithms and statistical tools for interpreting complex
        biological data.
    •   Biological Insight: Deriving meaningful conclusions from raw data—such as gene function,
        evolutionary relationships, and disease mechanisms.
Scientific Disciplines Associated with Bioinformatics:
    1. Traditional and Advanced Sciences:
            1. Plant sciences
            2. Animal sciences
            3. Molecular biology
            4. Genetics
            5. Evolutionary biology
    2. Other Fields:
            1. Pharmaceuticals
            2. Mathematical and statistical sciences
            3. Omics (e.g., genomics, proteomics)
Support Systems for Bioinformatics:
    1. Technology and Infrastructure:
            1. Computer science
            2. Information technology (IT)
            3. Computational resources (serve as the backbone)
    Scope of Bioinformatics:
   •     Molecular Biology: Genome sequencing, gene expression analysis.
   •     Structural Biology: Protein structure prediction, molecular modeling.
   •     Systems Biology: Pathway mapping, systems modeling.
   •     Drug Discovery: Target identification, virtual screening.
   •     Evolutionary Biology: Phylogenetic analysis, genome evolution.
   Applications:
   1. Genomics – Sequence analysis, gene prediction, comparative genomics.
   2. Proteomics – Protein structure/function analysis, interactions.
   3. Transcriptomics – Gene expression profiling using microarrays or RNA-seq.
   4. Metabolomics – Metabolic pathway analysis.
   5. Pharmacogenomics – Personalizing medicine based on genetic profile.
   6. Systems Biology – Understanding complex biological systems through modeling.
   7. Biological Databases – Storing and accessing biological information like GenBank, PDB,
      Swiss-Prot.
   8. Functions include data storage, mining, processing, structural and functional annotation of
      genes/proteins, system modeling, and drug discovery.
   9. Applications in Agriculture: Bioinformatics facilitates the use of genetic, genomic, and
      proteomic information to develop crops that are resistant, nutritionally enhanced, and more
      profitable.
   10. Drug Discovery: The same data and techniques are applied in discovering therapeutic drugs.
   11.
   Central Dogma in Bioinformatics:
Understanding the flow of genetic information: DNA → RNA → Protein
   •     Bioinformatics tools help in studying this flow at each level.
Bioinformatics tools assist in analyzing each step:
   •     DNA: Genome assembly, mutation detection
   •     RNA: Transcript quantification
   •     Protein: Function prediction, interaction modelling
   Major Bioinformatics Databases :
      Database                                        Description
 GenBank                     Nucleotide sequence database maintained by NCBI
 EMBL                        European Molecular Biology Laboratory nucleotide archive
 DDBJ                        DNA Data Bank of Japan
 ZINC                        Databases contain commercially available molecules for
                             computational screening
 TrEMBL                      Computer-annotated supplement of Swiss-Prot
 PDB (Protein Data           3D structures of proteins and nucleic acids
 Bank)
 Pfam                        Protein family and domain database
 KEGG                        Kyoto Encyclopedia of Genes and Genomes – pathways and
                             function
 BLAST                       Tool for comparing sequences (Basic Local Alignment Search
                             Tool)
 GEO                         GEO is a databases for functional genomics, contain gene
                             expression/ microarray data
                                   Introduction to Chemoinformatics
    Definition:
Chemoinformatics (or cheminformatics) is the use of computer and informational techniques to solve
chemical problems. It involves data storage, structure searching, and data analysis relevant to
chemical and biological information.
Scope:
Applied to a large number of small molecules (#N ~ 10, 100, 1,000...10⁶...10⁶⁰).
Fields of Application:
Medical science: Developing novel and effective drugs.
Material science: Creating new and superior materials.
Allied fields: Agrochemicals and biotechnology.
    Goals of Chemoinformatics:
    •    Convert chemical data into a digital form that is easily accessible and analyzable.
    •    Facilitate drug design and discovery using computational tools.
    •    Develop chemical databases and virtual libraries.
    •    Predict physicochemical and biological properties of molecules.
    Key Concepts:
    1. Molecular Representations – SMILES, InChI, 2D & 3D formats
    2. Descriptors – Numeric values that describe molecular features
    3. QSAR Modeling – Quantitative Structure-Activity Relationship to predict biological activity
    4. Virtual Screening – In silico filtering of large libraries for active compounds
    5. Scaffold Hopping – Identifying new chemical cores with same activity
Open source software:
Free Open-Source Software (FOSS) tools are defined as those programs which anyone can download
and change the source code, provided that they make the changes publicly available again, according
to the GNU Lesser General Public License (LGPL).
Key Areas:
1. Molecular Representations:
       o     SMILES (Simplified Molecular Input Line Entry System)
       o     InChI (IUPAC International Chemical Identifier)
       o     2D & 3D structures used in modeling and visualization.
2. Chemical Databases:
       o     PubChem
       o     ChEMBL
       o     ZINC database
       o     DrugBank
3. Structure-Activity Relationship (SAR):
       o     Examines the relationship between a compound’s structure and its biological activity.
       o     QSAR (Quantitative SAR): Uses statistical models to predict activity quantitatively.
4. Virtual Screening:
       o     Computational technique to screen large libraries of compounds.
       o     Saves time and cost compared to traditional lab screening.
5. Drug Design Tools:
       o     Ligand-based design (pharmacophore modeling)
       o     Structure-based design (docking studies, molecular dynamics)
Applications:
•   Lead identification and optimization in drug discovery.
    •    Prediction of ADME-Tox properties (Absorption, Distribution, Metabolism, Excretion,
         Toxicity).
    •    Design of novel compounds with desired bioactivity.
    •    Chemical data mining and knowledge discovery.
    •    Tasks such as spectra simulation, structure elucidation, reaction modeling, synthesis planning.
    Cheminformatics tools
JChemPaint (JCP):
         Open-source tool for drawing and editing 2D chemical structures.
         Developed using the Chemistry Development Kit (CDK).
RDKit:
         Offers molecule drawing and editing capabilities.
         Includes a Python API for integration into data analysis workflows.
ChemicalToolbox:
         Provides a user-friendly graphical interface for cheminformatics analysis.
         Built on the Galaxy platform for reproducible workflows.
ChemDraw:
         A widely used commercial software for creating chemical structure diagrams.
         Supports advanced features like reaction mechanisms and spectral analysis.
Avogadro:
         Open-source molecular editor and visualization tool.
         Ideal for 3D modeling and quantum chemistry calculations.
MayaChemTools:
         Includes cheminformatics utilities for molecular design and analysis.
         Offers command-line tools but integrates with graphical environments.
COMMERCIAL CHEMINFORMATICS TOOLS FOR CHEMICAL STRUCTURE
CREATION AND EDITING:
ChemDraw:
A leading software for drawing and editing chemical structures.
Offers advanced features like reaction mechanisms, spectral analysis, and 3D modeling.
MarvinSketch:
A powerful tool for creating and visualizing chemical structures.
Supports a wide range of file formats and chemical calculations.
BIOVIA Draw:
Provides tools for creating publication-quality chemical drawings.
Integrates with other BIOVIA software for cheminformatics and molecular modeling.
ACD/ChemSketch:
A comprehensive tool for drawing chemical structures and reactions.
Includes features for property prediction and molecular modeling.
CD/ChemSketch is a freeware for drawing chemical structures including organics, organometallics,
polymers, and Markush structures. It has options for structure cleaning, viewing and naming, inch
conversion, stereo descriptors etc. For freeware, no technical support is provided and the
functionalities are less compared to the commercial version which has structure search capabilities.
Molecular Operating Environment (MOE):
A versatile platform for molecular modeling and cheminformatics.
Supports structure creation, visualization, and analysis.
     Major Chemoinformatics Databases
    Database                                         Description
 PubChem              Open chemistry database by NCBI; includes structure and bioactivity
 ChEMBL               Bioactivity database of drug-like small molecules
 ZINC                 Free database of commercially available compounds for virtual
 Database             screening
 DrugBank             Comprehensive drug and drug target information
 ChemSpider           Free chemical structure database by Royal Society of Chemistry
 BindingDB            Database of binding affinities between protein targets and small
                      molecules
 PDBbind              Contains experimentally measured binding affinities for protein-ligand
                      complexes
                                         ADME DATABASES
ADME databases are specialized resources that provide information on the Absorption, Distribution,
Metabolism, and Excretion (ADME) properties of drugs. These databases are crucial for drug
discovery and development, as they help predict how a drug will behave in the human body. Here are
a few notable ADME databases:
Fujitsu's ADME Database: This database contains over 130,000 entries related to pharmacokinetics,
including data on drug-metabolizing enzymes (like cytochrome P450s) and transporters. It provides
both in vitro and human clinical drug interaction data.
WangLab's ADME Databases: These include multiple datasets such as water solubility, Caco-2
permeability, blood-brain permeability, and oral bioavailability. They are useful for benchmarking
experiments and building predictive models.
ADME@NCATS: Developed by the National Institutes of Health, this resource offers in silico
prediction models for various ADME properties. It allows users to input molecular data and receive
predictions with confidence scores.
   Key ADME Databases and Tools
 Database/Tool        Description
 ADMETlab             An online platform for comprehensive ADMET property prediction (covers
                      absorption, distribution, metabolism, excretion, and toxicity).
 SwissADME            Free web tool by Swiss Institute of Bioinformatics for evaluating
                      pharmacokinetics, drug-likeness, and medicinal chemistry friendliness of
                      small molecules.
 pkCSM                Predicts pharmacokinetic properties using graph-based signatures. Offers
                      data on absorption, BBB permeability, metabolism, and more.
 PreADMET             Web-based application for predicting ADME and toxicity properties based on
                      molecular structure.
 admetSAR             A structure-activity relationship-based database and prediction tool for over
                      60 ADMET properties.
 QikProp              Commercial software that provides accurate predictions of ADME
 (Schrödinger)        properties. Useful in virtual screening and lead optimization.
 Toxtree              Open-source application to estimate toxic hazard using decision tree
                      approaches; includes some ADME-related predictions.
 eTOX                 A collaborative project integrating data from pharmaceutical industries to
 (eTRANSAFE)          build a shared ADMET database.
 OCHEM                Online Chemical Modeling Environment – supports QSAR modeling for
                      ADMET and toxicity prediction.
 FAF-Drugs4           Free web service for filtering chemical libraries based on physicochemical
                      properties and ADMET rules.
   Commonly Predicted ADME Properties:
   •   Absorption: Caco-2 permeability, HIA (Human Intestinal Absorption), P-gp substrate
   •   Distribution: Volume of distribution, BBB (Blood-Brain Barrier) penetration, plasma protein
       binding
   •   Metabolism: CYP450 enzyme inhibition/induction (e.g., CYP3A4, CYP2D6)
   •   Excretion: Renal clearance, half-life prediction
   •   Toxicity (Tox): Ames mutagenicity, hepatotoxicity, cardiotoxicity (hERG inhibition)
                                            Databases
Chemical Databases
Chemical databases store information about chemical compounds, including their:
   •   Structures (2D & 3D)
   •   Molecular formulas
   •   Physical and chemical properties
   •   Spectral data
   •   Synthetic routes
   Applications:
   •   Structure searching (e.g., substructure or similarity)
   •   Virtual screening
   •   Property prediction (e.g., solubility, boiling point)
   •   Compound procurement or sourcing
   1. PubChem: A comprehensive repository for chemical molecules and their activities.
   2. ChemSpider: A free chemical structure database with over 130 million structures.
   3. ChEMBL: Focuses on bioactive molecules with drug-like properties.
These contain information on chemical structures, properties, and identifiers.
  Database                                   Description
 PubChem    Open chemistry database maintained by NCBI; contains compounds,
            substances, and bioassays.
 ChemSpider Free database by the Royal Society of Chemistry with data on chemical
            structures, spectra, properties.
 ZINC       Contains purchasable chemical compounds for virtual screening; widely
            used in docking studies.
 ChEBI      Chemical Entities of Biological Interest – focuses on small molecules of
            biological relevance.
 MolPort    Commercial compound sourcing database for screening and synthesis.
 Reaxys     Commercial database for chemical reaction and substance data, including
            experimental conditions.
 eMolecules Offers searchable information on commercially available molecules.
Biochemical Databases
Biochemical databases contain data related to biological macromolecules, such as:
   •   Proteins
   •   DNA/RNA sequences
   •   Enzymes
   •   Pathways
   •   Biological interactions
   Applications:
   •    Protein structure prediction
   •    Sequence alignment and comparison
   •    Functional annotation
Understanding metabolic and signaling pathways
   1. GenBank: A nucleotide sequence database maintained by NCBI.
   2. Protein Data Bank (PDB): Contains 3D structural data of biomolecules.
   3. UniProt: A protein sequence and functional information database.
 Database                     Description
 UniProt                      Protein sequence and functional information (Swiss-Prot & TrEMBL).
 PDB (Protein Data            3D structures of proteins, DNA, RNA, and protein-ligand complexes.
 Bank)
 KEGG                         Kyoto Encyclopedia of Genes and Genomes – pathways, genes, and
                              chemical functions.
 BioCyc                       Provides data on metabolic pathways and genomes from various
                              organisms.
 STRING                       Database of known and predicted protein-protein interactions.
 Pfam                         Protein families database based on conserved domains.
 NCBI Gene                    Gene-specific information including sequences, variants, and expression.
Pharmaceutical Databases
Pharmaceutical databases provide detailed data on drugs and drug targets, including:
   •    Approved and experimental drugs
   •    Drug structures
   •    Mechanisms of action
   •    Pharmacokinetics (ADME)
   •    Side effects
   •    Clinical trial data
   Applications:
   •    Drug repurposing and discovery
   •    Target identification
   •    Pharmacogenomics (gene-drug interactions)
  •   ADMET prediction
  •   Safety and efficacy evaluation
  1. DrugBank: Offers detailed drug and drug target information.
  2. FDA Drug Databases: Includes drug approvals, safety data, and more.
  3. Pharma Data: Provides insights into clinical trials, market trends, and patient demographics.
Database                        Description
DrugBank                        Detailed chemical, pharmacological, and pharmaceutical data on
                                drugs and their targets.
ChEMBL                          Bioactivity database with drug-like small molecules and their
                                activity on biological targets.
ClinicalTrials.gov              Registry of clinical trials worldwide, including drug interventions
                                and outcomes.
PharmGKB                        Focuses on how genetic variation affects drug response
                                (pharmacogenomics).
Drugs@FDA                       U.S. FDA database of approved drugs, including labels and
                                regulatory status.
DailyMed                        Provides FDA-approved drug labeling and dosage information.
RxList                          Offers detailed drug descriptions and uses, mainly for consumer
                                health reference.
SIDER                           Side Effect Resource – information on adverse drug reactions.
TTD (Therapeutic Target         Information on known and explored therapeutic protein and
Database)                       nucleic acid targets.