PRACTICAL-1
PDB DATABASE
The Protein Data Bank (PDB) is a global archive of experimentally-determined
3D structures of biological macromolecules such as proteins, nucleic acids, and
complex assemblies.
The Research Collaboratory for Structural Bioinformatics (RCSB) is one of
the three organizations that manage the PDB archive.
RCSB PDB provides access to over 208,844 structures from the PDB archive
and 1,068,577 Computed Structure Models (CSM) from Alpha Fold DB and
Model Archive. It can be accessed at rcsb.org.
RCSB PDB offers a range of tools for exploration, visualization, and analysis of
these structures. These include:
• Deposit: A tool for depositing new structures into the PDB archive.
• Search: A tool for searching the PDB archive using various criteria such
as sequence, structure, and ligands.
• Analyze: A tool for analyzing structures using various algorithms such as
Ramachandran plots and electrostatic potential maps.
• Learn: A resource for learning about structural biology and the PDB
archive.
Figure.: Home page of RCSB PDB Database
How to find protein of interest in RCSB PDB Database?
• Search for the name of protein.
• Under the refinements tab, select for the options under; Scientific Name of
Source Organism, Taxonomy, Experimental Method, Polymer entity type,
and Refinement Resolution (Å).
• One can also use Advanced Search Query Builder provided by the RCSB
PDB website. The Advanced Search Query Builder provides a powerful
search interface to build complex scientific queries with multiple search
conditions, that combine different attributes, inputs, operators, and
groupings.
Image & PDB id of the Protein
How to find the sequence length of the protein?
• After searching for the protein, click on the final target protein name, a new
page will open and then scroll to the macromolecule section, there the
amino acids sequence can be found.
• In this particular window, one can also check if a ligand is already bound
to the target protein or not. In many cases, multiple ligands are found
attached with one protein.
Fig. Macromolecule window in RCSB PDB Database to check for amino acid sequence length of
proteins and the highlighted area gives every information regarding the protein
PRACTICAL-2
UNIPROT DATABASE
The UniProt database is a freely accessible resource of protein sequence and
functional information. It is maintained by the UniProt consortium, which consists
of several European bioinformatics organizations and a foundation from
Washington, DC, United States. The UniProt database comprises three databases:
• The UniProt Knowledgebase (UniProtKB)
• The UniProt Reference Clusters (UniRef)
• The UniProt Archive (UniParc).
The UniProt Knowledgebase (UniProtKB) is an expertly and richly curated protein
database, consisting of two sections called UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL. It contains a large amount of information about the biological
function of proteins derived from the research literature.
The UniProtKB can be used to retrieve information about a protein of interest, such
as its sequence, function, structure, interactions, and more.
• You can search for a protein of interest by typing its name or accession
number in the search bar on the UniProt website.
Fig.: Homepage of UniProt
Filters used for
finding target protein
UniProt Id of Protein
How to find active/binding sites of protein in UniProt?
• Go to the Functions filter and then scroll down to Features tab.
Fig. Features tab on the homepage of UniProt showing the active and binding sites of the protein
PRACTICAL-3
NCBI COBALT DATABASE
COBALT is a multiple sequence alignment tool that finds a collection of pairwise
constraints derived from conserved domain database, protein motif database, and
sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST. Pairwise
constraints are then incorporated into a progressive multiple alignment.
Fig. Homepage of COBALT Database
The full form of COBALT is Constraint Based multiple Alignment Tool. For
running two sequences and finding similarities between them.
Fig. Advanced parameters filters under COBALT sequence alignment
Fig.: Alignment results of 2 proteins in COBALT Database
PRACTICAL-4
DISCOVERY STUDIO VISUALISER SOFTWARE
The Discovery Studio Visualizer is a free molecular modeling application that
allows users to view, share, and analyze protein and small molecule data. It is
designed to offer an interactive environment for viewing and editing molecular
structures, sequences, X-ray reflection data, scripts, and other data.
How to visualise 2D interactions in Discovery Studio Visualiser?
For interpretation of 2D receptor-ligand interactions, first load the receptor and
ligand in the window, and then go to the ‘Receptor-Ligand interactions’ tab on
the middle of the top toolbar and then click ‘show 2D diagram’ on the left window
under the ‘tools’ section.
Some of the uses and applications of the Discovery Studio Visualizer include:
• Visualizing proteins and small molecules.
• Analyzing Vina docking results.
• Modifying structures by building and editing nucleic acids and proteins.
• Superimposing proteins.
• Searching side-chain rotamers.
• Running short simulations for structures as well as for small molecules.
• Analyzing protein-ligand interactions.
• Creating pharmacophores.
• Sketching and browsing small molecules.
• Building fragments in small molecules.
• Aligning small molecules.
• Editing X-ray protein structures.
• Storing users’ tools under the ‘My Tools’ tab.
• Creating 2D interaction plots, heat maps, 3D point plots, etc.
PRACTICAL- 5
NCBI GENBANK SERVER
• GenBank is an open access, annotated collection of all publicly available
nucleotide sequences and their protein translations.
• It is maintained by the National Center for Biotechnology Information
(NCBI) and is part of the International Nucleotide Sequence Database
Collaboration.
How to use GenBank?
• Search GenBank for sequence identifiers and annotations with Entrez
Nucleotide.
• Search and align GenBank sequences to a query sequence
using BLAST (Basic Local Alignment Search Tool).
• Search, link, and download sequences programmatically using NCBI e-
utilities.
How to submit data to GenBank?
• GenBank accepts mRNA or genomic sequence data directly determined by
the submitter. The submission must include information about the source
organism and annotation provided by the submitter.
• mRNA Sequences
• Prokaryotic Genes
• Eukaryotic Genes
• rRNA and/or ITS
• Viral Sequences
• Transposon or Insertion Sequences
• Microsatellite Sequences
• Pseudogenes
• Cloning Vectors
• Phylogenetic or Population Sets
What NOT to submit to GenBank?
• Sequences <200 bp long. Unassembled sequences from next-generation
sequencing platforms should be submitted to the Sequence Read Archive
(SRA).
• A genomic sequence of multiple exons joined together without the
sequence of the intervening introns or without a 'gap' of internal nnns
representing the missing sequence.
• Primer only sequences (These sequences can be submitted directly to
NCBI’s Probe database).
• Protein only sequences.
• Sequences containing a mix of genomic and mRNA sequence represented
as a single sequence.
• Sequences without a physical counterpart (consensus sequences).
What GenBank Tools can be used?
• Submission Portal, a unified system for multiple submission types.
• Web-based submission tools that are automatically submitted to GenBank.
• BankIt, a WWW-based submission tool with wizards to guide the
submission process. The following data can be submitted using BankIt:
✓ SARS-CoV-2, Ribosomal RNA (rRNA) or rRNA-ITS, Metazoan
(multicellular animal) COX1, Eukaryotic nuclear mRNA, Influenza
virus, Norovirus Dengue virus, Eukaryotic and Prokaryotic Genomes
(WGS or Complete), Transcriptome Shotgun Assembly (TSA),
Unassembled sequence reads (SRA).
✓ Sequence data not listed above (through BankIt): genomic DNA,
organelle, ncRNA, plasmids, other viruses, phages, other mRNA,
synthetic constructs.
• Currently only ribosomal RNA (rRNA), rRNA-ITS, metazoan
mitochondrial COX1, eukaryotic nuclear mRNA, Influenza, Norovirus,
Dengue or SARS-CoV-2 sequences can be submitted with the GenBank
component of this tool.Genome and Transcriptome Assemblies can be
submitted through the Genomes and TSA portals, respectively.
• This will be expanded in the future to include other types of GenBank
submissions.
• Submission preparation tools which require uploading via the Submission
Portal or email to gb-sub@ncbi.nlm.nih.gov when relevant:
✓ Table2asn- a command-line program that replaces the older tool
tbl2asn, automates the creation of sequence records for submission to
GenBank. It is used primarily for submission of annotated genomes and
large batches of sequences, and is available by FTP for use on MAC,
PC and Unix platforms.
✓ Genome Workbench- offers a rich set of integrated tools for studying
and analyzing genetic data. Its Submission Wizard option allows you to
prepare submissions of single eukaryotic and prokaryotic genomes.
You can also use Genome Workbench to edit and visualize an ASN1
file created by table2asn.
PRACTICAL- 6
KEGG DATABASE & PATHWAY
KEGG stands for Kyoto Encyclopedia of Genes and Genomes. It is a
comprehensive database resource that provides information on high-level
functions and utilities of biological systems, such as cells, organisms, and
ecosystems, from molecular-level information. It is especially useful for large-
scale molecular datasets generated by genome sequencing and other high-
throughput experimental technologies.
The KEGG Database:
The KEGG model is implemented as an integrated database resource consisting
of sixteen databases shown below. They are broadly categorized into systems
information, genomic information, chemical information and health information,
which are distinguished by color coding of web pages.
Fig.: Different types of KEGG Database
KEGG is a collection of databases that deal with genomes, biological pathways,
diseases, drugs, and chemical substances. The KEGG database project was
initiated in 1995 under the Japanese Human Genome Project to enable
Fig.: Homepage of KEGG pathway Database
understanding of biological systems from genome sequence data. Major efforts
have been undertaken to represent the biological systems in terms of molecular
networks (molecular wiring diagrams), especially in the form of KEGG pathway
maps that are manually created by capturing knowledge from published literature.
KEGG has become one of the most utilized biological databases accessed by
millions of visitors per month. It is developed by Kanehisa Laboratories.
Pathway Identifiers:
Each pathway map is identified by the combination of 2-4 letter prefix code
and 5-digit number (see KEGG Identifier). The prefix has the following
meaning:
• Map- manually drawn reference pathway
• Ko- reference pathway highlighting Kos
• Ec-reference metabolic pathway highlighting EC numbers
• Rn-reference metabolic pathway highlighting reactions
• <org>- organism-specific pathway generated by converting KOs to gene
identifiers
The numbers are used for different types of maps are as follow:
• 011- global map (lines linked to KOs)
• 012- overview map (lines linked to KOs)
• 010- chemical structure map (no KO expansion)
• 07- drug structure map (no KO expansion)
• Other- regular map (boxes linked to KOs)
.KEGG PATHWAY is integrated with MODULE and NETWORK databases
as indicated below.
• M – module
• R - reaction module
• N – network
Fig.: KEGG pathway for Type II Diabetes Mellitus
Database of KEGG:
KEGG maintain 6 main databases:
1. KEGG Pathway
2. KEGG Genes
3. KEGG Genome
4. KEGG Ligand
5. KEGG BRITE
6. KEGG Cancer
KEGG: Kyoto Encyclopedia of Genes & Genomes
KEGG GENES Database:
KEGG GENES is a collection of genes and proteins in complete genomes of cellular organisms
and viruses generated from publicly available resources, mostly from NCBI RefSeq and
GenBank, and annotated by KEGG in the form of KO (KEGG Orthology) assignment. The
collection is supplemented with a KEGG original collection of functionally characterized
proteins from published literature. Protein sequences and RNA sequences of all GENES entries
are subject to SSDB computation and KO assignment by KOALA tools.
Fig.: KEGG of Genes and Genomes (Different Modules and Tools)
PRACTICAL- 7
STRING DATABASE
The STRING database is a comprehensive resource for protein-protein
interaction networks and functional enrichment analysis. It contains information
on over 59.3 million proteins from 12,535 organisms and more than 20 billion
interactions. The interactions include both direct (physical) and indirect
(functional) associations, and are derived from computational prediction,
knowledge transfer between organisms, and interactions aggregated from other
primary databases
The database content is pre-computed, stored in a relational database, and
available for separate download. All interaction evidence that contributes to a
given network is benchmarked and scored, and the scores are integrated into a
final “combined score”.
Fig. Homepage of STRING Database
When one wants to form a ‘string’ between the proteins to interpret their possible
interactions, the proteins names are entered in the ‘multiple proteins’ tab and then
the ‘strings’ are generated as shown in the figure below.
Fig. Homepage of String Database
Fig. NDRG1 protein interactions with various targets
STRING allows inspection of the interaction evidence for any given network in
the following ways:
Fig. Gene Co-expression of the above interpreted STRING results
PRACTICAL- 8
NCBI- GTR & TREE VIEWER
A. NCBI- GTR
The Genetic Testing Registry (GTR) is a free online resource developed by the
National Institutes of Health (NIH) that centralizes comprehensive information
about genetic tests offered in the USA and abroad. The registry includes
information about health-related clinical and research tests for germline variation,
including pharmacogenetic tests, and soon will expand to include tests for
somatic variants. The database provides details of each test, such as its purpose,
target populations, methods, what it measures, analytical validity, clinical
validity, clinical utility, ordering information, and laboratory location and contact
information.
The GTR is a valuable resource for clinicians and researchers who need to make
informed decisions about the use of genetic tests for patient care. It enables access
to comprehensive information about testing offered worldwide for disorders with
a genetic basis. Related information in the NIH Genetic Testing Registry (GTR),
MedGen, Gene, OMIM, PubMed and other sources is accessible through
hyperlinks on the records.
Fig.: GTR website server page showing results for Diabetes mellitus related genes
B. TREE VIEWER
NCBI Tree Viewer is a graphical display for phylogenetic trees that can visualize
trees in ASN (text and binary), Newick, and Nexus formats. It is a free online
resource developed by the National Center for Biotechnology Information
(NCBI) that enables researchers to view and manipulate phylogenetic trees.
The tool allows users to perform the following actions with a tree:
• Zooming and navigation
• Displaying in different layouts either as rectangular or slanted cladogram
and also circular or radial phylogenetic tree.
• Selecting branches and over-viewing selection
• Collapsing/Expending branches
• Rooting at midpoint
• Re-rooting at nodes
• Sorting
• Uploading/Downloading
• Creating PDF
Fig. Rectangular Cladogram representation of species to interpret phylogenetic relationships