0% found this document useful (0 votes)
19 views75 pages

Biological Databases

The document provides an overview of bioinformatics, focusing on sequence and genome analysis, and highlights key resources such as EMBnet and NCBI. It details the structure, mission, and services offered by EMBnet, including education, software development, and support for over 40,000 users globally. Additionally, it discusses various nucleic acid sequence databases and specialized genomic resources that facilitate research in molecular biology.

Uploaded by

Shashi Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views75 pages

Biological Databases

The document provides an overview of bioinformatics, focusing on sequence and genome analysis, and highlights key resources such as EMBnet and NCBI. It details the structure, mission, and services offered by EMBnet, including education, software development, and support for over 40,000 users globally. Additionally, it discusses various nucleic acid sequence databases and specialized genomic resources that facilitate research in molecular biology.

Uploaded by

Shashi Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Bioinformatics

Information Resources And Networks


Title : Bioinformatics
Subtitle : Sequence and genome analysis
Author : Mount, David W.
Publ.Plc : New York
Publ. : Cold Spring Harbor
Pages : xii, 564p.
ISBN : 0-87969-608-7
Title : Introduction to bioinformatics
Author : Attwood, Teresa K.
Parrysmith, D. J.
Publ.Plc : New Delhi
Publ. : Pearson Education
Pages : xvi, 218p.
Ser Note : Cell and molecular biology
ISBN : 81-7808-507-0
Outline
Bioinformatics Information Resources And Networks


EMBnet – European Molecular Biology Network
• DBs and Tools
• NCBI – National Center For Biotechnology Information
• DBs and Tools

• Nucleic Acid Sequence Databases


• Protein Information Resources
• Metabolic Databases
• Mapping Databases
• Databases concerning Mutations
• Literature Databases
EMBnet – European Molecular
Biology Network

 Founded in 1988
 Network that links European laboratories that use
biocomputing and bioinformatics in molecular biology research
 is a science-
science-based group of collaborating nodes throughout
Europe and nodes outside Europe
 provides information, services and training to the users
 efforts to increase the availability and
accessibility of data resources and
computing tools
 increase knowledge and proficiency in bioinformatics through
education and training
EMBnet - Nodes http://www.embnet.org/

National
• governmental
Nodes
(18)

• academic, industrial EMBnet • Biocomputing centers from


research centers (41 nodes) non European countries

Specialist Associate
Nodes Nodes
(9) (11)
EMBnet - Nodes
National Nodes

Appointed by the
Vienna Biocenter - Austria BEN - Belgium

CSC - Finland INFOBIOGEN - France governments
DKFZ - Germany HEN - Hungary
 Provide on-line services,
user support and training
INCBI - Ireland INN - Israel

IEN-AdR - Italy CMBI - Netherlands

Bio - Norway IBB - Poland

PEN - Portugal GeneBee - Russia

CNB-CSIC - Spain BMC - Sweden

SIB - Switzerland SEQNET - UK


EMBnet - Nodes
Munich Information Center for protein sequences
 Academic, industrial or
Specialist Nodes
research centers in
MIPS specific areas of
bioinformatics
Largely responsible for
ICGEB

Pharmarcia maintainance of
biological databases and
software
F.Hoffmann – La Roche

EBI Important key specialist node


Hinxton
and home of:
HGMP - RC Hall
(Cambridge UK)
EMBL, SWISS-PROT and
Sanger TrEMBL databases
UCL
EMBnet - Nodes

Centers from non


Associate Nodes


European countries
IBBM - Argentina ANGIS - Australia

CBI - China CIGB - Cuba

CDFD - India SANBI – South Africa

EMBnet - Brazil CBR - Canada

EMBnet - Chile EBMnet - Colombia

CIFN - MEXICO
EMBnet’s Mission

 Assist in biotechnological and bioinformatics related


research

 Provide training and education

 Exploit network infrastructures

 Investigate and develop new technologies

 Bridge between commercial and academic sectors


What does EMBnet do?
 Education and training
 Software development
 Computing resources
 Technical support
 Help desk in local languages
 Publications
Who are EMBnet’s Users?
 > 40,000 registered users from all over
the world as well as a larger number of
Internet users
 All scientists working in Life Sciences,
from undergraduate students to top level
scientists, in academia as well as
industry, can get support from EMBnet
EMBnets – SRS
National Sequence Retrieval System - SRS
Nodes
• result of a research project with the
EMBnet to interrogating all resources
gathered together
EMBnet • SRS is a network browser for DBs in
molecular Biology
Specialist Associate
Nodes Nodes • SRS allows any flat-file DB to be
indexed to any other
• queries across a range of different
DB types via a single interface
• independent of underlying data
structures or query languages
http://srs.ebi.ac.uk/
Sequence Retrieval System
Network Browser for Databanks in Molecular Biology

Rele Availa
Data Bank No Entries Indexing Date Group
ase bility

SWISSPROT 163235 10-Jun-2005 Sequence ok


SWISSNEW 81134 22-Mar-2006 Sequence ok
NRDB 2269647 29-Mar-2006 Sequence ok
SWALL 3022528 22-Mar-2006 Sequence ok
UNIPROT_SPROT 212425 22-Mar-2006 Sequence ok
UNIPROT_TREMBL 2666963 23-Mar-2006 Sequence ok
TREMBLNEW 624819 12-Dec-2005 Sequence ok
TREMBL 2576118 04-Oct-2005 Sequence ok
Availa
Data Bank No Entries Indexing Date Group
bility

SPTREMBL 1449374 16-Jun-2005 Sequence ok


SPTREMBLNEW 143140 17-Jun-2005 Sequence ok
REMTREMBL 92182 20-Jun-2005 Sequence ok
PIR 283416 16-Jun-2005 Sequence ok
WORMPEP 19538 16-Jun-2005 Sequence ok
DROSOPHILA 14100 16-Jun-2005 Sequence ok
EMBLNEW 4035816 21-Nov-2005 Sequence ok
EMBL 20343598 30-Dec-2005 Sequence ok
EMBLEST 31990232 06-Jan-2006 Sequence ok
EMBLWGS 11106060 24-Sep-2005 Sequence ok
GENBANK 19233264 18-Nov-2005 Sequence ok
GENBANKEST 31008556 23-Feb-2006 Sequence ok
REFSEQP 8006 16-Jun-2005 Sequence ok
SUBTILIST 1 16-Jun-2005 Sequence ok
Availa
Data Bank No Entries Indexing Date Group
bility

PROSITE 1935 22-Mar-2006 SeqRelated ok


PROSITEDOC 1407 22-Mar-2006 SeqRelated ok
BLOCKS 4034 16-Jun-2005 SeqRelated ok
EPD 1375 16-Jun-2005 SeqRelated ok
ENZYME 4173 16-Jun-2005 SeqRelated ok
PRINTS 865 16-Jun-2005 SeqRelated ok
TFSITE 4342 07-Apr-2003 TransFac ok
TFFACTOR 1799 07-Apr-2003 TransFac ok
TFCELL 816 07-Apr-2003 TransFac ok
TFCLASS 27 07-Apr-2003 TransFac ok
TFMATRIX 246 07-Apr-2003 TransFac ok
TFGENE 1035 07-Apr-2003 TransFac ok
PDB 34927 08-Feb-2006 Protein3DStruct ok
DSSP 30832 22-Nov-2005 Protein3DStruct ok
HSSP 30369 08-Feb-2006 Protein3DStruct ok
PDBFINDER 35701 28-Mar-2006 Protein3DStruct ok
NRL3D 6063 16-Jun-2005 Protein3DStruct ok
FLYGENES 7556 16-Jun-2005 Genome ok
FLYREFS 0 07-Apr-2003 Genome ok
OMIM 17004 18-Oct-2005 Mutations ok
REPTILIA 8364 18-Jan-2006 Others ok
EMBnets - EMBOSS
 The European Molecular Biology Open Software Suite
 EMBOSS is a free Open Source software analysis package
specially developed for the needs of the molecular biology (e.g.
EMBnet) user community.
 The software automatically copes with data in a variety of
formats and even allows transparent retrieval of sequence data
from the web.
 Also, as extensive libraries are provided with the package, it is
a platform to allow other scientists to develop and release
software in true open source spirit.
 EMBOSS also integrates a range of currently available
packages and tools for sequence analysis into a seamless
whole.
What can EMBOSS do for
you?

 Within EMBOSS you will find around hundreds of


programs (applications) covering areas such as:
• Sequence alignment,
• Rapid database searching with sequence patterns,
• Protein motif identification, including domain analysis,
• Nucleotide sequence pattern
• Codon usage analysis for small genomes,
• Rapid identification of sequence patterns in large scale
sequence sets,
• Presentation tools for publication,
and much more. Check:
http://emboss.sourceforge.net/

NCBI – National Center For
Biotechnology Information
Mission:
 Development of new information
 Leading American technologies to aid our
information provider understanding of the molecular
and genetic processes that
 Established in 1988 as underlie health and disease
a division of the  Creation of systems for storing and
National Library of analysing biological information
Medicine (NLM)  Development of advanced methods
• Located on the of computer-based information
campus of the processing
National Institute of  Facilitation of user access to DBs
Health (NIH – and software
Rockville/Maryland)  Co-ordination of efforts to gather
biotechnology information
worldwide
NCBI
 Since 1992 – maintenance of GenBank and collaboration
with international nucleotide DBs: EMBL and DDBJ
(Japan)
 Providing the Entrez that facilitates to access biological
DBs (similar to SRS that is provided by the EMBnet)
 gquery (https://www.ncbi.nlm.nih.gov/gquery/)
NCBI - Responsibilities
 administers research on biomedical problems at the molecular
level using mathematical and computational methods
 maintains collaborations with several NIH (National Institutes of
Health) institutes, academia, industry, and other governmental
agencies
 promotes scientific communication by sponsoring meetings,
workshops, and lecture series
 supports training on basic and applied research in
computational biology for postdoctoral fellows through the NIH
Intramural Research Program
 engages members of the international scientific community in
informatics research and training through the Scientific Visitors
Program
 develops, distributes, supports, and coordinates access to a
variety of databases and software for the scientific and medical
communities
 develops and promotes standards for databases, data
deposition and exchange, and biological nomenclature
Nucleic Acid Sequence Databases
• the principal nucleic acid sequence databases are GeneBank,
EMBL and DDBJ, which each collect a portion of the total sequence
data reported world-wide, and exchange new and updated entries
on a daily basis

Nucleic acid sequence Databases


EMBL (Europe)
GenBank (USA)
DDBJ (Japan)
ENSEMBL (project between EMBL - EBI and the Sanger Institute)
dbEST (division of GenBank)
GSDB (division of GenBank)
EMBL
source: http://www3.ebi.ac.uk/Services/DBStats/

Nucleic Acid Sequence Databases - EMBL


This week the EMBL Database contained 301,588,430,608 nucleotides in
199,575,971 entries
Breakdown by entry type:

Entry TypeEntries Nucleotides

Standard 128,262,666 120,603,334,814


Constructed (CON) 6,381,010 225,047,233,405
Third Party Annotation (TPA) 6,894 385,832,010
Whole Genome Shotgun (WGS) 64,925,118 180,599,264,067
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank)
constitutes Europe's primary nucleotide sequence resource. Main sources
for DNA and RNA sequences are direct submissions from individual
researchers, genome sequencing projects and patent applications. The
database is produced in an international collaboration with GenBank (USA)
and the DNA Database of Japan (DDBJ). Each of the three groups collects a
portion of the total sequence data reported worldwide, and all new and
updated database entries are exchanged between the groups on a daily
basis.
Nucleic Acid Sequence Databases -
EMBL
Number of entries Total nucleotides
(current 69,666,551) (current 127,450,085,130 )

Ref: EMBL Nucleotide Sequence Database:developments in 2005,


Nucleic Acids Research, 2006, Vol. 34, D10–D15
Nucleic Acid Sequence Databases -
EMBL By nucleotide count

Pan
Homo Mus Rattus
troglodyt
sapiens musculus norvegicus
es
Bos Canis Monodelphis Danio
taurus familiaris domestica rerio

Macaca Loxodonta
Other
mulatta africana
Nucleic Acid Sequence
Databases – GenBank
 GenBank which is produced at NCBI, is split
into smaller, discrete divisions.
 This facilitates fast, specific searches by
restricting queries to perticular database
subsets
 During 1992-1997, the level of EST and STS
data within GenBank grew 10-fold.
 the overall sequence information contributed
by such partial data was still less than that of
higher quality sequences in the other major
divisions
Specialised Genomic Resources

 In addition to the comprehensive DNA sequence DBs, there


is a variety of more specialised genomic resources.
 These so called boutique DBs bring focus to species-
specific genomics and to particular sequencing techniques.

Specialised Genomic Resources


SGD – Saccharomyces Genome Database
UniGene - gene-oriented clusters from GenBank
TIGR - Databases of The Institute for Genomic
Research
ACeDB – A C.elegans DataBase
Specialised Genomic Databases
 SGD
http://www.yeastgenome.org/ (bakers yeast)
 AceDB
http://www.acedb.org (c.elegans)
 FlyBase
http://flybase.org/ (fruit fly)
 MGD
http://www.informatics.jax.org (Mouse)
Protein Information Resources

Levels of protein sequence and structural organisation:

primary The primary structure of a protein is its amino acid sequence

The second structure of a protein corresponds to regions of


secondary
local regularity (e.g., α-helices and β-strands).

The tertiary structure of a protein arises from the packing


tertiary of its secondary structure elements, which may form
discrete domains within a fold.
Protein Information Resources
Levels of protein sequence and structural organisation:

primary
primary sequence AVILDRYFH
database

secondary
secondary motif [AS]-[IL]2-X[DE]-R-[FYW]2-H
database

structure
tertiary domain module a,b,c @.*,#
database
Primary Protein Databases

• The primary structure of a protein is its amino acid sequence


• these are stored in primary databases as linear alphabets that
denote the constituent residues

Protein sequence Databases


SWISS-PROT - Protein knowledgebase
TrEMBL - Computer-annotated supplement to Swiss-Prot
PIR – Protein Information Resource
MIPS – Munich Information Centre for Protein Sequences
NRL-3D - produced by PIR
Protein Sequence Databases
Table of the most represented species
 Swiss-Prot contains 197,228
sequence entries, comprising No. Frequ. Species
71,501,181 amino acids
abstracted from 135,257 1 13049 Homo sapiens (Human)
references 2 10132 Mus musculus (Mouse)
 Total number of species Saccharomyces cerevisiae
represented in Swiss-Prot: 3 5189
(Baker's yeast)
9,520
4 4847 Escherichia coli
 The average sequence length
in Swiss-Prot is 362 amino 5 4669 Rattus norvegicus (Rat)
acids. 6 3665
Arabidopsis thaliana (Mouse-
 Swiss-Prot is the most highly ear cress)
annotated protein sequence Schizosaccharomyces pombe
8 2863
DB (Fission yeast)
 http://expasy.org/sprot/ 7 2814 Bacillus subtilis
9 2750 Caenorhabditis elegans
Drosophila melanogaster (Fruit
10 2286
fly)
Composite Protein Sequence
Databases
 Composite databases amalgamate a variety of
different primary databases
 They render sequence searching much more
efficient, because they obviate the need to
interrogate multiple resources
 Different composite databases use different
primary sources and different redundancy
criteria in their amalgamation procedures
Composite Protein Sequence
Databases
NRDB OWL MIPSX SP+TrEMBL
Natural Resource DB SwissProt TrEMBL
PDB SWISS-PROT PIR1-4 SWISS-PROT
SWISS-PROT PIR MIPSOwn TrEMBL
PIR GenBank MIPSTrn
GenPept NRL-3D MIPSH
SWISS-PROTupdate PIRMOD
GenPeptupdate NRL-3D
SWISS-PROT
EMTrans
GBTrans
Kabat
PseqIP
Secondary databases
 Secondary databases contain pattern data, i.e., diagnostic
signatures for protein families. These signatures encode the
most highly conserved features of multiply aligned sequences,
which are often crucial to the structure or function of the protein.

 The secondary structure of a protein corresponds to regions of


local regularity (e.g., α-helices and β-strands), which in sequence
alignments, are often apparent as well-conserved motifs.

 Patterns are regular expressions, fingerprints, blocks, profiles,


etc.
Secondary databases
Primary Stored
Secondary DB
source information
PROSITE SWISS-PROT Regular expressions
(patterns)
Profiles SWISS-PROT Weighted matrices
(profiles)
PRINTS OWL Aligned motifs
(fingerprints)
BLOCKS PROSITE/PRINTS Aligned motifs
(blocks)
IDENTIFY BLOCKS/PRINTS Fuzzy regular
expressions
(patterns)
Secondary databases

 TRANSFAC
http://transfac.gbf.de
 EPD
http://www.epd.isb-sib.ch
 InterPro
http://www.ebi.ac.uk/interpro/
 PROSITE
http://www.expasy.ch/prosite
 BLOCKS
http://blocks.fhcrc.org
 PRINTS
ftp://ftp.seqnet.dl.ac.uk/pub/database/prints
 PFAM
http://www.sanger.ac.uk/Software/Pfam/index.shtml
 ProDom
http://www.toulouse.inra.fr/prodom.html
 InterPro
http://www.ebi.ac.uk/interpro
 GeneCards
http://bioinformatics.weizmann.ac.il/cards
 ENSEMBL
http://www.ensembl.org
 EcoCyc
http://ecocyc.panbio.com/ecocyc/ecocyc.html
Secondary databases
 There is some overlap in content between the secondary
databases
 PDBsum alone has 35,291 entries

 Pattern DB growth is slow because the addition of


detailed family annotation is very time consuming.

 PROSITE and PRINTS are the only comprehensively,


manually annotated secondary DBs

 To address the annotation bottleneck, the secondary


database curators are together created a unified
database of protein families known as InterPro
Structure Classification DBs
 Contain 3D structures available from
crystallographic and spectroscopic studies

Structure Classification Databases


PDBsum – Protein Data Bank
CATH – Class, Architecture, Topology, Homology
SCOP – Structural Classification of Proteins
Structure Classification DBs
 PDB
http://www.rcsb.org
 SCOP
http://scop.mrc-lmb.cam.ac.uk/scop
 CATH
http://www.cathdb.info/
 DSSP
http://swift.cmbi.ru.nl/gv/dssp/
 FSSP
http://www.ebi.ac.uk/dali/fssp
 HSSP
 http://swift.cmbi.kun.nl/swift/hssp/
Metabolic Databases

A number of metabolic databases are available electronically


some with features for querying and visualizing metabolic
pathways and regulatory networks.

KEGG (Kyoto Encyclopedia of Genes and Genomes)


http://www.genome.ad.jp/kegg
 ENZYME (Enzyme nomenclature database)
http://www.expasy.ch/enzyme
 BRENDA (Enzyme Information System)
 http://www.brenda-enzymes.org/
 EMP (Enzymes and Metabolic Pathways database)
http://www.metacyc.org/
Mapping Databases

 OMIM
http://www.ncbi.nlm.nih.gov/omim

 GDB (The GDB Human Genome Data Base: a source of


integrated genetic mapping and disease data.)
 http://morissardjerome.free.fr/infobiogen/www.gdb.org/gdb/
Databases concerning
Mutations

 dbSNP
http://www.ncbi.nlm.nih.gov/SNP

 The SNP Consortium (TSC)


http://snp.cshl.org

 http://www4a.biotec.or.th/PASNP
Literature Databases

 PubMed
http://www.ncbi.nlm.nih.gov/entrez/query

 Bioinformatics Online
http://www.bioinformatics.oupjournals.org

 Nature
http://www.nature.com

 Science
http://www.sciencemag.org
In 2003 scientists in the Human Genome
Project obtained the DNA sequence of the 3
billion base pairs making up the human
genome
Sequencing the Human Genome:
A Landmark in the History of Mankind
What we’ve learned so far from
the Human Genome Project

The human genome is nearly the same


(99.9%) in all people

Only about 2% of the human genome


contains genes, which are the
instructions for making proteins
Other Lessons from the
Human Genome Project

Humans have an estimated 30,000 genes;


the functions of more than half of them
are unknown

Almost half of all human proteins share


similarities with other organisms,
underscoring the unity of live
Sequence Alignment Logic
Evaluation of the alignment is a biological concept (significance)
Are you ready for the revolution?

If biologists do not adapt to the powerful computational tools


needed to exploit huge data sets, says Declan Butler, they could find
themselves floundering in the wake of advances in genomics.
Need to understand better from Human Genome sequence:

•Gene number, exact locations, and functions


•Gene regulation
•DNA sequence organization
•Chromosomal structure and organization
•Noncoding DNA types, amount, distribution, information content, and
functions
•Coordination of gene expression, protein synthesis, and post-translational
events
•Interaction of proteins in complex molecular machines
•Predicted vs experimentally determined gene function
•Evolutionary conservation among organisms
•Protein conservation (structure and function)
•Proteomes (total protein content and function) in organisms
•Correlation of SNPs (single-base DNA variations among individuals) with
health and disease
•Disease-susceptibility prediction based on gene sequence variation
•Genes involved in complex traits and multigene diseases
•Complex systems biology including microbial consortia useful for
environmental restoration
•Developmental genetics, genomics
Fast Forward to 2020: What to Expect in Molecular Medicine?

Docs to tailor Dose by your Smart Card

Personalized Medicine

More Effective Pharmaceuticals Societal Implications

Genetic Testing, Therapy Understanding Life


Challenges
74
Outlook – coming lecture
 Introduction to sequence alignment
 pair wise sequence alignment
• The Dot Matrix
• Dynamic Programming
• Scoring Matrices
 local alignment
 Alignment tools
• BLAST
• FASTA

You might also like