Bioinformatics
Information Resources And Networks
Title : Bioinformatics
Subtitle : Sequence and genome analysis
Author : Mount, David W.
Publ.Plc : New York
Publ. : Cold Spring Harbor
Pages : xii, 564p.
ISBN : 0-87969-608-7
Title : Introduction to bioinformatics
Author : Attwood, Teresa K.
Parrysmith, D. J.
Publ.Plc : New Delhi
Publ. : Pearson Education
Pages : xvi, 218p.
Ser Note : Cell and molecular biology
ISBN : 81-7808-507-0
Outline
Bioinformatics Information Resources And Networks
•
EMBnet – European Molecular Biology Network
• DBs and Tools
• NCBI – National Center For Biotechnology Information
• DBs and Tools
• Nucleic Acid Sequence Databases
• Protein Information Resources
• Metabolic Databases
• Mapping Databases
• Databases concerning Mutations
• Literature Databases
EMBnet – European Molecular
Biology Network
Founded in 1988
Network that links European laboratories that use
biocomputing and bioinformatics in molecular biology research
is a science-
science-based group of collaborating nodes throughout
Europe and nodes outside Europe
provides information, services and training to the users
efforts to increase the availability and
accessibility of data resources and
computing tools
increase knowledge and proficiency in bioinformatics through
education and training
EMBnet - Nodes http://www.embnet.org/
National
• governmental
Nodes
(18)
• academic, industrial EMBnet • Biocomputing centers from
research centers (41 nodes) non European countries
Specialist Associate
Nodes Nodes
(9) (11)
EMBnet - Nodes
National Nodes
Appointed by the
Vienna Biocenter - Austria BEN - Belgium
CSC - Finland INFOBIOGEN - France governments
DKFZ - Germany HEN - Hungary
Provide on-line services,
user support and training
INCBI - Ireland INN - Israel
IEN-AdR - Italy CMBI - Netherlands
Bio - Norway IBB - Poland
PEN - Portugal GeneBee - Russia
CNB-CSIC - Spain BMC - Sweden
SIB - Switzerland SEQNET - UK
EMBnet - Nodes
Munich Information Center for protein sequences
Academic, industrial or
Specialist Nodes
research centers in
MIPS specific areas of
bioinformatics
Largely responsible for
ICGEB
Pharmarcia maintainance of
biological databases and
software
F.Hoffmann – La Roche
EBI Important key specialist node
Hinxton
and home of:
HGMP - RC Hall
(Cambridge UK)
EMBL, SWISS-PROT and
Sanger TrEMBL databases
UCL
EMBnet - Nodes
Centers from non
Associate Nodes
European countries
IBBM - Argentina ANGIS - Australia
CBI - China CIGB - Cuba
CDFD - India SANBI – South Africa
EMBnet - Brazil CBR - Canada
EMBnet - Chile EBMnet - Colombia
CIFN - MEXICO
EMBnet’s Mission
Assist in biotechnological and bioinformatics related
research
Provide training and education
Exploit network infrastructures
Investigate and develop new technologies
Bridge between commercial and academic sectors
What does EMBnet do?
Education and training
Software development
Computing resources
Technical support
Help desk in local languages
Publications
Who are EMBnet’s Users?
> 40,000 registered users from all over
the world as well as a larger number of
Internet users
All scientists working in Life Sciences,
from undergraduate students to top level
scientists, in academia as well as
industry, can get support from EMBnet
EMBnets – SRS
National Sequence Retrieval System - SRS
Nodes
• result of a research project with the
EMBnet to interrogating all resources
gathered together
EMBnet • SRS is a network browser for DBs in
molecular Biology
Specialist Associate
Nodes Nodes • SRS allows any flat-file DB to be
indexed to any other
• queries across a range of different
DB types via a single interface
• independent of underlying data
structures or query languages
http://srs.ebi.ac.uk/
Sequence Retrieval System
Network Browser for Databanks in Molecular Biology
Rele Availa
Data Bank No Entries Indexing Date Group
ase bility
SWISSPROT 163235 10-Jun-2005 Sequence ok
SWISSNEW 81134 22-Mar-2006 Sequence ok
NRDB 2269647 29-Mar-2006 Sequence ok
SWALL 3022528 22-Mar-2006 Sequence ok
UNIPROT_SPROT 212425 22-Mar-2006 Sequence ok
UNIPROT_TREMBL 2666963 23-Mar-2006 Sequence ok
TREMBLNEW 624819 12-Dec-2005 Sequence ok
TREMBL 2576118 04-Oct-2005 Sequence ok
Availa
Data Bank No Entries Indexing Date Group
bility
SPTREMBL 1449374 16-Jun-2005 Sequence ok
SPTREMBLNEW 143140 17-Jun-2005 Sequence ok
REMTREMBL 92182 20-Jun-2005 Sequence ok
PIR 283416 16-Jun-2005 Sequence ok
WORMPEP 19538 16-Jun-2005 Sequence ok
DROSOPHILA 14100 16-Jun-2005 Sequence ok
EMBLNEW 4035816 21-Nov-2005 Sequence ok
EMBL 20343598 30-Dec-2005 Sequence ok
EMBLEST 31990232 06-Jan-2006 Sequence ok
EMBLWGS 11106060 24-Sep-2005 Sequence ok
GENBANK 19233264 18-Nov-2005 Sequence ok
GENBANKEST 31008556 23-Feb-2006 Sequence ok
REFSEQP 8006 16-Jun-2005 Sequence ok
SUBTILIST 1 16-Jun-2005 Sequence ok
Availa
Data Bank No Entries Indexing Date Group
bility
PROSITE 1935 22-Mar-2006 SeqRelated ok
PROSITEDOC 1407 22-Mar-2006 SeqRelated ok
BLOCKS 4034 16-Jun-2005 SeqRelated ok
EPD 1375 16-Jun-2005 SeqRelated ok
ENZYME 4173 16-Jun-2005 SeqRelated ok
PRINTS 865 16-Jun-2005 SeqRelated ok
TFSITE 4342 07-Apr-2003 TransFac ok
TFFACTOR 1799 07-Apr-2003 TransFac ok
TFCELL 816 07-Apr-2003 TransFac ok
TFCLASS 27 07-Apr-2003 TransFac ok
TFMATRIX 246 07-Apr-2003 TransFac ok
TFGENE 1035 07-Apr-2003 TransFac ok
PDB 34927 08-Feb-2006 Protein3DStruct ok
DSSP 30832 22-Nov-2005 Protein3DStruct ok
HSSP 30369 08-Feb-2006 Protein3DStruct ok
PDBFINDER 35701 28-Mar-2006 Protein3DStruct ok
NRL3D 6063 16-Jun-2005 Protein3DStruct ok
FLYGENES 7556 16-Jun-2005 Genome ok
FLYREFS 0 07-Apr-2003 Genome ok
OMIM 17004 18-Oct-2005 Mutations ok
REPTILIA 8364 18-Jan-2006 Others ok
EMBnets - EMBOSS
The European Molecular Biology Open Software Suite
EMBOSS is a free Open Source software analysis package
specially developed for the needs of the molecular biology (e.g.
EMBnet) user community.
The software automatically copes with data in a variety of
formats and even allows transparent retrieval of sequence data
from the web.
Also, as extensive libraries are provided with the package, it is
a platform to allow other scientists to develop and release
software in true open source spirit.
EMBOSS also integrates a range of currently available
packages and tools for sequence analysis into a seamless
whole.
What can EMBOSS do for
you?
Within EMBOSS you will find around hundreds of
programs (applications) covering areas such as:
• Sequence alignment,
• Rapid database searching with sequence patterns,
• Protein motif identification, including domain analysis,
• Nucleotide sequence pattern
• Codon usage analysis for small genomes,
• Rapid identification of sequence patterns in large scale
sequence sets,
• Presentation tools for publication,
and much more. Check:
http://emboss.sourceforge.net/
NCBI – National Center For
Biotechnology Information
Mission:
Development of new information
Leading American technologies to aid our
information provider understanding of the molecular
and genetic processes that
Established in 1988 as underlie health and disease
a division of the Creation of systems for storing and
National Library of analysing biological information
Medicine (NLM) Development of advanced methods
• Located on the of computer-based information
campus of the processing
National Institute of Facilitation of user access to DBs
Health (NIH – and software
Rockville/Maryland) Co-ordination of efforts to gather
biotechnology information
worldwide
NCBI
Since 1992 – maintenance of GenBank and collaboration
with international nucleotide DBs: EMBL and DDBJ
(Japan)
Providing the Entrez that facilitates to access biological
DBs (similar to SRS that is provided by the EMBnet)
gquery (https://www.ncbi.nlm.nih.gov/gquery/)
NCBI - Responsibilities
administers research on biomedical problems at the molecular
level using mathematical and computational methods
maintains collaborations with several NIH (National Institutes of
Health) institutes, academia, industry, and other governmental
agencies
promotes scientific communication by sponsoring meetings,
workshops, and lecture series
supports training on basic and applied research in
computational biology for postdoctoral fellows through the NIH
Intramural Research Program
engages members of the international scientific community in
informatics research and training through the Scientific Visitors
Program
develops, distributes, supports, and coordinates access to a
variety of databases and software for the scientific and medical
communities
develops and promotes standards for databases, data
deposition and exchange, and biological nomenclature
Nucleic Acid Sequence Databases
• the principal nucleic acid sequence databases are GeneBank,
EMBL and DDBJ, which each collect a portion of the total sequence
data reported world-wide, and exchange new and updated entries
on a daily basis
Nucleic acid sequence Databases
EMBL (Europe)
GenBank (USA)
DDBJ (Japan)
ENSEMBL (project between EMBL - EBI and the Sanger Institute)
dbEST (division of GenBank)
GSDB (division of GenBank)
EMBL
source: http://www3.ebi.ac.uk/Services/DBStats/
Nucleic Acid Sequence Databases - EMBL
This week the EMBL Database contained 301,588,430,608 nucleotides in
199,575,971 entries
Breakdown by entry type:
Entry TypeEntries Nucleotides
Standard 128,262,666 120,603,334,814
Constructed (CON) 6,381,010 225,047,233,405
Third Party Annotation (TPA) 6,894 385,832,010
Whole Genome Shotgun (WGS) 64,925,118 180,599,264,067
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank)
constitutes Europe's primary nucleotide sequence resource. Main sources
for DNA and RNA sequences are direct submissions from individual
researchers, genome sequencing projects and patent applications. The
database is produced in an international collaboration with GenBank (USA)
and the DNA Database of Japan (DDBJ). Each of the three groups collects a
portion of the total sequence data reported worldwide, and all new and
updated database entries are exchanged between the groups on a daily
basis.
Nucleic Acid Sequence Databases -
EMBL
Number of entries Total nucleotides
(current 69,666,551) (current 127,450,085,130 )
Ref: EMBL Nucleotide Sequence Database:developments in 2005,
Nucleic Acids Research, 2006, Vol. 34, D10–D15
Nucleic Acid Sequence Databases -
EMBL By nucleotide count
Pan
Homo Mus Rattus
troglodyt
sapiens musculus norvegicus
es
Bos Canis Monodelphis Danio
taurus familiaris domestica rerio
Macaca Loxodonta
Other
mulatta africana
Nucleic Acid Sequence
Databases – GenBank
GenBank which is produced at NCBI, is split
into smaller, discrete divisions.
This facilitates fast, specific searches by
restricting queries to perticular database
subsets
During 1992-1997, the level of EST and STS
data within GenBank grew 10-fold.
the overall sequence information contributed
by such partial data was still less than that of
higher quality sequences in the other major
divisions
Specialised Genomic Resources
In addition to the comprehensive DNA sequence DBs, there
is a variety of more specialised genomic resources.
These so called boutique DBs bring focus to species-
specific genomics and to particular sequencing techniques.
Specialised Genomic Resources
SGD – Saccharomyces Genome Database
UniGene - gene-oriented clusters from GenBank
TIGR - Databases of The Institute for Genomic
Research
ACeDB – A C.elegans DataBase
Specialised Genomic Databases
SGD
http://www.yeastgenome.org/ (bakers yeast)
AceDB
http://www.acedb.org (c.elegans)
FlyBase
http://flybase.org/ (fruit fly)
MGD
http://www.informatics.jax.org (Mouse)
Protein Information Resources
Levels of protein sequence and structural organisation:
primary The primary structure of a protein is its amino acid sequence
The second structure of a protein corresponds to regions of
secondary
local regularity (e.g., α-helices and β-strands).
The tertiary structure of a protein arises from the packing
tertiary of its secondary structure elements, which may form
discrete domains within a fold.
Protein Information Resources
Levels of protein sequence and structural organisation:
primary
primary sequence AVILDRYFH
database
secondary
secondary motif [AS]-[IL]2-X[DE]-R-[FYW]2-H
database
structure
tertiary domain module a,b,c @.*,#
database
Primary Protein Databases
• The primary structure of a protein is its amino acid sequence
• these are stored in primary databases as linear alphabets that
denote the constituent residues
Protein sequence Databases
SWISS-PROT - Protein knowledgebase
TrEMBL - Computer-annotated supplement to Swiss-Prot
PIR – Protein Information Resource
MIPS – Munich Information Centre for Protein Sequences
NRL-3D - produced by PIR
Protein Sequence Databases
Table of the most represented species
Swiss-Prot contains 197,228
sequence entries, comprising No. Frequ. Species
71,501,181 amino acids
abstracted from 135,257 1 13049 Homo sapiens (Human)
references 2 10132 Mus musculus (Mouse)
Total number of species Saccharomyces cerevisiae
represented in Swiss-Prot: 3 5189
(Baker's yeast)
9,520
4 4847 Escherichia coli
The average sequence length
in Swiss-Prot is 362 amino 5 4669 Rattus norvegicus (Rat)
acids. 6 3665
Arabidopsis thaliana (Mouse-
Swiss-Prot is the most highly ear cress)
annotated protein sequence Schizosaccharomyces pombe
8 2863
DB (Fission yeast)
http://expasy.org/sprot/ 7 2814 Bacillus subtilis
9 2750 Caenorhabditis elegans
Drosophila melanogaster (Fruit
10 2286
fly)
Composite Protein Sequence
Databases
Composite databases amalgamate a variety of
different primary databases
They render sequence searching much more
efficient, because they obviate the need to
interrogate multiple resources
Different composite databases use different
primary sources and different redundancy
criteria in their amalgamation procedures
Composite Protein Sequence
Databases
NRDB OWL MIPSX SP+TrEMBL
Natural Resource DB SwissProt TrEMBL
PDB SWISS-PROT PIR1-4 SWISS-PROT
SWISS-PROT PIR MIPSOwn TrEMBL
PIR GenBank MIPSTrn
GenPept NRL-3D MIPSH
SWISS-PROTupdate PIRMOD
GenPeptupdate NRL-3D
SWISS-PROT
EMTrans
GBTrans
Kabat
PseqIP
Secondary databases
Secondary databases contain pattern data, i.e., diagnostic
signatures for protein families. These signatures encode the
most highly conserved features of multiply aligned sequences,
which are often crucial to the structure or function of the protein.
The secondary structure of a protein corresponds to regions of
local regularity (e.g., α-helices and β-strands), which in sequence
alignments, are often apparent as well-conserved motifs.
Patterns are regular expressions, fingerprints, blocks, profiles,
etc.
Secondary databases
Primary Stored
Secondary DB
source information
PROSITE SWISS-PROT Regular expressions
(patterns)
Profiles SWISS-PROT Weighted matrices
(profiles)
PRINTS OWL Aligned motifs
(fingerprints)
BLOCKS PROSITE/PRINTS Aligned motifs
(blocks)
IDENTIFY BLOCKS/PRINTS Fuzzy regular
expressions
(patterns)
Secondary databases
TRANSFAC
http://transfac.gbf.de
EPD
http://www.epd.isb-sib.ch
InterPro
http://www.ebi.ac.uk/interpro/
PROSITE
http://www.expasy.ch/prosite
BLOCKS
http://blocks.fhcrc.org
PRINTS
ftp://ftp.seqnet.dl.ac.uk/pub/database/prints
PFAM
http://www.sanger.ac.uk/Software/Pfam/index.shtml
ProDom
http://www.toulouse.inra.fr/prodom.html
InterPro
http://www.ebi.ac.uk/interpro
GeneCards
http://bioinformatics.weizmann.ac.il/cards
ENSEMBL
http://www.ensembl.org
EcoCyc
http://ecocyc.panbio.com/ecocyc/ecocyc.html
Secondary databases
There is some overlap in content between the secondary
databases
PDBsum alone has 35,291 entries
Pattern DB growth is slow because the addition of
detailed family annotation is very time consuming.
PROSITE and PRINTS are the only comprehensively,
manually annotated secondary DBs
To address the annotation bottleneck, the secondary
database curators are together created a unified
database of protein families known as InterPro
Structure Classification DBs
Contain 3D structures available from
crystallographic and spectroscopic studies
Structure Classification Databases
PDBsum – Protein Data Bank
CATH – Class, Architecture, Topology, Homology
SCOP – Structural Classification of Proteins
Structure Classification DBs
PDB
http://www.rcsb.org
SCOP
http://scop.mrc-lmb.cam.ac.uk/scop
CATH
http://www.cathdb.info/
DSSP
http://swift.cmbi.ru.nl/gv/dssp/
FSSP
http://www.ebi.ac.uk/dali/fssp
HSSP
http://swift.cmbi.kun.nl/swift/hssp/
Metabolic Databases
A number of metabolic databases are available electronically
some with features for querying and visualizing metabolic
pathways and regulatory networks.
KEGG (Kyoto Encyclopedia of Genes and Genomes)
http://www.genome.ad.jp/kegg
ENZYME (Enzyme nomenclature database)
http://www.expasy.ch/enzyme
BRENDA (Enzyme Information System)
http://www.brenda-enzymes.org/
EMP (Enzymes and Metabolic Pathways database)
http://www.metacyc.org/
Mapping Databases
OMIM
http://www.ncbi.nlm.nih.gov/omim
GDB (The GDB Human Genome Data Base: a source of
integrated genetic mapping and disease data.)
http://morissardjerome.free.fr/infobiogen/www.gdb.org/gdb/
Databases concerning
Mutations
dbSNP
http://www.ncbi.nlm.nih.gov/SNP
The SNP Consortium (TSC)
http://snp.cshl.org
http://www4a.biotec.or.th/PASNP
Literature Databases
PubMed
http://www.ncbi.nlm.nih.gov/entrez/query
Bioinformatics Online
http://www.bioinformatics.oupjournals.org
Nature
http://www.nature.com
Science
http://www.sciencemag.org
In 2003 scientists in the Human Genome
Project obtained the DNA sequence of the 3
billion base pairs making up the human
genome
Sequencing the Human Genome:
A Landmark in the History of Mankind
What we’ve learned so far from
the Human Genome Project
The human genome is nearly the same
(99.9%) in all people
Only about 2% of the human genome
contains genes, which are the
instructions for making proteins
Other Lessons from the
Human Genome Project
Humans have an estimated 30,000 genes;
the functions of more than half of them
are unknown
Almost half of all human proteins share
similarities with other organisms,
underscoring the unity of live
Sequence Alignment Logic
Evaluation of the alignment is a biological concept (significance)
Are you ready for the revolution?
If biologists do not adapt to the powerful computational tools
needed to exploit huge data sets, says Declan Butler, they could find
themselves floundering in the wake of advances in genomics.
Need to understand better from Human Genome sequence:
•Gene number, exact locations, and functions
•Gene regulation
•DNA sequence organization
•Chromosomal structure and organization
•Noncoding DNA types, amount, distribution, information content, and
functions
•Coordination of gene expression, protein synthesis, and post-translational
events
•Interaction of proteins in complex molecular machines
•Predicted vs experimentally determined gene function
•Evolutionary conservation among organisms
•Protein conservation (structure and function)
•Proteomes (total protein content and function) in organisms
•Correlation of SNPs (single-base DNA variations among individuals) with
health and disease
•Disease-susceptibility prediction based on gene sequence variation
•Genes involved in complex traits and multigene diseases
•Complex systems biology including microbial consortia useful for
environmental restoration
•Developmental genetics, genomics
Fast Forward to 2020: What to Expect in Molecular Medicine?
Docs to tailor Dose by your Smart Card
Personalized Medicine
More Effective Pharmaceuticals Societal Implications
Genetic Testing, Therapy Understanding Life
Challenges
74
Outlook – coming lecture
Introduction to sequence alignment
pair wise sequence alignment
• The Dot Matrix
• Dynamic Programming
• Scoring Matrices
local alignment
Alignment tools
• BLAST
• FASTA