Introduction to Bioinformatics
Questions & Help
• Amir Mitchell – lecturer.
• Itay Mayros, Einat Hazkani-Covo, and Shira
Mintz – Teaching assistants
• Emails:
mitchel@post.tau.ac.il, itaymay@post.tau.ac.il,
einat@kimura.tau.ac.il, mintzshi@post.tau.ac.il
• Course site
2
Course Layout
• Eleven lessons – eleven weeks.
• Lecture, exercise, discussion.
• Presentations and exercises.
• Books and additional material.
• Missing lessons or exercises.
• Consultation hour.
• Personal gene/protein.
3
Final grade
• Final exam (80%):
– Multiple choice questions
– Open questions
– No online part
• Home assignment (20%)
4
Bioinformatics
• Buzzword …
Nanotechnology, Biotechnology …
Bioinformatics: Bioinformatics is the branch of computer science
that focuses on sub-domains of biology: research on genes and
proteins. Researchers in this field must use powerful computers and
special calculation methods to process the large body of complex data
generated by genetics. Using these tools, it was possible to sequence
the human genome .
Lexicon-encyclobio
5
Two separate approaches
• Computer science - inventing tools,
developing algorithms.
• Biology - Utilizing tools for biological
research.
1. Purely bioinformatics (comparing exon/intron
structure in human and mouse).
2. “Fairly” bioinformatics (Locating the active site of
an enzyme by identifying conserved residues in
the protein sequence).
6
Research outline
Databases (public, local)
Retrieve data
Analysis
Results
Lab (wet biology) Literature
7
Databases & Tools
• Free shared databases (on-line, bioinfo unit)
• Internet based tools (PC)
• GCG package tools (unix)
8
GCG
• Commercial DNA and protein sequence
analysis package.
• Written by Wisconsin Genetics Computing
Group.
• Includes more than 130 separate tools.
9
GCG
• GCG works in unix environment (OS)
• Same principles apply to all GCG programs
• On-line help
10
Divided work
PC1 Unix2 Web
- Databases Databases
(main ones only) (all)
Data storage Data storage -
Tools Tools Tools
1Access(unix and web)
2Advanced analysis, user databases, web site
11
Lesson 1 – Introduction,
Unix environment
1. Administration
2. Introduction to Bioinformatics.
3. NCBI
4. Working in Unix environment
12
Lesson 2 – databases and text
based searching:
1. Databases: organization and entries.
2. Database problems.
3. Principles of database searching.
4. Unix and GCG.
13
Lesson 3 – pairwise alignment
1. Comparing two sequences.
2. Scoring: good and bad alignments.
3. Comparison methods.
4. Comparison programs.
5. Unix.
14
Lesson 4 – Sequence based
searching
1. DNA or protein sequences as search queries.
2. Problems with sequence search.
3. Methods for searching (fasta, blast).
15
Lesson 5 – Multiple sequence
alignment
1. Comparing multiple sequences.
2. Uses of multiple alignment.
3. Methods for multiple alignment, efficiency
and limitations.
4. Profiles and consensus sequences.
16
Lesson 6 – Phylogenies
1. Introduction to phylogeny.
2. Methods for constructing evolutionary trees.
3. Statistical analysis of constructed trees.
17
Lesson 7 – Protein families,
secondary databases
1. Dividing proteins into families.
2. Patterns.
3. Different approaches: motifs, fingerprints.
4. Different databases.
5. Consurf.
18
Lesson 8 – DNA sequence
analysis
1. Gene structure.
2. Gene finding.
3. Predicting gene features.
4. Consurf.
19
Lesson 9 - genomes
• Genome features.
• Prokaryotic and Eukaryotic genomes.
• Genome viewers
• Model organisms
20
Lesson 10 - Various tools
• Making things easy, useful tools for lab
work.
Lesson 11 - Summary
• Overview, Q&A before the exam.
21
Last comments
• Introduction only.
• Finding sites: Links and google.
• Biology background.
• Unix accounts.
• Terminology
22
Milestones in bioinformatics
1965 Theory of molecular evolution (Zuckerkandl & Pauling)
1967 Atlas of protein sequences (Dayhoff)
1970 Global alignment algorithm (Needleman, Wunsch)
1981 Local alignment algorithm (Smith, Waterman)
1981 Sequence motif concept (Doolittle)
1982 GenBank made public
1982 Phage lambda genome fully sequenced
1983 Database search algorithm (Wilbur, Lipman)
1985 Fast sequence similarity searching
1990 Blast
1991 ESTs
23
* 1953 Watson and Crick
Milestones in bioinformatics
1995 First bacterial genome fully sequenced H. influenzae
1996 Yeast genome fully sequenced
1997 C. elegans genome fully sequenced
1999 Fruit fly genome fully sequenced
2000 Human genome fully sequenced (draft)
24
Today …
• Over 1500 fully sequenced genomes from
all domains of life.
• Numerous databases.
• Numerous tools.
25
Today …
Archea (16)
Eukarya (20)
Bacteria (139)
Viruses (1500)
26
Examples
• Human , mouse, rat, zebra fish, drosophila,
yeast, anopheles, tomato, rice, wheat.
• E. coli (4 strains), M. tuberculosis, M.
leprae.
• Mitochondria, chloroplast, plasmids.
27
Public interest:
Human Genome Project
• 2000 - Working draft of the Genome, work of 20
groups world wide.
(http://www.ncbi.nlm.nih.gov).
• 2003 - Obtain a complete, high-quality genomic
sequence.
• Determine the sequences of the 3 billion bases.
• Identify all the estimated 30,000 genes in human
DNA
28
Human Genome Project
Chromosome 21
9 May, 2000
Chromosome 22
2 Dec, 1999
Initial analysis
15 Feb, 2001
29
NCBI – at a glance
The biggest and most comprehensive site!
Includes numerous tools and databases!
30
NCBI - overview
PubMed OMIM
Books Exp’ profiles
Structure
NCBI Nucleotides
Domains Proteins
Taxonomy Genomes
31
* Cross references between the databases
NCBI
PubMed
• Citations, abstracts, full articles.
Books
• Online books, full text from books (Cell,
introduction to genetic analysis)
32
NCBI
OMIM
• Online Mendelian Inheritance in Man. A
comprehensive database of human genes
and genetic disorders.
• Entries include textual information and
,most importantly, references to literature
and sequences.
33
NCBI
GEO
• Gene Expression Omnibus
Results from a high throughput
experiments. mRNA, DNA, and protein
arrays.
34
NCBI
Genomes Nucleotides Proteins
• Sequence databases. Divided into sections
and sub-sections.
Domains
• Protein domains, both conserved sequence
domains and 3D domains.
35
NCBI
Structure
• 3D structure of proteins (~20,000 entries).
Taxonomy
• Taxonomy of all organisms found in NCBI
36
NCBI - Interconnectivity
PubMed OMIM
Books Exp’ profiles
Structure
NCBI Nucleotides
Domains Proteins
Taxonomy Genomes
37
* Cross references between the databases