0% found this document useful (0 votes)
16 views40 pages

Lecture 2

The document outlines homework assignments focused on the differences between DNA and protein sequencing, including their definitions, building blocks, and sequencing challenges. It also includes tasks related to gene analysis, such as extracting gene information, predicting gene structures, and exploring genomic databases like NCBI and Ensembl. The homework emphasizes understanding gene functions, their implications in cancer and immunity, and the methodologies used in sequencing and gene prediction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views40 pages

Lecture 2

The document outlines homework assignments focused on the differences between DNA and protein sequencing, including their definitions, building blocks, and sequencing challenges. It also includes tasks related to gene analysis, such as extracting gene information, predicting gene structures, and exploring genomic databases like NCBI and Ensembl. The homework emphasizes understanding gene functions, their implications in cancer and immunity, and the methodologies used in sequencing and gene prediction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

HOMEWORK DAY 1

Problems and Solutions?

1
Note for HOMEWORK 1
Homework 1: DNA sequencing vs Protein sequencing
a. What is the difference between DNA sequencing and protein sequencing?
Answer 1?
DNA sequence Protein sequence
Definition DNA sequence is a series of Protein sequence is a series of amino
deoxyribonucleotides acids
Building block Deoxyribonucleotides Amino acid
Different types Four types of deoxyribonucleotides Twenty different amino acid
of monomers
Bonds between Phosphodiester bonds Peptide bonds
monomers
Function DNA mainly stores genetic Important in structure, function, and
information to make proteins in a cell regulation of the body’s tissues and
organs
Variety One DNA sequence can only be One protein sequence can have more than
translated into one possible protein one possible translation of DNA sequence
sequence
Deduce Can deduce to protein sequence Cannnot deduce to DNA sequence

2
Answer 2?

DNA sequencing Protein sequencing

DNA sequencing relies heavily upon PCR Protein sequencing is de novo, meaning it
primers, which works well for model species doesn’t rely on a database.
=>DNA sequencing proves difficult for non- => It can sequence any protein of any isotype
annotated genomes

DNA sequencing requires access to the intact Protein sequencing uses the protein itself
original cell line
=> So when the hybridoma is lost, DNA => providing the ability to sequence without
sequencing is no longer feasible accessing to the original cell line or hybridoma

DNA sequencing is blind to post-translational Protein sequencing can objectively uncover


modifications, which may have implications on post-translational modifications
protein functionality

Missing information: Principle and techniques?


- DNA sequencing: Traditional Sanger sequencing and next-generation sequencing
- Protein sequencing: two major direct methods (mass spectrometry & Edman
degradation using a protein sequenator (sequencer))

3
b. Why don't we sequence protein like we sequence DNA

- Because if we sequence protein like what we do with DNA, it may include both
introns and exons, that leads to the lack of accuracy of result.

- Due to the different structural components and the different nature of the
sequencing process. DNA sequencing relies on DNA polymerase and primer, taking
advantage of DNA replication to sequence. Protein sequencing uses the protein itself
, so it must be solved directly to give the position and structure of each amino acid.

- DNA sequencing is blind to the post translational modification, which may have
implications on protein functionally. Protein sequencing can objectively uncover post
translational modifications like N terminal pyroglutamate formation, glycosylation
sites and deamidation

4
b. Why don't we sequence protein like we sequence DNA

Missing points:
- The technique lacks high-throughput capabilities
- Cost:
> Protein sequencing cost: First 5 amino acids: $600; 50$ for each Additional amino acid
> DNA sequencing cost: a whole-exome sequence of human genome (30 x 106 bp, 1000$)

5
Note for HOMEWORK 2
Figure out how the genes assigned to each of you are implicated in cancers and/or immunity
(File: Gene List.xlsx)

Requirements: get the following information about each of the 3 genes assigned to you
• Gene symbol, full name, reviewed by RefSeq
• Summary of its function
• Location on the human genome (based on GRCh38)
– e.g. chromosome, start, end, strand
• How this gene is related to cancer
– Get one open-access reference that is most relevant to cancers and/or immunity in your
opinion. Please list the article title, the authors, their institutions, publication year, journal
name.
• Any situations (mutations, over-expression, etc.) of this gene associated with other (non-cancer
and non-immune) diseases
• Extract DNA sequence of these genes and translate the DNA sequences in 3 frames, and
determine the reading frame which contains an open reading frame (ORF).

6
Using NCBI RefSeqGene

https://www.ncbi.nlm.nih.gov/gene/?term=akt1

7
RefSeqGene - AKT1

• Gene symbol, full name, reviewed by RefSeq


• Summary of its function 8
RefSeqGene - AKT1

• Location on the human genome (based on GRCh38)


e.g. chromosome, start, end, strand
9
• How this gene is related to cancer

RefSeqGene - AKT1

10
• How this gene is related to cancer

11
• How this gene is related to cancer
– Get one open-access reference that is most relevant to cancers and/or
immunity in your opinion. Please list the article title, the authors, their
institutions, publication year, journal name.

12
• Any situations (mutations, over-expression, etc.) of this gene associated
with other (non-cancer and non-immune) diseases

RefSeqGene – AKT1

13
From NCBI RefSeqGene to ClinVar

14
From NCBI RefSeqGene to ClinVar

15
Extract DNA sequence of these genes and translate the DNA sequences in 3
frames, and determine the reading frame which contains an open reading
frame (ORF).

GenBank Record Fields

16
RefSeqGene - AKT1 transcript

17
Extract DNA sequence of a transcript of AKT1 genes

Searching for ORFs


a. Missing protocol
- Which program? Website?
- Parameter: strand? Inititation codons? genetic code? min ORF size?.. 18
b. Conlusion: which ORF should be chosen for further study?
Structure of an Eukaryotic genes

19
How gene structure is determined?

• Experiments
– Reverse transcription PCR (RT-PCR) -> sequencing
– 5’ Rapid Amplification of cDNA ends (5’ RACE) -> finding the 5’ most exon -
sequencing
– Transcriptome library -> single-pass sequencing
• Expressed sequence tags (EST)
• RNA-seq

• Computational prediction

20
How computer can predict
the gene structure?

 The site for transcription and translation elements.


 The homology sequence of known gene/protein.
21
Strategy: Splice site recognition

GT-AG rule

22
DONOR-SPLICE: splicing site at the beginning of an intron, intron 5' left end.
ACCEPTOR-SPLICE: splicing site at the end of an intron, intron 3' right end.
Programs for gene prediction

 geneid: https://genome.crg.es/software/geneid/geneid.html
- Available organism: Homo sapiens (human), Drosophila melanogaster (fruit fly), Tetraodon
nigroviridis (puffer fish), Oryza sativa (rice), ….

 GenScan: http://hollywood.mit.edu/GENSCAN.html
- Available organism: Vertebrate, Arabidopsis, maize

 Augustus: http://bioinf.uni-greifswald.de/augustus/submission.php
- Available organisms: animals, alveolata, plants and algae, fungi, bacteria, archaea

 Other genefinders: FGENESH, GRAIL, GLIMMERM, GENEID, GENEFINDER,


GENEMARK, ….

23
EXERCISE BREAK
Exploring ab initio gene prediction
1. Extract the FASTA sequence of the genomic region of the AKT1 gene (NCBI Reference
Sequence: NG_012188.1)
2. Predict gene structure of this DNA sequence
- Searching signals of the first exon with geneid: Select acceptors, donors, start and stop
codons. Look for them in the real annotation of the sequence
- Searching exons using both geneid and GeneScan/or Augustus (or at least by two gene
prediction programs)
> Select All exons and try to find the real ones
> Finding gene
> Compare the predicted gene with the GenBank Record gene from NCBI

24
One gene
=> multiple (alternatively spliced) transcripts
=> multiple proteins (with distinct functions)

http://commons.wikimedia.org/wiki/File:Transformer_splicing.gif 25
Browsing genes and genomes
with Ensembl

26
Contents

• Introduction to Ensembl database and browser

• EXERCISE: A light exploration of the Ensembl genome


browser with AKT1 genes

27
NCBI databases are not the ultimate
solution to the knowledge of genomes

28
Introduction
Why do we need/have genome browsers? So many!

29
The Human Genome Project (HGP)

• Draft
– Published on June 26,
2000
– Coverage: 90 %
– Error rate: 1 %

• Finish
– Published in 2003
– Coverage: > 99 %
3
– Error rate: 0.01 % 0

30
Any thing new for the human genome?
The truth is that what we do
not know is much more
than what we've known…
This is no longer true since Encyclopedia of DNA Elements
(ENCODE) Consortium found new evidence

Once nearly everyone believed that only


3% of the human genome are functional
regions
1.5% are protein-coding regions
1.5% are regulatory elements
97% are junk DNAs
Nature (2001), 409(6822): 860-921

32
Non-coding RNA: It’s Not Junk

• ~70% (3/4) of the human genome can be


transcribed …, functionally unknown!

• >20,000 non-coding RNAs, functionally


unknown!
Djebali, S., et al. (2012). "Landscape of transcription in human cells." Nature
489 (7414): 101-108.

33
Genomic sequences must be
annotated with functions
Human Genome Project

GRCh38.p4 (June 29, 2015)


Annotation of gene structures
Reference genome

Advanced annotation

Population variations
Gene regulation Pathways
Variation and diseases

34
The Ensembl project

• The goal of Ensembl was to automatically annotate


the genome, integrate this annotation with other
available biological data and make all this publicly
available via the web (since 1999).

www.ensembl.org
35
Ensembl Features

36
EXERCISE BREAK

Exercise 2: A light exploration of the Ensembl genome browser with AKT1


genes
- Extracting genomic information from Ensembl:
 Gene ID, Gene Name, Ensembl Gene ID (Gene stable ID), NCBI gene ID,
Uniprot/Swiss-Prot ID
 What is the description of this gene? Where is it located in the genome?
 How many contigs cover the gene region? Is AKT1 gene in the forward strand
or in the reverse strand? How many transcripts are annotated for AKT1? How
many of them code for protein?
 SNP or variants within the genome of interest? What SNPs are found in my
gene and are they located in introns, promoters or exons?

37
HOMEWORK Day 2
- Revise your Homework 2 from Day 1.

- Extract the FASTA sequence of the genomic region of your genes (from
Homework Day 1) and predict gene structure of these DNA sequences using
one gene prediction programs. Summary the exons and introns from your
prediction; and write your observation and conclusion.

- Finding transcript information about a specific gene using NCBI & Ensembl
and compare with your prediction from bioinformatics program.

- Exploring genomic information of your genes (from Homework Day 1) using


Ensembl (see exercise 2 for detail).

- Between Ensembl and NCBI, which one would you prefer when searching
information of human genes? Why?

DEADLINE: 10am Thursday 15th 2021


37
Sequencing Primary data

ORF finder Gene prediction

Take-home message?
NCBI Ensembl
END

40

You might also like