THE INTERNATIONAL UNIVERSITY (IU) – VIETNAM NATIONAL UNIVERSITY – HCMC
MID-TERM EXAMINATION – CLASS
Date: 5/11/2021
Duration: 90 minutes
Student ID: .......BTBTIU19107....... Name: ….Lê Phước Quyền…….
SUBJECT: BIOINFORMATICS
Dean of School of Lecturer Proctor 1 Score
Biotechnology Signature:
Signature:
Proctor 2
Full name: Full name:
Dr. Nguyen Minh Thanh
(Sign and write
full name)
Instruction:
1. This is an open book examination
2. Student must answer right after a question
Part 1 – Paper-based exam (50 points)
A. Short answer (8 points)
1. Arrange the E-value in ascending order (lowest to highest): 8e-146, 7e-52, 0.0, 3e-45. (2 points)
0.0, 8e-146, 7e-52, 3e-45
2. PAM195 is a matrix where an average of -----75%---- amino acids have changed during evolution.
(write percentage in the blank). (2 points)
3. N50 is the length of contig/scaffold at which ------50%------ of the bases in a given assembly reside.
(write number or percentage in the blank) (2 points)
4. Give a name of assembly method that merges short reads to create a novel full-length DNA
sequence with no-prior reference sequence available. (2 points) De novo assembly
1
B. Multiple choice (20 points) (Please highlight your choice in yellow)
5. A heterotrimer contains
a. One subunit
b. Two identical subunits
c. Two identical subunits & one different subunit
d. Three different subunits
e. c&d
6. Which of the following is incorrect about next-generation sequencing (NGS)
technologies?
a. Fast
b. NGS generate a huge number of reads per run
c. NGS reduces the cost of sequencing dramatically
d. NGS reads are typically long
e. Lower accuracy in comparison with Sanger technology
7. The order of study of genome based on NGS technologies is
a. DNA library preparation – sequencing – trimming – de novo assembly – BUSCO
performance – chromosome assembly – annotation.
b. Sequencing – DNA library preparation – trimming – de novo assembly – BUSCO
performance – chromosome assembly – annotation.
c. DNA library preparation – annotation – trimming – de novo assembly – BUSCO
performance – chromosome assembly – sequencing.
d. DNA library preparation – sequencing – trimming – chromosome assembly – de
novo assembly – BUSCO performance – annotation.
e. DNA library preparation – trimming – sequencing – de novo assembly – BUSCO
performance – chromosome assembly – annotation.
8. Which of the following is incorrect about primary database?
a. Raw sequence data with some basic information
b. Redundancy
c. Contain only protein sequences
d. Majority of protein sequences derived from computational translation.
e. All above
9. You have two distantly related proteins. Which BLOSUM or PAM matrix is best suited to
compare them?
a. BLOSUM45 or PAM250
b. BLOSUM45 or PAM1
c. BLOSUM80 or PAM250
d. BLOSUM80 or PAM1
e. Non-best suited with the above options.
2
10. Which of the following is incorrect about typical basic BLAST output?
a. The subject sequences are listed from the highest similarity at the top to
progressively lower similarities going down the list.
b. The subject sequences are listed from the highest similarity at the top together with
the highest E-values.
c. The subject sequences are listed from the highest similarity at the top together with
the highest bit scores.
d. The E-values are listed from lowest value at the top to increasingly higher values
going down the list.
e. The bit scores are listed from the highest value at the top to progressively lower
values going down the list.
11. What is the difference between RefSeq and Gen-Bank?
a. RefSeq includes publicly available DNA sequences submitted from individual
laboratories and sequencing projects.
b. GenBank provides nonredundant curated data.
c. GenBank sequences are derived from RefSeq.
d. RefSeq sequences are derived from GenBank and provide nonredundant curated
data.
e. There is no difference between two databases.
12. Which of the following is correct about normalized BLAST scores (also called bit scores):
a. are unitless;
b. are not related to the scoring matrix that is used;
c. can be compared between different BLAST searches, even if different scoring
matrices are used;
d. can be compared between different BLAST searches, but only if the same scoring
matrices are used.
e. cannot be compared between different BLAST searches, if different scoring
matrices are used;
13. It is extremely difficult for intrinsic (ab initio) gene‐finding algorithms to predict protein‐
coding genes in eukaryotic genomic DNA. What is the main problem?
a. exon/intron borders are hard to predict;
b. introns may be many kilobases in length;
c. the GC content of coding regions is not always differentiated from the GC content
of noncoding regions;
d. All of the above.
e. None of the above.
3
14. Most sequencing technologies produce raw data in what format?
a. FASTA;
b. FASTQ;
c. FASTG;
d. FASTX;
e. FASTQC.
C. Matching (match an appropriate term with each definition) (10 points)
Terms:
DNA sequencing Accession number Contig
Chromosome Consensus sequence Coding sequence
Paralogs Orthologs Read
Definitions:
Definitions Terms
15. A unique identification is given to mark the entry of a sequence (protein Accession number
or nucleic acid) to a primary or secondary database.
16. A method determines the nucleotide sequence of a DNA molecule. DNA sequencing
17. Contiguous segment of a DNA that was generated by joining Contig
overlapping reads.
18. Part of the DNA that is transcribed into mRNA during transcription and Coding sequence
then translated into protein.
19. Homologous proteins that perform the same function in different Orthologs
species.
D. Calculation (12 points)
Consider the following alignment (| is perfect match, : is similar match, . is dissimilar match):
Question Answer
20. What is the length of the alignment? 17
21. What is the percent identity? 5/17 = 29.4 %
22. What is the percent similarity? 7/17 = 41.2 %
23. What is the percent gap? 8/17 = 47.1%
4
Part 2 – Computer-based exam (50 points)
Question 24: (14 points)
Set up an appropriate BLAST for the sequence with the file name as Unknown sequence_A. Answer
the following questions?
Question Answer
a. What is the accession number of the closest match NP_001187174
to the query sequence?
b. Protein name Somatotropin precursor
c. Length of peptide 200 aa
d. The common name of the species Channel catfish
e. The scientific name of the species Ictalurus punctatus
f. Function of the protein The function stimulate the liver and
other tissues to secrete IGF-1, which
stimulates both the differentiation and
proliferation of myoblasts.
Question 25: (36 points)
Run ORF Finder (https://www.ncbi.nlm.nih.gov/orffinder/) to predict potential genes in an unknown
sequence with the file name as Unknown sequence_B. Answer:
a. How many are Open Reading Frames (ORFs) found? (2 points)
b. The following information of the largest ORF: (7 points)
Question Answer
Label ORF143
Strand (+ or -) +
Frame number 2
Position of start nucleotide 39014
Position of stop nucleotide 42259
Length of ORF 3246 nt
Length of polypeptide translated from this frame 1081 aa
5
Navigate to ORF200 and perform a protein BLAST (BLASTP) for the polypeptide sequence
translated from ORF200, choose Non-redundant protein sequences (nr) database. Answer:
c. The following information of ORF200 from ORF Finder: (11 points)
Question Answer
Strand (+ or -) +
Frame number 3
Position of start nucleotide 32361
Position of stop nucleotide 34049
Length of ORF 1689 nt
Length of polypeptide translated from this frame 562 aa
Write the first five amino acids MRGCV
Write the nucleotide sequence of the coding strand ATGCGCGGGTGCGTA
that corresponds to the first five amino acids
Write the nucleotide sequence of the template TACGCGCCCACGCAT
strand that corresponds to the first five amino acids
Note: the template strand is the strand that is complementary to the coding strand
d. Results of the best hit from BLASTP: (16 points)
Question Answer
Accession number AEI75106
Max score 1033
E-value 0.0
Percent identity 100.00 %
Length of polypeptide 518 aa
Protein name Putative 30S ribosomal protein S1
Organism name Candidatus Tremblaya princeps PCIT
GOOD LUCK!