0% found this document useful (0 votes)
22 views11 pages

Bioinformatics

The document provides definitions and explanations of various biological concepts and tools, including KEGG, ORF, and Sanger sequencing. It discusses the importance of databases like STRING and PDB for protein analysis, as well as the functionalities of different BLAST types for sequence alignment. Additionally, it covers RNA and DNA chemistry, secondary structures, and the significance of NGS in modern biological research.

Uploaded by

Tamanna Jena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

Bioinformatics

The document provides definitions and explanations of various biological concepts and tools, including KEGG, ORF, and Sanger sequencing. It discusses the importance of databases like STRING and PDB for protein analysis, as well as the functionalities of different BLAST types for sequence alignment. Additionally, it covers RNA and DNA chemistry, secondary structures, and the significance of NGS in modern biological research.

Uploaded by

Tamanna Jena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

4. Define KEGG? What is BRITE in KEGG database?

Ans: KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of databases


dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG
is utilized for bioinformatics research and education, including data analysis in genomics,
metagenomics, metabolomics and other omics studies, modeling and simulation in systems
biology, and translational research in drug development.
The KEGG BRITE database is a collection of BRITE hierarchy files,
called htext (hierarchical text) files, with additional files for binary relations.

5.Book page 98 and 105.

6. Define ORF with a valid explanation? Which codons are called initiation and
termination codons?
Ans: An open reading frame (ORF), is a portion of a DNA sequence that does not include
a stop codon (which functions as a stop signal). Detect potential coding regions by looking at
ORFs
– A genome of length n is comprised of (n/3) codons
– Stop codons (TAA, TAG or TGA) break genome into segments
between consecutive Stop codons
– The subsegments of these that start from the Start codon (ATG) are ORFs
ORFs in different frames may overlap
A start codon interacts with initiation factors or nearby sequences to initiate the translation
process. A stop codon can individually initiate the termination. The standard start codon is
AUG. The standard stop codon is UAG, UGA and UAA.
7.
Some tRNAs can form base pairs with more than one codon.
Atypical base pairs—between nucleotides other than A-U and G-C—can form at the third
position of the codon, a phenomenon known as wobble. Wobble pairing doesn't follow
normal rules, but it does have its own rules. For instance, a G in the anticodon can pair with a
C or U (but not an A or G) in the third position of the codon, as shown below. Rules like this
ensure codons are read correctly despite wobble.

The answer may be that wobble pairing allows fewer tRNAs to cover all the codons of the
genetic code, while still making sure that the code is read accurately.
[A wobble base pair is a pairing between two nucleotides in RNA molecules that does not
follow Watson-Crick base pair rules](1 mark question).

8. Define terminal and internal nodes in a phylogenetic tree structure?


Ans:
Terminal nodes - represent the data (e.g sequences) under comparison (A,B,C,D,E), also
known as OTUs,(Operational Taxonomic Units).

Internal nodes - represent inferred ancestral units (usually without empirical


data), also known as HTUs, (Hypothetical Taxonomic Units).

9. Which confidence measure does the AlphaFold2 uses?


Ans: We observe high side-chain accuracy when the backbone prediction is accurate and we
show that our confidence measure, the predicted local-distance difference test (pLDDT),
reliably predicts the Cα local-distance difference test (lDDT-Cα) accuracy of the
corresponding prediction.

10. Which amino acids act as helix breaker and helix formers?
Ans: proline and glycine - helix breaker
Alanine - helix former

11. State the formula for systematic conformational search?


Ans: Systematic (deterministic) search procedures
● Grid Scan
● Custom Search
● Cyclic Modelling

There are two ways to perform a systematic search, Grid Scan and Custom Search.
In a Grid Scan search, each specified torsion angle is varied over a grid of equally spaced
values. If more than one torsion angle is involved, the variation of the torsion angles are
nested. If there are two angles a and b, for a given value of a, angle b assumes a grid of
values. If b is the faster torsion angle, the b loop is inside the a loop (see case 1). Although
the application is capable of handling up to 10 grid torsions, it is impractical in most cases to
employ grid scan for more than four torsion angles.
In a Custom Search, torsion angles are assigned specific values. These values do not need to
be equally spaced. This is an advantage in those cases where favorable states of a torsion
angle are known from previous modeling studies and the intent is to restrict the systematic
search to these values. A further advantage of Custom Search is that it can handle, if so
desired, simultaneous changes in several torsion angles. As in Grid Scan, these changes may
also be nested.

12. Write down different databases used for constructing functional association
networks for proteins?
STRING, HumanNet, GeneMania, HumanBase, IMP, I2D, and ConsensuspathDB.

13. Define SANGER? What is it used for?


Ans: Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is
based on the random incorporation of chain-terminating dideoxynucleotides by DNA
polymerase during in vitro DNA replication.
Sanger sequencing, also known as the “chain termination method”, is a method for
determining the nucleotide sequence of DNA. Sanger sequencing was used in the Human
Genome Project to determine the sequences of relatively small fragments of human DNA
(900 bp or less). These fragments were used to assemble larger DNA fragments and,
eventually, entire chromosomes.
14. Give a difference between BLAST and BLAST+.

Ans: The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity
between sequences. The program compares nucleotide or protein sequences to sequence
databases and calculates the statistical significance of matches. BLAST can be used to infer
functional and evolutionary relationships between sequences as well as help identify
members of gene families.
The NCBI provides a suite of command-line tools to run BLAST called BLAST+. This
allows users to perform BLAST searches on their own server without size, volume and
database restrictions. BLAST+ can be used with a command line so it can be integrated
directly into your workflow.

15. What is the E and S value used in BLAST?

E value :
• E-value is the statistical theory used in the BLAST for the alignment of each pair of
sequences and provides the idea of whether the alignment is good or not and whether
the two sequences match with it or not.

• The number of expected hits of similar quality (score) that could be found just by
chance is the BLAST E-value and the E-value of 10 means that up to 10 hits can be
expected to be found by chance.

• The E-value provides the information about the likelihood that a given sequence
match is purely by chance and is used as the first quality filter for the BLAST search
result.

• The lower the E-value the better the match which means if E is less than 1e-50, then
there is high confidence that the database match is a result of homologous
relationships.

• If the value of E is between 0.01 and 10 then the match is considered to be non-
significant but may have a weak homology relationship.

• Similarly, if the value of E is greater than 10, then the sequence under consideration is
either unrelated or if related then has an extremely distant relationship.

• A corrected bit-score adjusted to the sequence database size is the E-value (expected
value) and it depends on the size of the used sequence database.

• When presented in the smaller database, the sequence hit would get a better E-value.

S Value :
• Once a similar sequence has been found for the query sequence in the database
through BLAST, then it becomes essential to have the idea of whether the alignment
is good or whether it shows the possible biological relationships or not. So BLAST
uses statistical theory to produce a bit score for each alignment pair.

• The indication of the good alignment is given by the bit score, which shows the higher
the scores, the better the alignments.

• Generally, this score is calculated by taking into consideration the alignment of the
similar or identical residues and the gaps introduced while aligning the sequences.

• It uses the “substitution matrix” for the alignment of any possible residues.

• For most of the BLAST programs, the BLOSUM62 matrix is the default with the
exception of BLASTn and MegaBLAST as these are the programs that perform
nucleotide-nucleotide comparisons and do not use protein-specific matrices.

• Bit scores from different alignments can be compared, even if there is the use of
different matrices.

• Bit score is not dependent on the size of the database and gives the same value for hits
in databases of different sizes.

16. Same as 6.

17.Name some secondary protein structure. Which tool can be used to visualize

these structures?

Ans: The most common types of secondary structures are the α helix and the β pleated
sheet. Both structures are held in shape by hydrogen bonds, which form between the carbonyl
O of one amino acid and the amino H of another.

PDB can be used to visualize the secondary structures of proteins.

18. Same as 7.

19. Name the databases for metabolic pathways, and protein structure information.

Ans: Metabolic pathway databases: KEGG

PDB for protein structure information.

20. Same as 8.

21. What is AlphaFold2?


Ans: An artificial intelligence (AI) tool called AlphaFold2. The software could predict the
3D shape of proteins from their genetic sequence with, for the most part, pinpoint
accuracy.

Q. Give a plausible explanation on the chemistry of RNA and DNA? Describe the
components of RNA secondary structure? Which database is used for RNA 3D
structure prediction?

DNA (deoxyribonucleic acid) is the genomic material in cells that contains the genetic
information used in the development and functioning of all known living organisms. DNA,
along with RNA and proteins, is one of the three major macromolecules that are essential for
life. Most of the DNA is located in the nucleus, although a small amount can be found in
mitochondria (mitochondrial DNA). Within the nucleus of eukaryotic cells, DNA is
organized into structures called chromosomes. DNA consists of two long polymers of simple
units called nucleotides, with backbones made of sugars and phosphate groups joined by ester
bonds. These two strands run in opposite directions to each other and are therefore anti-
parallel. Attached to each sugar is one of four types of molecules called nucleobases (bases).
It is the sequence of these four bases along the backbone that encodes information. The
sequence of these bases comprises the genetic code, which subsequently specifies the
sequence of the amino acids within proteins. The ends of DNA strands are called the 5′(five
prime) and 3′ (three prime) ends. The 5′ end has a terminal phosphate group and the 3′ end a
terminal hydroxyl group.Bases are classified into two types: the purines, A and G, and the
pyrimidines, the six-membered rings C, T and U. Uracil (U), takes the place of thymine in
RNA and differs from thymine by lacking a methyl group on its ring. Uracil is not usually
found in DNA, occurring only as a breakdown product of cytosine.
Levels of DNA,
1)Primary
2)Secondary
3)Tertiary
4)Quarternary

RNA, is another macromolecule essential for all known forms of life. Like DNA, RNA is
made up of nucleotides. Once thought to play ancillary roles, RNAs are now understood to be
among a cell’s key regulatory players where they catalyze biological reactions, control and
modulate gene expression, sensing and communicating responses to cellular signals, etc.The
chemical structure of RNA is very similar to that of DNA: each nucleotide consists of a
nucleobase a ribose sugar, and a phosphate group. There are two differences that distinguish
DNA from RNA: (a) RNA contains the sugar ribose, while DNA contains the slightly
different sugar deoxyribose (a type of ribose that lacks one oxygen atom), and (b) RNA has
the nucleobase uracil while DNA contains thymine. Unlike DNA, most RNA molecules are
single-stranded and can adopt very complex three-dimensional structures

This RNA secondary structure is also called the stem-and-loop structure, As long as all the
paired bases of an RNA sequence are determined, the secondary structure of the entire RNA
can be determined.
Levels of RNA ,

1)The primary structure of RNA is the sequence of nucleotides (i.e., four bases A, C, G, and
U) in the single-stranded polymer of RNA.

2)secondary (hairpins, bulges and internal loops),

3)tertiary (A-minor motif, 3-way junction, pseudoknot, etc.)

4)and quaternary structure (supermolecular organisation).

Chemically speaking, DNA and RNA are very similar. Nucleic acid structure is often divided
into four different levels: primary, secondary, tertiary, and quaternary.

The database used for RNA 3D structure prediction is RNArchitecture

Q. Give a brief description on the different types of BLAST and describe their functionalities.
Ans: BLASTN

• The query is a nucleotide sequence


• The database is a nucleotide database
• No conversion is done on the query or database
• DNA :: DNA homology
• Mapping oligos to a genome
• Annotating genomic DNA with transcriptome data from ESTs and RNA-Seq
• Annotating untranslated regions

• BLASTP
• The query is an amino acid sequence
• The database is an amino acid database
• No conversion is done on the query or database
• Protein :: Protein homology
• Protein function exploration
• Novel gene 🡺 make parameters more sensitive
• BLASTX
• The query is a nucleotide sequence
• The database is an amino acid database
• All six reading frames are translated on the query and used to search the database

• Coding nucleotide seq :: Protein homology


• Gene finding in genomic DNA
• Annotating ESTs and transcripts assembled from RNA-Seq data

• TBLASTN
• The query is an amino sequence
• The database is a nucleotide database
• All six frames are translated in the database and searched with the protein
sequence

• Protein :: Coding nucleotide DB homology


• Mapping a protein to a genome
• Mining ESTs and RNA-Seq data for protein similarities

• TBLASTX
• The query is a nucleotide sequence
• The database is a nucleotide database
• All six frames are translated on the query and on the database
• Coding :: Coding homology
• Searching distantly-related species
• Sensitive but expensive

Q. Suppose you have two sequences, and you suspect that they diverge from common
ancestor. What possible events might have occurred during the evolution process? Draw a
schematic to represent the evolution process? State the differences between homology,
orthology, paralogy, xenology, analogy and cenancestor? (Sequence alignment concepts : Pg
4-6)
Ans: Divergence from the common ancestor can either be due to duplication or speciation.
Mutational events occur during their evolution,
● substitutions
● deletions
● Insertions
● Homology: the two sequences diverged from a common ancestor. The same organ
under every variety of form and function. Homology is the relationship of any two
characters that have descended, usually with divergence, from a common ancestral
character.

● Analogy: relationship of two characters that have developed convergently from


unrelated ancestor.

● Orthology: relationship of any two homologous characters whose common ancestor


lies in the cenancestor of the taxa from which the two sequences were obtained.

● Paralogy: Relationship of two characters arising from a duplication of the gene for
that character.

● Xenology: relationship of any two characters whose history, since their common
ancestor, involves interspecies (horizontal) transfer of the genetic material for at least
one of those characters.

● Cenancestor: the most recent common ancestor of the taxa under consideration.

Q. Describe NGS. What are the various techniques used to carry out NGS? Give brief
elaboration.
Ans: Next-generation sequencing (NGS) is a massively parallel sequencing technology that
offers ultra-high throughput, scalability, and speed. The technology is used to determine the
order of nucleotides in entire genomes or targeted regions of DNA or RNA. NGS has
revolutionized the biological sciences, allowing labs to perform a wide variety of applications
and study biological systems at a level never before possible.
Q. If you get a particular protein named ‘Keratin’. How will you retrieve its (a) Nucleic acid
sequence (b) Protein sequence (c) Carbohydrate binding site, if present. (d) Protein chains (e)
Amino acid frequency etc? Describe briefly

All the required information concerning any protein (i.e., keratin) can be obtained from the
appropriate databases.
(a) Nucleic acid sequence encoding keratin (gene and cDNA or mRNA) can be obtained from
NCBI and Ensemble databases. These services contain the complete sequences of all human
genes, as well as genes present in other organisms.
(b) Protein sequence can be also found in these databases (NCBI, Ensemble), as well as
UniProt database. On the other hand, protein sequence can be retrieved by a simple
translation of cDNA or mRNA sequence using ExPASy translation tool. In general, a reading
frame represented by the longest translation product corresponds to the correct protein
sequence.
(c) / (d) Both carbohydrate-binding site and protein chains are related to the structural
features of the protein that can be retrieved from the RCSB PDB database containing 164174
biological macromolecular structures, as well as their structural and functional features.
(e) Amino acid frequency can be calculated using the ExPASy ProtParam tool that calculates
the percentage of each amino acid in the protein while the one-letter amino acid sequence is
used as an input.

Q. Give an overview of High-throughput sequencing?

Ans: Sequencing that is capable of sequencing multiple DNA molecules in parallel, enabling
hundreds of millions of DNA molecules to be sequenced at a time.

Q. Perform Needleman wunch algorithm with explanation (tabular chart) and algorithm for
the
following sequences.
Sequence 1: GATTACA
Sequence 2: GTCGACGCA
Match score 2
Mismatch score -2
Gap score -5

Q. How to identify a biomarker?


● Bioinformatics plays a key role in the biomarker discovery
process, bridging the gap between initial discovery phases
such as experimental design, clinical study execution, and
bioanalytics, including sample preparation, separation and
high-throughput profiling and independent validation of
identified candidate biomarkers.

● Once a biomarker cohort study has been set up, and sample
collection, preparation, separation and MS analysis have
been carried out, an extensive technical review of
generated data is essential to ensure a high degree of
consistency, completeness and reproducibility in the data.

● Data preprocessing, as a preliminary data mining practice


performed on the raw data, is necessary to transform data
into a format that will be more easily and effectively
processed for the purpose of targeted analyses. There are a
number of methods used for data preprocessing, including
data transformation (e.g. logarithmic scaling of data) and
normalization, e.g. using z-transformation, data sampling or
outlier detection.

Q. How to download fasta sequence of protein?

1. Open NCBI website (http://www.ncbi.nlm.nih.gov/)


2. Select the Protein (ALL databases), write the name of protein.
3. The list obtained, choice the specific protein click on that.
4. Just below the name of the protein, FASTA is written, click on it.
5. Download in the .txt format.

You might also like