COVID2–19 DNA sequence data using python.
Major Modules Used:
Bio Python
Squiggle
Pandas
Importing Modules:
from __future__ import division
from Bio.SeqUtils import ProtParam
import warnings
import pandas as pd
from Bio import SeqIO
from Bio.Data import CodonTable
We will use Bio.SeqIO from Biopython for parsing
DNA sequence data(fasta). It provides a simple
uniform interface to input and output assorted
sequence file formats.
for sequence in SeqIO.parse(r'Covid.fna', "fasta"):
print(sequence.seq)
print(len(sequence), 'nucliotides')
DNAsequence = SeqIO.read(r'Covid.fna', "fasta")
print(DNAsequence)
Since input sequence is FASTA (DNA), and
Coronavirus is RNA type of virus, we need to:
Transcribe DNA to RNA (ATTAAAGGTT… =>
AUUAAAGGUU…)
Translate RNA to Amino acid sequence
(AUUAAAGGUU… => IKGLYLPR*Q…)
In the current scenario, the .fna file starts with
ATTAAAGGTT, then we call transcribe() so T
(thymine) is replaced with U (uracil), so we get the
RNA sequence which starts with AUUAAAGGUU
The transcribe() method will convert the DNA to
mRNA.
DNA = DNAsequence.seq
mRNA = DNA.transcribe()
print(mRNA)
print('Size : ', len(mRNA))
The difference between the DNA and the mRNA is
just that the bases T (for Thymine) are replaced
with U (for Uracil).
Next, we are going to translate the mRNA sequence
to amino-acid sequence using translate() method,
we get something like IKGLYLPR*Q ( is so-called
STOP codon, effectively is a separator for proteins).
Amino_Acid = mRNA.translate(table=1, cds=False)
print('Amino Acid', Amino_Acid)
print("Length of Protein:", len(Amino_Acid))
print("Length of Original mRNA:", len(mRNA))
The standard genetic code is traditionally
represented as an RNA codon table because, when
proteins are made in a cell by ribosomes, it is
mRNA that directs protein synthesis. The mRNA
sequence is determined by the sequence of
genomic DNA. Here are some features of codons:
Most codons specify an amino acid
Three “stop” codons mark the end of a protein
One “start” codon, AUG, marks the beginning of a
protein and also encodes the amino acid
methionine.
A series of codons in part of a messenger RNA
(mRNA) molecule. Each codon consists of three
nucleotides, usually corresponding to a single
amino acid. The nucleotides are abbreviated with
the letters A, U, G, and C. This is mRNA, which
uses U (uracil). DNA uses T (thymine) instead. This
mRNA molecule will instruct a ribosome to
synthesize a protein according to this code. Source
print(CodonTable.unambiguous_rna_by_name['Sta
ndard'])
Now we are extracting the Proteins (chains of
amino acids), basically separating at the stop
codon, marked by * (ASTERISK). Then let’s remove
any sequence less than 20 amino acids long, as
this is the smallest known functional protein
Proteins = Amino_Acid.split('*')
df = pd.DataFrame(Proteins)
df.describe()
print('Total proteins:', len(df))
def conv(item):
return len(item)
def to_str(item):
return str(item)
df['sequence_str'] = df[0].apply(to_str)
df['length'] = df[0].apply(conv)
df.rename(columns={0: "sequence"}, inplace=True)
df.head()
functional_proteins = df.loc[df['length'] >= 20]
print('Total functional proteins:',
len(functional_proteins))
print(functional_proteins.describe())
Protein Analysis With The Protparam Module In
Biopython using ProtParam.
poi_list = []
MW_list = []
for record in Proteins[:]:
print("\n")
X = ProtParam.ProteinAnalysis(str(record))
POI = X.count_amino_acids()
poi_list.append(POI)
MW = X.molecular_weight()
MW_list.append(MW)
print("Protein of Interest = ", POI)
try:
print("Amino acids percent = ",
str(X.get_amino_acids_percent()))
except ZeroDivisionError:
pass
print("Molecular weight = ", MW)
try:
print("Aromaticity = ", X.aromaticity())
except ZeroDivisionError:
pass
print("Flexibility = ", X.flexibility())
try:
print("Secondary structure fraction = ",
X.secondary_structure_fraction())
except ZeroDivisionError:
pass
As The Above Code Produces The OutPut For All
The 775 proteins, we have attached only one of the
output screen.