BioPython
Edited by
Python Programming
in bioinformatics
Why Python?
Python can be installed and used on different platforms,
including Windows, Mac, and Linux.
Python has several built-in features that make it well-
suited for bioinformatics applications.
Python‟s dynamic and modular nature allows researchers
to reuse and share code, reducing development time and
increasing productivity.
Python has a relatively simple syntax, making it easy to
learn and use.
Python is a high-level language that offers advanced data
structures and functions that make it easy to work with
complex biological data.
Tools for Python Programming in
Bioinformatics
1. Biopython
One of the most widely used bioinformatics packages for Python. Biopython is an
open-source collection of Python modules that provides a set of powerful and easy-
to-use tools for performing biological computations.
Biopython requires very less code and comes up with the following advantages −
Some of the tasks of Biopython are:
Biopython provides tools for working with DNA, RNA, and protein sequences,
including sequence alignment, motif and pattern matching, and translation between
nucleotide and protein sequences.
Biopython includes tools for working with protein structures, such as parsing and
manipulating PDB files and performing structure comparisons.
Biopython supports file formats commonly used in bioinformatics, such as FASTA,
GenBank, and BLAST.
Biopython includes tools for visualizing biological data, such as sequence alignment
plots and phylogenetic trees.
BioSQL − Standard set of SQL tables for storing sequences plus features and
annotations.
2. PyMOL
PyMOL is a free and open-source molecular
visualization software used in bioinformatics. It creates
high-quality images and animations of molecular
structures, which can be useful in a variety of applications
including drug discovery, protein engineering, and
molecular biology research.
PyMOL is written in Python and can easily integrate with
other Python-based tools and libraries.
3. Scikit-learn
Scikit-learn is a Python library that provides tools for machine
learning. It is a powerful and flexible tool for machine learning
applications in bioinformatics which provides a wide range of
algorithms and tools that can be used to analyze complex
biological datasets and make predictions about biological
systems.
Some uses of Scikit-learn in bioinformatics are:
It can be used to classify biological samples based on gene
expression data or proteomics data.
It can be used to cluster biological samples or reduce the
dimensionality of large datasets.
It can be used to develop machine learning models to predict
the structure of proteins and protein-protein interactions
based on their amino acid sequences.
4. NumPy (Numerical Python)
NumPy is a Python library that is used for working with
numerical data in Python. It is extensively used in Pandas,
SciPy, Matplotlib, Scikit-learn, and many other scientific
Python packages. NumPy provides a multidimensional
array object called „ndarray‟ and can be used to perform a
wide range of mathematical operations on arrays.
To install and import Biopython:
What are the input data types for Biopython?
Text file:
1. Sequence file (sequence.txt)
2. Cell Microarray
What are the input data types for Biopython?
CSV file:
FASTA File:
What are the input data types for Biopython?
Other files format like:
Blast output
GenBank
PubMed and Medline
SCOP, including „dom‟ and „lin‟ files
UniGene
SwissProt
Overview on
Some key notes in Python
Data Types: Number Types
int, float, complex
1. Integer 2. Real 3. Complex
numbers: numbers: numbers:
>>> type(4) >>> type(4.5) >>> type(3+2j)
<type 'int’>
<type ’float’>
<type ’complex'>
>>> (2+1j)**2
>>> 17/5
(3+4j)
3
Data Types: Strings
Single quote:
>>> ’atg’
’atg’
Double quote:
>>> ”atg”
’atg’
>>> ’This is a codon, isn’t it?’
Invalid Syntax
>>> ” This is a codon, isn’t it?” # Or >>> ’This is a codon, isn\’t it?’
This is a codon, isn’t it?
String Operators
Escape character: Backslash „\‟ , gives special meaning for the
following character.
To produce more readable outputs: print()
String Operators: Construct Meaning
Concatenate + \n Newline
Copy or replicate * \t Tab
Checks if first IS in second string in \\ Backslash
Checks if first IS NOT in second string not in \” Double Qoute
>>> ’atg’ + ’gcc’
’atggcc’
>>> ’atg’ * 3
’atgatgatg’
>>> ’tg’ in ’atgatgatg’
True
>>> ’tc’ in ’atgatgatg’
False
Variables
Variables are containers that store numbers, strings,
and other data types and structures.
Variables are names given to values that can be changed.
Variables are assigned values using the equal sign (=).
>>> codon = ’tag'
>>> dna_sequence = "gtcgcctaaccgtatatttttcccgt"
A variable cannot be used if not assigned a
value, an error occurs.
>>> dna
NameError: name 'dna' is not defined
Variables
Naming
Select meaningful names: dnaSequence, is better than s.
Follow naming rules:
Case-sensitive :
DnaSequence = 1
DNASEQUENCE = 2
Dnasequence = 3
Consists of letters and numbers combinations, and
underscore.
Dna1, dna_1, dnaSeq.
Numbers should not be the first letter.
Invalid: 1dna
No special characters.
dna#, dna@1
String Operators
[i] : returns the character in index i in a string. (index)
[i:j] : returns the substring between index i and index j in a string. (slice)
>>> dna="gatcccccgatattatttgc”
>>> dna[0]
'g’ - The first position in a string is position 0
>>> dna[-1]
'c’ - Counting from the right using negative
indices, begins with -1
>>> dna[-2]
'g’
>>> dna[0:3]
'gat’ - In slices: Start index included, end index
excluded
>>> dna[:3]
‘gat’ - Ommiting start index means use default, 0
>>> dna[2:]
‘tcccccgatattatttgc’ - Ommiting end index means use default, end
of string
Strings as Objects
• String variables are objects that can perform specific
actions using built-in methods:
>>> dna="gatcccccgatattatttgc
>>> len(dna)
20
>>> dna.count(‟t') - Count characthers
7
>>> dna.count(‟ga') - Count substrings
2
Strings Functions
>>> dna="gatcccccgatattatttgc”
>>> dna.upper() - Convert all to upper case, lower(): Lower
case
GATCCCCCGATATTATTTGC
>>> dna.find(‟ga') - Returns the first occurrence of „ga‟, -1 if not
found
0
>>> dna.find(‟at‟,5) - Returns the first occurrence of „ga‟ starting
from index 5
9
>>> dna.rfind(„ga‟) - Returns the last occurrence of „ga‟, -1 if not
8
>>> dna.islower() - True if all is lower case
True
>>> dna.isupper()
False
>>> dna.replace('a','A') - Replaces all ‟a‟ with ‟A‟
Inputs
>>> dna = input("Enter a DNA sequence, please:")
Enter a DNA sequence, please: agtagcatgaggagggacttc
>>> dna
agtagcatgaggagggacttc
Examples:
Create a random DNA sequence of length 10
import random
alphabet = "AGCT"
sequence = ""
for i in range(10):
index = random.randint(0, 3)
sequence = sequence + alphabet[index]
Read from a text file
readlines(). read().
•readlines(x); read up to x bytes. If you read(x); read up to x bytes in a file. If
don’t supply a size, it reads all the data you don’t supply the size, it reads the
until it reaches a newline (\n) or the end entire file.
of a paragraph. The output is displayed as strings only
once.
Write a text file
Notes about file modes
What is + means in open()?
• The + adds either reading or writing to an existing open mode (update mode).
• The r means reading file; r+ means reading and writing the file.
• The w means writing file; w+ means reading and writing the file.
• The a means writing file, append mode; a+ means reading and writing file, append mode.
Examples:
Difference between r and r+ in open()
with open('file.txt„, „r‟) as f: with open('file.txt', 'r+') as f:
print(f.read()) f.write("new line \n")
Output Output
On Terminal On Terminal
new line
welcome to python 1
welcome to python 1
welcome to python 2
welcome to python 2
welcome to python 3
welcome to python 3
welcome to python 4
welcome to python 4
with open('file.txt', 'r') as f:
f.write("test \n")
io.UnsupportedOperation: not writable
Examples:
Difference between w and w+ in open()
with open('file.txt', 'w+') as f: with open('file.txt', 'w+') as f:
f.write("test 1\n") f.write("test 1\n")
f.write("test 2\n") f.write("test 2\n")
f.write("test 3\n") f.write("test 3\n")
Output f.seek(0)
file.txt lines = f.read()
test 1
test 2
print(lines)
test 3 Output
Terminal
test 1
test 2
test 3
Note: f. seek(0) move the file pointer to begining
Examples:
Difference between a and a+ in open()
with open('file.txt', 'a') as f: with open('file.txt', 'a+') as f:
f.write(“3") f.seek(0)
Output lines = f.readlines()
file.txt f.write("\n" + str(len(lines)))
welcome to python 1
welcome to python 2 Output
welcome to python 3 file.txt
welcome to python 4 welcome to python 1
3 welcome to python 2
welcome to python 3
welcome to python 4
4
Assignment
Apply all the discussed functions on a text file produced
by your self