Overview
day one day four
0. getting set up 6. regular expressions
1. text output and manipulation 7. dictionaries
today day five
2. reading and writing files 8. files, programs and user
input
3. lists and loops
day six
day three
4. writing functions 9. biopython
5. conditional statements
This course (apart from chapter 9) is based on the book "Python for Biologists":
http://pythonforbiologists.com/
from scratch
A primer for scientists working with Next-Generation-
Sequencing data
Chapter 2
Reading and writing files
Why are files important?
● NGS data – as most other biological data – is stored in
files, e.g.
– FASTA: DNA/protein sequences
– FASTQ: sequencing reads
– SAM: sequences mapped to a reference
– VCF: variant calls (like SNPs)
● most of the time we deal with text files
(files you can open and read in a text editor)
→ Handling text files is essential when working with NGS
data.
Chapter 2: reading and writing files
In this unit you will learn
● how to open a file for reading and writing
● reading text input from a file
● writing text output to a file
Reading from a file
● the open function takes
a filename and returns my_file = open("dna.txt")
a file object
● interaction with the file mainly through methods
● file content is accessible
through the read method my_dna = my_file.read()
● don't confuse:
(a) file name:
just a string file_name = "dna.txt"
(b) file object: dna_file = open(file_name)
providing methods
(c) file contents: a string dna_sequence = dna_file.read()
(potentially very large)
Reading a file line by line
If you want to read just a single line, use the readline
method:
my_file = open("dna.fasta")
fasta_header = my_file.readline()
seq_line1 = my_file.readline()
When the end of the file has been reached, readline will
return the value None.
Dealing with line breaks
● every line of a file >>> my_file = open("dna.txt")
is terminated by a >>> my_dna = my_file.read()
newline ("\n") >>> my_dna
character "ACTTGAC\n"
● when reading from a file it's usually a good idea to remove
the newline with the strip method:
>>> my_file = open("dna.txt")
>>> my_dna = my_file.read().strip("\n")
>>> my_dna
"ACTTGAC"
Creating and writing to a file
●
open function can
also open files for outfile = open("out.txt","w")
writing
● second argument to open determines the mode the file is
opened in:
– "r": reading (default)
– "w": writing
– "a": appending
● write contents to file outfile.write("my output")
with the write method
Closing files
● a file is closed using outfile.close()
the close method
● especially important after writing files, as closing saves
the contents to the file
● files are closed automatically when a script terminates
(BUT: you would not be able to read the newly created
contents of the file in the script if you didn't close it first)
Paths and folders
files can be opened from any location on your file system
using absolute paths:
– Linux:
my_file = open("/home/harry/dna.txt")
– Mac:
my_file = open("/Users/harry/Desktop/dna.txt")
– Windows:
my_file = open(r"C:\Windows\Desktop\dna.txt")
Recap
● working with files is always a two-step process:
1. open file (for reading or writing)
2. read from or write to file
→ file object have a mode
●
read method returns the contents of the file as a string
●
write method writes a string to the file
●
close method closes the file
● More sophisticated ways of handling file contents will
follow later, so stay tuned... ;-)
Exercise 2-1: Splitting genomic DNA
The file "genomic_dna.txt" contains the same DNA
sequence as in exercise 1-4.
As in ex. 1-4, the sequence has exons (coding regions) at
base pair positions [start – 63] and [91 – end].
Write a program that will split the genomic DNA into coding
and non-coding parts, write these sequences to two
separate files.
Hint: Use your solution from ex. 1-4 and modify it to handle
the file input/output
Exercise 2-1: Splitting genomic DNA
expected result:
If your script works as intended, you will end up with two files:
coding.txt noncoding.txt
ATCGAT...TACTAT TCGATC...CATGCT
"…" means, there are more bases in the middle
You can compare the start and end of your sequences with
the ones shown here
Exercise 2-2: Writing a FASTA file
● The FASTA format is designed to store DNA/RNA and
protein sequences and is structured as follows:
>sequence_name ">" followed by sequence identifier
ACCATTAGCGAGGCT... actual DNA/RNA/protein sequence
>sequence_one
● One file can hold ATCGATCGATCGATCGAT
multiple sequences: >sequence_two
ACTAGCTAGCTAGCATCG
...
Exercise 2-2: Writing a FASTA file
Write a program that will create a FASTA file for the following
three sequences – make sure that all sequences are in upper
case and only contain the bases A, T, G and C (ignore other
characters).
sequence header DNA sequence
ABC123 ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456 actgatcgacgatcgatcgatcacgact
HIJ789 ACTGAC-ACTGT--ACTGTA----CATGTG
Exercise 2-2: Writing a FASTA file
expected result:
If your script works as intended, you will end up with one file:
sequences.fasta
>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
>DEF456
ACTGATCGACGATCGATCGATCACGACT
>HIJ789
ACTGACACTGTACTGTACATGTG
Exercise 2-3: Writing multiple FASTA files
Using the data from ex. 2-2, write one FASTA file for each
sequence.
The file names should be identical with the sequence
identifiers, followed by extension ".fasta"
Exercise 2-3: Writing multiple FASTA files
expected result:
If your script works as intended, you will end up with three
files:
ABC123.fasta
>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456.fasta HIJ789.fasta
>DEF456 >HIJ789
ACTGATCGACGATCGATCGATCACGACT ACTGACACTGTACTGTACATGTG