0% found this document useful (0 votes)
12 views18 pages

02 Handling Files

The document outlines a course on Python for Biologists, focusing on reading and writing files, particularly in the context of Next-Generation Sequencing (NGS) data. It covers essential file operations such as opening, reading, writing, and closing files, along with practical exercises for manipulating genomic DNA and creating FASTA files. The course is structured over six days, introducing various programming concepts relevant to biological data analysis.

Uploaded by

faranpourali1383
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views18 pages

02 Handling Files

The document outlines a course on Python for Biologists, focusing on reading and writing files, particularly in the context of Next-Generation Sequencing (NGS) data. It covers essential file operations such as opening, reading, writing, and closing files, along with practical exercises for manipulating genomic DNA and creating FASTA files. The course is structured over six days, introducing various programming concepts relevant to biological data analysis.

Uploaded by

faranpourali1383
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Overview

day one day four


0. getting set up 6. regular expressions
1. text output and manipulation 7. dictionaries
today day five
2. reading and writing files 8. files, programs and user
input
3. lists and loops
day six
day three
4. writing functions 9. biopython
5. conditional statements

This course (apart from chapter 9) is based on the book "Python for Biologists":
http://pythonforbiologists.com/
from scratch
A primer for scientists working with Next-Generation-
Sequencing data

Chapter 2

Reading and writing files


Why are files important?

● NGS data – as most other biological data – is stored in


files, e.g.
– FASTA: DNA/protein sequences
– FASTQ: sequencing reads
– SAM: sequences mapped to a reference
– VCF: variant calls (like SNPs)
● most of the time we deal with text files
(files you can open and read in a text editor)
→ Handling text files is essential when working with NGS
data.
Chapter 2: reading and writing files

In this unit you will learn


● how to open a file for reading and writing
● reading text input from a file
● writing text output to a file
Reading from a file

● the open function takes


a filename and returns my_file = open("dna.txt")
a file object
● interaction with the file mainly through methods
● file content is accessible
through the read method my_dna = my_file.read()
● don't confuse:
(a) file name:
just a string file_name = "dna.txt"

(b) file object: dna_file = open(file_name)


providing methods
(c) file contents: a string dna_sequence = dna_file.read()
(potentially very large)
Reading a file line by line

If you want to read just a single line, use the readline


method:

my_file = open("dna.fasta")
fasta_header = my_file.readline()
seq_line1 = my_file.readline()

When the end of the file has been reached, readline will
return the value None.
Dealing with line breaks

● every line of a file >>> my_file = open("dna.txt")


is terminated by a >>> my_dna = my_file.read()
newline ("\n") >>> my_dna
character "ACTTGAC\n"

● when reading from a file it's usually a good idea to remove


the newline with the strip method:

>>> my_file = open("dna.txt")


>>> my_dna = my_file.read().strip("\n")
>>> my_dna
"ACTTGAC"
Creating and writing to a file


open function can
also open files for outfile = open("out.txt","w")
writing
● second argument to open determines the mode the file is
opened in:
– "r": reading (default)
– "w": writing
– "a": appending
● write contents to file outfile.write("my output")
with the write method
Closing files

● a file is closed using outfile.close()


the close method

● especially important after writing files, as closing saves


the contents to the file

● files are closed automatically when a script terminates


(BUT: you would not be able to read the newly created
contents of the file in the script if you didn't close it first)
Paths and folders

files can be opened from any location on your file system


using absolute paths:

– Linux:
my_file = open("/home/harry/dna.txt")
– Mac:
my_file = open("/Users/harry/Desktop/dna.txt")
– Windows:
my_file = open(r"C:\Windows\Desktop\dna.txt")
Recap

● working with files is always a two-step process:


1. open file (for reading or writing)
2. read from or write to file
→ file object have a mode

read method returns the contents of the file as a string

write method writes a string to the file

close method closes the file

● More sophisticated ways of handling file contents will


follow later, so stay tuned... ;-)
Exercise 2-1: Splitting genomic DNA

The file "genomic_dna.txt" contains the same DNA


sequence as in exercise 1-4.
As in ex. 1-4, the sequence has exons (coding regions) at
base pair positions [start – 63] and [91 – end].
Write a program that will split the genomic DNA into coding
and non-coding parts, write these sequences to two
separate files.

Hint: Use your solution from ex. 1-4 and modify it to handle
the file input/output
Exercise 2-1: Splitting genomic DNA

expected result:
If your script works as intended, you will end up with two files:
coding.txt noncoding.txt

ATCGAT...TACTAT TCGATC...CATGCT

"…" means, there are more bases in the middle


You can compare the start and end of your sequences with
the ones shown here
Exercise 2-2: Writing a FASTA file

● The FASTA format is designed to store DNA/RNA and


protein sequences and is structured as follows:

>sequence_name ">" followed by sequence identifier


ACCATTAGCGAGGCT... actual DNA/RNA/protein sequence

>sequence_one
● One file can hold ATCGATCGATCGATCGAT
multiple sequences: >sequence_two
ACTAGCTAGCTAGCATCG
...
Exercise 2-2: Writing a FASTA file

Write a program that will create a FASTA file for the following
three sequences – make sure that all sequences are in upper
case and only contain the bases A, T, G and C (ignore other
characters).

sequence header DNA sequence

ABC123 ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456 actgatcgacgatcgatcgatcacgact
HIJ789 ACTGAC-ACTGT--ACTGTA----CATGTG
Exercise 2-2: Writing a FASTA file

expected result:
If your script works as intended, you will end up with one file:
sequences.fasta

>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
>DEF456
ACTGATCGACGATCGATCGATCACGACT
>HIJ789
ACTGACACTGTACTGTACATGTG
Exercise 2-3: Writing multiple FASTA files

Using the data from ex. 2-2, write one FASTA file for each
sequence.
The file names should be identical with the sequence
identifiers, followed by extension ".fasta"
Exercise 2-3: Writing multiple FASTA files

expected result:
If your script works as intended, you will end up with three
files:
ABC123.fasta
>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG

DEF456.fasta HIJ789.fasta
>DEF456 >HIJ789
ACTGATCGACGATCGATCGATCACGACT ACTGACACTGTACTGTACATGTG

You might also like