0% found this document useful (0 votes)

12 views18 pages

02 Handling Files

The document outlines a course on Python for Biologists, focusing on reading and writing files, particularly in the context of Next-Generation Sequencing (NGS) data. It covers essential file operations such as opening, reading, writing, and closing files, along with practical exercises for manipulating genomic DNA and creating FASTA files. The course is structured over six days, introducing various programming concepts relevant to biological data analysis.

Uploaded by

faranpourali1383

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views18 pages

02 Handling Files

Uploaded by

faranpourali1383

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Overview

day one day four

0. getting set up 6. regular expressions
1. text output and manipulation 7. dictionaries
today day five
2. reading and writing files 8. files, programs and user
input
3. lists and loops
day six
day three
4. writing functions 9. biopython
5. conditional statements

This course (apart from chapter 9) is based on the book "Python for Biologists":
http://pythonforbiologists.com/
from scratch
A primer for scientists working with Next-Generation-
Sequencing data

Chapter 2

Reading and writing files

Why are files important?

● NGS data – as most other biological data – is stored in

files, e.g.
– FASTA: DNA/protein sequences
– FASTQ: sequencing reads
– SAM: sequences mapped to a reference
– VCF: variant calls (like SNPs)
● most of the time we deal with text files
(files you can open and read in a text editor)
→ Handling text files is essential when working with NGS
data.
Chapter 2: reading and writing files

In this unit you will learn

● how to open a file for reading and writing
● reading text input from a file
● writing text output to a file
Reading from a file

● the open function takes

a filename and returns my_file = open("dna.txt")
a file object
● interaction with the file mainly through methods
● file content is accessible
through the read method my_dna = my_file.read()
● don't confuse:
(a) file name:
just a string file_name = "dna.txt"

(b) file object: dna_file = open(file_name)

providing methods
(c) file contents: a string dna_sequence = dna_file.read()
(potentially very large)
Reading a file line by line

If you want to read just a single line, use the readline

method:

my_file = open("dna.fasta")
fasta_header = my_file.readline()
seq_line1 = my_file.readline()

When the end of the file has been reached, readline will
return the value None.
Dealing with line breaks

● every line of a file >>> my_file = open("dna.txt")

is terminated by a >>> my_dna = my_file.read()
newline ("\n") >>> my_dna
character "ACTTGAC\n"

● when reading from a file it's usually a good idea to remove

the newline with the strip method:

>>> my_file = open("dna.txt")

>>> my_dna = my_file.read().strip("\n")
>>> my_dna
"ACTTGAC"
Creating and writing to a file

●
open function can
also open files for outfile = open("out.txt","w")
writing
● second argument to open determines the mode the file is
opened in:
– "r": reading (default)
– "w": writing
– "a": appending
● write contents to file outfile.write("my output")
with the write method
Closing files

● a file is closed using outfile.close()

the close method

● especially important after writing files, as closing saves

the contents to the file

● files are closed automatically when a script terminates

(BUT: you would not be able to read the newly created
contents of the file in the script if you didn't close it first)
Paths and folders

files can be opened from any location on your file system

using absolute paths:

– Linux:
my_file = open("/home/harry/dna.txt")
– Mac:
my_file = open("/Users/harry/Desktop/dna.txt")
– Windows:
my_file = open(r"C:\Windows\Desktop\dna.txt")
Recap

● working with files is always a two-step process:

1. open file (for reading or writing)
2. read from or write to file
→ file object have a mode
●
read method returns the contents of the file as a string
●
write method writes a string to the file
●
close method closes the file

● More sophisticated ways of handling file contents will

follow later, so stay tuned... ;-)
Exercise 2-1: Splitting genomic DNA

The file "genomic_dna.txt" contains the same DNA

sequence as in exercise 1-4.
As in ex. 1-4, the sequence has exons (coding regions) at
base pair positions [start – 63] and [91 – end].
Write a program that will split the genomic DNA into coding
and non-coding parts, write these sequences to two
separate files.

Hint: Use your solution from ex. 1-4 and modify it to handle
the file input/output
Exercise 2-1: Splitting genomic DNA

expected result:
If your script works as intended, you will end up with two files:
coding.txt noncoding.txt

ATCGAT...TACTAT TCGATC...CATGCT

"…" means, there are more bases in the middle

You can compare the start and end of your sequences with
the ones shown here
Exercise 2-2: Writing a FASTA file

● The FASTA format is designed to store DNA/RNA and

protein sequences and is structured as follows:

>sequence_name ">" followed by sequence identifier

ACCATTAGCGAGGCT... actual DNA/RNA/protein sequence

>sequence_one
● One file can hold ATCGATCGATCGATCGAT
multiple sequences: >sequence_two
ACTAGCTAGCTAGCATCG
...
Exercise 2-2: Writing a FASTA file

Write a program that will create a FASTA file for the following
three sequences – make sure that all sequences are in upper
case and only contain the bases A, T, G and C (ignore other
characters).

sequence header DNA sequence

ABC123 ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456 actgatcgacgatcgatcgatcacgact
HIJ789 ACTGAC-ACTGT--ACTGTA----CATGTG
Exercise 2-2: Writing a FASTA file

expected result:
If your script works as intended, you will end up with one file:
sequences.fasta

>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
>DEF456
ACTGATCGACGATCGATCGATCACGACT
>HIJ789
ACTGACACTGTACTGTACATGTG
Exercise 2-3: Writing multiple FASTA files

Using the data from ex. 2-2, write one FASTA file for each
sequence.
The file names should be identical with the sequence
identifiers, followed by extension ".fasta"
Exercise 2-3: Writing multiple FASTA files

expected result:
If your script works as intended, you will end up with three
files:
ABC123.fasta
>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG

DEF456.fasta HIJ789.fasta
>DEF456 >HIJ789
ACTGATCGACGATCGATCGATCACGACT ACTGACACTGTACTGTACATGTG

Lec 2 PDF
No ratings yet
Lec 2 PDF
28 pages
Lab 15
No ratings yet
Lab 15
6 pages
Python 9
No ratings yet
Python 9
5 pages
File Handling in Python-2024
No ratings yet
File Handling in Python-2024
49 pages
CS Class 12 Complete Study Material
No ratings yet
CS Class 12 Complete Study Material
268 pages
Lab 12
No ratings yet
Lab 12
5 pages
Lecture5 LIFE733 202425
No ratings yet
Lecture5 LIFE733 202425
45 pages
Python File Handling
No ratings yet
Python File Handling
18 pages
File Handling Notes
No ratings yet
File Handling Notes
8 pages
PP Expt 5 A33
No ratings yet
PP Expt 5 A33
12 pages
4.0 Notes of File Handling
No ratings yet
4.0 Notes of File Handling
8 pages
File Handling
No ratings yet
File Handling
61 pages
CH - 5 File Handling
No ratings yet
CH - 5 File Handling
11 pages
Python UNIT 4 New
No ratings yet
Python UNIT 4 New
18 pages
File I&O
No ratings yet
File I&O
28 pages
Python
No ratings yet
Python
30 pages
File Input and Output in Python
No ratings yet
File Input and Output in Python
30 pages
3 File Handling
No ratings yet
3 File Handling
29 pages
File Handing Notes For Students
No ratings yet
File Handing Notes For Students
7 pages
Introduction To File Handling
No ratings yet
Introduction To File Handling
10 pages
Unit Iii File Handling Functions
No ratings yet
Unit Iii File Handling Functions
5 pages
7 Files
No ratings yet
7 Files
20 pages
III Unit Files in Python
No ratings yet
III Unit Files in Python
16 pages
Unit-4 Files and Data Bases Notes
No ratings yet
Unit-4 Files and Data Bases Notes
39 pages
Unit 6 Notes
No ratings yet
Unit 6 Notes
17 pages
Class 12 COMPUTER SCIENCE PPT Chapter 2 File-Handling-In-Python
No ratings yet
Class 12 COMPUTER SCIENCE PPT Chapter 2 File-Handling-In-Python
60 pages
Computer Science Grade XII Unit 1 Chapter 4
No ratings yet
Computer Science Grade XII Unit 1 Chapter 4
4 pages
AI Lab 6 Files
No ratings yet
AI Lab 6 Files
7 pages
07-Csci333 Lecture FileIOExceptions
No ratings yet
07-Csci333 Lecture FileIOExceptions
80 pages
Text and Binary File-1
No ratings yet
Text and Binary File-1
10 pages
Data File Cbse
No ratings yet
Data File Cbse
16 pages
DataFilehandling - Text Files
No ratings yet
DataFilehandling - Text Files
34 pages
Python 07 File
No ratings yet
Python 07 File
22 pages
File Handling Python
No ratings yet
File Handling Python
8 pages
Text File
No ratings yet
Text File
96 pages
3 Filehandling
No ratings yet
3 Filehandling
4 pages
Python File Handeling
No ratings yet
Python File Handeling
3 pages
Unit III File Handling and Exception Handling
No ratings yet
Unit III File Handling and Exception Handling
16 pages
Unit 4
No ratings yet
Unit 4
12 pages
File Hadling
No ratings yet
File Hadling
59 pages
File Handling
No ratings yet
File Handling
12 pages
Python Notes Mod4
No ratings yet
Python Notes Mod4
26 pages
Unit-4 Python
No ratings yet
Unit-4 Python
18 pages
Introduction To Files
No ratings yet
Introduction To Files
17 pages
Py 05
No ratings yet
Py 05
4 pages
File Handling
No ratings yet
File Handling
18 pages
File Handling 1
No ratings yet
File Handling 1
17 pages
13 BSC CS - Python - Chapter 5
No ratings yet
13 BSC CS - Python - Chapter 5
11 pages
PP Handout 5
No ratings yet
PP Handout 5
16 pages
File Handling - 7
No ratings yet
File Handling - 7
48 pages
Chapter 5 (File Handling)
No ratings yet
Chapter 5 (File Handling)
10 pages
Python File Handling Guide
No ratings yet
Python File Handling Guide
17 pages
PSPP RSK
No ratings yet
PSPP RSK
25 pages
File Handing For Students
No ratings yet
File Handing For Students
10 pages
File Handling in Python
No ratings yet
File Handling in Python
65 pages
Unit-IV Python - BCC402 (File Handling)
No ratings yet
Unit-IV Python - BCC402 (File Handling)
24 pages
Python File Operation: Sharada Desai Sharada - Desai@vit - Edu
No ratings yet
Python File Operation: Sharada Desai Sharada - Desai@vit - Edu
43 pages
Python
No ratings yet
Python
10 pages
Data File Handling (Autosaved)
No ratings yet
Data File Handling (Autosaved)
48 pages
Computer-Paper Class 2
No ratings yet
Computer-Paper Class 2
7 pages
English - Communication in Professional Life
100% (1)
English - Communication in Professional Life
39 pages
Pakistan Studies (O1)
No ratings yet
Pakistan Studies (O1)
11 pages
B-20 Underground Codigos Manual
No ratings yet
B-20 Underground Codigos Manual
79 pages
Denon Pma-710ae Service Manual
No ratings yet
Denon Pma-710ae Service Manual
35 pages
Danfoss Vane Tip LUG Sylax
No ratings yet
Danfoss Vane Tip LUG Sylax
18 pages
Lab Manual PYTHON
No ratings yet
Lab Manual PYTHON
59 pages
Supply Chain Management and E-Commerce
No ratings yet
Supply Chain Management and E-Commerce
6 pages
Final Group Project
No ratings yet
Final Group Project
22 pages
Office 365 Home Tab Guide
No ratings yet
Office 365 Home Tab Guide
5 pages
2N3390, 2N3391, 2N3392 Silicon NPN Transistor General Purpose Amplifier TO 92 Type Package
No ratings yet
2N3390, 2N3391, 2N3392 Silicon NPN Transistor General Purpose Amplifier TO 92 Type Package
2 pages
Sinngle Layer Perceptron1
No ratings yet
Sinngle Layer Perceptron1
28 pages
Escalate Privileges Via Token Manipulation
No ratings yet
Escalate Privileges Via Token Manipulation
1 page
Fintoch Killa Presentation
No ratings yet
Fintoch Killa Presentation
32 pages
Ucs415 MST With Solutions
No ratings yet
Ucs415 MST With Solutions
9 pages
Centrilift ESP Equipment Catalog PDF
67% (3)
Centrilift ESP Equipment Catalog PDF
177 pages
Energy Intelligence: World Crude Oil Data & Handbook 2019
100% (1)
Energy Intelligence: World Crude Oil Data & Handbook 2019
1 page
Cryptography and Network Security 2010
No ratings yet
Cryptography and Network Security 2010
4 pages
19BCP096 HashPointers
No ratings yet
19BCP096 HashPointers
4 pages
Python Questions and Answers Lists 5
No ratings yet
Python Questions and Answers Lists 5
4 pages
(Ebook PDF) Accounting Information Systems 11th Edition by Patrick Wheeler Instant Download
No ratings yet
(Ebook PDF) Accounting Information Systems 11th Edition by Patrick Wheeler Instant Download
53 pages
IP Camera IE Browser - User Manual V1.1 English Version
No ratings yet
IP Camera IE Browser - User Manual V1.1 English Version
24 pages
Ay-Uf355 Configuration Manual
No ratings yet
Ay-Uf355 Configuration Manual
11 pages
Databook - Q2'17 - NAND (Rev1.0)
No ratings yet
Databook - Q2'17 - NAND (Rev1.0)
12 pages
WW Cøvgv-Bb-Bwäwbqvwis: Aa Vq-1 Acv Iwus WM ÷G
No ratings yet
WW Cøvgv-Bb-Bwäwbqvwis: Aa Vq-1 Acv Iwus WM ÷G
5 pages
Freelance Venue Staff Application - Assessment Invite: 1 Message
No ratings yet
Freelance Venue Staff Application - Assessment Invite: 1 Message
2 pages
Fast Hub Floating Point Adder
No ratings yet
Fast Hub Floating Point Adder
5 pages
Eminence Sigma Pro 18a-2
No ratings yet
Eminence Sigma Pro 18a-2
2 pages
SCORG Brochure 2022
No ratings yet
SCORG Brochure 2022
8 pages
Coral M2 Datasheet
No ratings yet
Coral M2 Datasheet
12 pages

02 Handling Files

Uploaded by

02 Handling Files

Uploaded by

Overview

day one day four

Reading and writing files

● NGS data – as most other biological data – is stored in

In this unit you will learn

● the open function takes

(b) file object: dna_file = open(file_name)

If you want to read just a single line, use the readline

● every line of a file >>> my_file = open("dna.txt")

● when reading from a file it's usually a good idea to remove

>>> my_file = open("dna.txt")

● a file is closed using outfile.close()

● especially important after writing files, as closing saves

● files are closed automatically when a script terminates

files can be opened from any location on your file system

● working with files is always a two-step process:

● More sophisticated ways of handling file contents will

The file "genomic_dna.txt" contains the same DNA

"…" means, there are more bases in the middle

● The FASTA format is designed to store DNA/RNA and

>sequence_name ">" followed by sequence identifier

sequence header DNA sequence

You might also like