Reading Files
Chapter 7
Python for Informatics: Exploring Information
www.py4inf.com
Unless otherwise noted, the content of this course material is licensed under a Creative
Commons Attribution 3.0 License.
http://creativecommons.org/licenses/by/3.0/.
Copyright 2010, 2011, Charles Severance
What It is time to go
Software
Next? find some Data to
mess with!
Input Central
and Output Processing Files R Us
Devices Unit
Secondary
if x< 3: print Memory
Main From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Memory Date: Sat, 5 Jan 2008 09:12:18 -0500
To: source@collab.sakaiproject.org
From: stephen.marquard@uct.ac.za
Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
...
File Processing
• A text file can be thought of as a sequence of lines
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: source@collab.sakaiproject.org
From: stephen.marquard@uct.ac.za
Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
http://www.py4inf.com/code/mbox-short.txt
Opening a File
• Before we can read the contents of the file we must tell Python which
file we are going to work with and what we will be doing with the file
• This is done with the open() function
• open() returns a “file handle” - a variable used to perform operations
on the file
• Kind of like “File -> Open” in a Word Processor
Using open()
• handle = open(filename, mode) fhand = open('mbox.txt', 'r')
• returns a handle use to manipulate the file
• filename is a string
• mode is optional and should be 'r' if we are planning reading the file
and 'w' if we are going to write to the file.
http://docs.python.org/lib/built-in-funcs.html
What is a Handle?
>>> fhand = open('mbox.txt')
>>> print fhand
<open file 'mbox.txt', mode 'r' at 0x1005088b0>
When Files are Missing
>>> fhand = open('stuff.txt')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'stuff.txt'
The newline
Character >>> stuff = 'Hello\nWorld!'
>>> stuff
'Hello\nWorld!'
• We use a special character to >>> print stuff
Hello
indicate when a line ends
called the "newline" World!
>>> stuff = 'X\nY'
• We represent it as \n in strings >>> print stuff
X
• Newline is still one character - Y
not two >>> len(stuff)
3
File Processing
• A text file can be thought of as a sequence of lines
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: source@collab.sakaiproject.org
From: stephen.marquard@uct.ac.za
Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
File Processing
• A text file has newlines at the end of each line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008\n
Return-Path: <postmaster@collab.sakaiproject.org>\n
Date: Sat, 5 Jan 2008 09:12:18 -0500\n
To: source@collab.sakaiproject.org\n
From: stephen.marquard@uct.ac.za\n
Subject: [sakai] svn commit: r39772 - content/branches/\n
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772\n
File Handle as a Sequence
• A file handle open for read can be
treated as a sequence of strings
where each line in the file is a string xfile = open('mbox.txt', 'r')
in the sequence
for cheese in xfile:
• We can use the for statement to print cheese
iterate through a sequence
• Remember - a sequence is an
ordered set
Counting Lines in a File
fhand = open('mbox.txt')
• Open a file read-only count = 0
for line in fhand:
• Use a for loop to read each count = count + 1
line
print 'Line Count:', count
• Count the lines and print out
the number of lines python open.py
Line Count: 132045
Reading the *Whole* File
>>> fhand = open('mbox-short.txt')
>>> inp = fhand.read()
• We can read the whole file
>>> print len(inp)
(newlines and all) into a
94626
single string.
>>> print inp[:20]
From stephen.marquar
Searching Through a File
fhand = open('mbox-short.txt')
for line in fhand:
• We can put an if statement in if line.startswith('From:') :
our for loop to only print print line
lines that meet some criteria
OOPS!
What are all these blank
From: stephen.marquard@uct.ac.za
lines doing here?
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
...
OOPS!
What are all these blank
From: stephen.marquard@uct.ac.za\n
lines doing here?
\n
From: louis@media.berkeley.edu\n
The print statement adds a \n
newline to each line. From: zqian@umich.edu\n
\n
From: rjlowe@iupui.edu\n
Each line from the file also
...
has a newline at the end.
Searching Through a File (fixed)
fhand = open('mbox-short.txt')
for line in fhand:
• We can strip the whitespace line = line.rstrip()
from the right hand side of if line.startswith('From:') :
the string using rstrip() from print line
the string library
• The newline is considered From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
"white space" and is stripped
From: zqian@umich.edu
From: rjlowe@iupui.edu
....
Skipping with continue
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
• We can convienently
# Skip 'uninteresting lines'
skip a line by using the
if not line.startswith('From:') :
continue statement
continue
# Process our 'interesting' line
print line
Using in to select lines
fhand = open('mbox-short.txt')
• We can look for a string for line in fhand:
line = line.rstrip()
anywhere in a line as our
selection criteria if not '@uct.ac.za' in line :
continue
print line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
X-Authentication-Warning: set sender to stephen.marquard@uct.ac.za using -f
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan 4 07:02:32 2008
X-Authentication-Warning: set sender to david.horwitz@uct.ac.za using -f
...
fname = raw_input('Enter the file name: ')
fhand = open(fname) Prompt for
count = 0
for line in fhand: File Name
if line.startswith('Subject:') :
count = count + 1
print 'There were', count, 'subject lines in', fname
Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt
python search6.py
Enter the file name: mbox-short.txt
There were 27 subject lines in mbox-short.txt
fname = raw_input('Enter the file name: ')
try:
fhand = open(fname)
Bad File except:
print 'File cannot be opened:', fname
Names exit()
count = 0
for line in fhand:
if line.startswith('Subject:') :
count = count + 1
print 'There were', count, 'subject lines in', fname
Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt
Enter the file name: na na boo boo
File cannot be opened: na na boo boo
Summary
• Secondary storage • Stripping white space
• Opening a file - file handle • Using continue
• File structure - newline character • Using in as an operator
• Reading a file line-by-line with a for • Reading a file and splitting lines
loop
• Reading file names
• Reading the whole file as a string
• Dealing with bad files
• Searching for lines