420-SN1-RE Programming in Science - Lab Exercise 15
March 16, 2025
Introduction
Goals for this lab:
• Working with files
• Reading files, iterating through data contained in files
• Printing formatted strings
Introduction to files and folders
Files
A file is a collection of related information stored as a unit on the permanent storage device of your
computer (e.g., solid state drive or hard disk drive). All files have names (the filename) that are used to
identify the contents of file. Filenames typically have a file extension appended to the end. The extension
identifies the file type and your computer uses the file extension to decide which program should be used
to open the file. Files are useful for storing information on your computer long-term.
In general, Python can open files using the following syntax:
f = open(" myfilename .txt") # .txt is file the extension
Importantly, this statement will only work if the file myfilename.txt is in the same folder as
Python is working from. Otherwise, you would have to provide more information to Python about
where your files are stored.
Folders and paths
Permanent storage devices are organized in a hierarchical manner using folders or directories. A folder
on your disk is a named storage area that can contain files and even other folders. All storage devices
have a root folder. The root folder holds all other files and folders on the device.
A path specifies the location of a file (or folder) on your disk. Using this path, programs (including
the programs you write in Python) can open files to read contents from or write contents to these files.
Paths come in two flavors: relative and absolute. For the most part, we will focus on relative paths.
Check the Appendix if you wish to learn about absolute paths. In everyday usage, we don’t have to think
much about folder paths. However, when programming with files, we will need to understand how to
specify the location of the files using paths.
Relative paths and working directory
The Python interpreter runs from a particular folder on the computer, which may not be the same folder
where your files are located. However, we can tell Python to run from a particular folder (the working
directory). Once we have Python in the right working directory, we can specify filenames relative to that
directory.
The easiest way to set the working directory of Python in IDLE and Thonny is the following:
1
1. Make sure you know where the files you wish to access are saved. Put the files in a folder for the
particular project or lab you are working on. For example:
(a) Create a Lab_15 folder.
(b) Move any files related to Lab 15 into this folder.
2. Now, create a new .py program and save it in the folder from Step 1.
3. With the file open in IDLE or Thonny, run the Module (even if the file is empty).
4. When running the module, IDLE and Thonny will switch the working directory to the location of
the .py file. Now, your shell is in the correct location, and any paths you refer to in this .py file will
be referenced relative to the .py file location.
You can also change directory manually using the os.chdir() function after importing the os module.
The current working directory can be viewed using the os.getcwd() function. Look these functions up
using help() if you want to learn how to use them.
Opening files in Python
Once the working directory is set, you can open() files in the same folder by supplying open() the
filename and extension as an argument(e.g., open("file.txt")). You can also open files contained in
subfolders by writing the name of the subfolder(s) separated by a forward slash (e.g.,
open("subfolder/file.txt")).
Exercise 1: Opening files with Python (exercise1.py)
1. Unzip the contents of the lab and move all files to a folder called Lab_15.
2. Create a new exercise1.py file inside this folder. Run the file to get Python working inside the
Lab_15 folder.
3. Add the following Python statements to the program you created, then run the program.
FILE = open(" my_first_open .txt")
contents = FILE.read ()
print(type(FILE ))
print(type( contents ))
print( contents )
If it worked, you should see the contents of the file printed by Python. If something went wrong,
try to verify that your Python file and the text file are in the same folder. What type is the object
FILE (you can check it out by using the function type()).
What type is the object contents?
What did the FILE.read() command do?
4. Add to your exercise1.py similar code to open dna.txt contained inside the dna subfolder.
2
Exercise 2: Calculating AT content for each DNA string in a file
(at_content_file.py)
The file dna.txt contains 12 different DNA fragments on each line1 . The fragments vary in length from
700 to 3000 nucleotides. You will write a program to analyze the AT content for each of the 12 strings.
Your program must have at least two functions: one function to read a file containing DNA fragments
that returns a list of DNA fragments, and one function that computes AT content for a single fragment.
Follow the steps below:
1. Create a new program at_content_file.py, save it in the folder for this lab.
2. Write a function at_content(dna) that computes the AT content (ratio of A’s and T’s to the length
of a DNA string). The function should accept one DNA string as a parameter and return the AT
content of the string as a ratio.
In a Lab 8, you wrote a program to calculate AT content. Go find this program and turn it into a
function.
3. In your program, write a new function, read_dna(filename), which accepts a filename parameter
(string) and returns a list of DNA fragments.
Function logic: Open the file of the given parameter and read the contents of the file to a variable.
After using the .read() method, the type of the object referred to by the variable will be str. Use
the string method .split() to convert the string into a list by using the white space (i.e., newlines)
as a delimiter. Return the resulting list.
Test that your read_dna() function works correctly. The result of calling this function on the
supplied dna.txt should be a list of length 12.
4. Use your two functions (read_dna and at_content) and your knowledge of Python to compute
and display the AT content for each of 12 DNA fragments in the file. The displayed results should
be nicely formatted (see example below). DNA fragments should be truncated to the first 10 nu-
cleotides. AT content ratios.
Place the main driver code (i.e., the code that calls your functions) inside a special conditional
statement, if __name__ == "__main__".
The skeleton of your code should look like the following, where pass keywords are replaced with
your function and program logic.
def read_dna ( filename ):
pass # replace pass with your function code
def at_content (dna ):
pass # replace pass with your function code
if __name__ == " __main__ ":
pass # replace pass with your program code
1
In a text file, a line contains 0 or more characters followed by the newline character "\n" which denotes the end of a line.
In other words, the newline character can be used as a delimiter to split lines of a file into list elements.
3
The output of your program should resemble the output below. Each fragment should be displayed,
truncated to the first 10 characters followed by the AT content of that string, displayed as a per-
centage and rounded to two decimals.
AT content analysis:
tcataggaag...: 51.80%
aagcgtagga...: 52.86%
cgacgttctg...: 51.71%
cacatacata...: 54.00%
acgcggttct...: 52.00%
taggtgatga...: 48.43%
taggcgcaaa...: 49.14%
gtccctaacc...: 49.71%
aatgcgccgg...: 50.86%
attttagacg...: 51.57%
tatcggtcac...: 49.71%
atacgctgcg...: 48.50%
Exercise 3: The case of the missing grant money
(forensic_analysis.py)
You are part of a special digital forensic team tasked with solving a mystery involving missing grant
money. A research team has depleted its entire budget after two months, and all members have disap-
peared without a trace. The researchers left important research notes hidden across several files in a
mystery folder. Your team has provided you with the forensic copy of the folder mystery_folder. Your
team has determined that a specific keyword ("formula") is crucial to unlocking the next step in your
investigation.
Objective
Your mission is to write a Python program forensic_analysis.py that will:
a. Search through all text files in the folder mystery_folder, including its subfolders clues, notes,
and experiments.
b. Count how many times the keyword "formula" appears in each file.
c. Identify the filename of the file with the most occurrences of the keyword.
d. Print a report using string formatting. Display each filename and the number of occurrences in
that filename. This output should look like a nicely formatted table. At the end of your output,
display the filename with the most occurrences and print the text of the file with the most keyword
occurrences. Make this report readable to the average human.
e. Add a comment to your program stating what most likely happened to the missing grant money.
4
Requirements
• Enumerate the relative path of all the files you must search.
– Non-enriched: These paths can be written manually and stored be stored in a Python list.
In other words, you can hard-code a list of pathnames to the files, relative to the Lab 15 folder.
– Enriched class requirement: Write a function to enumerate the paths of all files of the
mystery_folder across all subfolders. Hint: this is a naturally recursive process because
filesystems are tree-based structures; see recursion tips below.
• Use Python’s file handling capabilities to read text data from the files.
• Ensure your program is concise in handling the various file paths. In other words, you should avoid
repetitious code.
• Making use of pre-defined string methods may help you comb through the files.
• Make sure to print your results in a clear and organized format using string formatting.
General Tips
• In this exercise, the text files are embedded in subfolders of the folder containing the Lab 15 files.
You can have Python read a file in a subfolder by specifying a "relative path" to get to that file.
To get you started, if Python is working from the folder containing the Lab 15 files, you should be
able to run:
FILE=open(" mystery_folder /notes / research_notes .txt")
#open the file research_notes .txt contained in mystery_folder , then notes
Recursion tips:
The os module (import os) contains many functions for working with paths. Here are some helpful ones
that might work well in your recursive function:
• os.listdir() - list all files and folders in a path
• os.path.isdir() - Return True if a path is a folder
• os.path.isfile() - Return True if a path is a file
• os.path.isabs() - Return True if a path is absolute (i.e., full path starting at the root folder)
• os.path.join() - Join two or more pathname components, inserting ’/’ as needed.
Having trouble thinking about the recursive procedure?
1. Iterate the contents (files and folders) in the current folder.
(a) Append any file to an accumulator list.
(b) Recursively call the function on any subfolder, passing in the accumulator list. The recursion
stops (base case) if there are no subfolders in the contents of the current folder.
2. Return the accumulator list.
5
Appendix
Absolute paths
Absolute paths specify the location of a file starting from the root folder of your device. For example, the
following is a possible absolute path specification on a PC for a file with the name dna.txt.
C:\Users\jason\Documents\Programming\dna.txt
This path begins at the root folder of the storage device named C and gives a list of folders separated
by backslashes (\). The path terminates with the filename of the intended file.
If you are on a Mac, your paths may look like the following.
/Users/jason/Documents/Programming/dna.txt
Both paths above can be read roughly as: start at the root folder of your storage device, go into the
folder called "Users", next go into the folder called "jason", next go into the folder called "Documents",
next go into the folder called "Programming", then look for the file called "at_content.py". Again, paths
on your computer may look different from the above examples!
In Python, if the above path were valid for your computer, you could run following command to open
the file dna.txt
x = open("C:\ Users \ jason \ Documents \ Programming \dna.txt")
Absolute paths have limitations. Absolute paths tend to be quite long, especially for files that are
deeply embedded on your storage device. Moreover, programs that rely on absolute paths won’t work
well on different computers, since every computer may have a different folder structure. For example,
your computer likely does not have a folder called "jason". Relative paths are often better.