Skip to content

Code project management

Neal W Morton edited this page Jun 11, 2023 · 30 revisions

Managing Python and Bash code projects for data analysis

Code is often much easier to develop and test locally, where you can install and run any software you want. This workflow sets things up so that you develop code locally and then run it on a cluster Lonestar 6. By working locally, you can use an integrated development environment (IDE) to help you develop your code. It can make a lot of things easier with features like checking your code syntax and looking up documentation for you.

Part of the workflow is setting up a Python package that can quickly be installed anywhere you want. All the code will be tracked on GitHub, so you'll always be able to look up what changes have been made and when; this can be really useful during large analysis projects like fMRI data analysis, which can take a long time. This also makes code easy to share, so that you can invite other people to contribute code or give feedback. And it makes transferring work between clusters (sometimes necessary when there is downtime for maintenance or cluster decommissioning) much easier.

See the test project for a simple example of a project following the organization outlined in this workflow.

These instructions assume basic familiarity with Bash, Git, and GitHub.

This workflow was developed by Neal Morton on 2023-06-01 based on experience with many complex Python analysis packages and work on multiple clusters, along with recommendations from various Python development experts. Python development best practices are still evolving, but this procedure should remain sensible for some time.

Initial setup

  • First, create a new repository on GitHub. This can be either under your personal account or under the lab's prestonlab organization. Choose to add a README file; this is a good place to put basic documentation for your project. The purpose of your README is both to help you keep track of your project but also to allow naive users to run your code and achieve the same results you did. As a result, there are good READMEs and bad READMEs. An example of a thorough README can be found in the XMaze repository here. Also select a license to indicate what others are allowed to do with your code (for example, you can forbid others from using your code to make a closed-source program for commercial purposes). You may also want to specify a .gitignore file template. The .gitignore file specifies files that you don't want to track using GitHub. Often there are temporary files or local configuration files that you don't want to include in your repository. GitHub has nice templates for different programming languages that you can select.
  • Download an IDE (integrated development environment) like PyCharm or Visual Studio to whatever local computer you're working on. These programs usually have free licenses for people who sign up with a .edu email address.
  • Get a clone of the repository you made on GitHub (you can do this through PyCharm if you link your GitHub account, and Visual Studio may have a similar tool). If you are new to GitHub, you can also download and make use of GitHub's official visual tool: Github Desktop. While this tool has some limitations, it does make cloning, pushing, pulling, and viewing history, very very simple, with no need to use the command line tools.
  • Install Python. A simple way to do this is by going to python.org and downloading an installer for your operating system. A good rule of thumb is to use the previous version of Python. For example, if the most recent version is 3.11, install 3.10. This helps to avoid problems with dependencies that may not support the latest version of Python yet.
  • Set up a Python environment. First, check that you're running the version of python that you expect, by running python --version. If that looks right, then change directory to the main directory of your local clone of your repository. Then run python -m venv venv to create a Python virtual environment. You only need to create it once. To activate the environment, run . venv/bin/activate. To deactivate it, run deactivate.
  • Now you're ready to get started developing Python code for your project.

Developing code

There are three general ways to write Python code:

  • Scripts to be called from the command line. You probably already use some scripts that are written in Python (for example, launch is a Python program). If you're running some analysis on a cluster like Lonestar 6, you probably need to write a script.
  • Notebooks that are run using Jupyter/Jupyter Lab. This is a good way to explore and visualize data. It's also a great way to handle the "last mile" of analysis, where you've already done most of the computation necessary and just want to do things like run statistical tests and make figures for presenting your results.
  • Modules that contain Python functions and classes. A good rule of thumb is that any code that's used multiple times should be in a function, and generally functions should be placed in modules for easy reuse. This allows you to write a function once and call it from multiple scripts or notebooks. Once you set up a package, you can call functions in your module just like you do from third-party libraries, using commands like from mypackage import module; result = module.myfunction(x, y, z).

Project directory

Here's a structure that has worked well for multiple projects:

myproject
├── LICENSE
├── README.md
├── bin [optional]
│   └── myproject-script1
│   └── myproject-script2
├── jupyter
│   └── mynotebook1.ipynb
├── pyproject.toml
├── setup.py [optional]
├── src
│   ├── myproject
│   │   ├── __init__.py
│   │   ├── mymodule1.py
│   │   ├── mymodule2.py
├── venv

The __init__.py can be just an empty file; it marks myproject as a Python package. The pyproject.toml file gives metadata about the project and specifies how to install it. See SetupTools documentation for details. For example:

[project]
name = "myproject"
description = "MyProject: A sentence about what this project is designed to do."
version = "0.0.1"
readme = "README.md"
requires-python = ">=3.8"
license = {file = "LICENSE"}
authors = [
    {name = "Firstname Lastname", email = "myemail@example.com"}
]
keywords = ["keyword1", "keyword2", "etc."]
classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)",
    "Operating System :: OS Independent"
]
urls = {project = "https://github.com/prestonlab/myproject"}
dependencies = [
    "numpy",
    "pandas"
]

[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

Much of the metadata isn't necessary for running your code. But it's a good idea to establish at the beginning what the project is for, who is writing it, what other people are licensed to do with your code, etc. If you selected a license when setting up your GitHub project, the pyproject.toml code shown above will point to that license file. Authorship and licensing information will be especially important if you publish your code later on. See the Publishing code section below.

Installing your Python pacakge

There are multiple tools for installing Python packages. The most common is called SetupTools, and it's what we'll use here. The [build-system] section in the example above tells Python how to take code for a package and install it into a Python environment.

Once your build system and dependencies are specified in pyproject.toml, you can install your project by running pip install -e .. This will install a package, if you've created one, and any dependencies you listed in the pyproject.toml file. You'll re-run this any time you add new package files or scripts. There isn't really anything to install yet, but we'll get to that next. The -e flag stands for editable. It's really useful when developing code, because it makes it so you can make edits to the code and have them be "installed" automatically without you having to rerun pip install. You still need to rerun pip install, however, when you add a new script or module.

Bash scripts

If you will be writing Bash scripts also, you can include a bin subdirectory to hold the scripts and a setup.py file to indicate how to install them. The setup.py script is an older standard used by SetupTools. Because it contains Python code, can be used to do things that aren't supported by the pyproject.toml configuration. This example file will install any scripts starting with myproject found in the bin directory, including shell scripts:

import setuptools
import glob

scripts = glob.glob('bin/myproject-*')
setuptools.setup(scripts=scripts)

Python modules and scripts

Python modules contain code that may be reused in multiple scripts or notebooks. Modules go in the src/myproject directory. This can be split up into multiple modules. For example, you might have a raw.py module that contains functions to help with reading raw data and converting it to BIDS format, and a task.py module with functions to load and analyze behavioral data.

Python scripts can be placed in any of these modules. The click package is really useful for setting up Python command line scripts. Take this script from the Sardonyx project for example:

# sardonyx/src/sardonyx/raw.py
from pathlib import Path
import click

@click.command()
@click.argument("raw_dir", type=click.Path())
def garnet_rename_subjects(raw_dir):
    """Rename garnet participants."""
    raw_dir = Path(raw_dir)
    for old_path in raw_dir.iterdir():
        old_name = str(old_path.name)
        new_name = "_".join(old_name.split("_")[:2])
        new_path = old_path.parent / new_name
        old_path.rename(new_path)

This simple function takes in a directory and renames the subdirectories to remove subject initials. The @click statements at the top indicate that this function should also be callable as a command line script, and specify what arguments the user needs to specify.

To make this script work, we need to also specify in the pyproject.toml file that it should be installed as a script, by adding/editing a section called [project.scripts]:

[project.scripts]
garnet-rename-subjects = "sardonyx.raw:garnet_rename_subjects"

As shown in this example, the location of the function to use is given through package.module:function syntax. On the left side is the name that we want the installed script to have. Now, we need to reinstall the package by running:

pip install -e .

After installation, you should be able to run the script:

garnet-rename-subjects --help

All scripts set up using click support the --help option. That should print out information about how to call the script.

Once you've installed your package, scripts and notebooks can call functions defined in your modules, using the same syntax as for third-party packages. For example, the Sardonyx function get_subjects can be called by installing Sardonyx using pip, importing the module (e.g., from sardonyx import task), and then calling the function (e.g., subjects = task.get_subjects("garnet")). It's generally best to place function definitions in a module, rather than in a notebook or a standalone script, so that you can easily call the functions from anywhere by importing your module.

Running code on TACC (or anywhere)

First, get a clone of your project on Lonestar or whichever cluster you're using. You can place the project anywhere, but we have had some problems with running git on the WORK filesystem. It's a specialized distributed filesystem designed to work well on a cluster, and it hasn't always played well with git. Cloning to your home directory should work well, and Corral will probably also work without issues.

For example:

mkdir -p ~/analysis
cd ~/analysis
git clone git@github.com:prestonlab/sardonyx.git

There are multiple ways to clone projects, depending on what your authentication settings are on GitHub. One nice option is to set up passwordless authentication using ssh keys. First, get your key on the computer/cluster you want to clone to using cat ~/.ssh/id_rsa.pub. Then go to your account settings on GitHub and add that key in the "SSH and GPG Keys" section. After it's been added, you should be able to clone a project using the SSH method (e.g., git clone git@github.com:prestonlab/sardonyx.git) without signing in.

Next, activate the version of Python you want to use. If possible, make this close to the local development environment that you set up previously. For example, run module load python3/3.9.7 on Lonestar6 to load version 3.9.7. Use this version of Python to set up a virtual environment, this time on the cluster, for example python -m venv venv. The virtual environment is just a directory, which can be placed anywhere and moved around. Place it wherever is convenient for your project. If you're collaborating on a project, one option is to place it on work so that your collaborators can also activate it and you can use the same Python setup. As with your local virtual environment, activate the environment using . venv/bin/activate and deactivate using deactivate.

Finally, you can install your project and its dependencies using pip.

pip install -e [path to my cloned project]

You should now be ready to run your libraries and scripts on the cluster.

Making code changes

To continue developing code, make changes on your local computer and test them out there. Copying sample data to you local computer is often helpful for this; for example, you can copy one participant's imaging data to your computer and test out analyses on that. It's a good idea to write your scripts to avoid "hard coding". For example, if you have a shell script that hard-codes the location of the data (e.g., /corral/utexas/prestonlab/sardonyx), then this script will need to be edited before you can test it out locally. For flexibility, you can instead write the script so it takes the data directory as an input. For example:

#!/bin/bash
#
# myscript.sh

# data_dir=/corral/utexas/prestonlab/sardonyx
data_dir=$1
...

This indicates that the data_dir variable will be set to the first input argument supplied when calling the script. You can run the script with something like myscript.sh /corral/utexas/prestonlab/sardonyx on TACC and something like myscript.sh ~/data/sardonyx on your local computer, so that it works in either location without having to modify the script. This also makes code much easier to share, because the person you're sharing with can just indicate where the data are stored on their system when running each of your scripts instead of having to edit each script individually.

After testing out your code changes locally, commit and push code changes to GitHub (your IDE should have features for doing this). Then, on the cluster, change directory to your cloned project and run git pull to get your latest changes. If you've added any new modules or scripts, you'll need to run pip install -e [path to project] again. If the dependencies haven't changed, you can add the --no-deps flag to just install the new code without checking on the dependencies, so it will run faster.

In this way, you can use the best tools available for local code development (e.g., an IDE like PyCharm or VS Code), and you never run any code that isn't tracked on GitHub. This can give you confidence knowing exactly how each analysis was run. If you later discover a bug in your code, you'll be able to tell exactly when that bug was introduced and when it was fixed. You'll also be in good position to publish your code alongside any papers you publish for the project.

Using Jupyter notebooks for final statistics and plots

Notebooks are really useful for the "last mile" of analysis, after the computationally demanding processing has already been done on a cluster. They make it easier to organize results such as statistics and figures, along with the code used to create them. They can be committed to your GitHub project alongside your other code, and GitHub will render your notebooks so that others can see them right from the project page. Some clusters make it possible to run Jupyter notebooks from a remote cluster, but usually it's easiest to run Jupyter from your local computer.

If you followed the directions above, you should already have a Python virtual environment set up locally. We can use that environment to run Jupyter notebooks. First, however, we must install that environment as a kernel:

pip install -U jupyterlab  # make sure Jupyter is installed
python -m ipykernel install --user --name myproject

To launch Jupyter Lab, run jupyter lab &. You should see a Jupyter window pop up in your browser. When you create a new notebook, you should be able to select your Python environment as the kernel. This will make it so you can import modules from your Python package.

I recommend keeping code in each notebook to a minimum, and putting more complex code in a module within your Python package. That will make it easier to read the notebook, while also making the complicated code easier to test. It's also worth keeping in mind that it's harder to keep track of changes to notebooks, as they're in a more complicated format than Python source code. If most of your code is stored in modules, it will be easier to go back later and see what changes you've made. When developing an analysis, often I'll make a test notebook first to explore the data and develop code. Then I'll move most of the code to a module within my Python package, and make a new notebook that is more streamlined because most of the work is done by the module code.

Publishing code

Publishing code alongside published papers can dramatically increase the reproducibility of your work, by making the exact calculations available. Planning from the beginning to publish code can also encourage using best coding practices and making sure to test and document your code properly. Using best practices helps build confidence in your research, both for yourself and among other members of the scientific community. Having a well-organized and documented project can also make things much easier for you when it comes time to revise a paper or plan a follow-up project.

The simplest method of publishing is to just change the GitHub project's visibility from private to public in Settings, and share the GitHub project URL in your paper.

It's best, however, to avoid relying on commercial companies for hosting code that a researcher might want to access in the future. The OSF and Zenodo projects are both designed for long-term archiving of code and data, and these projects have tools for importing and archiving code from GitHub.

Managing Python and Bash code projects for data analysis by Neal W Morton is licensed under CC BY 4.0

Clone this wiki locally