-
Notifications
You must be signed in to change notification settings - Fork 0
Data Backup
Table of Contents
This section explains the backup resources used in the lab and how they can be used generally. See the Instructions section for specific backup procedures.
Data can be lost a few different ways:
- drive failure
- accidental deletion
- being misplaced
All of these can be avoided with the right organization.
At a minimum, there should always be two copies of raw data from a study. The standard backup in the lab is to have a copy on Ranch and a copy on UTBox. While data collection/analysis is ongoing, you may also have a copy of the raw data on Corral. This applies to all data, including behavior-only studies or behavioral pilots for fMRI studies. A standard organization and a corresponding spreadsheet with study information is used to help avoid data becoming lost as people leave the lab.
Ongoing studies should be backed up to Corral. Completed studies should be moved to Ranch and UT Box. Each system has the same organization, with raw, archive, and bids top-level directories.
These subdirectories hold data for different types of studies. This could include different datasets under a given project. For example, behavioral piloting for a scan study should go in the behav directory, while the scanned data (including behavior) should go in the fmri directory.
/corral-repl/utexas/prestonlab/raw/behav
/corral-repl/utexas/prestonlab/raw/fmri
/corral-repl/utexas/prestonlab/raw/ecogThe raw directory is intended for raw data, including raw DICOM files and behavioral logs. The archive directory can be used for other types of data, such as analysis results. The bids directory is reserved for BIDS-compliant datasets.
All archived data must be logged in the Data archive spreadsheet. Ask the lab manager if you need access to edit that spreadsheet.
Raw data are backed up after a scan at the BIC, but this is only for a limited time. Data may be on the scanner computer for up to about a week; then data are moved to BIC servers for up to about 6 months, depending on usage. You can't rely on BIC copies for more than a few months, so make sure to backup your data before then.
Archive files created on Corral in /corral-repl/utexas/prestonlab/raw are part of the Preston Lab data repository and will be kept backed up on Ranch. Files on Corral are mirrored, so there is protection from drive failure, but there is no protection from accidental file deletion. If a file on Corral is deleted, it will immediately also be deleted in the mirrored copy of Corral. Therefore, it is important to keep another copy of data on Ranch.
An easy way to setup a data archive for a study is to make archive files of individual subject directories.
tar -czf archive_file.tar.gz dir_to_archive # make an archive from a directory
tar -tf archive_file.tar.gz # list all files in an archiveRanch has a huge amount of storage, but is harder to use. To make backup up there easier, the lab has a designated data manager to handle copying data from Corral to Ranch. Data on Corral should be staged in the /corral-repl/utexas/prestonlab/raw directory and can be uploaded using the backup-raw scripts. This is not currently configured to run automatically so you're responsible for running this backup script. There are plans to implement an automated system for scheduling these backups in the future.
The psychology department server is useful for sharing things like stimulus sets, experiment code not on GitHub, etc. However, as a long-term data storage solution, it has disadvantages compared to the Corral/Ranch setup, which stores all data in a single place that's easier to track. Going forward, make sure that all experiment data, including behavior-only experiments, is on Corral or UTBox.
UT Box comes with unlimited storage, which can be accessed using a desktop app, the website, or through FTP. Using FTP requires initial setup: see instructions.
-
You need to set up a password first, as FTP doesn't support single sign on. This can be done in your UT Box account through the web interface. See Account Settings>Authentication.
-
Get a program that can run FTPS (not to be confused with FTP or SFTP).
- For transferring files from a desktop computer, you can use the free program FileZilla.
-
For transferring files from a TACC account, can use LFTP.
LFTP allows syncing a local directory with a remote directory:
lftp -c "open -u [your eid]@eid.utexas.edu,[your Box password] ftp.box.com; mirror -R [local directory] [destination directory]"Every time that command is run, it will check for updates to each file and push them to Box as needed.
For LFTP transfers you may want to tar and gzip your subject directories as higher numbers of files cause the transfer to be very slow. The max file size is 15GB so make sure that if your combined subject directory is larger than that, that you tar and gzip the subfolders. Alternatively, if you have large archive files that need to be transferred, you can split them using split-files:
split-files file1 file2 ...or
split-files *.tar.gzTo split all archive files in the current directory.
All raw data from all experiment types (behavioral, fMRI, ECoG) should be stored on Corral, unless it's already been archived on both Ranch and UTBox.
During data collection for a study, push raw files to your work or scratch directory. From there, you can place each subject's raw files in the standard location on Corral for backup:
module use /work/IRC/ls5/opt/local/modules
module load archive
backup_raw.sh [subject_dicom_dir] [studytype] [study] [subject]For example, to backup subject remind_202a from the remind fMRI study:
backup_raw.sh remind_202a fmri remind remind_202a
Once a study has completed and the raw files are not currently needed for analysis, the raw files may be deleted from Corral to save space. Before doing this, check with the lab's TACC administrator to make sure that all the raw files on Corral have been backed up. The raw data archives for each subject will be pushed to both Ranch and UTBox.
You can sync your analysis results to UTBox at any time (see the "UT Box" section below for instructions on setting up your account for this). You can "mirror" a directory on Corral, work, or scratch with UTBox using lftp:
lftp -c "open -u [your eid]@eid.utexas.edu,[your Box password] ftp.box.com; mirror -R [local directory] Preston_Lab_Data/[data_type]/[study_type]/[study_name]"There are different directories for raw, archive, or BIDS data. Raw data archives should ideally only include the data and nothing else. Backups of analysis results should be stored in the archive directory, and may or may not include raw data. For example, to backup analysis results from the fMRI study bender that are stored in your work directory, you might run:
lftp -c "open -u [your eid]@eid.utexas.edu,[your Box password] ftp.box.com; mirror -R $WORK/bender Preston_Lab_Data/archive/fmri/bender"When syncing analysis results to UTBox, you might want to first archive the files using tar. This will make file transfers run faster, and this also means that you can copy the same archive to both UTBox and Ranch without changes (Ranch strongly discourages sending small files, so directories should be compressed into .tar.gz files). You can archive all files together in one file (see below for directions), or you can have a separate file for different parts, such as one file for each participant. Keep in mind that UTBox has a file size limit of 50GB (as of 2022-05-20), so if you have a large archive you will have to split it into chunks (see below).
On the other hand, if you do not make tar archive files at this step and instead sync the files as they are, you can re-run the sync to send new files as needed. This would not be possible if the files are archived.
Either way, add an entry to the lab data archive spreadsheet to record information about your archive and what is in it (ask the lab manager or TACC delegate if you don't have access to the spreadsheet). This step is critical for long-term management of the lab's data.
Note that only people with active EIDs can use UTBox. The lab manager or TACC delegate can help you copy files if you need a dataset but do not have access to UTBox.
After results have been published, results files can be archived to Ranch. Give your archive a name with the project name followed by the name of the journal; for example, bender_jneurosci. To archive on Ranch, you'll need an archive file or multiple archive files. For example, to archive a directory called "bender" in your current directory:
mkdir $SCRATCH/bender_jneurosci
tar -cvzf $SCRATCH/bender_jneurosci/bender_archive.tar.gz benderThis example temporarily stores the .tar.gz file in your $SCRATCH directory so it doesn't count against your quota.
Once the archive has been created, check the file size of it using e.g., ls -lh $SCRATCH/bender_jneurosci. If the archive is larger than 50GB, you will have to split it up. To do this, run split-files $SCRATCH/bender_jneurosci/bender_archive.tar.gz to split the single archive into smaller chunks.
Finally, set permissions correctly and push the directory to Ranch using scp:
chmod -R 770 $SCRATCH/bender_jneurosci # give access to users in our group, but not others
scp -r $SCRATCH/bender_jneurosci [username]@ranch.tacc.utexas.edu:stornext/archive/fmriIn this example, we're storing an archive of fmri results, so we place the directory under stornext/archive/fmri.
If you haven't already archived your files on Box, you can mirror the .tar.gz file(s) there:
lftp -c "open -u [your eid]@eid.utexas.edu,[your Box password] ftp.box.com; mirror -R $SCRATCH/bender_jneurosci Preston_Lab_Data/archive/fmri/bender_jneurosci"To send an individual file to a directory:
lftp -c "open -u [your eid]@eid.utexas.edu,[your Box password] ftp.box.com; put -O Preston_Lab_Data/[destination directory] [file to send]"You now have two remote copies of the archived results. After you confirm that the transfers to Ranch and Box were successful (you can check the Box results by logging into your account on a web browser, while Ranch can be accessed via ssh), you may now delete the local copy of the archive. For example:
cd $SCRATCH
rm -rf bender_jneurosciIf you no longer need the results files locally, you can also delete the original directory to save space.
Finally, add an entry to the lab's data archive spreadsheet, indicating what type of archive it is, what project and paper it corresponds to, etc.
You can use tar to create an archive of behavioral data. This can either be done separately for each subject, or in one archive for the whole dataset. For example:
tar -czvf flexo.tar.gz flexoAfter the archive is finished, place the data on Corral in the standard location. If this is a purely behavioral study (or a pilot for an fMRI study with no fMRI data collected):
ssh [username]@data.tacc.utexas.edu "mkdir /corral-repl/utexas/prestonlab/raw/behav/flexo"
scp flexo.tar.gz [username]@data.tacc.utexas.edu:/corral-repl/utexas/prestonlab/raw/behav/flexoIf this is behavior for a scanning study, instead place the archive under /corral-repl/utexas/prestonlab/raw/fmri/[studyname].
-
Online resources
-
Software
-
Running Experiments
-
Statistics and Analysis
-
Publications
-
Administration