Skip to content

umccr/genocrate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

genocrate

PyPI Changelog Tests License

Overview

Genocrate defines RO Crate profiles and provides validators to package genomic datasets for research studies. In typical genomics projects, data for multiple participants is often organized into batches, with each batch grouped by participant identifiers, reflecting sequencing output or transfer times. Multiple batches or sequencing runs usually belong to the same study. Genocrate helps you build RO Crates for each batch to ensure files are well documented, and also create an overarching crate to organize and track all files across batches within a study.

Expected Output Structure

Below is an example of the output folder structure generated by Genocrate. For more details, see the layout description:

./tests/fixtures/test-batches
├── batch-001
│   ├── bag-info.txt
│   ├── bagit.txt
│   ├── data
│   │   ├── (genomic file set ...)
│   │   └── ro-crate-metadata.json
│   ├── manifest-md5.txt
│   └── tagmanifest-md5.txt
├── batch-002
│   ├── bag-info.txt
│   ├── bagit.txt
│   ├── data
│   │   ├── (genomic file set ...)
│   │   └── ro-crate-metadata.json
│   ├── manifest-md5.txt
│   └── tagmanifest-md5.txt
├── batch-003
│   ├── bag-info.txt
│   ├── bagit.txt
│   ├── data
│   │   ├── ( some genomic file set ...)
│   │   └── ro-crate-metadata.json
│   ├── manifest-md5.txt
│   └── tagmanifest-md5.txt
├── ro-crate-metadata.json
└── ro-crate-preview.html

Profiles

There are 2 profiles to capture genomic file set:

  • batch-submission: Describes a smaller set of genomic files submitted as a batch, typically representing data generated or transferred together as part of the same study.
  • study-dataset: Describes the complete dataset for a study, aggregating information from multiple batch RO Crates to provide an overview of all files and participants in the study.

CLI suite

  • build: Create a root RO Crate or merge an existing root RO Crate with a new batch RO Crate that conforms to the study-dataset profile. This command reads through batch submission crates to assemble or update the study-level crate.
  • csv2genocrate: Convert a CSV manifest file into an RO Crate that conforms to the batch-submission profile within the batch submission folder. This command also validates checksums defined in the CSV.
  • diff: Show differences between the root study-dataset RO Crate and a new batch RO Crate before merging, helping you review changes prior to running the build command.
  • validate-batch: Validate that a folder conforms to the batch-submission RO Crate profile, including checks for checksums and BagIt specification compliance.
  • validate-dataset: Validate the study-level RO Crate against the study-dataset profile. Skips content / integrity checks (e.g., checksums, BagIt) handled by validate-batch. For more details on each command, see the CLI documentation.

Installation

Install this tool using pip:

pip install genocrate

Usage

For help, run:

genocrate --help

You can also use:

python -m genocrate --help

CLI Documentation

Detailed command-line documentation is available in the CLI docs.

Example

Some example command from validating a ro-crate batch

Successful validation (batch-submission with BagIt)

genocrate validate-batch ./tests/fixtures/test-batches/batch-001 -t bagit
RO-Crate metadata is valid!
Validating for BagIt compliance (./tests/fixtures/test-batches/batch-001)
Validation successful!

Failure: File present in directory but missing from RO-Crate metadata

genocrate validate-batch ./tests/fixtures/batch-004 --skip-integrity-validation
Detected issue of severity REQUIRED with check "batch-submission_1.1": Physical file 'A001.vcf.tbi' present in directory but not declared as File entity in ro-crate-metadata.json
RO-Crate metadata is invalid!

Failure: Integrity check fails due to bad MD5 manifest

Note: The actual batch-004 in the repository is intentionally a bad RO-Crate. For this example it was modified to have valid RO-Crate metadata but an invalid MD5 manifest, as shown in the example below.

genocrate validate-batch ./tests/fixtures/batch-004/bad-manifest-md5.txt -t md5
RO-Crate metadata is valid!
Validating files from md5sum (./tests/fixtures/batch-004/bad-manifest-md5.txt)
[
    "File data/A001.vcf.tbi not found in manifest",
    "File data/A001.vcf not found in manifest",
    "File data/A001.bam.bai not found in manifest",
    "MD5 mismatch for data/ro-crate-metadata.json: expected b7410ecf0fd48789ab7b618a519cdaaf, calculated a9650294eb3d10acbff3926769ae800e",
    "MD5 mismatch for data/A001.bam: expected d6d0c756fb8abfb33e652a20XXXXXXXX, calculated d6d0c756fb8abfb33e652a20e85b70bc"
]
MD5 checksum validation failed.

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

uv venv
source .venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

python -m pytest

To update the CLI docs:

python ./scripts/generate_cli_docs.py

Testing

Tests are located in the tests/ folder. Run them using pytest -rA

================================================================= short test summary info ==================================================================
PASSED tests/test_build.py::test_build_root_ro_crate - Validates that the 'build' command generates a correct RO-Crate
PASSED tests/test_csv2crate.py::test_csv2genocrate - Test 'csv2genocrate' command output a valid RO-Crate for a given csv manifest file.
PASSED tests/test_diff.py::test_diff_crate - Test 'diff' command is picking up the changes correctly
PASSED tests/test_genocrate.py::test_version
PASSED tests/test_validate.py::test_validate_batch_valid_bagit - Test 'validate-batch' command succeeds for a batch conforming to the BagIt specification.
PASSED tests/test_validate.py::test_validate_batch_invalid_bagit - Test 'validate-batch' command fails for a batch NOT conforming to the BagIt specification.
PASSED tests/test_validate.py::test_validate_batch_valid_md5 - Test 'validate-batch' command succeeds when the manifest is valid and all files are listed in the manifest.
PASSED tests/test_validate.py::test_validate_batch_invalid_md5 - Test 'validate-batch' when the manifest failed (incorrect checksum or files not listed).
PASSED tests/test_validate.py::test_validate_invalid_ro_crate - Test 'validate-batch' command fails when the RO-Crate metadata does not list all files.

About

CLI suite for creating/validating genomics datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •