Genocrate defines RO Crate profiles and provides validators to package genomic datasets for research studies. In typical genomics projects, data for multiple participants is often organized into batches, with each batch grouped by participant identifiers, reflecting sequencing output or transfer times. Multiple batches or sequencing runs usually belong to the same study. Genocrate helps you build RO Crates for each batch to ensure files are well documented, and also create an overarching crate to organize and track all files across batches within a study.
Below is an example of the output folder structure generated by Genocrate. For more details, see the layout description:
./tests/fixtures/test-batches
├── batch-001
│ ├── bag-info.txt
│ ├── bagit.txt
│ ├── data
│ │ ├── (genomic file set ...)
│ │ └── ro-crate-metadata.json
│ ├── manifest-md5.txt
│ └── tagmanifest-md5.txt
├── batch-002
│ ├── bag-info.txt
│ ├── bagit.txt
│ ├── data
│ │ ├── (genomic file set ...)
│ │ └── ro-crate-metadata.json
│ ├── manifest-md5.txt
│ └── tagmanifest-md5.txt
├── batch-003
│ ├── bag-info.txt
│ ├── bagit.txt
│ ├── data
│ │ ├── ( some genomic file set ...)
│ │ └── ro-crate-metadata.json
│ ├── manifest-md5.txt
│ └── tagmanifest-md5.txt
├── ro-crate-metadata.json
└── ro-crate-preview.html
There are 2 profiles to capture genomic file set:
- batch-submission: Describes a smaller set of genomic files submitted as a batch, typically representing data generated or transferred together as part of the same study.
- study-dataset: Describes the complete dataset for a study, aggregating information from multiple batch RO Crates to provide an overview of all files and participants in the study.
build: Create a root RO Crate or merge an existing root RO Crate with a new batch RO Crate that conforms to thestudy-datasetprofile. This command reads through batch submission crates to assemble or update the study-level crate.csv2genocrate: Convert a CSV manifest file into an RO Crate that conforms to thebatch-submissionprofile within the batch submission folder. This command also validates checksums defined in the CSV.diff: Show differences between the root study-dataset RO Crate and a new batch RO Crate before merging, helping you review changes prior to running thebuildcommand.validate-batch: Validate that a folder conforms to thebatch-submissionRO Crate profile, including checks for checksums and BagIt specification compliance.validate-dataset: Validate the study-level RO Crate against thestudy-datasetprofile. Skips content / integrity checks (e.g., checksums, BagIt) handled by validate-batch. For more details on each command, see the CLI documentation.
Install this tool using pip:
pip install genocrateFor help, run:
genocrate --helpYou can also use:
python -m genocrate --helpDetailed command-line documentation is available in the CLI docs.
Some example command from validating a ro-crate batch
genocrate validate-batch ./tests/fixtures/test-batches/batch-001 -t bagit
RO-Crate metadata is valid!
Validating for BagIt compliance (./tests/fixtures/test-batches/batch-001)
Validation successful!genocrate validate-batch ./tests/fixtures/batch-004 --skip-integrity-validation
Detected issue of severity REQUIRED with check "batch-submission_1.1": Physical file 'A001.vcf.tbi' present in directory but not declared as File entity in ro-crate-metadata.json
RO-Crate metadata is invalid!Note: The actual batch-004 in the repository is intentionally a bad RO-Crate. For this example it was modified to have valid RO-Crate metadata but an invalid MD5 manifest, as shown in the example below.
genocrate validate-batch ./tests/fixtures/batch-004/bad-manifest-md5.txt -t md5
RO-Crate metadata is valid!
Validating files from md5sum (./tests/fixtures/batch-004/bad-manifest-md5.txt)
[
"File data/A001.vcf.tbi not found in manifest",
"File data/A001.vcf not found in manifest",
"File data/A001.bam.bai not found in manifest",
"MD5 mismatch for data/ro-crate-metadata.json: expected b7410ecf0fd48789ab7b618a519cdaaf, calculated a9650294eb3d10acbff3926769ae800e",
"MD5 mismatch for data/A001.bam: expected d6d0c756fb8abfb33e652a20XXXXXXXX, calculated d6d0c756fb8abfb33e652a20e85b70bc"
]
MD5 checksum validation failed.To contribute to this tool, first checkout the code. Then create a new virtual environment:
uv venv
source .venv/bin/activateNow install the dependencies and test dependencies:
pip install -e '.[test]'To run the tests:
python -m pytestTo update the CLI docs:
python ./scripts/generate_cli_docs.pyTests are located in the tests/ folder. Run them using pytest -rA
================================================================= short test summary info ==================================================================
PASSED tests/test_build.py::test_build_root_ro_crate - Validates that the 'build' command generates a correct RO-Crate
PASSED tests/test_csv2crate.py::test_csv2genocrate - Test 'csv2genocrate' command output a valid RO-Crate for a given csv manifest file.
PASSED tests/test_diff.py::test_diff_crate - Test 'diff' command is picking up the changes correctly
PASSED tests/test_genocrate.py::test_version
PASSED tests/test_validate.py::test_validate_batch_valid_bagit - Test 'validate-batch' command succeeds for a batch conforming to the BagIt specification.
PASSED tests/test_validate.py::test_validate_batch_invalid_bagit - Test 'validate-batch' command fails for a batch NOT conforming to the BagIt specification.
PASSED tests/test_validate.py::test_validate_batch_valid_md5 - Test 'validate-batch' command succeeds when the manifest is valid and all files are listed in the manifest.
PASSED tests/test_validate.py::test_validate_batch_invalid_md5 - Test 'validate-batch' when the manifest failed (incorrect checksum or files not listed).
PASSED tests/test_validate.py::test_validate_invalid_ro_crate - Test 'validate-batch' command fails when the RO-Crate metadata does not list all files.