Snakemake Modularization and Harmonization#2667
Conversation
Format Snakemake workflow files
…akemake_module I am not sure why this merge is needed
Reduce redundancy
Fixing merge conflicts
…r}/params.json, returns schema dict, no-op if file absent.
…it__.py; WorkflowSuperClass.init() now derives them from params.json schema automatically
…d general_params.extend + default_config.update from contigs/__init__.py.
…ypes at config load time for any workflow with a params.json.
There was a problem hiding this comment.
Holds the rules' params in echophylo. This file can be added to or edited as desired!
There was a problem hiding this comment.
Rules related to long read assembly
|
Workflow retested after changes with component tests, and all are still passing except for ecophylo since the component tests are broken. |
|
@FlorianTrigodet Also made this which may be of use (have the svg if you wanted that) |
|
With this Everything should be good to go in this PR and is fully finished (pending reviewer review) |
|
Hey @Kekananen, and others, this is quite a substantial undertaking. Pulling this together in a few days is an amazing feat by itself. Even though I already have commented on or dealt with many of the PRs from last week, I didn't have a chance to go through this one carefully (as (1) this requires much more care and (2) I am still trying to organize my thoughts regarding how to do this best). But I just wanted to post this so you know it by no means has fallen through the cracks if you have been getting notifications from other activities in the repo :) It will likely take a few days, and the luxury of testing and merging this will probably fall on @FlorianTrigodet given his familiarity with our snakemake workflows and my looming administrative duties, but we will get there! HI FLORIAN! AND THANK YOU AGAIN @Kekananen, @btrouche, @katysloz, and @l-gallucci! |
|
This is such a great PR, and I will carefully go through everything (with my friend Claude as well). For now, I'm exploring the rest of the branch and I'll push any fix if needed :) |
Yes! You are correct. This was a quick fix I did for an issue (#2606) with numpy and gene calls. I wasn't able to test it however as the issue wasn't able to be reproduced / I unfortunately didn't have time to generate a test dataset mimicking this problem. I'll make a smaller separate PR for this code and remove it from snakemake PR. |
This reverts commit 8ed0185.
|
Okay! The changes relating to the gene calling numpy issue should be reverted and no longer in this PR. |
The metagenomics params.json calls the QC-report rule as `gen_qc_report`, but the actual Snakemake rule (qc/rules/sr_filter.smk) and the QCModule rule registration use `metagenomics_gen_qc_report`. As a result, the config produced by `anvi-run-workflow -w metagenomics --get-default-config` contained a `gen_qc_report` that is not in self.rules, so ConfigError.
Older config files will have "" for that flag, but the new default is None, so the check ""!=None is not returning what we expect and triggers a warning message. So it is mostly a cosmetic change.
|
The newly added unit test break the anvi'o convention:
I can see that they could be useful but I don't know if they are worth breaking the current tests philosophy. They are also not wired to anything (as far as I can tell), so they are a bit orphan. Either we go full in, we wire it to the daily tests and we update the ARCHITECTURE.md and the convention, or we move/remove them for another PR about how tests are run in anvi'o. But the current format has to be addressed. For everyone associated with that branch: let me know what you think :D |
Good point, @FlorianTrigodet. I agree with the idea that we either go full, or refrain from partial coverage of code with unit tests that will -almost certainly- not be maintained by anyone other than LLMs. I would vote for carefully designed and up-to-date component tests than boilerplate unit tests that checks the box "we have tests" but does not add any real value to maintain a codebase that works. My 2 cents. |
|
I went through the newly added long-read support for the workflow and I want to spend some careful time testing it and changing some things / discuss features with @Kekananen @katysloz @btrouche who worked on it (let me know if I forgot any contributors). But fundamentally, it is separate from the snakemake refactor. I don't want to delay the review and merge of the refactor since it has timely consequences with other PRs. So I will move the LR changes to a new branch/PR and remove the unit test for now (and open an issue to discuss the need for a different test philosophy). |
The workflows were initially hard to maintain for developers as well as not as easy to add additional workflows to. The workflow passes the component test for each workflow in the
anvio/tests/directory.Related to issue #2611 and #2606.
General Change Summary:
- Example:
metagenomics/Snakefilenow includes files likerules/assembly_short_reads.smk,rules/read_filtering.smk,rules/taxonomy.smk.- makes workflows easier to review, test, and extend without editing one massive file.
- Example:
anvio/workflows/config.py,paths.py,registry.py,snakemake_utils.py.- moves pure helpers out of
__init__.py, reducing clutter and making ownership clearer.- Example: Snakefiles can still call
w.get_workflow_rule_file_path(...).- existing workflow code does not need broad rewrites while internals become cleaner.
- Example:
anvio/workflows/metagenomics/params.jsondeclares accepted params, types, and defaults.- workflow configuration becomes data-driven instead of scattered through Python classes.
- Example:
get_default_config()reads defaults from params.json.- reduces duplication between “allowed params” and “default params.”
- Example: "threads": "many" now fails early if schema expects int.
- users get clear config errors before Snakemake builds a DAG.
- Example: metagenomics can inherit contigs/read-recruitment-style params while adding its own.
- shared rules can travel with shared config contracts.
- Example: contigs params moved from
ContigsDBWorkflow.__init__()intocontigs/params.json.- reduces drift between implementation, defaults, and validation.
- Example:
anvio/workflows/read_recruitment/rules/main.smk.- mapping, BAM initialization, profiling, and merging can be reused by workflows instead of rewritten.
- ample: assembly, binning, taxonomy, read filtering, and postprocessing now live in separate .smk files.
- makes metagenomics workflow changes less risky and easier to isolate.
- Example:
anvio/workflows/qc/with SR/LR QC rules.- QC can become shared workflow infrastructure rather than metagenomics-only code.
- Example: LongQC and Filtlong rules in
workflows/qc/rules/lr.smk.- mixed short/long-read workflows can support long-read preprocessing more cleanly.
- Example: samples_txt can distinguish long-read technologies for appropriate tool flags.
- avoids hard-coded assumptions about long-read data type.
- Example: group/readset logic tracks SR and LR readsets separately.
- avoids mixing incompatible read types in assembly and mapping steps.
- Example: minimap2 presets can vary by readset technology.
- long-read mapping uses more appropriate parameters.
- Example:
readset=READSETS_REin read-recruitment rules.- prevents Snakemake recursion like sample-RAW-RAW-RAW.bam.
- Example: metagenomics references mode preserves .fa.gz input paths.
- prevents missing input errors where Snakemake looked for .fa instead of .fa.gz.
- Example: ignores comments, shell keywords, and {params.env_prefix} placeholders.
- dependency reports no longer claim # or {params.env_prefix} are missing programs.
- Example:
workflows/tests/unit/test_biological_workflow_contracts.py.- catches missing rule files, broken includes, and structural workflow regressions earlier.
- Example:
workflows/tests/unit/test_dependencies.py.- protects shell dependency parsing behavior from regressions.
manifest/logtests- Example:
test_metagenomics_workflow_logs_manifest.py.- verifies workflow logging/manifest behavior as workflows become more modular.
- Example: helper functions and rules now describe their purpose.
- makes the new modular structure easier for future developers to navigate.
- Example: simple run: shell(...) blocks converted to shell: where possible.
- makes Snakemake rules more declarative and easier for Snakemake to inspect.
- Example:
sra_download/rules/download.smk.- separates download/checksum/compression logic from workflow setup.
- Example:
trnaseq/rules/identification.smk,qc.smk,merge_taxonomy.smk.- makes tRNA-seq stages easier to understand and modify independently.