Skip to content

Snakemake Modularization and Harmonization#2667

Open
Kekananen wants to merge 70 commits into
merenlab:masterfrom
Kekananen:snakemake_module
Open

Snakemake Modularization and Harmonization#2667
Kekananen wants to merge 70 commits into
merenlab:masterfrom
Kekananen:snakemake_module

Conversation

@Kekananen

@Kekananen Kekananen commented May 21, 2026

Copy link
Copy Markdown
Contributor

The workflows were initially hard to maintain for developers as well as not as easy to add additional workflows to. The workflow passes the component test for each workflow in the anvio/tests/ directory.

Related to issue #2611 and #2606.

General Change Summary:

  • Split large Snakefiles into rule modules
    - Example: metagenomics/Snakefile now includes files like rules/assembly_short_reads.smk, rules/read_filtering.smk, rules/taxonomy.smk.
    - makes workflows easier to review, test, and extend without editing one massive file.
  • Added reusable workflow helper modules
    - Example: anvio/workflows/config.py, paths.py, registry.py, snakemake_utils.py.
    - moves pure helpers out of __init__.py, reducing clutter and making ownership clearer.
  • Kept old anvio.workflows as w API working
    - Example: Snakefiles can still call w.get_workflow_rule_file_path(...).
    - existing workflow code does not need broad rewrites while internals become cleaner.
  • Added params.json schemas
    - Example: anvio/workflows/metagenomics/params.json declares accepted params, types, and defaults.
    - workflow configuration becomes data-driven instead of scattered through Python classes.
  • Default configs now come from schema
    - Example: get_default_config() reads defaults from params.json.
    - reduces duplication between “allowed params” and “default params.”
  • Added config type validation
    - Example: "threads": "many" now fails early if schema expects int.
    - users get clear config errors before Snakemake builds a DAG.
  • Added inherited schema merging
    - Example: metagenomics can inherit contigs/read-recruitment-style params while adding its own.
    - shared rules can travel with shared config contracts.
  • Removed duplicated config dictionaries from workflow classes
    - Example: contigs params moved from ContigsDBWorkflow.__init__() into contigs/params.json.
    - reduces drift between implementation, defaults, and validation.
  • Added reusable read-recruitment module
    - Example: anvio/workflows/read_recruitment/rules/main.smk.
    - mapping, BAM initialization, profiling, and merging can be reused by workflows instead of rewritten.
  • Modularized metagenomics rules
    - ample: assembly, binning, taxonomy, read filtering, and postprocessing now live in separate .smk files.
    - makes metagenomics workflow changes less risky and easier to isolate.
  • Added reusable QC module
    - Example: anvio/workflows/qc/ with SR/LR QC rules.
    - QC can become shared workflow infrastructure rather than metagenomics-only code.
  • Added long-read QC support
    - Example: LongQC and Filtlong rules in workflows/qc/rules/lr.smk.
    - mixed short/long-read workflows can support long-read preprocessing more cleanly.
  • Added lr_technology sample metadata
    - Example: samples_txt can distinguish long-read technologies for appropriate tool flags.
    - avoids hard-coded assumptions about long-read data type.
  • Improved mixed SR/LR metagenomics handling
    - Example: group/readset logic tracks SR and LR readsets separately.
    - avoids mixing incompatible read types in assembly and mapping steps.
  • Added per-readset minimap2 preset dispatch
    - Example: minimap2 presets can vary by readset technology.
    - long-read mapping uses more appropriate parameters.
  • Constrained readset wildcards
    - Example: readset=READSETS_RE in read-recruitment rules.
    - prevents Snakemake recursion like sample-RAW-RAW-RAW.bam.
  • Fixed gzipped reference FASTA handling
    - Example: metagenomics references mode preserves .fa.gz input paths.
    - prevents missing input errors where Snakemake looked for .fa instead of .fa.gz.
  • Improved shell dependency parsing
    - Example: ignores comments, shell keywords, and {params.env_prefix} placeholders.
    - dependency reports no longer claim # or {params.env_prefix} are missing programs.
  • Added workflow contract tests
    - Example: workflows/tests/unit/test_biological_workflow_contracts.py.
    - catches missing rule files, broken includes, and structural workflow regressions earlier.
  • Added dependency parser tests
    - Example: workflows/tests/unit/test_dependencies.py.
    - protects shell dependency parsing behavior from regressions.
  • Added metagenomics manifest/log tests
    - Example: test_metagenomics_workflow_logs_manifest.py.
    - verifies workflow logging/manifest behavior as workflows become more modular.
  • Added comments/docstrings
    - Example: helper functions and rules now describe their purpose.
    - makes the new modular structure easier for future developers to navigate.
  • Harmonized simple shell rules
    - Example: simple run: shell(...) blocks converted to shell: where possible.
    - makes Snakemake rules more declarative and easier for Snakemake to inspect.
  • Modularized SRA download workflow
    - Example: sra_download/rules/download.smk.
    - separates download/checksum/compression logic from workflow setup.
  • Modularized tRNA-seq workflow
    - Example: trnaseq/rules/identification.smk, qc.smk, merge_taxonomy.smk.
    - makes tRNA-seq stages easier to understand and modify independently.

Kekananen and others added 30 commits May 19, 2026 02:39
Format Snakemake workflow files
…akemake_module

I am not sure why this merge is needed
…r}/params.json, returns schema dict, no-op if file absent.
…it__.py; WorkflowSuperClass.init() now derives them from params.json schema automatically
…d general_params.extend + default_config.update from contigs/__init__.py.
…ypes at config load time for any workflow with a

  params.json.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holds the rules' params in echophylo. This file can be added to or edited as desired!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rules related to long read assembly

@ivagljiva ivagljiva added the hackathon 2026 ideas, code, and plans related to the 2026 anvi'o hackathon label May 22, 2026
@Kekananen Kekananen marked this pull request as ready for review May 22, 2026 09:44
@Kekananen Kekananen marked this pull request as draft May 22, 2026 09:47
@Kekananen Kekananen marked this pull request as ready for review May 22, 2026 10:21
@Kekananen

Copy link
Copy Markdown
Contributor Author

Workflow retested after changes with component tests, and all are still passing except for ecophylo since the component tests are broken.

@Kekananen Kekananen marked this pull request as draft May 22, 2026 10:50
@Kekananen Kekananen marked this pull request as ready for review May 22, 2026 11:17
@Kekananen

Kekananen commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

@FlorianTrigodet Also made this which may be of use (have the svg if you wanted that)
anvio_workflows_layout

@Kekananen

Copy link
Copy Markdown
Contributor Author

With this Everything should be good to go in this PR and is fully finished (pending reviewer review)

@meren

meren commented May 25, 2026

Copy link
Copy Markdown
Member

Hey @Kekananen, and others, this is quite a substantial undertaking. Pulling this together in a few days is an amazing feat by itself.

Even though I already have commented on or dealt with many of the PRs from last week, I didn't have a chance to go through this one carefully (as (1) this requires much more care and (2) I am still trying to organize my thoughts regarding how to do this best). But I just wanted to post this so you know it by no means has fallen through the cracks if you have been getting notifications from other activities in the repo :)

It will likely take a few days, and the luxury of testing and merging this will probably fall on @FlorianTrigodet given his familiarity with our snakemake workflows and my looming administrative duties, but we will get there! HI FLORIAN! AND THANK YOU AGAIN @Kekananen, @btrouche, @katysloz, and @l-gallucci!

@meren meren requested a review from FlorianTrigodet May 25, 2026 09:56
@FlorianTrigodet

Copy link
Copy Markdown
Contributor

This is such a great PR, and I will carefully go through everything (with my friend Claude as well).
But first, I noticed the changes to numpy integers in external gene calls. This is not related to this PR and should be a small, dedicated PR instead.

For now, I'm exploring the rest of the branch and I'll push any fix if needed :)

@Kekananen

Copy link
Copy Markdown
Contributor Author

But first, I noticed the changes to numpy integers in external gene calls. This is not related to this PR and should be a small, dedicated PR instead.

Yes! You are correct. This was a quick fix I did for an issue (#2606) with numpy and gene calls. I wasn't able to test it however as the issue wasn't able to be reproduced / I unfortunately didn't have time to generate a test dataset mimicking this problem.

I'll make a smaller separate PR for this code and remove it from snakemake PR.

@Kekananen

Copy link
Copy Markdown
Contributor Author

Okay! The changes relating to the gene calling numpy issue should be reverted and no longer in this PR.

The metagenomics params.json calls the QC-report rule as `gen_qc_report`,
but the actual Snakemake rule (qc/rules/sr_filter.smk) and the QCModule rule
registration use `metagenomics_gen_qc_report`. As a result, the config produced
by `anvi-run-workflow -w metagenomics --get-default-config` contained a
`gen_qc_report` that is not in self.rules, so ConfigError.
Older config files will have "" for that flag,
but the new default is None, so the check ""!=None
is not returning what we expect and triggers a
warning message. So it is mostly a cosmetic change.
@FlorianTrigodet

Copy link
Copy Markdown
Contributor

The newly added unit test break the anvi'o convention:

There are no pytest unit tests in the traditional sense — testing is integration-based via shell scripts.
(from the ARCHITECTURE.md).

I can see that they could be useful but I don't know if they are worth breaking the current tests philosophy. They are also not wired to anything (as far as I can tell), so they are a bit orphan.

Either we go full in, we wire it to the daily tests and we update the ARCHITECTURE.md and the convention, or we move/remove them for another PR about how tests are run in anvi'o. But the current format has to be addressed.

For everyone associated with that branch: let me know what you think :D

@meren

meren commented Jun 4, 2026

Copy link
Copy Markdown
Member

Either we go full in, we wire it to the daily tests and we update the ARCHITECTURE.md and the convention, or we move/remove them for another PR about how tests are run in anvi'o. But the current format has to be addressed.

Good point, @FlorianTrigodet. I agree with the idea that we either go full, or refrain from partial coverage of code with unit tests that will -almost certainly- not be maintained by anyone other than LLMs.

I would vote for carefully designed and up-to-date component tests than boilerplate unit tests that checks the box "we have tests" but does not add any real value to maintain a codebase that works.

My 2 cents.

@FlorianTrigodet

Copy link
Copy Markdown
Contributor

I went through the newly added long-read support for the workflow and I want to spend some careful time testing it and changing some things / discuss features with @Kekananen @katysloz @btrouche who worked on it (let me know if I forgot any contributors).

But fundamentally, it is separate from the snakemake refactor. I don't want to delay the review and merge of the refactor since it has timely consequences with other PRs.

So I will move the LR changes to a new branch/PR and remove the unit test for now (and open an issue to discuss the need for a different test philosophy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hackathon 2026 ideas, code, and plans related to the 2026 anvi'o hackathon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants