Skip to content

Add PyPOLCA polishing and use polished assemblies for long-read binning#1048

Merged
Harshita-sriv merged 26 commits into
nf-core:devfrom
Harshita-sriv:add-polishing-module
Jun 19, 2026
Merged

Add PyPOLCA polishing and use polished assemblies for long-read binning#1048
Harshita-sriv merged 26 commits into
nf-core:devfrom
Harshita-sriv:add-polishing-module

Conversation

@Harshita-sriv

@Harshita-sriv Harshita-sriv commented Jun 2, 2026

Copy link
Copy Markdown

Summary

This PR adds a PyPOLCA polishing module and integrates polished assemblies into the long-read MAG workflow.

Changes

  • Added a new PyPOLCA module and conda environment.
  • Added publication of PyPOLCA outputs.
  • Added the run_pypolca parameter.
  • Modified the workflow so that long-read binning uses ch_polished_assemblies instead of raw ch_longread_assemblies.

@dialvarezs

dialvarezs commented Jun 2, 2026

Copy link
Copy Markdown
Member

Hi @Harshita-sriv, thanks for this!
For the meantime, can you join the nf-core org? That way you will able to run the ci without explicit approval.
https://nf-co.re/join

@dialvarezs dialvarezs left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it looks right, but needs to migrate to the nf-core module, and of course, pass the ci.

Comment thread modules/local/pypolca/main.nf Outdated
Comment thread workflows/mag.nf Outdated
Comment thread workflows/mag.nf
@Harshita-sriv

Copy link
Copy Markdown
Author

Thanks for the review. I investigated the CI failures and found that the current implementation does not define ch_polished_assemblies when run_pypolca is not enabled, which causes the test workflows to fail. I'll address this together with the requested migration to the official nf-core PyPOLCA module, removal of debug statements, and formatting fixes

@Harshita-sriv Harshita-sriv requested a review from dialvarezs June 3, 2026 09:58
@dialvarezs

Copy link
Copy Markdown
Member

Hi @Harshita-sriv! I think this looks fine, but it would be ideal to wire it up in some test. I'll think what would be the best option here and I'll come back to you.
For the meantime, can you update the changelog?
Thanks!

@d4straub d4straub left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition, I think thats absolutely desired!
Looks good but I have some concerns with multiple samples.

Comment thread nextflow_schema.json Outdated
Comment thread workflows/mag.nf Outdated
Comment thread conf/modules.config Outdated
@Harshita-sriv

Copy link
Copy Markdown
Author

I noticed that the remaining lint failure is due to modifications in modules/nf-core/pypolca/run, which no longer matches the upstream module.

The changes were introduced because PyPOLCA was failing during testing, and with the modifications the pipeline completes successfully on the test dataset.

Would it be acceptable to keep these modifications to the PyPOLCA module, or would you prefer that the implementation uses the upstream module unchanged and handles the required logic elsewhere?

@d4straub

d4straub commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Would it be acceptable to keep these modifications to the PyPOLCA module, or would you prefer that the implementation uses the upstream module unchanged and handles the required logic elsewhere?

Some of the changes are not acceptable: the output block and the versioning must be kept as it was originally. Additionally the line ch_versions = ch_versions.mix(PYPOLCA_RUN.out.versions) in workflows/mag.nf has to be removed.
The incoming data (contigs) should be compressed as in the original code. Even if it would be necessary to allow non-compressed input (which may be ok in same cases), copying the files when already uncompressed seems like an unnecessary burden on IO and file sizes to me (assemblies in nf-core/mag can be huge!).

Generally, small changes on nf-core modules can be done and accepted by the linting via https://nf-co.re/docs/nf-core-tools/cli/modules/patch

@Harshita-sriv

Copy link
Copy Markdown
Author

@d4straub I restored the PyPOLCA module to the upstream nf-core implementation and moved the coassembly-specific logic into the workflow layer (workflows/mag.nf). The workflow now groups short reads appropriately when --coassemble_group is enabled before passing them to PyPOLCA.

The remaining CI issue appears to be the check_local_copy lint check for pypolca/run. Locally, the module matches the installed nf-core module definition (nf-core modules info pypolca/run), and the workflow compiles and reaches PYPOLCA_RUN in stub runs. All other CI checks pass.

Could you advise whether there is an additional module sync/patch step expected for this module?

@d4straub

Copy link
Copy Markdown
Collaborator

Looks good.
I think the easiest way of satisfying the linting is to re-install the module.

@dialvarezs dialvarezs left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again @Harshita-sriv! This is looking good!

A few last things:

Regarding the testing setup:
I think the best option is to use the longreadonly_alternatives but changing the samplesheet so it uses the hybrid one that has LR and SR data. Then disable SPAdes, SPAdes hybrid and MEGAHIT at least. That way we can get this tested in that profile.

Also I left a comment regarding the pypolca module requiring gzipped input (that will need updating the nf-core module).

Comment thread modules/nf-core/pypolca/run/main.nf Outdated
@Harshita-sriv

Copy link
Copy Markdown
Author

Hey @dialvarezs

I'm still seeing the remaining CI failures, and I haven't been able to reproduce/update the snapshots locally. Since you made the latest changes, could you take a look and let me know whether there's anything else I should update on my side?

@dialvarezs

Copy link
Copy Markdown
Member

Hi @Harshita-sriv, it just needed to regenerate the snapshot and exclude the VCF, because it has the date.
I also realized that we missed one point: Nextflow channels are not guaranteed to be emited in the same order. CI pass because it's only one sample. I fixed that with a combine and multiMap mix.

@dialvarezs

Copy link
Copy Markdown
Member

For me this looks good to merge now. Can you give it a final look @d4straub?

@Harshita-sriv

Copy link
Copy Markdown
Author

Hey, @dialvarezs!
Thanks for fixing those issues and for the explanation! I'll take a look through the changes to understand them better. Thanks again for the review.

@d4straub d4straub left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I still found some relevant inconsistencies.

Comment thread tests/test_longreadonly_alternatives.nf.test.snap Outdated
Comment thread tests/test_longreadonly_alternatives.nf.test.snap Outdated
Comment thread workflows/mag.nf
Comment thread conf/modules.config Outdated
@Harshita-sriv Harshita-sriv force-pushed the add-polishing-module branch from 2056f30 to 5b6382a Compare June 17, 2026 15:23
@Harshita-sriv Harshita-sriv force-pushed the add-polishing-module branch from 5b6382a to c92845b Compare June 17, 2026 15:32
Comment thread tests/test_longreadonly_alternatives.nf.test.snap Outdated
@Harshita-sriv

Harshita-sriv commented Jun 18, 2026

Copy link
Copy Markdown
Author

Hey @d4straub and @dialvarezs,

I've addressed the review comments, updated MAG to the latest merged PyPOLCA module revision, and regenerated the affected snapshots. The longreadonly_alternatives test passes locally with Docker.

I had to rerun the CI jobs after canceling a workflow while investigating a long-running NF-test shard. The CI checks have now been completed successfully. 🎉

Do you think this is good to go? Please say yes! 😄

Comment on lines +53 to +55
"Assembly/PYPOLCA/minigut/minigut/minigut.report",
"Assembly/PYPOLCA/minigut/minigut/minigut.vcf",
"Assembly/PYPOLCA/minigut/minigut/minigut_pypolca.fasta"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still a bit concerned here: In this current test, only one assembler (FLYE) is run, produces "Assembly/FLYE/FLYE-minigut.assembly.fasta.gz" and corrected to file "Assembly/PYPOLCA/minigut/minigut/minigut_pypolca.fasta". But if two assemblers are run, FLYE and metaMDBG, both corrected assemblies will be named to "Assembly/PYPOLCA/minigut/minigut/minigut_pypolca.fasta", overwriting each other.
I think the file name must include the assembler. Or did I get something wrong?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with the changes. meta2.assembler came in handy! I've updated the naming to preserve the assembler in the PyPOLCA outputs and regenerated the snapshots.

Comment thread conf/modules.config
Comment thread workflows/mag.nf Outdated
Comment on lines +258 to +269
ch_polished_assemblies = PYPOLCA_RUN.out.polished
.map { meta, polished ->
[meta.id, polished]
}
.join(
ch_longread_assemblies.map { meta, assembly ->
[meta.id, meta]
}
)
.map { id, polished, meta ->
[meta, polished]
}

@dialvarezs dialvarezs Jun 18, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple assemblers enabled, this join could be problematic. We're not using the assembler for the join, so, if we have two assemblies for the same sample, the join will do a cross product, producing 4 outputs instead of 2 (each combination of original assembler + poslished one).

The problem is that pypolca emits the reads meta, not the assembly one, so we can't get that info easily.

In my opinion, we should update the module so it emits the meta from the assembly, not the one from the reads. The output is more closely related to the assembly than the reads, so it makes more sense. If we have that change, we can remove this whole block, because we will have the meta from the assembly here already, and will prevent the issue I'm describing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess in this case I would just swap the inputs: make the assembly the first, because it's the primary one and leave the reads as the second input.
The other option is using meta2 in the outputs, but I think this feels like a poorer design.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the module to use the assembly metadata as the primary input, updated the MAG integration/tests, and verified the output structure locally. CI is now passing.

@Harshita-sriv Harshita-sriv force-pushed the add-polishing-module branch from 8fab6cd to 03b1a05 Compare June 19, 2026 04:43

@d4straub d4straub left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further comments, looks perfect now!
That was quite a journey, but a successful one ;)

@dialvarezs dialvarezs left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Harshita-sriv!

As a finishing touches I updated some variable names and moved some things around to improve semantics, and also updated output docs.
I think this is good to merge now.

You did a wonderful job! Thanks for contributing this!

Feel free to push the merge button :D

@Harshita-sriv Harshita-sriv merged commit e7d60f5 into nf-core:dev Jun 19, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants