Add PyPOLCA polishing and use polished assemblies for long-read binning#1048
Add PyPOLCA polishing and use polished assemblies for long-read binning#1048Harshita-sriv wants to merge 13 commits into
Conversation
|
Hi @Harshita-sriv, thanks for this! |
|
Thanks for the review. I investigated the CI failures and found that the current implementation does not define ch_polished_assemblies when run_pypolca is not enabled, which causes the test workflows to fail. I'll address this together with the requested migration to the official nf-core PyPOLCA module, removal of debug statements, and formatting fixes |
Merge remote-tracking branch 'upstream/dev' into add-polishing-module
|
Hi @Harshita-sriv! I think this looks fine, but it would be ideal to wire it up in some test. I'll think what would be the best option here and I'll come back to you. |
d4straub
left a comment
There was a problem hiding this comment.
Thanks for the addition, I think thats absolutely desired!
Looks good but I have some concerns with multiple samples.
|
I noticed that the remaining lint failure is due to modifications in The changes were introduced because PyPOLCA was failing during testing, and with the modifications the pipeline completes successfully on the test dataset. Would it be acceptable to keep these modifications to the PyPOLCA module, or would you prefer that the implementation uses the upstream module unchanged and handles the required logic elsewhere? |
Some of the changes are not acceptable: the output block and the versioning must be kept as it was originally. Additionally the line Generally, small changes on nf-core modules can be done and accepted by the linting via https://nf-co.re/docs/nf-core-tools/cli/modules/patch |
|
@d4straub I restored the PyPOLCA module to the upstream nf-core implementation and moved the coassembly-specific logic into the workflow layer ( The remaining CI issue appears to be the Could you advise whether there is an additional module sync/patch step expected for this module? |
|
Looks good. |
dialvarezs
left a comment
There was a problem hiding this comment.
Hi again @Harshita-sriv! This is looking good!
A few last things:
Regarding the testing setup:
I think the best option is to use the longreadonly_alternatives but changing the samplesheet so it uses the hybrid one that has LR and SR data. Then disable SPAdes, SPAdes hybrid and MEGAHIT at least. That way we can get this tested in that profile.
Also I left a comment regarding the pypolca module requiring gzipped input (that will need updating the nf-core module).
| def read_files = reads instanceof List ? reads : [reads] | ||
| def read_file_arg = read_files.size() > 1 ? "-1 ${read_files[0]} -2 ${read_files[1]}" : "-1 ${read_files[0]}" | ||
| """ | ||
| gzip -cdf $contigs > contigs_uncompressed |
There was a problem hiding this comment.
Can you update the module so it only decompresses when the input has a .gz extension? I was testing this locally, and pypolca received the uncompressed assembly, throwing an error.
And also I think it's a better design not assuming that the input is compressed.
There was a problem hiding this comment.
Yes, I also faced this problem, but I didn't change the module because I assumed it wouldn't be accepted. However, for the testing setup, could you confirm that for test_longreadonly_alternatives, I should replace samplesheet.long_read.v4.csv with samplesheet.hybrid.v4.csv, enable run_pypolca, and keep only Flye active by disabling SPAdes, SPAdes-hybrid, MEGAHIT, and MetaMDBG?
There was a problem hiding this comment.
It's a totally reasonable change if you execute gzip based on the extension, so it works with compressed and uncompressed input. So don't worry about opening a PR on the module. I can review it if you ping me.
And yes, that's the setup I'm proposing. MetaMDBG is already disabled in that profile, so you'll need to disable both SPAdes and MEGAHIT.
Summary
This PR adds a PyPOLCA polishing module and integrates polished assemblies into the long-read MAG workflow.
Changes
run_pypolcaparameter.ch_polished_assembliesinstead of rawch_longread_assemblies.Validation
Tested on one Nanopore metagenomic sample.
Verified that:
sample_polished.fasta.LONGREAD_BINNING_PREPARATION:MINIMAP2_ASSEMBLY_INDEXusedsample_polished.fasta.This ensures MAG generation is performed using polished assemblies rather than raw Flye assemblies.