Skip to content

Document use of SquashFS image as gtdbtk database.#793

Merged
jfy133 merged 11 commits into
nf-core:devfrom
muniheart:dev
Apr 28, 2025
Merged

Document use of SquashFS image as gtdbtk database.#793
jfy133 merged 11 commits into
nf-core:devfrom
muniheart:dev

Conversation

@muniheart

@muniheart muniheart commented Apr 18, 2025

Copy link
Copy Markdown
Contributor

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

This PR documents a feature of subworkflow GTDBTK. Users with limited storage resources can economize on inodes (The uncompressed database requires >200K inodes!) by providing the database for GTDB-Tk as a SquashFS image.
This feature is only available when using container engines singularity and apptainer.

This is a replacement of #785.

** Why is a PR necessary to use a SquashFS image? Why can't I just mount the image and pass its path?

Mounting a file image requires permission to mount a loop device. If you have sufficient privilege on your system to mount a loop device then you don't need this PR. Singularity and apptainer allow an unprivileged user to bind-mount a file system image to a container's file system. This PR documents the configuration of the bind-mount options.

** Why is a PR necessary to bind-mount an image? Why not just configure the parameters in containerOptions?

Setting containerOptions is sufficient. This PR provides usage details.

** Why not just pass the database path in input path(db)?

Prior to 788, the database was input to process GTDBTK_CLASSIFYWF as,

    input:
        path( 'database/*' )

This allowed for the configuration

process {
    withName: GTDBTK_CLASSIFYWF {
        containerOptions = "-B $params.gtdb_db:\$NXF_TASK_WORKDIR/database:image-src=/release220"
    }
}

This worked because shell function nxf_stage in .command.run script would create a directory named database and the image would be mounted there. With 788, nxf_stage creates a symbolic link to the database. The solution is to mount the image file system somewhere outside the workDir of the process and pass its location in path(db).

** Why not specify a distinguished absolute path at which to mount the image when the container runs and pass that as path(db)?

process {
    withName: GTDBTK_CLASSIFYWF {
        containerOptions = "-B $params.gtdb_db:/some/absolute/path:image-src=/release220"
    }
}

and call the process

    GTDBTK_CLASSIFYWF( [ 'gtdb', `/some/absolute/path` ] )

Yes! That does work. See notes on using SqashFS as GTDB-Tk database in docs/usage.md.

I used nf-core tools in docker container.

$ nf-core modules update -f gtdbtk/classifywf

I was user 'root' in container, so I had to change ownership and permissions of modified
files afterward outside the container.
@muniheart muniheart changed the title Allow squashfs image as gtdbtk database. Document use of SquashFS image as gtdbtk database. Apr 20, 2025

@jfy133 jfy133 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @muniheart ! Thanks for persisting with your experimentation on this one!

I've done a quick read through, but I will do a pass to update the text tomorrow (I think it's a bit too unnecessarily technical at points)

But so if I understand correctly in very simple terms:

  • No code changes are necessary
  • You generate your squashfs image
  • You need to make an 'empty' directory (so to say) somewhere on your filesystem (and you give this to --gtdb_db)
  • You then pass to the container options the squashfs image to then mount at the location of --gtdb_db (and the name of the top level directory)

Is that correct?

@muniheart

muniheart commented Apr 24, 2025

Copy link
Copy Markdown
Contributor Author

Hi @muniheart ! Thanks for persisting with your experimentation on this one!

I've done a quick read through, but I will do a pass to update the text tomorrow (I think it's a bit too unnecessarily technical at points)

But so if I understand correctly in very simple terms:

* No code changes are necessary

* You generate your squashfs image

* You need to make an 'empty' directory (so to say) somewhere on your filesystem (and you give this to `--gtdb_db`)

* You then pass to the container options the squashfs image to then mount at the location of `--gtdb_db`  (and the name of the top level directory)

Is that correct?

Hi @jfy133. That's it! The top-level-directory bit can even be dropped if you like. It would be resolved by find.

 process {
      withName: GTDBTK_CLASSIFYWF {
              containerOptions = "-B /path/to/gtdb.squashfs:${params.gtdb_db}:image-src=/"
      }
  }

Thanks for your patience in considering this 'enhancement'.

@jfy133

jfy133 commented Apr 25, 2025

Copy link
Copy Markdown
Member

@muniheart please have a look and check this looks OK - we can always roll back or take bits from your pverious commit if necessary

@jfy133

jfy133 commented Apr 25, 2025

Copy link
Copy Markdown
Member

@nf-core-bot fix linting

Comment thread docs/usage.md
Comment on lines +419 to +425
```nextflow
process {
withName: GTDBTK_CLASSIFYWF {
containerOptions = "-B /<path>/<to>/<empty_dir>/gtdbtk_r220.squashfs:${params.gtdb_db}:image-src=/<output_from_unsquashfs_ls>"
}
}
```

@muniheart muniheart Apr 25, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, @jfy133 . I have 2 comments:

  1. The /<path>/<to>/<empty_dir>/ should be passed in params.gtdb_db. That is mentioned above.
  2. It may look cleaner as
 process {
      withName: GTDBTK_CLASSIFYWF {
              containerOptions = "-B <path_to_image>:${params.gtdb_db}:image-src=/"
      }
  }

Using <path_to_image> may be closer to the convention used elsewhere in the doc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use /<path>/<to> where possible as it's a more explicit example (and tells the user what to expect). In the case of <output_from_unsquas> bit I can't predict what this will look like so prefer to make it more generic, if that makes sense?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a problem with arguments to -B, @jfy133.

It should be,

process {
      withName: GTDBTK_CLASSIFYWF {
              containerOptions = "-B /<path>/<to>/gtdbtk_r220.squashfs:${params.gtdb_db}:image-src=/"
      }
  }

The value /<path>/<to>/<empty_dir> is passed in params.gtdb_db.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got, sorry my mistake!

Comment thread docs/usage.md Outdated
Co-authored-by: muniheart <52059779+muniheart@users.noreply.github.com>
Comment thread docs/usage.md
@jfy133

jfy133 commented Apr 25, 2025

Copy link
Copy Markdown
Member

How about now @muniheart ?

@muniheart

Copy link
Copy Markdown
Contributor Author

How about now @muniheart ?

Hi @jfy133. I suggested a change above.

Comment thread docs/usage.md
Comment on lines +419 to +425
```nextflow
process {
withName: GTDBTK_CLASSIFYWF {
containerOptions = "-B /<path>/<to>/<empty_dir>/gtdbtk_r220.squashfs:${params.gtdb_db}:image-src=/<output_from_unsquashfs_ls>"
}
}
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got, sorry my mistake!

Comment thread docs/usage.md Outdated
Comment thread docs/usage.md Outdated
@jfy133

jfy133 commented Apr 28, 2025

Copy link
Copy Markdown
Member

Hopefully that's everything now @muniheart ?

@jfy133

jfy133 commented Apr 28, 2025

Copy link
Copy Markdown
Member

@nf-core-bot fix linting

@jfy133

jfy133 commented Apr 28, 2025

Copy link
Copy Markdown
Member

Ignore the failing tests for now @muniheart we can merge in if you're happy with the instructions :)

@muniheart

Copy link
Copy Markdown
Contributor Author

LGTM, @jfy133. Thank you.

Comment thread CHANGELOG.md Outdated
@prototaxites

Copy link
Copy Markdown
Contributor

Hi @muniheart, no comments but just wanted to say thank you for persevering with this, despite the difficulties finding the best way to incorporate it! I think this makes a really nice addition to the documentation, and will definitely be helpful to other users of the pipeline.

@jfy133

jfy133 commented Apr 28, 2025

Copy link
Copy Markdown
Member

I +1 that too, thank you @muniheart !

Comment thread CHANGELOG.md Outdated
### `Added`

- [#784](https://github.com/nf-core/mag/pull/784) - Added `--bin_min_size` and `--bin_max_size` parameters to filter out bins based on size (requested by @maxibor, @alexhbnr, added by @jfy133, @prototaxites).
- [#793](https://github.com/nf-core/mag/pull/793) - Document use of a SquashFS image with `--gtdb_db` (by @muniheart).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise:

Suggested change
- [#793](https://github.com/nf-core/mag/pull/793) - Document use of a SquashFS image with `--gtdb_db` (by @muniheart).
- [#793](https://github.com/nf-core/mag/pull/793) - Document use of a SquashFS image with `--gtdb_db`, useful for limited inode infrastructure (by @muniheart).

Comment thread CHANGELOG.md Outdated
@jfy133 jfy133 merged commit fd0f79c into nf-core:dev Apr 28, 2025
muabnezor pushed a commit to muabnezor/mag that referenced this pull request Apr 28, 2025
Document use of SquashFS image as gtdbtk database.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants