Skip to content

Sketching optimisations #57

@vrbouza

Description

@vrbouza

[This issue is to collect data regarding improvements to the sketching methodology for DNA (not AA/3Di)].

Datasets

Focusing on reads (most complicated use case) and using a metagenomic and an isolate example.

Code used

  • Vanilla: sketchlib.rust master branch (commit b9a9310) of this repo, with a small change over it so that min_qual==0 avoids making the checks completely (in src/hashing/nthash_iterator.rs).
  • SIMD: sketchlib.rust simdsketch_crate branch (commit 87192a5) of my fork. It is an implementation of sketchlib.rust, with the following caveats:
    • Not optimised, improvements to be done yet. Most surely not best way of doing it.
    • Bin size changed from 14 to 16, to accommodate simd_sketch.
    • No counts apart from the 0/1 argument (duplicate) possible.
    • Coverage set manually to 30 when the input file is a read.
    • Densification not applied.
  • SIMD with final code: sketchlib.rust simdsketch_crate branch (commit fa8e72d) of my fork. It is the code from the previous point, with a cargo update done the 04/12/2025, and with some improvements:
    • Now we can set a minimum number of counts.
    • Estimated coverage is now an argument. We are using 15000 as the default for the numbers shown below.

Comparisons

  • Default (equivalent to --min-count 5 --min-qual 20).
  • Remove all count checks (--min-count 0).
  • --min-count 2.
  • Remove all quality checks (--min-qual 0).
  • Remove all count and quality checks (--min-count 0, --min-qual 0).

Other arguments

  • k = 21.
  • sketch_size = 1024.
  • threads = 1.

Commands

# Default 
sketchlib sketch -k $k -o test -f $tsvreadfile --threads $nthreads --verbose

# w/o counts
sketchlib sketch -k $k --min-count 0 -o test -f $tsvreadfile --threads $nthreads --verbose

# w/ counts = 2
sketchlib sketch -k $k --min-count 2 -o test -f $tsvreadfile --threads $nthreads --verbose

# w/o quality 
sketchlib sketch -k $k --min-qual 0 -o test -f $tsvreadfile --threads $nthreads --verbose

# w/o counts w/o quality 
sketchlib sketch -k $k --min-qual 0 --min-count 0 -o test -f $tsvreadfile --threads $nthreads --verbose

For the last version, the argument --est-coverage $ecov is added, with ecov=15000.

Hardware

  • Intel Xeon Gold 6230 CPU @ 2.10GHz (2x, 80 threads).
  • 800GB RAM.

Results

Time is recorded as that reported from the whole sketchlib run (sketchlib done in Xs), and is shown in the last two columns, in seconds.

Comparison Code // Dataset -> Metagenomics (s) Isolate (s)
Default Vanilla 161 6
SIMD 66 4
SIMD & N removal 61 4
SIMD final tests 42.2 2.6
--min-count 0 Vanilla 159 6
SIMD 66 2
SIMD & N removal 62 2
SIMD final tests 41.5 2.2
--min-count 2 Vanilla 159 6
SIMD 67 4
SIMD & N removal 62 4
SIMD final tests 41.3 2.7
--min-qual 0 Vanilla 155 6
SIMD 66 4
SIMD & N removal 56 5
SIMD final tests 43.2 1.6
--min-counts/qual 0 Vanilla 154 6
SIMD 66 2
SIMD & N removal 55 2
SIMD final tests 43.3 1.8

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions