-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
enhancementNew feature or requestNew feature or request
Description
[This issue is to collect data regarding improvements to the sketching methodology for DNA (not AA/3Di)].
Datasets
Focusing on reads (most complicated use case) and using a metagenomic and an isolate example.
- Isolate: reference ERR9281752.
- Metagenomic: reference SRR5918785.
Code used
- Vanilla: sketchlib.rust master branch (commit b9a9310) of this repo, with a small change over it so that min_qual==0 avoids making the checks completely (in
src/hashing/nthash_iterator.rs). - SIMD: sketchlib.rust simdsketch_crate branch (commit 87192a5) of my fork. It is an implementation of sketchlib.rust, with the following caveats:
- Not optimised, improvements to be done yet. Most surely not best way of doing it.
- Bin size changed from 14 to 16, to accommodate simd_sketch.
- No counts apart from the 0/1 argument (
duplicate) possible. - Coverage set manually to 30 when the input file is a read.
- Densification not applied.
- SIMD with final code: sketchlib.rust simdsketch_crate branch (commit fa8e72d) of my fork. It is the code from the previous point, with a cargo update done the 04/12/2025, and with some improvements:
- Now we can set a minimum number of counts.
- Estimated coverage is now an argument. We are using 15000 as the default for the numbers shown below.
Comparisons
- Default (equivalent to --min-count 5 --min-qual 20).
- Remove all count checks (--min-count 0).
- --min-count 2.
- Remove all quality checks (--min-qual 0).
- Remove all count and quality checks (--min-count 0, --min-qual 0).
Other arguments
- k = 21.
- sketch_size = 1024.
- threads = 1.
Commands
# Default
sketchlib sketch -k $k -o test -f $tsvreadfile --threads $nthreads --verbose
# w/o counts
sketchlib sketch -k $k --min-count 0 -o test -f $tsvreadfile --threads $nthreads --verbose
# w/ counts = 2
sketchlib sketch -k $k --min-count 2 -o test -f $tsvreadfile --threads $nthreads --verbose
# w/o quality
sketchlib sketch -k $k --min-qual 0 -o test -f $tsvreadfile --threads $nthreads --verbose
# w/o counts w/o quality
sketchlib sketch -k $k --min-qual 0 --min-count 0 -o test -f $tsvreadfile --threads $nthreads --verbose
For the last version, the argument --est-coverage $ecov is added, with ecov=15000.
Hardware
- Intel Xeon Gold 6230 CPU @ 2.10GHz (2x, 80 threads).
- 800GB RAM.
Results
Time is recorded as that reported from the whole sketchlib run (sketchlib done in Xs), and is shown in the last two columns, in seconds.
| Comparison | Code // Dataset -> | Metagenomics (s) | Isolate (s) |
|---|---|---|---|
| Default | Vanilla | 161 | 6 |
| SIMD | 66 | 4 | |
| SIMD & N removal | 61 | 4 | |
| SIMD final tests | 42.2 | 2.6 | |
| --min-count 0 | Vanilla | 159 | 6 |
| SIMD | 66 | 2 | |
| SIMD & N removal | 62 | 2 | |
| SIMD final tests | 41.5 | 2.2 | |
| --min-count 2 | Vanilla | 159 | 6 |
| SIMD | 67 | 4 | |
| SIMD & N removal | 62 | 4 | |
| SIMD final tests | 41.3 | 2.7 | |
| --min-qual 0 | Vanilla | 155 | 6 |
| SIMD | 66 | 4 | |
| SIMD & N removal | 56 | 5 | |
| SIMD final tests | 43.2 | 1.6 | |
| --min-counts/qual 0 | Vanilla | 154 | 6 |
| SIMD | 66 | 2 | |
| SIMD & N removal | 55 | 2 | |
| SIMD final tests | 43.3 | 1.8 |
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request