Readme
kmerust
A fast, parallel k-mer counter for DNA sequences in FASTA and FASTQ files.
Features
Fast parallel processing using rayon and dashmap
FASTA and FASTQ support with automatic format detection from file extension
Canonical k-mers - outputs the lexicographically smaller of each k-mer and its reverse complement
Flexible k-mer lengths from 1 to 32
Handles N bases by skipping invalid k-mers
Jellyfish-compatible output format for easy integration with existing pipelines
Tested for accuracy against Jellyfish
Installation
From crates.io
cargo install kmerust
From source
git clone https://github.com/suchapalaver/kmerust.git
cd kmerust
cargo install -- path .
Usage
kmerust < k> < path>
Arguments
< k > - K-mer length (1-32)
< path > - Path to a FASTA or FASTQ file (use - or omit for stdin)
Options
- f, - - format < FORMAT > - Output format: fasta (default), tsv , json , or histogram
- i, - - input- format < FORMAT > - Input format: auto (default), fasta , or fastq
- m, - - min- count < N> - Minimum count threshold (default: 1)
- Q, - - min- quality < N> - Minimum Phred quality score for FASTQ (0-93); bases below this are skipped
--save < PATH > - Save k-mer counts to a binary index file for fast querying
- q, - - quiet - Suppress informational output
- h, - - help - Print help information
- V, - - version - Print version information
Examples
Count 21-mers in a FASTA file:
kmerust 21 sequences.fa > kmers.txt
Count 21-mers in a FASTQ file (format auto-detected):
kmerust 21 reads.fq > kmers.txt
Count 5-mers:
kmerust 5 sequences.fa > kmers.txt
Unix Pipeline Integration
kmerust supports reading from stdin, enabling seamless integration with Unix pipelines:
# Pipe from another command
cat genome.fa | kmerust 21
# Decompress and count
zcat large.fa.gz | kmerust 21 > counts.tsv
# Sample reads and count
seqtk sample reads.fa 0.1 | kmerust 17
# Explicit stdin marker
cat genome.fa | kmerust 21 -
# FASTQ from stdin (specify format explicitly)
cat reads.fq | kmerust 21 --input-format fastq
zcat reads.fq.gz | kmerust 21 -i fastq > counts.tsv
Use --format to choose the output format:
# TSV format (tab-separated)
kmerust 21 sequences.fa --format tsv
# JSON format
kmerust 21 sequences.fa --format json
# FASTA-like format (default)
kmerust 21 sequences.fa --format fasta
# Histogram format (k-mer frequency spectrum)
kmerust 21 sequences.fa --format histogram
Histogram Output
The histogram format outputs the k-mer frequency spectrum (count of counts), useful for genome size estimation and error detection:
kmerust 21 genome.fa -- format histogram > spectrum.tsv
Output is tab-separated with columns count and frequency :
1 1523456 # 1. 5 M k- mers appear exactly once ( likely errors)
2 234567 # 234K k- mers appear twice
10 45678 # 45K k- mers appear 10 times
...
Quality Filtering (FASTQ)
For FASTQ files, use --min-quality to filter out k-mers containing low-quality bases:
# Skip k-mers with any base below Q20
kmerust 21 reads.fq --min-quality 20
# Higher threshold for stricter filtering
kmerust 21 reads.fq -Q 30 --format tsv
K-mers containing bases with Phred quality scores below the threshold are skipped entirely.
Index Serialization
For large genomes, save k-mer counts to a binary index file to avoid re-counting:
# Count and save to index
kmerust 21 genome.fa --save counts.kmix
# Counts are also written to stdout as usual
kmerust 21 genome.fa --save counts.kmix > counts.tsv
The index file uses a compact binary format with CRC32 checksums for integrity verification. Gzip compression is auto-detected from the . gz extension:
# Save with gzip compression
kmerust 21 genome.fa --save counts.kmix.gz
Querying a Saved Index
Use the query subcommand to look up k-mer counts from a saved index:
# Query a single k-mer
kmerust query counts.kmix ACGTACGTACGTACGTACGTA
# Output: 42 (or 0 if not found)
# Queries are case-insensitive and canonicalized
kmerust query counts.kmix acgtacgtacgtacgtacgta # Same result
kmerust query counts.kmix TGTACGTACGTACGTACGTAC # Reverse complement, same result
The query k-mer length must match the index's k value.
Sequence Readers
kmerust supports two sequence readers via feature flags, both supporting FASTA and FASTQ:
To use needletail instead:
cargo run -- release -- no-default-features -- features needletail -- 21 sequences.fa
With needletail, format is auto-detected from file content. With rust-bio, format is detected from file extension (. fa, . fasta, . fna for FASTA; . fq, . fastq for FASTQ).
Production Features
Enable production features for additional capabilities:
cargo build -- release -- features production
Or enable individual features:
gzip - Read gzip-compressed FASTA files (. fa. gz)
mmap - Memory-mapped I/O for large files
tracing - Structured logging and diagnostics
With the gzip feature, kmerust can directly read gzip-compressed files:
cargo run -- release -- features gzip -- 21 sequences.fa.gz
Tracing/Logging
With the tracing feature, use the RUST_LOG environment variable for diagnostic output:
RUST_LOG = kmerust=debug cargo run -- features tracing -- 21 sequences.fa
Output is written to stdout in FASTA-like format:
> { count}
{ canonical_kmer}
Example output:
> 114928
ATGCC
> 289495
AATCA
Library Usage
kmerust can also be used as a library:
use kmerust:: run:: count_kmers;
use std:: path:: PathBuf;
fn main ( ) -> Result < ( ) , Box < dyn std:: error:: Error> > {
// Works with both FASTA and FASTQ (format auto-detected)
let path = PathBuf:: from( " sequences.fa" ) ;
let counts = count_kmers ( & path, 21 ) ? ;
for ( kmer, count) in counts {
println! ( " {kmer} : {count} " ) ;
}
Ok ( ( ) )
}
When using the builder API, you can explicitly specify the input format:
use kmerust:: builder:: KmerCounter;
use kmerust:: format:: SequenceFormat;
fn main ( ) -> Result < ( ) , Box < dyn std:: error:: Error> > {
let counts = KmerCounter:: new( )
. k ( 21 ) ?
. input_format ( SequenceFormat:: Fastq)
. count ( " reads.fq" ) ? ;
Ok ( ( ) )
}
Progress Reporting
Monitor progress during long-running operations:
use kmerust:: run:: count_kmers_with_progress;
fn main ( ) -> Result < ( ) , Box < dyn std:: error:: Error> > {
let counts = count_kmers_with_progress ( " genome.fa" , 21 , | progress | {
eprintln! (
" Processed {} sequences ({} bases)" ,
progress. sequences_processed,
progress. bases_processed
) ;
} ) ? ;
Ok ( ( ) )
}
Memory-Mapped I/O
For large files, use memory-mapped I/O (requires mmap feature):
use kmerust:: run:: count_kmers_mmap;
fn main ( ) -> Result < ( ) , Box < dyn std:: error:: Error> > {
let counts = count_kmers_mmap ( " large_genome.fa" , 21 ) ? ;
println! ( " Found {} unique k-mers" , counts. len ( ) ) ;
Ok ( ( ) )
}
Streaming API
For memory-efficient processing:
use kmerust:: streaming:: count_kmers_streaming;
fn main ( ) -> Result < ( ) , Box < dyn std:: error:: Error> > {
let counts = count_kmers_streaming ( " genome.fa" , 21 ) ? ;
println! ( " Found {} unique k-mers" , counts. len ( ) ) ;
Ok ( ( ) )
}
Reading from Any Source
Count k-mers from any BufRead source, including stdin or in-memory data:
use kmerust:: streaming:: count_kmers_from_reader;
use std:: io:: BufReader;
fn main ( ) -> Result < ( ) , Box < dyn std:: error:: Error> > {
// From in-memory data
let fasta_data = b " >seq1\n ACGTACGT\n >seq2\n TGCATGCA\n " ;
let reader = BufReader:: new( & fasta_data[ .. ] ) ;
let counts = count_kmers_from_reader ( reader, 4 ) ? ;
// From stdin
// use kmerust::streaming::count_kmers_stdin;
// let counts = count_kmers_stdin(21)?;
Ok ( ( ) )
}
kmerust uses parallel processing to efficiently count k-mers:
Sequences are processed in parallel using rayon
A concurrent hash map (dashmap) allows lock-free updates
FxHash provides fast hashing for 64-bit packed k-mers
License
MIT License - see LICENSE for details.