A lightweight (it runs on my potato 🥔) and fast compression-based k-nearest neighbors (k-NN) classifier using Normalized Compression Distance (NCD) for nucleotide (and maybe AA too?) sequence classification. The implementation is based on the Less is More: Parameter-Free Text Classification with Gzip arXiv.2212.09410, introducing an alternative approach for resource hungry methods like DNNs and other algorithms, like Naive-Bayes.
It compares query sequences against reference FASTA datasets; during tests, the Silva S16 rDNA datasets were used.
Kompa leverages compression-based similarity as a universal measure of sequence relatedness, rooted in information theory and Kolmogorov complexity. Here's why this approach is probably effective and efficient for biological sequence classification than the currently popular k-mer based methods:
Kolmogorov Complexity Approximation: The Kolmogorov complexity of a string is the length of the shortest program that can produce it. While theoretically uncomputable, practical compression algorithms (like LZ4) approximate this by finding patterns and redundancies. The Normalized Compression Distance (NCD) uses compression to measure similarity:
NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))
Where similar sequences share common patterns, leading to better compression of their concatenation C(xy) relative to their individual compressed sizes.
-
Pattern Recognition Without Feature Engineering: DNA/RNA sequences contain inherent patterns (repeats, motifs, conserved regions). Compression algorithms automatically detect these patterns without requiring manual feature extraction or domain knowledge.
-
Universal Similarity Metric: Unlike alignment-based methods that require scoring matrices or gap penalties, compression-based distance works universally across different sequence types (nucleotide, protein, text) without parameter tuning.
-
Conserved Region Detection: Related biological sequences (e.g., 16S rRNA from similar species) share conserved regions. When concatenated, these shared patterns compress efficiently, yielding low NCD scores and indicating similarity.
-
Parameter-Free Classification: The method requires no pre-training (like k-mer based Naive Bayes), or fine-tuning. This makes it ideal for:
- Low-resource scenarios (limited labeled data)
- Quick exploratory analysis
-
Robustness to Noise: Compression naturally filters out random noise while preserving signal. Small sequencing errors or variations don't drastically affect the overall compression ratio, making the method relatively robust (But the general audience will decide if this method is useful or utterly useless...).
Works Well For:
- ✅ Highly conserved sequences (16S/18S rRNA, housekeeping genes)
- ✅ Taxonomic classification at genus/family level
- ✅ Few-shot learning scenarios with limited examples
- ✅ Quick screening of large sequence databases
Limitations:
⚠️ Short sequences (< 500 bp) may lack sufficient information for reliable compression patterns⚠️ Fine-grained classification (strain/species level) may require alignment-based approaches for better resolution⚠️ Highly divergent sequences with low overall similarity benefit less from compression-based methods⚠️ Horizontal gene transfer or recombination events may confound classification
Optimal Use Cases:
- Classifying environmental samples against known taxonomic databases
- Rapid metagenomic profiling
- Exploratory analysis before more computationally intensive methods
The approach can be a fast, parameter-free, universal classifier alternative for many practical biological sequence classification tasks, particularly when labeled data is scarce or computational resources are limited.
Here are some minimal benchmark results (could not do bigger, I have a potato machine 🥔): The used datasets you can find in the test_data directory.
- Reference sequences: 213,119 sequences (Silva S16 rRNA v132)
- Reference avg length: 1,474 bp
- Query sequences: 2 sequences
- Query avg length: 1,379 bp
- Average execution time: ~8.4 seconds (over 10 runs)
- Fastest run: 6.80 seconds
- Memory usage: ~332 MB RAM
- CPU cores used: 6 workers (75% of 8 cores)
- Test system: 8-core CPU
Benchmark command: kompa test_data/query.fa test_data/silva_nr_v132_train_set.fa.gz
- LZ4 Compression: Ultra-fast compression (10x faster than zlib)
- Parallel Processing: Multi-process execution across all CPU cores (Multi-threading is out of question, the tasks are CPU bound, so Python's GIL is a problem)
- Memory Efficient: Streaming architecture with Iterators / Generator, and temporary disk cache storage used to minimize the overhead during the argument pickling for child processes.
- I/O Optimized: Memory-mapped files and 8MB buffers
pip install kompagit clone https://github.com/heloint/kompa.git
cd kompa
pip install .git clone https://github.com/heloint/kompa.git
cd kompa
pip install -e .After installation, you can use the kompa command:
kompa [OPTIONS] queries_file reference_files [reference_files ...]Alternatively, you can run it as a Python module:
python3 -m kompa [OPTIONS] queries_file reference_files [reference_files ...]Positional Arguments:
queries_file- Path to the input queries file (FASTA format, can be gzipped)reference_files- One or more paths to reference database files (FASTA format, can be gzipped)
Optional Arguments:
--max-workers N- Maximum number of worker processes for parallel processing (default: dynamically calculated as 75% of available CPU cores)--no-cpu-limit- Use all available CPU cores (100%) instead of the default 75%. Overrides--max-workers-o, --output PATH- Path to output file. If not specified, results are printed to console--json- Output results in JSON format instead of TSV--cache-file-path PATH- Path to cache file for storing processed reference sequences. If not specified, a temporary file is used and deleted after execution--k-nearest N- Number of nearest neighbors to consider for classification. Must be a positive integer. Default: 5--spreading-limit FLOAT- Maximum spreading threshold for NCD values among k-nearest neighbors. Must be non-negative. Default: 0.05
Basic usage with default settings:
kompa queries.fasta silva_nr_v132_train_set.fa.gzWith custom worker count:
kompa queries.fasta silva_nr_v132_train_set.fa.gz --max-workers 8Max out CPU usage (use all cores):
kompa queries.fasta silva_nr_v132_train_set.fa.gz --no-cpu-limitSave results to TSV file:
kompa queries.fasta silva_nr_v132_train_set.fa.gz -o results.tsvSave results to JSON file:
kompa queries.fasta silva_nr_v132_train_set.fa.gz -o results.json --jsonWith persistent cache file:
kompa queries.fasta silva_nr_v132_train_set.fa.gz --cache-file-path ./my_cache.txtWith custom k-nearest and spreading limit:
kompa queries.fasta silva_nr_v132_train_set.fa.gz --k-nearest 10 --spreading-limit 0.1Multiple reference files with all options:
kompa queries.fasta ref1.fa.gz ref2.fa.gz ref3.fa.gz --max-workers 8 --cache-file-path ./cache.txt --k-nearest 5 --spreading-limit 0.05Display help:
kompa --helpNormalized Compression Distance (NCD): The implementation uses NCD as a similarity metric between sequences. NCD is calculated as:
NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))
where C(x) is the compressed size of sequence x, C(y) is the compressed size of sequence y, and C(xy) is the compressed size of the concatenation of both sequences.
Compression-based Classification: The approach leverages the insight that similar sequences compress better together than dissimilar ones. By using the ultra-fast LZ4 compression algorithm, the classifier can measure semantic similarity without requiring explicit feature engineering.
k-NN Classification: For each query sequence, the tool:
- Computes the NCD between the query and all reference sequences in the reference FASTA dataset
- Selects the k nearest neighbors (sequences with smallest NCD values)
- Classifies the query based on the most common taxonomic label among the neighbors
Spreading Limit: The spreading limit parameter controls classification confidence. When the spread (difference between max and min NCD values) among k-nearest neighbors exceeds this threshold, the classifier selects the single best match instead of voting among neighbors. This helps handle ambiguous cases where neighbors are not tightly clustered.
Kompa supports two output formats: TSV (default) and JSON (with --json flag).
| Field | Description |
|---|---|
query_header |
Header/name of the query sequence |
best_reference_header |
Header of the best matching reference sequence |
normalized_compression_distance |
NCD score (lower is more similar) |
frequency |
Number of times the best reference appears in k-nearest neighbors |
max_k_nearest |
The k value used for classification |
spreading_limit |
The spreading limit threshold used |
Tab-separated values with header row:
query_header best_reference_header normalized_compression_distance frequency max_k_nearest spreading_limit
should-be-Candidatus_Regiella Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Candidatus_Regiella; 0.4934086629001883 3 5 0.05
should-be-Gloeotrichia_PYH6 Bacteria;Cyanobacteria;Oxyphotobacteria;Nostocales;Nostocaceae;Gloeotrichia_PYH6; 0.6254826254826255 2 5 0.05
Array of result objects:
[
{
"query_header": "should-be-Candidatus_Regiella",
"best_reference_header": "Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Candidatus_Regiella;",
"normalized_compression_distance": 0.4934086629001883,
"frequency": 3,
"max_k_nearest": 5,
"spreading_limit": 0.05
},
{
"query_header": "should-be-Gloeotrichia_PYH6",
"best_reference_header": "Bacteria;Cyanobacteria;Oxyphotobacteria;Nostocales;Nostocaceae;Gloeotrichia_PYH6;",
"normalized_compression_distance": 0.6254826254826255,
"frequency": 2,
"max_k_nearest": 5,
"spreading_limit": 0.05
}
]- Python 3.9+
- lz4 >= 4.0.0 (automatically installed)
- MIT: See LICENSE file for details.