Readme
check_build
A fast, memory-efficient tool to verify VCF files against hg19 and hg38 reference genomes. Also available as a library for general-purpose use beyond VCF.
Quick Start
What build is my file?
check_build -- detect my_variants.vcf
# Output: Hg38 (100.0% match, high confidence)
Full verification:
check_build my_variants.vcf
Installation
cargo install check_build
Or from source:
git clone https://github.com/SauersML/check_build.git
cd check_build
cargo build -- release
Usage
CLI
# Simple build detection
check_build --detect sample.vcf
# Full verification with summary
check_build sample.vcf
# Quiet mode (no progress bars)
check_build -q sample.vcf
# Summary only (no mismatch details)
check_build -s sample.vcf
# Single reference
check_build --hg38-only sample.vcf
# Custom reference paths
check_build --hg19-path /data/hg19.fa --hg38-path /data/hg38.fa sample.vcf
Library
Add to Cargo.toml :
[ dependencies ]
check_build = { git = " https://github.com/SauersML/check_build" }
Simple usage:
use check_build:: detect_build;
let result = detect_build ( " sample.vcf" ) ? ;
println! ( " {} " , result) ; // "Hg38 (100.0% match, high confidence)"
Full control:
use check_build:: { Verifier, Reference} ;
let result = Verifier:: new( " sample.vcf" )
. quiet ( )
. verify_both ( ) ? ;
println! ( " hg19: {:.1} % match" , result. match_rate ( Reference:: Hg19) ) ;
println! ( " hg38: {:.1} % match" , result. match_rate ( Reference:: Hg38) ) ;
// Detailed detection with edge case handling
match result. detect ( ) {
BuildDetection:: Detected { build, confidence, .. } => {
println! ( " Build: {:?} ({} confidence)" , build, confidence) ;
}
BuildDetection:: Ambiguous { reason, .. } => {
println! ( " Cannot determine: {} " , reason) ;
}
BuildDetection:: Unknown { reason, .. } => {
println! ( " Problem with file: {} " , reason) ;
}
BuildDetection:: NoData => {
println! ( " No valid variants found" ) ;
}
}
Features
Fast : Parallel verification of hg19/hg38 using rayon
Memory-efficient : Streams references, processes one contig at a time
Auto-download : Fetches reference FASTAs if not present
Edge case handling : Detects ambiguous, unknown, or corrupt files
Dual interface : Both CLI and library
How It Works
Splits VCF by contig into temp files
Streams each reference FASTA (never loads full genome)
Verifies REF alleles match reference bases
Reports match rates and infers build
Exit Codes
Code
Meaning
0
Success (build detected or verification passed)
1
Error (file not found, download failed, etc.)
2
Ambiguous (matches both builds similarly)
3
Unknown (low match on both, possibly corrupt)
4
No data (VCF had no valid variants)
License
MIT