Skip to content

AI4S-YB/fastqc-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastQC-RS

A Rust implementation of FastQC, a quality control tool for high throughput sequence data. This is a 1:1 rewrite of FastQC v0.12.1 with identical output format and analysis algorithms.

Features

  • Built-in Trim Galore support — adapter/quality trimming via fastqc-rs trim-galore, a Rust reimplementation wrapping Cutadapt

  • 12 analysis modules with identical algorithms and pass/warn/fail thresholds:

    1. Basic Statistics
    2. Per Base Sequence Quality
    3. Per Tile Sequence Quality
    4. Per Sequence Quality Scores
    5. Per Base Sequence Content
    6. Per Sequence GC Content
    7. Per Base N Content
    8. Sequence Length Distribution
    9. Sequence Duplication Levels
    10. Overrepresented Sequences
    11. Adapter Content
    12. Kmer Content
  • Input formats: FASTQ (plain, gzip, bzip2), BAM, SAM

  • Output: HTML report with SVG graphs, ZIP archive, fastqc_data.txt, summary.txt

  • Output compatibility: Text reports match Java FastQC output (identical PASS/WARN/FAIL, near-identical data values)

  • Multi-file parallel processing via rayon

  • Single binary with embedded configuration files

Installation

cargo install --path .

Or build from source:

cargo build --release

Usage

# Basic usage
fastqc-rs input.fastq

# Multiple files with parallel processing
fastqc-rs -t 4 sample1.fastq.gz sample2.fastq.gz

# Specify output directory
fastqc-rs -o results/ input.fastq

# BAM/SAM files
fastqc-rs input.bam
fastqc-rs -f sam_mapped input.sam   # Only mapped reads, with soft-clip removal

# Extract results from ZIP
fastqc-rs --extract input.fastq

# Quiet mode (suppress progress)
fastqc-rs -q input.fastq

CLI Options

Option Description
-o, --outdir <DIR> Output directory (must exist)
--extract Unzip output after creation
--noextract Don't unzip output (default)
--delete Delete ZIP after extraction
-f, --format <FMT> Force format: fastq, bam, sam, bam_mapped, sam_mapped
-c, --contaminants <FILE> Custom contaminant list
-a, --adapters <FILE> Custom adapter list
-l, --limits <FILE> Custom pass/warn/fail thresholds
-t, --threads <N> Number of files to process simultaneously (default: 1)
-k, --kmers <SIZE> Kmer length (default: 7)
-q, --quiet Suppress progress messages
--casava CASAVA mode (filter flagged reads)
--nogroup Disable base position grouping
--expgroup Use exponential base grouping
--min-length <BP> Minimum sequence length for grouping
--dup-length <BP> Truncation length for duplication analysis
--svg Output SVG graphs

Output

For each input file sample.fastq.gz, produces:

  • sample_fastqc.html — Interactive HTML report
  • sample_fastqc.zip — Archive containing:
    • fastqc_report.html
    • fastqc_data.txt — Tab-delimited analysis data
    • summary.txt — PASS/WARN/FAIL per module
    • Icons/ — Status icons
    • Images/ — SVG graphs

Trim Galore

A built-in Rust reimplementation of Trim Galore (by Felix Krueger), wrapping Cutadapt for adapter and quality trimming. Requires Cutadapt to be installed separately.

# Single-end with adapter auto-detection
fastqc-rs trim-galore reads.fq.gz

# Paired-end with custom cutadapt path
fastqc-rs trim-galore --paired --path-to-cutadapt /opt/bin/cutadapt \
  R1.fq.gz R2.fq.gz

# Illumina adapter, quality 30, min length 50, 4 cores
fastqc-rs trim-galore --illumina -q 30 --length 50 -j 4 -o trimmed/ reads.fq.gz

# Hard-trim to first 75bp
fastqc-rs trim-galore --hardtrim5 75 reads.fq.gz

# See all options
fastqc-rs trim-galore --help

Key options:

Option Description
--paired Paired-end mode
-a, --adapter <SEQ> Custom adapter sequence
--illumina / --nextera / --small-rna / --bgiseq Adapter presets
-q, --quality <INT> Quality cutoff (default: 20)
--length <INT> Minimum read length (default: 20)
-j, --cores <N> Number of Cutadapt cores
--path-to-cutadapt <PATH> Path to cutadapt executable
--clip_R1 / --clip_R2 <INT> Clip N bp from 5' end
--three_prime_clip_R1 / --three_prime_clip_R2 <INT> Clip N bp from 3' end
--hardtrim5 / --hardtrim3 <INT> Hard-trim to N bp from 5'/3' end
--rrbs RRBS mode (MspI-digested)
--fastqc Run FastQC after trimming
-o, --output_dir <DIR> Output directory

Performance

Benchmarked on a paired-end Illumina dataset (~1.15 GB / ~1.20 GB gzipped FASTQ, ~9.9M reads x 150bp):

Baseline (direct Rust rewrite, no optimization)

File FastQC v0.12.1 (Java) fastqc-rs (Rust) Speedup
SPL1E1_raw_1.fastq.gz (1.15 GB) 48.6s 46.8s 1.04x
SPL1E1_raw_2.fastq.gz (1.20 GB) 47.6s 45.7s 1.04x

Optimized v1 (zlib-rs + 2-thread pipeline + ahash + LTO)

Optimizations applied: zlib-rs decompression backend, reader/processor pipeline (overlapping I/O with compute), AHashMap for hot-path modules, in-place ASCII uppercase, 256KB I/O buffer, LTO + codegen-units=1.

File FastQC v0.12.1 (Java) fastqc-rs (optimized) Speedup vs Java Speedup vs baseline
SPL1E1_raw_1.fastq.gz (1.15 GB) 48.6s 38.8s 1.25x 1.21x
SPL1E1_raw_2.fastq.gz (1.20 GB) 47.6s 39.1s 1.22x 1.17x

Tested on Linux 6.6 (WSL2). Pipeline uses 2 threads (reader + processor). CPU utilization ~117%. Bottleneck is module processing (~85% of wall time).

Optimized v2 (data-parallel multi-threaded processing, output bug) #0458a08

Added data-parallel architecture: 1 reader thread + 4 worker threads, each with independent module copies. Workers process sequence subsets in parallel, results merged after completion.

File FastQC v0.12.1 (Java) fastqc-rs (multi-threaded) Speedup vs Java Speedup vs baseline
SPL1E1_raw_1.fastq.gz (1.15 GB) 48.6s 9.4s 5.2x 5.0x
SPL1E1_raw_2.fastq.gz (1.20 GB) 47.6s 9.4s 5.1x 4.9x

This version was fast, but not output-compatible. Multi-threaded duplication tracking changed the original FastQC semantics, and per-sequence quality bucketing still used Rust-side rounding instead of Java truncation.

Optimized v2 fixed (reader-ordered duplication + Java-compatible bucketing) #3fdd3b9

Keeps the 1 reader + 4 worker architecture, but moves duplication tracking back to the reader thread so it preserves original file order. Also restores Java-compatible per-sequence quality bucketing and more closely matches Java number formatting.

File FastQC v0.12.1 (Java) fastqc-rs (fixed multi-threaded) Speedup vs Java Speedup vs baseline
SPL1E1_raw_1.fastq.gz (1.15 GB) 48.6s 13.0s 3.7x 3.6x
SPL1E1_raw_2.fastq.gz (1.20 GB) 47.6s 13.7s 3.5x 3.3x

Tested on Linux 6.6 (WSL2). Uses 5 threads total (1 reader + 4 workers). summary.txt matches Java FastQC exactly, Per sequence quality scores matches exactly, and Sequence Duplication Levels now differs only in floating-point tail digits.

Comprehensive Benchmark (cold cache, Illumina + ONT)

Tested on Intel i9-13900K (8C/16T), Linux 6.6 (WSL2), all runs with cold disk cache (echo 3 > /proc/sys/vm/drop_caches).

Illumina short reads (~9.9M reads x 150bp per file):

Test FastQC v0.12.1 (Java) fastqc-rs Speedup
Single file (1.15 GB gz) 52.3s 24.8s 2.1x
Two files -t 2 (2.35 GB gz) 50.5s 14.1s 3.6x

ONT long reads (500K reads, variable length 200bp–150kbp, 3.99 GB gz):

Test FastQC v0.12.1 (Java) fastqc-rs Speedup
Single file 154.7s 32.9s 4.7x

Note on Java memory: FastQC (Java) defaults to 512 MB heap and will OOM on large ONT datasets. The benchmark used --memory 5120 (5 GB) for the ONT test. fastqc-rs has no such limitation — memory usage scales automatically.

Compatibility

Output format is compatible with Java FastQC v0.12.1:

  • summary.txt — identical (PASS/WARN/FAIL per module match exactly)
  • fastqc_data.txt — nearly identical with minor differences:
    • Per sequence quality scores: Matches Java FastQC exactly after restoring Java-style truncation for per-read mean quality.
    • Sequence Duplication Levels: PASS/WARN/FAIL status matches exactly; remaining differences are limited to floating-point tail digits in the deduplicated percentage and related percentages.
    • Number formatting: Small and large values are formatted closer to Java style, including scientific notation where applicable.
    • Floating-point precision: Some modules still differ only in the last 1-2 digits due to f64 vs Java double rounding.
  • Modules with identical or effectively identical data: Basic Statistics, Per Base Sequence Quality, Per Sequence Quality Scores, Per Sequence GC Content, Sequence Length Distribution, Overrepresented Sequences, Sequence Duplication Levels
  • Same module ordering and threshold logic
  • Same CLI flags (with minor naming convention differences for multi-word flags)

Citation

If you use this project in academic work, please cite the original FastQC release/publication.

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0), consistent with the original FastQC project it rewrites.

Acknowledgements

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors