Skip to content

CUDA port of bwa aln#457

Open
teepean wants to merge 29 commits into
lh3:masterfrom
teepean:cuda-port
Open

CUDA port of bwa aln#457
teepean wants to merge 29 commits into
lh3:masterfrom
teepean:cuda-port

Conversation

@teepean

@teepean teepean commented Jun 1, 2026

Copy link
Copy Markdown

This fork adds gpualn, a CUDA implementation of BWA-backtrack (bwa aln) for single-end reads, targeting ancient DNA (aDNA) where seeding is disabled (-l 1024) — the regime where seed-and-extend aligners (and the older BarraCUDA) struggle. It produces a bit-exact .sai/SAM (byte-identical to CPU bwa aln/samse) while running ~5x faster than 16 CPU threads on a single NVIDIA RTX 3090 (full 3.95M-read aDNA file: 161 s vs 895 s; .sai md5 identical).

teepean and others added 29 commits June 1, 2026 10:29
Phase 0: instrument bwt_match_gap (ALN_PROFILE) to measure aDNA search-tree
sizes. On real data (AVA1B, hs37d5, -l 1024 -n 0.01 -o 2): 0.52% mapping,
99.3% zero-hit reads dominate, no read hits the 2M max_entries cap, tree size
spans 6 orders of magnitude. Key consequence recorded in CUDA_PORT_PLAN.md:
for zero-hit reads DFS and best-first visit the identical node set, so the
per-thread priority queue (BarraCUDA's memory wall) is unneeded for the bulk;
target a persistent-thread work-pool + per-read iterative DFS, reconciling the
~0.5% hit-finding reads via the exact CPU path for a bit-exact .sai.

Phase 1: cuda/fm_device.cuh mirrors bwt_occ4/bwt_2occ4/bwt_occ/bwt_match_exact
on the existing 64-byte-bucket FM-index layout (cnt_table/L2 in constant mem,
.bwt in global mem, __ldg reads); the suffix array is not needed on the GPU.
cuda/fmtest.cu validates against the CPU: 10M random Occ4 probes and full
backward exact-search of 10k real reads both PASS with 0 mismatches at
2.28 G-Occ4/s on an RTX 3090. Added `make fmtest` target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cuda/dfstest.cu + device DFS in fm_device.cuh. Host reuses bwa's real read
I/O, bwt_cal_width, complement and per-read max_diff so the device DFS sees
exactly the (seq, width, opt) that CPU bwt_match_gap sees. The DFS reproduces
the bounded search-tree node set and reports has_hit per read; the ~0.5% reads
that find a hit are reconciled via the exact CPU bwt_match_gap.

Validated on sub10k vs the golden .sai (md5 54410c75...):
- CPU-only .sai == golden (harness replicates the driver)
- GPU has_hit matches CPU n_aln>0 with ZERO false negatives
- hybrid .sai (GPU-gated, CPU-reconciled) == golden  -> BIT-EXACT

Performance is intentionally a baseline only: the naive one-read-per-thread
grid-stride with a global-memory stack runs at ~100 reads/s, ~33x slower than
the 16-core CPU -- reproducing the BarraCUDA failure mode. Causes diagnosed in
cuda/PROGRESS.md (warp divergence from 6-orders tree-size variance, low
occupancy, uncoalesced per-thread stacks, no 2occ4 fast path). The whole run
is ~227x off the measured FM-index throughput ceiling, so the loss is entirely
scheduling/memory -- step 2 (work-pool + coalescing + divergence fixes) is the
speedup lever.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Optimized the device DFS engine while preserving bit-exact .sai:
- coalesced SoA stack ([sp*P+slot]) and occupancy-sized persistent-thread
  work-pool (atomic work counter); large runs reconcile only flagged/hit
  reads from stored flat data (the real production hybrid; full CPU reference
  is now opt-in via DFS_FULLCPU).
- per-read WORK BUDGET (DFS_BUDGET, default 2M pops): the decisive fix. The
  naive ~100s was dominated by a single read doing 33.1M node-pops on one
  thread; the CPU's max_entries=2M cap bails on such reads, the GPU did not.
  Capping GPU work and flagging overflow reads to the exact CPU path is free
  for correctness (those reads are reconciled on CPU regardless).

Measured on RTX 3090, hs37d5, -l 1024 -n 0.01 -o 2:
- sub2k:  100s -> 4.5s after budget (2 reads flagged), bit-exact.
- sub100k: 12.08s (8276 reads/s) + 2.37s 1-thread reconcile of 520 reads;
  hybrid .sai md5 == golden (eecf35c1). CPU bwa -t 16 = 23.47s (4261 r/s)
  on the same box -> ~1.9x the 16-core CPU, ~83x over the naive GPU baseline.
  338M pops/s ~= 38% of the FM-index occ4 ceiling; headroom remains.

Next: 2occ4 shared-bucket fast path, lower register pressure for occupancy,
cheaper exact-match tail, full-file streaming + multithreaded reconcile.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…exact

Profiling-guided optimization (consulted CUDA Ampere tuning guide, NVIDIA
forums, Volkov latency-hiding, and the 2025 arXiv N-Queens GPU-DFS paper):
- Added bwt_2occ4 shared-bucket fast path: NO change -> not occ-load bound.
- Swept occupancy via maxrregcount: MORE warps were SLOWER (40w 6445 r/s,
  48w 5243 r/s vs 28w 8276) -> NOT occupancy/latency bound. The kernel is
  bound by the global-memory DFS stack (per-thread live stack working set
  exceeds the 6 MB L2 as threads scale).
- Fix: in-register-continue DFS -- keep the current node in registers, push
  only siblings, read the global stack only on backtrack. Halves stack
  round-trips and removes the push->immediately-pop dependency. 8276 -> 9934
  reads/s (+20%), 69 registers.

sub100k stays bit-exact (hybrid .sai md5 == golden eecf35c1). Standing:
~2.3x the 16-core CPU (4261 r/s), ~27x one core, 99x over naive GPU baseline.
~45% of the FM-index occ4 throughput ceiling.

Records: cuda/PROGRESS.md (optimization table) and cuda/REFERENCES.md (docs/
forum findings + next levers: subtree-per-warp with a shared-memory stack).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Records the 6 proposed optimizations with an adopt/reject analysis in
cuda/OPTIMIZATION_IDEAS.md. Adopting lh3#1 (warp-level parallel-branch DFS) +
lh3#2 (register+shared-memory stack) as the next big lever, lh3#5 (stream-overlapped
multithreaded reconcile) for the full-file engine, lh3#4 (vectorized checkpoint
load) as a minor follow-up.

Flagged lh3#3 (exact-seed zero-hit fast path) as REJECTED: it breaks correctness
under -l 1024. With seeding disabled (the aDNA case) a read may map with
mismatches in the first 32 bp, so an exact-seed filter would skip mappable
reads -> false negatives -> non-bit-exact .sai. A conservative q-gram lower
bound could be safe but is deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Added DFS stack-DEPTH instrumentation: sub100k mean 111, max 386, so a
512-entry stack covers every read (cap 1024->512, 0 overflow). Tried
block-local stack striding to improve locality: NO win (9165 vs 9934 r/s,
bit-exact). Negative result confirms the global stack is bound by traffic
VOLUME, not layout -> only removing global stack traffic (shared-memory
stack) or sharing it (warp cooperation) can help. A per-thread full shared
stack can't fit (~1 MB/block), so the shared stack requires one-read-per-warp
-> implement the warp-cooperative engine (Idea lh3#1+lh3#2) next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ernel, bit-exact

Add DFS_ENGINE=warp: one read per warp, 32 lanes co-explore the tree as a
frontier, per-warp stack in SHARED memory (SoA k/l/packed). Each wave pops up
to 32 top nodes, expands in parallel, warp-scan compacts children, pushes back
to the shared stack; hit via __any_sync ballot; budget/overflow -> CPU.
Removes global stack traffic entirely (the diagnosed bottleneck).

sub100k: GPU kernel 3366 ms = 29,708 reads/s -- 3x the thread engine (9934),
~7x the 16-core CPU (4261); bit-exact (md5 eecf35c1). The monster reads that
serialized on one thread now spread across 32 lanes.

Known issue to fix next: CAP=640 shared stack overflows for 4.6% of reads
(warp frontier is wider than the serial DFS depth 386), flagging them to the
CPU -> 84 s single-thread reconcile that dwarfs the GPU time. Fixes: larger
CAP to cut overflow, and multithreaded/overlapped reconcile (Idea lh3#5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…exact

Resolved the overflow-flag bottleneck:
- CAP sweep: 640->4.6% flagged, 1024->0.28%, 1216->0.065%. Larger CAP keeps
  bushy reads on the GPU instead of punting them to the CPU. Default CAP=1024.
- Multithreaded the CPU reconcile (std::thread, per-thread gap_stack;
  bwt_match_gap is thread-safe): 12.1 s -> 1.57 s wall on 16 threads.
- Added DFS_NORECON for perf sweeps.

sub100k (RTX 3090, bit-exact md5 eecf35c1):
- warp GPU kernel 21,432 reads/s (CAP=1024)
- end-to-end GPU + 16-thread reconcile (sequential) 6.24 s = 16,026 reads/s
- streaming-overlapped would be GPU-bound ~21,432 r/s
- vs CPU bwa -t 16 = 4,261 r/s -> ~3.8x (sequential) to ~5x (overlapped),
  ~210x over the naive GPU baseline.

Bottleneck documented in cuda/PROGRESS.md. Remaining levers: shared+global
two-level stack for higher occupancy, and a full-file streaming engine with
reconcile overlapping the next GPU batch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o-end ceiling

Implement dynamic pop-count in the warp engine (Lever 1):
- WAVE mode: pop up to 32 nodes, but only while room=(CAP-sp)/9 fits every
  node's <=9 children -> wave mode never overflows.
- DRAIN mode (near full): pop 1, keep a child in registers (carry) so
  single-child chains never touch the stack. Gated to sp>=CAP/2 to preserve
  the wave ramp-up for normal reads. MAX_CHILDREN confirmed = 9.
- All bit-exact (sub2k false-pos 90 -> 1).

CAP sweep (sub100k, carry-drain): 256->27% flag/59k r/s (16 warps/SM),
384->5.6%/35.7k, 512->2.2%/21k, 768->0.34%/22.2k (4 warps/SM). With the
overlapped 16-thread reconcile, end-to-end is GPU-bound ~22k for CAP>=768 and
reconcile-bound for small CAP. KEY FINDING: small-CAP high-occupancy speed is
illusory because it punts the bushy reads (2-5%) to the CPU; those reads cost
similarly on GPU (low occ) or CPU (reconcile), pinning end-to-end at ~22k r/s
(~5x the 16-core CPU) regardless of CAP. Default CAP=768.

To break the ceiling, bushy reads must run on the GPU at high occupancy ->
next build is the two-level shared(small)+global-backing stack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ndex ceiling

Add DFS_ENGINE=warp2: per-warp shared top-window (CAP_SM) + pre-allocated
per-warp GLOBAL backing, with warp-parallel coalesced spill/unspill of
128-entry chunks (global=oldest bottom, shared=newest top). Bushy reads spill
their deep frontier to global and STAY ON THE GPU instead of flagging to CPU.

sub100k, bit-exact (md5 eecf35c1): flag rate driven to ~0.015% (vs 0.34%
single-level) -> CPU reconcile negligible (0.61 s). CAP_SM sweep:
256->15.9k r/s (16 w/SM), 384->18.8k, 512->21.5k (8 w/SM), 768->22.1k (4 w/SM).

KEY FINDING: occupancy is NOT the limiter -- 4..16 warps/SM all plateau at
~22k reads/s. We are at the FM-index occ4 ceiling: ~8e9 occ4 for this work at
the measured 2.28 G-occ4/s = ~3.6 s floor vs ~4.5 s actual (~80% of ceiling,
~5x the 16-core CPU). 40-50k is not reachable on this GPU; the kernel is bound
by FM-index probe throughput, not latency hiding. Further raw speedup needs
fewer occ4/node (bit-exactness-constrained) or a faster FM-index.

warp2 ships as the engine (ceiling speed + ~0 reconcile tail -> clean for the
streaming engine). Default CAP_SM=512. Next: full-file streaming (deliverable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extract the warp2 two-level engine into cuda/dfs_engine.cuh and add
cuda/aln_gpu.cu = bwa-aln-gpu: a full-file GPU bwa-aln. Streams the FASTQ in
bwa's native 0x40000-read chunks (chunking matches the CPU driver), per chunk:
MT host preprocessing (bwt_cal_width + complement + per-read max_diff) ->
warp2 GPU has_hit -> MT CPU reconcile of the flagged/hit reads via exact
bwt_match_gap -> write records in read order. One-time global-backing alloc
and reusable (grown-on-demand) device buffers; SAI header + records match
bwa_aln_core exactly. Options -l/-n/-o/-t/-f.

sub100k end-to-end: 5.3 s = 18,785 reads/s (preprocess 0.0s MT, gpu 4.7s,
reconcile 0.6s, io 0.0s), .sai md5 == golden (eecf35c1). Bit-exact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ran bwa-aln-gpu on the full AVA1B.combined.fq.gz (3,948,528 reads, hs37d5,
-l 1024 -n 0.01 -o 2) and compared to CPU `bwa aln -t 16` on the same file:

  GPU bwa-aln-gpu : 175.9 s  (22,445 reads/s)
  CPU bwa aln -t16: 875   s  ( 4,512 reads/s)
  => 4.97x end-to-end speedup

Both .sai are 16,852,676 bytes with identical md5 a09a26dd894690031da42a00862f7a3d
-> BIT-EXACT over the whole 3.95M-read file.

GPU breakdown: preprocess 1.4s (MT), kernel 156.5s, reconcile 16.7s (16 thr),
io 0.1s, gzip read ~1.2s -> GPU-bound at the FM-index occ4 ceiling, as analyzed.
0.531% of reads reconciled on the CPU (mostly real hits). Goal achieved: GPU
BWA-backtrack for aDNA with seeding disabled, ~5x the 16-core CPU, byte-exact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, bit-exact

Add -S (SAM output) and -r RG to bwa-aln-gpu: fuse aln + samse (GeoGenetics-
style alnse). No .sai on disk; FASTQ read ONCE (samse reuses the loaded
seqs[]); bns/SA/pac loaded once. Per chunk after GPU aln + reconcile:
serial bwa_aln2seq_core (preserves the only samse RNG, drand48 repeat-hit
order; lrand48 @bntseq.c:266 is index-build only) -> MT bwa_cal_pac_pos_core
SA-lookup (RNG-free) -> serial bwa_refine_gapped + bwa_print_sam1 (ordered,
bit-exact). Header via bwa_print_sam_hdr + a @pg for bwa-aln-gpu.

Validated bit-exact: fused SAM alignment records byte-identical to
`bwa samse` on the matching .sai -- sub100k md5 ee2dd6ef, full 3.95M-read
file md5 e2a6e1c9 (header differs only in the expected @pg CL).
`bwa-aln-gpu -S ... | samtools sort` produces a valid BAM.

Full file (3,948,528 reads, RTX 3090): 174.2 s = 22,672 reads/s; even faster
than aln-only(176s)+separate samse(20s) since samse folds in for free. vs CPU
bwa aln -t16 + samse = 895 s -> ~5.1x end-to-end, one command, no .sai.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add FMTEST_KRANGE to fmtest to probe occ4 over a capped k-range (working-set
size). Sweep on hs37d5 (RTX 3090):
  0.1 MB (L1) : 9568 M-occ4/s
  1   MB (L2) : 9887 M-occ4/s
  16.8 MB     : 2936 M-occ4/s
  268 MB (HBM): 2324 M-occ4/s
  full (HBM)  : 2308 M-occ4/s
=> a 4.3x locality ceiling exists IN CACHE, but the warp2 kernel already runs
at ~1.8 G occ4/s (~the HBM-random rate), NOT the L2 rate. The FM-index DFS is
pure scatter: ~40k probes/read over ~49M 64-byte buckets => ~0.04% within-read
block reuse (birthday bound). So a per-warp SMEM BWT cache (and address sort,
bit-matrix popc) cannot capture the ceiling; the kernel is at the HBM random
BANDWIDTH wall. Remaining real wins: CPU/GPU overlap (~10% wall-clock) and
multi-GPU (~2x bandwidth, needs a 2nd GPU). Recorded in cuda/REFERENCES.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lh3#5 real)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, bit-exact

Restructure bwa-aln-gpu into a producer/consumer pipeline: the main thread
owns the GPU (read -> MT preprocess -> upload -> kernel -> download has_hit,
serial on one stream so the shared per-warp global backing is never raced);
each chunk is handed to ONE in-order finisher thread doing the CPU reconcile +
output (.sai or fused samse), concurrent with the next chunk's GPU work. A
single ordered consumer preserves drand48 (repeat-hit) and output order ->
bit-exact. Structured multi-GPU-ready (replicate the GPU stage per device +
round-robin chunks; keep one ordered finisher) for when a 2nd card is added.

Full file (3.95M reads, fused alnse SAM): 174.2 s -> 160.9 s = 24,541 reads/s
(overlap hid ~13s of the ~18s CPU tail), records BIT-EXACT (md5 e2a6e1c9).
Both modes re-verified bit-exact under the pipeline (.sai eecf35c1, SAM
records identical). vs CPU bwa aln -t16 + samse = 895 s -> ~5.56x end-to-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
aln_gpu.cu's main is now `extern "C" int bwa_alnse_gpu(int,char**)`; main.c
dispatches `bwa gpualn` under #ifdef HAVE_CUDA, with a usage line. Builds:
  make            -> CPU-only bwa (no CUDA dependency; gpualn absent)
  make bwa-gpu    -> CUDA bwa with `gpualn` (main.c -DHAVE_CUDA + nvcc
                     aln_gpu.o, linked via nvcc; reuses existing AOBJS)
  make bwa-aln-gpu-> standalone tool (now -DALN_GPU_MAIN)
The CUDA-enabled main.c sets the @pg, so gpualn output gets proper bwa
provenance (ID:bwa). As a subcommand it skips its own @pg.

Verified: `bwa gpualn` .sai md5 == golden (eecf35c1); `bwa gpualn -S` SAM
records identical to `bwa samse`; default CPU-only bwa still builds without
CUDA and has no gpualn (regression-checked).

Usage:
  bwa gpualn [-l 1024 -n 0.01 -o 2 -t 16] ref.fa reads.fq.gz > out.sai
  bwa gpualn -S -r '@rg\t...' ref.fa reads.fq.gz | samtools sort -O bam -o out.bam -

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "GPU acceleration: bwa gpualn" section to the README: what it is
(CUDA BWA-backtrack for single-end aDNA, seeding-disabled), bit-exact vs CPU
bwa, ~5x the 16-core CPU, how to build (make bwa-gpu, no CUDA dependency in the
default make), .sai and fused alnse (-S) usage with samtools, options, and a
short note on the bit-exact GPU-DFS + CPU-reconcile design (links cuda/PROGRESS.md).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compared `bwa gpualn -S` against the production pipeline BAM
(/home/dnastorage/aDNApipeline/AVA1B_aln/AVA1B_aln.short.bam = bwa aln -l 1024
-n 0.01 -o 2 + samse + sort, built with bwa 0.7.18-r1243-dirty on the same
3,948,528 reads). Mapped records extracted, RG:Z stripped (production used a
different @rg ID), sorted by QNAME:
- mapped set identical (20,701 == 20,701 QNAMEs; no read changed status)
- ALL mapped records + ALL tags (FLAG/RNAME/POS/MAPQ/CIGAR, XT/NM/X0/X1/XM/XO/
  XG/MD) byte-identical, even across the 0.7.18-dirty -> 0.7.19 version gap.
Only @rg ID and SAM @pg differ. End-to-end equivalence to production confirmed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reference aDNA mapping pipeline (AdapterRemoval3 -> short/long split -> align
-> merge -> optional dedup). Adds a `-g` flag that routes SHORT reads through
`bwa gpualn` (GPU BWA-backtrack) instead of CPU bwa aln+samse; long reads use
-a mem|mem3|fq2bam (incl. Parabricks). Default is CPU (existing behavior); -g
is opt-in. Binary path overridable via $BWA_GPU.

Verified bit-exact in-pipeline: single-end (system bwa 0.7.18-dirty, all mapped
records identical) and paired-end SOL3B (~1M pairs: GPU 218s vs CPU 953s = 4.4x,
both 52,577 mapped, ALL mapped records identical). gpualn is a drop-in for the
short-read aligner. (This is the user's working pipeline; paths reflect that env.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a Performance subsection to the gpualn section with measured short-read
alignment timings (gpualn vs CPU bwa aln+samse, single RTX 3090 vs 16 threads,
-l 1024 -n 0.01 -o 2, hs37d5):
- 3,948,528 reads: 174 s vs ~895 s (~5.1x)
- 40,571,012 reads: 1,947 s vs ~9,790 s wall (~5.0x); all 944,922 mapped
  records byte-identical to bwa aln (bit-exact at scale).
Notes throughput ~20-28k reads/s (tracks per-read backtracking tree size) and
that the kernel is at ~80% of the FM-index Occ random-access bandwidth ceiling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New cuda/LLM_AND_REFERENCES.md documents the human-directed, AI-assisted
development: teepean (lead, direction, real-data validation), Claude
(Opus 4.8: implementation, profiling, measurement, data-driven pushback) and
z.ai GLM-5.1 (architectural sparring / optimization proposals relayed by
teepean). Consolidates all references -- teepean-provided (start + during) and
those found during the work -- and notes the bit-exact validation/disclosure.
Linked from the README gpualn section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gpualn port is a derivative work of bwa's GPLv3 core (links bwtgap.c
bwt_match_gap, bwtaln.c bwt_cal_width, bwase.c samse functions, bwaseqio.c --
none MIT-headered, so GPLv3 per COPYING). Add the standard GPLv3 per-file notice
(gnu.org/licenses/gpl-3.0) to cuda/fm_device.cuh, dfs_engine.cuh, aln_gpu.cu,
dfstest.cu, fmtest.cu; copyright holder placeholder "teepean" (edit as desired).
Full license already present in COPYING. Documented the GPLv3 inheritance in
cuda/LLM_AND_REFERENCES.md. Comment-only change; bwa-gpu rebuilds unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
adna_aligner.sh -s now defaults to "auto": after adapter trimming it reads
AdapterRemoval3's post-processing read-length histogram (output.*.lengths in
{sample}.json) and picks the short/long cutoff from bwa's own max_diff-vs-length
curve. Reads whose max_diff would reach 5 (~64 bp at -n 0.01) are kept out of the
backtracking aln path -- that boundary is exactly the tree-cost cliff (cumulative
aln cost jumps ~18x from <64 to <70 on a real 66M-read sample). A tree-cost budget
(PICK_SPLIT_BUDGET, default 2e14) bounds the aln runtime; on that sample auto picks
51 bp, matching the hand-tuned -s 50 (26 min) that replaced an earlier 12 h run.

- new cuda/pick_split.py: JSON length-hist + bwa_cal_maxdiff -> cutoff (cliff cap
  + cost budget), sums all non-discarded output histograms (SE + PE merged), falls
  back to 50 on any error so the pipeline never breaks.
- -s still accepts a fixed integer for a manual cutoff; help documents both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The short/long split piped awk to two single-threaded `gzip` processes, using ~3
of 32 cores. Swapped both to parallel BGZF (`bgzip -@ THREADS/2 -l 2`): on a 3M-read
real sample the split goes 25.3s -> 4.0s (119k -> 752k reads/s, 6.3x). BGZF is valid
gzip, so bwa / mem3 / fq2bam read the split fastqs natively. The files are transient
(deleted after alignment), so level 2 is fine.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Short/long split now defaults to a fixed 64 bp (was data-driven 'auto'). Keeps the precision-critical 30-60bp band (bulk of aDNA, where bwa aln is most precise and mem -k19 -r2.5 inflates reference bias -- Oliva et al.) on aln/gpualn, and sits at backtrack's d=5 cost cliff. '-s auto' is retained but clamped to a floor of 60 so the runtime guard can never push that band onto mem. Real VILA01 test: raising the cutoff (64->70->75) monotonically lost mappings and added wall time.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Behavior-preserving when off. With GPUALN_HISTO=1, gpualn reads back the per-read node-pop count and a new per-read flag bit (added to k_dfs_warp2), buckets them by read length, and prints a histogram (count, hit%, flag%, mean/max pops) plus GPU-kernel-vs-CPU-reconcile wall times. This is the instrumentation behind the d=5-band and cutoff measurements.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…d engine self-tests

README: new subsections on (a) pipeline integration and the 64bp/floor-60 cutoff with the VILA01 evidence and Oliva-et-al. precision rationale, and (b) the GPU alignment-generation investigation -- order-sensitivity == multimapping (c1>1, validated with gaps), unique-best offload fractions, and the finding that a warp-cooperative collect-and-sort is bit-exact but throughput-counterproductive (per-thread best-first is the right, modest win). Adds GPU_ALIGN_GENERATION_SCOPING.md, the domain-neutral GPU_BOUNDED_DFS_REPORT.md, and two self-contained nvcc self-tests (cuda/fmdfs_selftest.cu, cuda/bestfirst_selftest.cu).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ad aligner)

New -A flag bypasses the short/long split and the BWA-backtrack (aln/gpualn) pass entirely, sending every read to the -a aligner (mem/mem3/fq2bam). A fast first-pass for a new sample with no prior data. Ignores -s and -g; updates the banner and pipeline.log summary accordingly. Default behaviour (split + aln) is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant