CUDA port of bwa aln by teepean · Pull Request #457 · lh3/bwa

teepean · 2026-06-01T14:39:49Z

This fork adds gpualn, a CUDA implementation of BWA-backtrack (bwa aln) for single-end reads, targeting ancient DNA (aDNA) where seeding is disabled (-l 1024) — the regime where seed-and-extend aligners (and the older BarraCUDA) struggle. It produces a bit-exact .sai/SAM (byte-identical to CPU bwa aln/samse) while running ~5x faster than 16 CPU threads on a single NVIDIA RTX 3090 (full 3.95M-read aDNA file: 161 s vs 895 s; .sai md5 identical).

Phase 0: instrument bwt_match_gap (ALN_PROFILE) to measure aDNA search-tree sizes. On real data (AVA1B, hs37d5, -l 1024 -n 0.01 -o 2): 0.52% mapping, 99.3% zero-hit reads dominate, no read hits the 2M max_entries cap, tree size spans 6 orders of magnitude. Key consequence recorded in CUDA_PORT_PLAN.md: for zero-hit reads DFS and best-first visit the identical node set, so the per-thread priority queue (BarraCUDA's memory wall) is unneeded for the bulk; target a persistent-thread work-pool + per-read iterative DFS, reconciling the ~0.5% hit-finding reads via the exact CPU path for a bit-exact .sai. Phase 1: cuda/fm_device.cuh mirrors bwt_occ4/bwt_2occ4/bwt_occ/bwt_match_exact on the existing 64-byte-bucket FM-index layout (cnt_table/L2 in constant mem, .bwt in global mem, __ldg reads); the suffix array is not needed on the GPU. cuda/fmtest.cu validates against the CPU: 10M random Occ4 probes and full backward exact-search of 10k real reads both PASS with 0 mismatches at 2.28 G-Occ4/s on an RTX 3090. Added `make fmtest` target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cuda/dfstest.cu + device DFS in fm_device.cuh. Host reuses bwa's real read I/O, bwt_cal_width, complement and per-read max_diff so the device DFS sees exactly the (seq, width, opt) that CPU bwt_match_gap sees. The DFS reproduces the bounded search-tree node set and reports has_hit per read; the ~0.5% reads that find a hit are reconciled via the exact CPU bwt_match_gap. Validated on sub10k vs the golden .sai (md5 54410c75...): - CPU-only .sai == golden (harness replicates the driver) - GPU has_hit matches CPU n_aln>0 with ZERO false negatives - hybrid .sai (GPU-gated, CPU-reconciled) == golden -> BIT-EXACT Performance is intentionally a baseline only: the naive one-read-per-thread grid-stride with a global-memory stack runs at ~100 reads/s, ~33x slower than the 16-core CPU -- reproducing the BarraCUDA failure mode. Causes diagnosed in cuda/PROGRESS.md (warp divergence from 6-orders tree-size variance, low occupancy, uncoalesced per-thread stacks, no 2occ4 fast path). The whole run is ~227x off the measured FM-index throughput ceiling, so the loss is entirely scheduling/memory -- step 2 (work-pool + coalescing + divergence fixes) is the speedup lever. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Optimized the device DFS engine while preserving bit-exact .sai: - coalesced SoA stack ([sp*P+slot]) and occupancy-sized persistent-thread work-pool (atomic work counter); large runs reconcile only flagged/hit reads from stored flat data (the real production hybrid; full CPU reference is now opt-in via DFS_FULLCPU). - per-read WORK BUDGET (DFS_BUDGET, default 2M pops): the decisive fix. The naive ~100s was dominated by a single read doing 33.1M node-pops on one thread; the CPU's max_entries=2M cap bails on such reads, the GPU did not. Capping GPU work and flagging overflow reads to the exact CPU path is free for correctness (those reads are reconciled on CPU regardless). Measured on RTX 3090, hs37d5, -l 1024 -n 0.01 -o 2: - sub2k: 100s -> 4.5s after budget (2 reads flagged), bit-exact. - sub100k: 12.08s (8276 reads/s) + 2.37s 1-thread reconcile of 520 reads; hybrid .sai md5 == golden (eecf35c1). CPU bwa -t 16 = 23.47s (4261 r/s) on the same box -> ~1.9x the 16-core CPU, ~83x over the naive GPU baseline. 338M pops/s ~= 38% of the FM-index occ4 ceiling; headroom remains. Next: 2occ4 shared-bucket fast path, lower register pressure for occupancy, cheaper exact-match tail, full-file streaming + multithreaded reconcile. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…exact Profiling-guided optimization (consulted CUDA Ampere tuning guide, NVIDIA forums, Volkov latency-hiding, and the 2025 arXiv N-Queens GPU-DFS paper): - Added bwt_2occ4 shared-bucket fast path: NO change -> not occ-load bound. - Swept occupancy via maxrregcount: MORE warps were SLOWER (40w 6445 r/s, 48w 5243 r/s vs 28w 8276) -> NOT occupancy/latency bound. The kernel is bound by the global-memory DFS stack (per-thread live stack working set exceeds the 6 MB L2 as threads scale). - Fix: in-register-continue DFS -- keep the current node in registers, push only siblings, read the global stack only on backtrack. Halves stack round-trips and removes the push->immediately-pop dependency. 8276 -> 9934 reads/s (+20%), 69 registers. sub100k stays bit-exact (hybrid .sai md5 == golden eecf35c1). Standing: ~2.3x the 16-core CPU (4261 r/s), ~27x one core, 99x over naive GPU baseline. ~45% of the FM-index occ4 throughput ceiling. Records: cuda/PROGRESS.md (optimization table) and cuda/REFERENCES.md (docs/ forum findings + next levers: subtree-per-warp with a shared-memory stack). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Records the 6 proposed optimizations with an adopt/reject analysis in cuda/OPTIMIZATION_IDEAS.md. Adopting lh3#1 (warp-level parallel-branch DFS) + lh3#2 (register+shared-memory stack) as the next big lever, lh3#5 (stream-overlapped multithreaded reconcile) for the full-file engine, lh3#4 (vectorized checkpoint load) as a minor follow-up. Flagged lh3#3 (exact-seed zero-hit fast path) as REJECTED: it breaks correctness under -l 1024. With seeding disabled (the aDNA case) a read may map with mismatches in the first 32 bp, so an exact-seed filter would skip mappable reads -> false negatives -> non-bit-exact .sai. A conservative q-gram lower bound could be safe but is deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Added DFS stack-DEPTH instrumentation: sub100k mean 111, max 386, so a 512-entry stack covers every read (cap 1024->512, 0 overflow). Tried block-local stack striding to improve locality: NO win (9165 vs 9934 r/s, bit-exact). Negative result confirms the global stack is bound by traffic VOLUME, not layout -> only removing global stack traffic (shared-memory stack) or sharing it (warp cooperation) can help. A per-thread full shared stack can't fit (~1 MB/block), so the shared stack requires one-read-per-warp -> implement the warp-cooperative engine (Idea lh3#1+lh3#2) next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ernel, bit-exact Add DFS_ENGINE=warp: one read per warp, 32 lanes co-explore the tree as a frontier, per-warp stack in SHARED memory (SoA k/l/packed). Each wave pops up to 32 top nodes, expands in parallel, warp-scan compacts children, pushes back to the shared stack; hit via __any_sync ballot; budget/overflow -> CPU. Removes global stack traffic entirely (the diagnosed bottleneck). sub100k: GPU kernel 3366 ms = 29,708 reads/s -- 3x the thread engine (9934), ~7x the 16-core CPU (4261); bit-exact (md5 eecf35c1). The monster reads that serialized on one thread now spread across 32 lanes. Known issue to fix next: CAP=640 shared stack overflows for 4.6% of reads (warp frontier is wider than the serial DFS depth 386), flagging them to the CPU -> 84 s single-thread reconcile that dwarfs the GPU time. Fixes: larger CAP to cut overflow, and multithreaded/overlapped reconcile (Idea lh3#5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…exact Resolved the overflow-flag bottleneck: - CAP sweep: 640->4.6% flagged, 1024->0.28%, 1216->0.065%. Larger CAP keeps bushy reads on the GPU instead of punting them to the CPU. Default CAP=1024. - Multithreaded the CPU reconcile (std::thread, per-thread gap_stack; bwt_match_gap is thread-safe): 12.1 s -> 1.57 s wall on 16 threads. - Added DFS_NORECON for perf sweeps. sub100k (RTX 3090, bit-exact md5 eecf35c1): - warp GPU kernel 21,432 reads/s (CAP=1024) - end-to-end GPU + 16-thread reconcile (sequential) 6.24 s = 16,026 reads/s - streaming-overlapped would be GPU-bound ~21,432 r/s - vs CPU bwa -t 16 = 4,261 r/s -> ~3.8x (sequential) to ~5x (overlapped), ~210x over the naive GPU baseline. Bottleneck documented in cuda/PROGRESS.md. Remaining levers: shared+global two-level stack for higher occupancy, and a full-file streaming engine with reconcile overlapping the next GPU batch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…o-end ceiling Implement dynamic pop-count in the warp engine (Lever 1): - WAVE mode: pop up to 32 nodes, but only while room=(CAP-sp)/9 fits every node's <=9 children -> wave mode never overflows. - DRAIN mode (near full): pop 1, keep a child in registers (carry) so single-child chains never touch the stack. Gated to sp>=CAP/2 to preserve the wave ramp-up for normal reads. MAX_CHILDREN confirmed = 9. - All bit-exact (sub2k false-pos 90 -> 1). CAP sweep (sub100k, carry-drain): 256->27% flag/59k r/s (16 warps/SM), 384->5.6%/35.7k, 512->2.2%/21k, 768->0.34%/22.2k (4 warps/SM). With the overlapped 16-thread reconcile, end-to-end is GPU-bound ~22k for CAP>=768 and reconcile-bound for small CAP. KEY FINDING: small-CAP high-occupancy speed is illusory because it punts the bushy reads (2-5%) to the CPU; those reads cost similarly on GPU (low occ) or CPU (reconcile), pinning end-to-end at ~22k r/s (~5x the 16-core CPU) regardless of CAP. Default CAP=768. To break the ceiling, bushy reads must run on the GPU at high occupancy -> next build is the two-level shared(small)+global-backing stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ndex ceiling Add DFS_ENGINE=warp2: per-warp shared top-window (CAP_SM) + pre-allocated per-warp GLOBAL backing, with warp-parallel coalesced spill/unspill of 128-entry chunks (global=oldest bottom, shared=newest top). Bushy reads spill their deep frontier to global and STAY ON THE GPU instead of flagging to CPU. sub100k, bit-exact (md5 eecf35c1): flag rate driven to ~0.015% (vs 0.34% single-level) -> CPU reconcile negligible (0.61 s). CAP_SM sweep: 256->15.9k r/s (16 w/SM), 384->18.8k, 512->21.5k (8 w/SM), 768->22.1k (4 w/SM). KEY FINDING: occupancy is NOT the limiter -- 4..16 warps/SM all plateau at ~22k reads/s. We are at the FM-index occ4 ceiling: ~8e9 occ4 for this work at the measured 2.28 G-occ4/s = ~3.6 s floor vs ~4.5 s actual (~80% of ceiling, ~5x the 16-core CPU). 40-50k is not reachable on this GPU; the kernel is bound by FM-index probe throughput, not latency hiding. Further raw speedup needs fewer occ4/node (bit-exactness-constrained) or a faster FM-index. warp2 ships as the engine (ceiling speed + ~0 reconcile tail -> clean for the streaming engine). Default CAP_SM=512. Next: full-file streaming (deliverable). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Extract the warp2 two-level engine into cuda/dfs_engine.cuh and add cuda/aln_gpu.cu = bwa-aln-gpu: a full-file GPU bwa-aln. Streams the FASTQ in bwa's native 0x40000-read chunks (chunking matches the CPU driver), per chunk: MT host preprocessing (bwt_cal_width + complement + per-read max_diff) -> warp2 GPU has_hit -> MT CPU reconcile of the flagged/hit reads via exact bwt_match_gap -> write records in read order. One-time global-backing alloc and reusable (grown-on-demand) device buffers; SAI header + records match bwa_aln_core exactly. Options -l/-n/-o/-t/-f. sub100k end-to-end: 5.3 s = 18,785 reads/s (preprocess 0.0s MT, gpu 4.7s, reconcile 0.6s, io 0.0s), .sai md5 == golden (eecf35c1). Bit-exact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Ran bwa-aln-gpu on the full AVA1B.combined.fq.gz (3,948,528 reads, hs37d5, -l 1024 -n 0.01 -o 2) and compared to CPU `bwa aln -t 16` on the same file: GPU bwa-aln-gpu : 175.9 s (22,445 reads/s) CPU bwa aln -t16: 875 s ( 4,512 reads/s) => 4.97x end-to-end speedup Both .sai are 16,852,676 bytes with identical md5 a09a26dd894690031da42a00862f7a3d -> BIT-EXACT over the whole 3.95M-read file. GPU breakdown: preprocess 1.4s (MT), kernel 156.5s, reconcile 16.7s (16 thr), io 0.1s, gzip read ~1.2s -> GPU-bound at the FM-index occ4 ceiling, as analyzed. 0.531% of reads reconciled on the CPU (mostly real hits). Goal achieved: GPU BWA-backtrack for aDNA with seeding disabled, ~5x the 16-core CPU, byte-exact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@pg

…, bit-exact Add -S (SAM output) and -r RG to bwa-aln-gpu: fuse aln + samse (GeoGenetics- style alnse). No .sai on disk; FASTQ read ONCE (samse reuses the loaded seqs[]); bns/SA/pac loaded once. Per chunk after GPU aln + reconcile: serial bwa_aln2seq_core (preserves the only samse RNG, drand48 repeat-hit order; lrand48 @bntseq.c:266 is index-build only) -> MT bwa_cal_pac_pos_core SA-lookup (RNG-free) -> serial bwa_refine_gapped + bwa_print_sam1 (ordered, bit-exact). Header via bwa_print_sam_hdr + a @pg for bwa-aln-gpu. Validated bit-exact: fused SAM alignment records byte-identical to `bwa samse` on the matching .sai -- sub100k md5 ee2dd6ef, full 3.95M-read file md5 e2a6e1c9 (header differs only in the expected @pg CL). `bwa-aln-gpu -S ... | samtools sort` produces a valid BAM. Full file (3,948,528 reads, RTX 3090): 174.2 s = 22,672 reads/s; even faster than aln-only(176s)+separate samse(20s) since samse folds in for free. vs CPU bwa aln -t16 + samse = 895 s -> ~5.1x end-to-end, one command, no .sai. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add FMTEST_KRANGE to fmtest to probe occ4 over a capped k-range (working-set size). Sweep on hs37d5 (RTX 3090): 0.1 MB (L1) : 9568 M-occ4/s 1 MB (L2) : 9887 M-occ4/s 16.8 MB : 2936 M-occ4/s 268 MB (HBM): 2324 M-occ4/s full (HBM) : 2308 M-occ4/s => a 4.3x locality ceiling exists IN CACHE, but the warp2 kernel already runs at ~1.8 G occ4/s (~the HBM-random rate), NOT the L2 rate. The FM-index DFS is pure scatter: ~40k probes/read over ~49M 64-byte buckets => ~0.04% within-read block reuse (birthday bound). So a per-warp SMEM BWT cache (and address sort, bit-matrix popc) cannot capture the ceiling; the kernel is at the HBM random BANDWIDTH wall. Remaining real wins: CPU/GPU overlap (~10% wall-clock) and multi-GPU (~2x bandwidth, needs a 2nd GPU). Recorded in cuda/REFERENCES.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lh3#5 real) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…, bit-exact Restructure bwa-aln-gpu into a producer/consumer pipeline: the main thread owns the GPU (read -> MT preprocess -> upload -> kernel -> download has_hit, serial on one stream so the shared per-warp global backing is never raced); each chunk is handed to ONE in-order finisher thread doing the CPU reconcile + output (.sai or fused samse), concurrent with the next chunk's GPU work. A single ordered consumer preserves drand48 (repeat-hit) and output order -> bit-exact. Structured multi-GPU-ready (replicate the GPU stage per device + round-robin chunks; keep one ordered finisher) for when a 2nd card is added. Full file (3.95M reads, fused alnse SAM): 174.2 s -> 160.9 s = 24,541 reads/s (overlap hid ~13s of the ~18s CPU tail), records BIT-EXACT (md5 e2a6e1c9). Both modes re-verified bit-exact under the pipeline (.sai eecf35c1, SAM records identical). vs CPU bwa aln -t16 + samse = 895 s -> ~5.56x end-to-end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@pg

aln_gpu.cu's main is now `extern "C" int bwa_alnse_gpu(int,char**)`; main.c dispatches `bwa gpualn` under #ifdef HAVE_CUDA, with a usage line. Builds: make -> CPU-only bwa (no CUDA dependency; gpualn absent) make bwa-gpu -> CUDA bwa with `gpualn` (main.c -DHAVE_CUDA + nvcc aln_gpu.o, linked via nvcc; reuses existing AOBJS) make bwa-aln-gpu-> standalone tool (now -DALN_GPU_MAIN) The CUDA-enabled main.c sets the @pg, so gpualn output gets proper bwa provenance (ID:bwa). As a subcommand it skips its own @pg. Verified: `bwa gpualn` .sai md5 == golden (eecf35c1); `bwa gpualn -S` SAM records identical to `bwa samse`; default CPU-only bwa still builds without CUDA and has no gpualn (regression-checked). Usage: bwa gpualn [-l 1024 -n 0.01 -o 2 -t 16] ref.fa reads.fq.gz > out.sai bwa gpualn -S -r '@rg\t...' ref.fa reads.fq.gz | samtools sort -O bam -o out.bam - Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a "GPU acceleration: bwa gpualn" section to the README: what it is (CUDA BWA-backtrack for single-end aDNA, seeding-disabled), bit-exact vs CPU bwa, ~5x the 16-core CPU, how to build (make bwa-gpu, no CUDA dependency in the default make), .sai and fused alnse (-S) usage with samtools, options, and a short note on the bit-exact GPU-DFS + CPU-reconcile design (links cuda/PROGRESS.md). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@rg

Compared `bwa gpualn -S` against the production pipeline BAM (/home/dnastorage/aDNApipeline/AVA1B_aln/AVA1B_aln.short.bam = bwa aln -l 1024 -n 0.01 -o 2 + samse + sort, built with bwa 0.7.18-r1243-dirty on the same 3,948,528 reads). Mapped records extracted, RG:Z stripped (production used a different @rg ID), sorted by QNAME: - mapped set identical (20,701 == 20,701 QNAMEs; no read changed status) - ALL mapped records + ALL tags (FLAG/RNAME/POS/MAPQ/CIGAR, XT/NM/X0/X1/XM/XO/ XG/MD) byte-identical, even across the 0.7.18-dirty -> 0.7.19 version gap. Only @rg ID and SAM @pg differ. End-to-end equivalence to production confirmed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Reference aDNA mapping pipeline (AdapterRemoval3 -> short/long split -> align -> merge -> optional dedup). Adds a `-g` flag that routes SHORT reads through `bwa gpualn` (GPU BWA-backtrack) instead of CPU bwa aln+samse; long reads use -a mem|mem3|fq2bam (incl. Parabricks). Default is CPU (existing behavior); -g is opt-in. Binary path overridable via $BWA_GPU. Verified bit-exact in-pipeline: single-end (system bwa 0.7.18-dirty, all mapped records identical) and paired-end SOL3B (~1M pairs: GPU 218s vs CPU 953s = 4.4x, both 52,577 mapped, ALL mapped records identical). gpualn is a drop-in for the short-read aligner. (This is the user's working pipeline; paths reflect that env.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a Performance subsection to the gpualn section with measured short-read alignment timings (gpualn vs CPU bwa aln+samse, single RTX 3090 vs 16 threads, -l 1024 -n 0.01 -o 2, hs37d5): - 3,948,528 reads: 174 s vs ~895 s (~5.1x) - 40,571,012 reads: 1,947 s vs ~9,790 s wall (~5.0x); all 944,922 mapped records byte-identical to bwa aln (bit-exact at scale). Notes throughput ~20-28k reads/s (tracks per-read backtracking tree size) and that the kernel is at ~80% of the FM-index Occ random-access bandwidth ceiling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

New cuda/LLM_AND_REFERENCES.md documents the human-directed, AI-assisted development: teepean (lead, direction, real-data validation), Claude (Opus 4.8: implementation, profiling, measurement, data-driven pushback) and z.ai GLM-5.1 (architectural sparring / optimization proposals relayed by teepean). Consolidates all references -- teepean-provided (start + during) and those found during the work -- and notes the bit-exact validation/disclosure. Linked from the README gpualn section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The gpualn port is a derivative work of bwa's GPLv3 core (links bwtgap.c bwt_match_gap, bwtaln.c bwt_cal_width, bwase.c samse functions, bwaseqio.c -- none MIT-headered, so GPLv3 per COPYING). Add the standard GPLv3 per-file notice (gnu.org/licenses/gpl-3.0) to cuda/fm_device.cuh, dfs_engine.cuh, aln_gpu.cu, dfstest.cu, fmtest.cu; copyright holder placeholder "teepean" (edit as desired). Full license already present in COPYING. Documented the GPLv3 inheritance in cuda/LLM_AND_REFERENCES.md. Comment-only change; bwa-gpu rebuilds unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

adna_aligner.sh -s now defaults to "auto": after adapter trimming it reads AdapterRemoval3's post-processing read-length histogram (output.*.lengths in {sample}.json) and picks the short/long cutoff from bwa's own max_diff-vs-length curve. Reads whose max_diff would reach 5 (~64 bp at -n 0.01) are kept out of the backtracking aln path -- that boundary is exactly the tree-cost cliff (cumulative aln cost jumps ~18x from <64 to <70 on a real 66M-read sample). A tree-cost budget (PICK_SPLIT_BUDGET, default 2e14) bounds the aln runtime; on that sample auto picks 51 bp, matching the hand-tuned -s 50 (26 min) that replaced an earlier 12 h run. - new cuda/pick_split.py: JSON length-hist + bwa_cal_maxdiff -> cutoff (cliff cap + cost budget), sums all non-discarded output histograms (SE + PE merged), falls back to 50 on any error so the pipeline never breaks. - -s still accepts a fixed integer for a manual cutoff; help documents both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The short/long split piped awk to two single-threaded `gzip` processes, using ~3 of 32 cores. Swapped both to parallel BGZF (`bgzip -@ THREADS/2 -l 2`): on a 3M-read real sample the split goes 25.3s -> 4.0s (119k -> 752k reads/s, 6.3x). BGZF is valid gzip, so bwa / mem3 / fq2bam read the split fastqs natively. The files are transient (deleted after alignment), so level 2 is fine. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Short/long split now defaults to a fixed 64 bp (was data-driven 'auto'). Keeps the precision-critical 30-60bp band (bulk of aDNA, where bwa aln is most precise and mem -k19 -r2.5 inflates reference bias -- Oliva et al.) on aln/gpualn, and sits at backtrack's d=5 cost cliff. '-s auto' is retained but clamped to a floor of 60 so the runtime guard can never push that band onto mem. Real VILA01 test: raising the cutoff (64->70->75) monotonically lost mappings and added wall time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Behavior-preserving when off. With GPUALN_HISTO=1, gpualn reads back the per-read node-pop count and a new per-read flag bit (added to k_dfs_warp2), buckets them by read length, and prints a histogram (count, hit%, flag%, mean/max pops) plus GPU-kernel-vs-CPU-reconcile wall times. This is the instrumentation behind the d=5-band and cutoff measurements. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…d engine self-tests README: new subsections on (a) pipeline integration and the 64bp/floor-60 cutoff with the VILA01 evidence and Oliva-et-al. precision rationale, and (b) the GPU alignment-generation investigation -- order-sensitivity == multimapping (c1>1, validated with gaps), unique-best offload fractions, and the finding that a warp-cooperative collect-and-sort is bit-exact but throughput-counterproductive (per-thread best-first is the right, modest win). Adds GPU_ALIGN_GENERATION_SCOPING.md, the domain-neutral GPU_BOUNDED_DFS_REPORT.md, and two self-contained nvcc self-tests (cuda/fmdfs_selftest.cu, cuda/bestfirst_selftest.cu). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ad aligner) New -A flag bypasses the short/long split and the BWA-backtrack (aln/gpualn) pass entirely, sending every read to the -a aligner (mem/mem3/fq2bam). A fast first-pass for a new sample with no prior data. Ignores -s and -g; updates the banner and pipeline.log summary accordingly. Default behaviour (split + aln) is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

teepean and others added 29 commits June 1, 2026 10:29

docs: record round-2 optimization verdicts (HBM wall; lh3#1-3 rejected,

0da4aca

lh3#5 real) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA port of bwa aln#457

CUDA port of bwa aln#457
teepean wants to merge 29 commits into
lh3:masterfrom
teepean:cuda-port

teepean commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teepean commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant