I build LLM systems that sit close to the metal β MoE architectures, attention kernels, speculative decoding. The kind of work where a misaligned memory access costs you a day.
Note
The workflow: read the paper, implement it, fix what broke. The CUDA race conditions weren't in the abstract.
Keiro β Sparse MoE on Qwen2.5-3B
Retrofitted Sparse Mixture-of-Experts into Qwen2.5-3B. A Top-2 router activates 2 of 8 LoRA experts per transformer block, leaving the frozen FFN untouched and routing through Rank-16 adapters instead. Active compute stays identical to the dense baseline. The model adds 19.46M trainable parameters (0.63% of total) and retains 95.4% of GSM8K performance.
What actually needed fixing:
βββ CUDA race condition in index_add_ with duplicate Top-K indices
βββ BFloat16 cumsum upcast mismatch in the coalesce path
βββ 4.7Γ autoregressive inference bottleneck β resolved by bypassing
capacity buffers during single-token generation
lm-evaluation-harness results vs. base model:
βββ HellaSwag β0.13%
βββ ARC-Challenge β0.17%
βββ GSM8K β3.19%
Prolepsis β Speculative Decoding
A Qwen 1.7B draft model generates candidate tokens; a Qwen 8B target verifies them in a single parallel pass. A rejection sampling pipeline ensures the output distribution is mathematically identical to running the target model alone.
| Metric | Result |
|---|---|
| Speedup on A100 | 1.30Γ |
| Acceptance Rate | ~56.5% across mixed-domain prompts |
| Output Distribution | Identical to target |
FlashTile β Flash Attention V1/V2
Implements block-wise tiling, online softmax, and recomputation-based backward passes to cut attention storage from O(NΒ²) to O(N). Covers GQA and MQA variants, with a forward-only Triton kernel included for benchmarking.
Substrata9 β Linux Introspection Toolkit
Pure Bash. No compilation, no dependencies. Reads /proc to surface memory maps, file descriptors, process hierarchies, and runtime anomalies. Outputs JSON β slots into observability, debugging, and forensics pipelines without modification.
Mission Cipher β GraphRAG App
Combines cosine-similarity search over semantic embeddings with a live knowledge graph (NetworkX) to answer questions with richer contextual grounding than plain RAG. Deployed on GCE behind NGINX, with Flask and Gunicorn communicating over a Unix socket.
Tip
neuralnets.dev β LLM architecture, inference, GPU programming, and occasionally the math underneath all of it. The goal is precision over vibe β the writeups get into what the papers skip and what the code alone won't tell you.