Making Qwen3 go Brrr from First Principles
Deep dive into optimizing Qwen3 inference from the ground up. Learn how to build a high-performance inference engine with custom kernels and optimization techniques.
Read More →I'm an AI Systems Engineer who spans the entire stack. I don't just train models; I build the high-performance infrastructure that makes them fly. Specializing in distributed training, inference optimization, and custom CUDA/Triton kernels.
In the world of AI, performance is everything. But true performance doesn't come from just one layer of the stack. It comes from understanding the entire pipeline—from the mathematical intuition in a research paper to the memory access patterns on the GPU.
I operate across this entire spectrum. I read papers to understand the algorithms, then I dive into the systems level to optimize them. Whether it's writing custom CUDA kernels to squeeze out 9x speedups, designing distributed training systems for massive scale, or architecting production-ready inference engines, I build systems that are fast, scalable, and robust.
Presenting RetinaSys, an innovative framework designed to overcome critical limitations in medical image analysis through strong generalization, real-time performance, and clinical interpretability. Leveraging self-supervised MoCo v3 pre-training on diverse datasets, RetinaSys employs a ConvNeXt backbone fine-tuned in a multi-task paradigm, integrating lesion-centric attention, ordinal grade consistency, and domain adaptation. Explainable AI (XAI) techniques, like attention maps, integrated gradients, SHAP, and Monte Carlo dropout, significantly enhance interpretability along with model optimization, which ensures efficient inference on standard CPU hardware.
Deep dive into optimizing Qwen3 inference from the ground up. Learn how to build a high-performance inference engine with custom kernels and optimization techniques.
Read More →Deep dive into optimizing memory bound kernels to maximize performance.
Read More →A high-performance extension for PyTorch. Implemented custom CUDA kernels and distributed training primitives to supercharge deep learning workflows.
Optimized inference engine for Qwen3 0.6B. Leveraged custom Triton kernels and FlashAttention to obliterate baseline Hugging Face performance.
Scalable distributed training library for JAX. Provides high-level abstractions for data and model parallelism, simplifying large-scale model training.
A comprehensive library of hand-written GPU kernels in CUDA and Triton. Demonstrates advanced optimization strategies like tiling, shared memory caching, and warp-level primitives.
I'm actively seeking remote full-time roles in AI systems and performance engineering. If you need someone who can bridge the gap between research and high-performance production code, let's talk.
Get in Touch