Available for Remote Roles

From Research Papers
to Fused Kernels.

I'm an AI Systems Engineer who spans the entire stack. I don't just train models; I build the high-performance infrastructure that makes them fly. Specializing in distributed training, inference optimization, and custom CUDA/Triton kernels.

The Full Spectrum Engineer

In the world of AI, performance is everything. But true performance doesn't come from just one layer of the stack. It comes from understanding the entire pipeline—from the mathematical intuition in a research paper to the memory access patterns on the GPU.

I operate across this entire spectrum. I read papers to understand the algorithms, then I dive into the systems level to optimize them. Whether it's writing custom CUDA kernels to squeeze out 9x speedups, designing distributed training systems for massive scale, or architecting production-ready inference engines, I build systems that are fast, scalable, and robust.

Core Competencies

CUDA / C++
Triton
PyTorch Internals
Distributed Systems
Model Optimization
Computer Architecture
JAX
LLM Inference

Selected Research

Research Paper

Medical Imaging

Presenting RetinaSys, an innovative framework designed to overcome critical limitations in medical image analysis through strong generalization, real-time performance, and clinical interpretability. Leveraging self-supervised MoCo v3 pre-training on diverse datasets, RetinaSys employs a ConvNeXt backbone fine-tuned in a multi-task paradigm, integrating lesion-centric attention, ordinal grade consistency, and domain adaptation. Explainable AI (XAI) techniques, like attention maps, integrated gradients, SHAP, and Monte Carlo dropout, significantly enhance interpretability along with model optimization, which ensures efficient inference on standard CPU hardware.

Machine Learning Research Theory

Latest Posts

December 22, 2025 13 min read

Making Qwen3 go Brrr from First Principles

Deep dive into optimizing Qwen3 inference from the ground up. Learn how to build a high-performance inference engine with custom kernels and optimization techniques.

LLM Optimization Inference
Read More →
December 27, 2025 12 min read

Optimizing Memory Bound Kernel to Maximize Performance

Deep dive into optimizing memory bound kernels to maximize performance.

CUDA C++ Optimization
Read More →
View All Posts

Selected Engineering

9.39x Inference Speedup

Torch++

A high-performance extension for PyTorch. Implemented custom CUDA kernels and distributed training primitives to supercharge deep learning workflows.

CUDA C++ PyTorch
9.25x Latency Reduction

FastQwen3

Optimized inference engine for Qwen3 0.6B. Leveraged custom Triton kernels and FlashAttention to obliterate baseline Hugging Face performance.

Triton CUDA LLM
Distributed Training

DistJax

Scalable distributed training library for JAX. Provides high-level abstractions for data and model parallelism, simplifying large-scale model training.

JAX Python Distributed
Kernel Optimization

KernelLab

A comprehensive library of hand-written GPU kernels in CUDA and Triton. Demonstrates advanced optimization strategies like tiling, shared memory caching, and warp-level primitives.

CUDA Triton HPC
View All Projects

Ready to Scale?

I'm actively seeking remote full-time roles in AI systems and performance engineering. If you need someone who can bridge the gap between research and high-performance production code, let's talk.

Get in Touch
GitHub LinkedIn Twitter/X