Available for Remote Roles

From Research Papers
to Fused Kernels.

I'm an AI Systems Engineer who spans the entire stack. I don't just train models; I build the high-performance infrastructure that makes them fly. Specializing in distributed training, inference optimization, and custom CUDA/Triton kernels.

View Selected Work Let's Talk

The Full Spectrum Engineer

In the world of AI, performance is everything. But true performance doesn't come from just one layer of the stack. It comes from understanding the entire pipeline—from the mathematical intuition in a research paper to the memory access patterns on the GPU.

I operate across this entire spectrum. I read papers to understand the algorithms, then I dive into the systems level to optimize them. Whether it's writing custom CUDA kernels to squeeze out 9x speedups, designing distributed training systems for massive scale, or architecting production-ready inference engines, I build systems that are fast, scalable, and robust.

Core Competencies

CUDA / C++

Triton

PyTorch Internals

Distributed Systems

Model Optimization

Computer Architecture

JAX

LLM Inference

Selected Research

Research Paper

Medical Imaging

Presenting RetinaSys, an innovative framework designed to overcome critical limitations in medical image analysis through strong generalization, real-time performance, and clinical interpretability. Leveraging self-supervised MoCo v3 pre-training on diverse datasets, RetinaSys employs a ConvNeXt backbone fine-tuned in a multi-task paradigm, integrating lesion-centric attention, ordinal grade consistency, and domain adaptation. Explainable AI (XAI) techniques, like attention maps, integrated gradients, SHAP, and Monte Carlo dropout, significantly enhance interpretability along with model optimization, which ensures efficient inference on standard CPU hardware.

Machine Learning Research Theory

Read Paper →

Latest Posts

December 22, 2025 13 min read

Making Qwen3 go Brrr from First Principles

Deep dive into optimizing Qwen3 inference from the ground up. Learn how to build a high-performance inference engine with custom kernels and optimization techniques.

LLM Optimization Inference

December 27, 2025 12 min read

Optimizing Memory Bound Kernel to Maximize Performance

Deep dive into optimizing memory bound kernels to maximize performance.

CUDA C++ Optimization

View All Posts

Selected Engineering

9.39x Inference Speedup

Torch++

A high-performance extension for PyTorch. Implemented custom CUDA kernels and distributed training primitives to supercharge deep learning workflows.

CUDA C++ PyTorch

GitHub → Case Study →

9.25x Latency Reduction

FastQwen3

Optimized inference engine for Qwen3 0.6B. Leveraged custom Triton kernels and FlashAttention to obliterate baseline Hugging Face performance.

Triton CUDA LLM

GitHub → Case Study →

Distributed Training

DistJax

Scalable distributed training library for JAX. Provides high-level abstractions for data and model parallelism, simplifying large-scale model training.

JAX Python Distributed

GitHub → Case Study →

Kernel Optimization

KernelLab

A comprehensive library of hand-written GPU kernels in CUDA and Triton. Demonstrates advanced optimization strategies like tiling, shared memory caching, and warp-level primitives.

CUDA Triton HPC

GitHub → Case Study →

View All Projects

Ready to Scale?

I'm actively seeking remote full-time roles in AI systems and performance engineering. If you need someone who can bridge the gap between research and high-performance production code, let's talk.

Get in Touch

GitHub LinkedIn Twitter/X

From Research Papersto Fused Kernels.

The Full Spectrum Engineer

Core Competencies

Selected Research

Medical Imaging

Latest Posts

Making Qwen3 go Brrr from First Principles

Optimizing Memory Bound Kernel to Maximize Performance

Selected Engineering

Torch++

FastQwen3

DistJax

KernelLab

Ready to Scale?

From Research Papers
to Fused Kernels.