waynehacking8

WEI CHENG CHIU waynehacking8

working on ML inference & Agentic AI on the NVIDIA stack

Achievements

tensor-core-from-scratch tensor-core-from-scratch Public

From naive matmul to tensor cores on NVIDIA Blackwell — step by step. 8 self-contained CUDA kernels, each benchmarked against cuBLAS.

Cuda 1 1
federated-learning-lab federated-learning-lab Public

From-scratch federated learning: FedAvg / FedProx / SCAFFOLD, DP-SGD & secure aggregation, plus FedPer / Byzantine-robust / FedAdam / FedLoRA. 50/50 tests + CI, literature-cross-validated, with hon…

Python
inference-kernel-cookbook inference-kernel-cookbook Public

LLM inference techniques from scratch — Flash Attention, KV Cache, Paged Attention, each in one self-contained CUDA file. Benchmarked on Blackwell.

Cuda
llm-security-lab llm-security-lab Public

Learn LLM security by building attacks and defenses from first principles. System prompt extraction, prompt injection, model extraction — each in one runnable Python file.

Python
nccl-collectives-bench nccl-collectives-bench Public

NCCL collective benchmarks on an 8×H100 NVSwitch host — busbw vs link budget, NVLS/Ring/Tree, small-message latency floors (eager vs CUDA Graph vs symmetric memory), and the TP-decode comms ceiling…

Python
trtllm-triton-serving trtllm-triton-serving Public

TensorRT-LLM vs vLLM controlled head-to-head on H100 — 12 studies including a knob-by-knob waterfall reproducing NVIDIA's published 27.7k tok/s (100.3%) and attributing the gap to real serving, plu…

Python