Pinned Loading
-
tensor-core-from-scratch
tensor-core-from-scratch PublicFrom naive matmul to tensor cores on NVIDIA Blackwell — step by step. 8 self-contained CUDA kernels, each benchmarked against cuBLAS.
-
federated-learning-lab
federated-learning-lab PublicFrom-scratch federated learning: FedAvg / FedProx / SCAFFOLD, DP-SGD & secure aggregation, plus FedPer / Byzantine-robust / FedAdam / FedLoRA. 50/50 tests + CI, literature-cross-validated, with hon…
Python
-
inference-kernel-cookbook
inference-kernel-cookbook PublicLLM inference techniques from scratch — Flash Attention, KV Cache, Paged Attention, each in one self-contained CUDA file. Benchmarked on Blackwell.
Cuda
-
llm-security-lab
llm-security-lab PublicLearn LLM security by building attacks and defenses from first principles. System prompt extraction, prompt injection, model extraction — each in one runnable Python file.
Python
-
nccl-collectives-bench
nccl-collectives-bench PublicNCCL collective benchmarks on an 8×H100 NVSwitch host — busbw vs link budget, NVLS/Ring/Tree, small-message latency floors (eager vs CUDA Graph vs symmetric memory), and the TP-decode comms ceiling…
Python
-
trtllm-triton-serving
trtllm-triton-serving PublicTensorRT-LLM vs vLLM controlled head-to-head on H100 — 12 studies including a knob-by-knob waterfall reproducing NVIDIA's published 27.7k tok/s (100.3%) and attributing the gap to real serving, plu…
Python
If the problem persists, check the GitHub status page or contact support.