Skip to content
View yongwww's full-sized avatar
🐢
working
🐢
working
  • Redmond, WA
  • 23:11 (UTC -07:00)

Highlights

  • Pro

Organizations

@apache @NVIDIA @octoml @flashinfer-ai

Block or report yongwww

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 12,526 1,005 Updated Feb 6, 2026

FlashInfer Bench @ MLSys 2026: Building AI agents to write high performance GPU kernels

Python 153 108 Updated Mar 20, 2026

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

TypeScript 328,997 63,874 Updated Mar 22, 2026

Anthropic's original performance take-home, now open for you to try!

Python 3,714 844 Updated Jan 22, 2026

Our first fully AI generated deep learning system

Python 591 44 Updated Feb 2, 2026

Fast and memory-efficient exact attention

Python 22,892 2,537 Updated Mar 21, 2026

Autonomous GPU Kernel Generation & Optimization via Deep Agents

Python 321 51 Updated Mar 18, 2026

CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…

MLIR 879 63 Updated Mar 17, 2026

Building the Virtuous Cycle for AI-driven LLM Systems

Python 204 31 Updated Mar 19, 2026

KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)

Jupyter Notebook 877 145 Updated Mar 9, 2026

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,364 943 Updated Mar 22, 2026

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,983 127 Updated Mar 21, 2026

kernelboard is the webapp for https://www.gpumode.com

TypeScript 17 12 Updated Mar 16, 2026

Tile-Based Runtime for Ultra-Low-Latency LLM Inference

Python 687 40 Updated Mar 8, 2026

JAX support for tvm-ffi abi

Python 22 4 Updated Mar 14, 2026

Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 vs H100 & soon™ TPUv6e/v7/Trainium2/3

Python 698 104 Updated Mar 22, 2026

a size profiler for cuda binary

Python 71 Updated Jan 15, 2026

FlashInfer+ROCm: ROCm port of FlashInfer

Cuda 6 8 Updated Mar 19, 2026

Terraform module for scalable GitHub action runners on AWS

TypeScript 3,029 714 Updated Mar 20, 2026

Open ABI and FFI for Machine Learning Systems

C++ 364 64 Updated Mar 21, 2026

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 483 68 Updated Mar 10, 2026

Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.

Python 457 18 Updated Mar 21, 2026

Distributed Compiler based on Triton for Parallel Systems

Python 1,394 132 Updated Mar 11, 2026

An implementation of a deep learning recommendation model (DLRM)

Python 4,027 868 Updated Jan 12, 2026

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

C++ 2,161 184 Updated Mar 21, 2026

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,240 130 Updated Mar 21, 2026

verl: Volcano Engine Reinforcement Learning for LLMs

Python 20,097 3,475 Updated Mar 21, 2026
Python 173 96 Updated Mar 18, 2026
Next