yongwww

🐢

working

Yong Wu yongwww

🐢

working

MLSys Engineer @ Nvidia | FlashInfer and Machine Learning Compiler LLM co-design

100 followers · 86 following

@NVIDIA
Redmond, WA
23:11 (UTC -07:00)

Achievements

x3 x3

Achievements

x3 x3

Highlights

Organizations

Stars

deepseek-ai / FlashMLA

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 12,526 1,005 Updated Feb 6, 2026

flashinfer-ai / flashinfer-bench-starter-kit

FlashInfer Bench @ MLSys 2026: Building AI agents to write high performance GPU kernels

Python 153 108 Updated Mar 20, 2026

openclaw / openclaw

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

TypeScript 328,997 63,874 Updated Mar 22, 2026

anthropics / original_performance_takehome

Anthropic's original performance take-home, now open for you to try!

Python 3,714 844 Updated Jan 22, 2026

NVlabs / vibetensor

Our first fully AI generated deep learning system

Python 591 44 Updated Feb 2, 2026

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 22,892 2,537 Updated Mar 21, 2026

meta-pytorch / KernelAgent

Autonomous GPU Kernel Generation & Optimization via Deep Agents

Python 321 51 Updated Mar 18, 2026

NVIDIA / cuda-tile

CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…

MLIR 879 63 Updated Mar 17, 2026

flashinfer-ai / flashinfer-bench

Building the Virtuous Cycle for AI-driven LLM Systems

Python 204 31 Updated Mar 19, 2026

ScalingIntelligence / KernelBench

KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)

Jupyter Notebook 877 145 Updated Mar 9, 2026

ai-dynamo / dynamo

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,364 943 Updated Mar 22, 2026

NVIDIA / cutile-python

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,983 127 Updated Mar 21, 2026

gpu-mode / kernelboard

kernelboard is the webapp for https://www.gpumode.com

TypeScript 17 12 Updated Mar 16, 2026

tile-ai / TileRT

Tile-Based Runtime for Ultra-Low-Latency LLM Inference

Python 687 40 Updated Mar 8, 2026

deepseek-ai / DeepSeek-V3.2-Exp

Python 1,515 150 Updated Nov 18, 2025

NVIDIA / jax-tvm-ffi

JAX support for tvm-ffi abi

Python 22 4 Updated Mar 14, 2026

SemiAnalysisAI / InferenceX

Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 vs H100 & soon™ TPUv6e/v7/Trainium2/3

Python 698 104 Updated Mar 22, 2026

flashinfer-ai / cubloaty

a size profiler for cuda binary

Python 71 Updated Jan 15, 2026

ROCm / flashinfer

Forked from flashinfer-ai/flashinfer

FlashInfer+ROCm: ROCm port of FlashInfer

Cuda 6 8 Updated Mar 19, 2026

github-aws-runners / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS

TypeScript 3,029 714 Updated Mar 20, 2026

flashinfer-ai / flashinfer-trace

Python 3 4 Updated Oct 29, 2025

apache / tvm-ffi

Open ABI and FFI for Machine Learning Systems

C++ 364 64 Updated Mar 21, 2026

NVIDIA / nvshmem

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 483 68 Updated Mar 10, 2026

NVIDIA / tilus

Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.

Python 457 18 Updated Mar 21, 2026

ByteDance-Seed / Triton-distributed

Distributed Compiler based on Triton for Parallel Systems

Python 1,394 132 Updated Mar 11, 2026

facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)

Python 4,027 868 Updated Jan 12, 2026

mirage-project / mirage

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

C++ 2,161 184 Updated Mar 21, 2026

uccl-project / uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,240 130 Updated Mar 21, 2026

verl-project / verl

verl: Volcano Engine Reinforcement Learning for LLMs

Python 20,097 3,475 Updated Mar 21, 2026

mlc-ai / relax

Python 173 96 Updated Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yong Wu yongwww

Achievements

Achievements

Highlights

Organizations

Block or report yongwww

Stars

deepseek-ai / FlashMLA

flashinfer-ai / flashinfer-bench-starter-kit

openclaw / openclaw

anthropics / original_performance_takehome

NVlabs / vibetensor

Dao-AILab / flash-attention

meta-pytorch / KernelAgent

NVIDIA / cuda-tile

flashinfer-ai / flashinfer-bench

ScalingIntelligence / KernelBench

ai-dynamo / dynamo

NVIDIA / cutile-python

gpu-mode / kernelboard

tile-ai / TileRT

deepseek-ai / DeepSeek-V3.2-Exp

NVIDIA / jax-tvm-ffi

SemiAnalysisAI / InferenceX

flashinfer-ai / cubloaty

ROCm / flashinfer

github-aws-runners / terraform-aws-github-runner

flashinfer-ai / flashinfer-trace

apache / tvm-ffi

NVIDIA / nvshmem

NVIDIA / tilus

ByteDance-Seed / Triton-distributed

facebookresearch / dlrm

mirage-project / mirage

uccl-project / uccl

verl-project / verl

mlc-ai / relax