yzhaiustc

Yujia Zhai yzhaiustc

216 followers · 15 following

@NVIDIA
Santa Clara, California
17:20 (UTC -07:00)
https://yzhaiustc.github.io/

Achievements

x2 x2

Achievements

x2 x2

Stars

zartbot / shallowsim

DeepSeek-V3/R1 inference performance simulator

Jupyter Notebook 195 30 Updated Mar 27, 2025

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 972 128 Updated May 17, 2026

excalidraw / excalidraw

Virtual whiteboard for sketching hand-drawn like diagrams

TypeScript 123,431 13,677 Updated May 14, 2026

ByteDance-Seed / Seed-Thinking-v1.5

814 18 Updated Jun 9, 2025

ai-dynamo / dynamo

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,803 1,113 Updated May 18, 2026

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 7,265 983 Updated May 13, 2026

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 9,632 1,245 Updated May 13, 2026

huggingface / open-r1

Fully open reproduction of DeepSeek-R1

Python 26,017 2,418 Updated Apr 2, 2026

deepseek-ai / DeepSeek-R1

92,015 11,734 Updated Jun 27, 2025

SiriusNEO / Triton-Puzzles-Lite

Puzzles for learning Triton, play it with minimal environment configuration!

Python 699 97 Updated Mar 17, 2026

triton-lang / triton

Development repository for the Triton language and compiler

MLIR 19,206 2,855 Updated May 17, 2026

ChenLiu-1996 / CitationMap

A simple pip-installable Python tool to generate your HTML citation world map from your Google Scholar ID.

Python 705 63 Updated Mar 14, 2026

mit-han-lab / omniserve

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 838 63 Updated Mar 6, 2025

bytedance / flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,306 101 Updated Aug 28, 2025

tinygrad / tinygrad

You like pytorch? You like micrograd? You love tinygrad! ❤️

Python 32,707 4,112 Updated May 17, 2026

xai-org / grok-1

Grok open release

Python 51,649 8,478 Updated Aug 30, 2024

volcengine / veScale

Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs

Python 1,010 61 Updated Mar 3, 2026

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,076 88 Updated Sep 4, 2024

google / heir

A compiler for homomorphic encryption

C++ 727 133 Updated May 17, 2026

NVIDIA / TensorRT-LLM

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 13,664 2,385 Updated May 17, 2026

iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

C++ 3,761 910 Updated May 17, 2026

AlibabaResearch / flash-llm

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 244 24 Updated Sep 24, 2023

tlc-pack / libflash_attn

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 113 15 Updated Sep 10, 2024

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 23,813 2,731 Updated May 16, 2026

intel / xetla

C++ 61 21 Updated Dec 18, 2024

raulbehl / 100DaysOfRTL

100 Days of RTL

SystemVerilog 411 111 Updated Aug 15, 2024

vosen / ZLUDA

CUDA on non-NVIDIA GPUs

Rust 14,209 904 Updated May 14, 2026

hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible

Python 41,382 4,512 Updated May 11, 2026

eth-cscs / spla

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

C++ 32 7 Updated Jun 26, 2024

icl-utk-edu / slate

SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) systems. It is developed as part of the U.S. Department of Ene…

C++ 132 29 Updated Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yujia Zhai yzhaiustc

Achievements

Achievements

Block or report yzhaiustc

Stars

zartbot / shallowsim

Dao-AILab / quack

excalidraw / excalidraw

ByteDance-Seed / Seed-Thinking-v1.5

ai-dynamo / dynamo

deepseek-ai / DeepGEMM

deepseek-ai / DeepEP

huggingface / open-r1

deepseek-ai / DeepSeek-R1

SiriusNEO / Triton-Puzzles-Lite

triton-lang / triton

ChenLiu-1996 / CitationMap

mit-han-lab / omniserve

bytedance / flux

tinygrad / tinygrad

xai-org / grok-1

volcengine / veScale

IST-DASLab / marlin

google / heir

NVIDIA / TensorRT-LLM

iree-org / iree

AlibabaResearch / flash-llm

tlc-pack / libflash_attn

Dao-AILab / flash-attention

intel / xetla

raulbehl / 100DaysOfRTL

vosen / ZLUDA

hpcaitech / ColossalAI

eth-cscs / spla

icl-utk-edu / slate