Skip to content
View yzhaiustc's full-sized avatar

Block or report yzhaiustc

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

DeepSeek-V3/R1 inference performance simulator

Jupyter Notebook 195 30 Updated Mar 27, 2025

A Quirky Assortment of CuTe Kernels

Python 972 128 Updated May 17, 2026

Virtual whiteboard for sketching hand-drawn like diagrams

TypeScript 123,431 13,677 Updated May 14, 2026

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,803 1,113 Updated May 18, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 7,265 983 Updated May 13, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,632 1,245 Updated May 13, 2026

Fully open reproduction of DeepSeek-R1

Python 26,017 2,418 Updated Apr 2, 2026

Puzzles for learning Triton, play it with minimal environment configuration!

Python 699 97 Updated Mar 17, 2026

Development repository for the Triton language and compiler

MLIR 19,206 2,855 Updated May 17, 2026

A simple pip-installable Python tool to generate your HTML citation world map from your Google Scholar ID.

Python 705 63 Updated Mar 14, 2026

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 838 63 Updated Mar 6, 2025

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,306 101 Updated Aug 28, 2025

You like pytorch? You like micrograd? You love tinygrad! ❤️

Python 32,707 4,112 Updated May 17, 2026

Grok open release

Python 51,649 8,478 Updated Aug 30, 2024

Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs

Python 1,010 61 Updated Mar 3, 2026

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,076 88 Updated Sep 4, 2024

A compiler for homomorphic encryption

C++ 727 133 Updated May 17, 2026

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 13,664 2,385 Updated May 17, 2026

A retargetable MLIR-based machine learning compiler and runtime toolkit.

C++ 3,761 910 Updated May 17, 2026

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 244 24 Updated Sep 24, 2023

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 113 15 Updated Sep 10, 2024

Fast and memory-efficient exact attention

Python 23,813 2,731 Updated May 16, 2026
C++ 61 21 Updated Dec 18, 2024

100 Days of RTL

SystemVerilog 411 111 Updated Aug 15, 2024

CUDA on non-NVIDIA GPUs

Rust 14,209 904 Updated May 14, 2026

Making large AI models cheaper, faster and more accessible

Python 41,382 4,512 Updated May 11, 2026

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

C++ 32 7 Updated Jun 26, 2024

SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) systems. It is developed as part of the U.S. Department of Ene…

C++ 132 29 Updated Oct 21, 2025
Next