Skip to content
View qimcis's full-sized avatar
🌱
🌱

Block or report qimcis

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results
Python 185 29 Updated Jun 15, 2026

Microbenchmarking hyperparameter tuning for JAX functions.

Python 18 Updated Apr 15, 2026
Python 1 2 Updated Apr 16, 2026

GPUGrants - a list of GPU grants that I can think of

110 6 Updated Sep 13, 2025

DoubleAI’s hyperoptimised version of cuGraph

Cuda 60 6 Updated Mar 3, 2026

Automated High-Performance GPU Kernel Generation

Python 115 22 Updated Jun 1, 2026

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 29,172 6,606 Updated Jun 19, 2026

A Quirky Assortment of CuTe Kernels

Python 1,019 136 Updated Jun 16, 2026

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,418 158 Updated Jun 19, 2026

A PyTorch-native inference engine with cache, parallelism, quantization and cpu offload for DiTs.

Python 1,205 75 Updated Jun 16, 2026

DFlash: Block Diffusion for Flash Speculative Decoding

Python 5,167 373 Updated May 10, 2026

Artifact for "Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs" [arXiv '25]

Python 20 4 Updated May 4, 2026

Codebase for Cuda Learning

Cuda 35 3 Updated Jul 13, 2024

A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do

829 102 Updated Apr 27, 2026

how to optimize some algorithm in cuda.

Cuda 3,090 279 Updated Jun 9, 2026

System Intelligence Benchmark

TLA 54 11 Updated May 9, 2026

Efficient Long-context Language Model Training by Core Attention Disaggregation

Python 105 7 Updated Apr 7, 2026

A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.

Python 4,426 704 Updated May 17, 2026

Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.

Python 1,586 267 Updated Jun 19, 2026

Nano vLLM

Python 14,094 2,231 Updated Apr 26, 2026

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

Cuda 2,320 221 Updated Jun 18, 2026

Accelerate inference without tears

Python 372 23 Updated Jan 23, 2026

TPU inference for vLLM, with unified JAX and PyTorch support.

Python 357 215 Updated Jun 19, 2026

NanoGPT (124M) in 90 seconds

Python 5,412 811 Updated Jun 13, 2026

Tenstorrent MLIR compiler

MLIR 280 131 Updated Jun 19, 2026

🤘 TT-NN operator library, and TT-Metalium low level kernel programming model.

C++ 1,533 507 Updated Jun 19, 2026

Universal LLM Deployment Engine with ML Compilation

Python 22,822 2,066 Updated May 11, 2026

Open Machine Learning Compiler Framework

Python 13,476 3,896 Updated Jun 18, 2026
Next