Skip to content
View edchangy11's full-sized avatar

Block or report edchangy11

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 2,017 130 Updated Apr 11, 2026

vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization

Python 2,272 387 Updated Apr 13, 2026

Achieve state of the art inference performance with modern accelerators on Kubernetes

Shell 2,975 406 Updated Apr 13, 2026

A collection of GPU experiments and benchmarks for my personal understanding and research.

Cuda 28 7 Updated Apr 9, 2026

QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning

C++ 176 21 Updated Nov 11, 2025

Fast and memory-efficient exact attention

Python 23,334 2,612 Updated Apr 13, 2026

Quantized LLM training in pure CUDA/C++.

C++ 244 14 Updated Mar 6, 2026

A PyTorch native platform for training generative AI models

Python 5,234 782 Updated Apr 13, 2026

Learn CUDA with PyTorch

Cuda 270 37 Updated Apr 9, 2026

My tools for the Slurm HPC workload manager

Shell 574 112 Updated Apr 13, 2026

Write a fast kernel and see how you compare against the best humans and AI on gpumode.com

Python 91 28 Updated Apr 10, 2026

A fast AI Video Generator for the GPU Poor. Supports Wan 2.1/2.2, Qwen Image, Hunyuan Video, LTX Video and Flux.

Python 5,207 751 Updated Apr 11, 2026

TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels

Python 200 23 Updated Apr 13, 2026
Python 2,858 613 Updated Apr 13, 2026

Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.

Python 290 51 Updated Apr 2, 2026

Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, TensorRT-LLM, and Triton

Go 419 72 Updated Apr 11, 2026

A Quirky Assortment of CuTe Kernels

Python 924 110 Updated Apr 13, 2026

Allow torch tensor memory to be released and resumed later

Python 237 49 Updated Mar 10, 2026

An open-source AI agent that brings the power of Gemini directly into your terminal.

TypeScript 101,141 13,087 Updated Apr 13, 2026

CUDA Kernel Benchmarking Library

Cuda 849 102 Updated Apr 13, 2026

Train your Agent model via our easy and efficient framework

Python 1,732 163 Updated Dec 5, 2025

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,291 136 Updated Apr 13, 2026

From a+b to sparsemax(QK^T)V in Triton!

Jupyter Notebook 29 Updated Jun 19, 2025

FlashInfer: Kernel Library for LLM Serving

Python 5,383 893 Updated Apr 13, 2026

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,260 1,041 Updated Apr 12, 2026

LLM training in simple, raw C/CUDA

Cuda 29,552 3,517 Updated Jun 26, 2025

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 99,094 27,478 Updated Apr 13, 2026

My tests and experiments with some popular dl frameworks.

Python 17 2 Updated Sep 11, 2025
Next