Skip to content
View edchangy11's full-sized avatar

Block or report edchangy11

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 2,009 130 Updated Apr 2, 2026

vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization

Python 2,256 382 Updated Mar 31, 2026

Achieve state of the art inference performance with modern accelerators on Kubernetes

Shell 2,889 385 Updated Apr 2, 2026

A collection of GPU experiments and benchmarks for my personal understanding and research.

Cuda 27 7 Updated Mar 18, 2026

QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning

C++ 174 20 Updated Nov 11, 2025

Fast and memory-efficient exact attention

Python 23,107 2,576 Updated Apr 2, 2026

Quantized LLM training in pure CUDA/C++.

C++ 242 14 Updated Mar 6, 2026

A PyTorch native platform for training generative AI models

Python 5,205 771 Updated Apr 2, 2026

Learn CUDA with PyTorch

Cuda 262 35 Updated Mar 27, 2026

My tools for the Slurm HPC workload manager

Shell 573 112 Updated Mar 30, 2026

Write a fast kernel and see how you compare against the best humans and AI on gpumode.com

Python 90 27 Updated Apr 2, 2026

A fast AI Video Generator for the GPU Poor. Supports Wan 2.1/2.2, Qwen Image, Hunyuan Video, LTX Video and Flux.

Python 4,915 703 Updated Mar 30, 2026

TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels

Python 198 21 Updated Apr 1, 2026

Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.

Python 287 51 Updated Apr 2, 2026

Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, TensorRT-LLM, and Triton

Go 412 68 Updated Apr 1, 2026

A Quirky Assortment of CuTe Kernels

Python 891 101 Updated Apr 2, 2026

Allow torch tensor memory to be released and resumed later

Python 229 46 Updated Mar 10, 2026

An open-source AI agent that brings the power of Gemini directly into your terminal.

TypeScript 100,020 12,836 Updated Apr 2, 2026

CUDA Kernel Benchmarking Library

Cuda 838 102 Updated Apr 2, 2026

Train your Agent model via our easy and efficient framework

Python 1,723 163 Updated Dec 5, 2025

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,269 131 Updated Apr 2, 2026

From a+b to sparsemax(QK^T)V in Triton!

Jupyter Notebook 29 Updated Jun 19, 2025

FlashInfer: Kernel Library for LLM Serving

Python 5,260 846 Updated Apr 2, 2026

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,095 1,023 Updated Mar 23, 2026

LLM training in simple, raw C/CUDA

Cuda 29,328 3,473 Updated Jun 26, 2025

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 98,762 27,385 Updated Apr 2, 2026

My tests and experiments with some popular dl frameworks.

Python 17 2 Updated Sep 11, 2025
Next