Skip to content
View coolhok's full-sized avatar

Block or report coolhok

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Puzzles for learning Triton

Jupyter Notebook 2,348 207 Updated Mar 18, 2026

Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.

Python 333 77 Updated Mar 24, 2026

Achieve state of the art inference performance with modern accelerators on Kubernetes

Shell 2,700 364 Updated Mar 24, 2026
Python 52 3 Updated May 19, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 4,969 623 Updated Mar 24, 2026

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 966 88 Updated Feb 25, 2026
C++ 119 18 Updated May 16, 2025

Distributed Compiler based on Triton for Parallel Systems

Python 1,395 133 Updated Mar 11, 2026

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)

C 1,602 534 Updated Mar 23, 2026

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,396 950 Updated Mar 24, 2026

My learning notes for ML SYS.

Python 5,763 373 Updated Mar 19, 2026

A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.

Python 2,935 318 Updated Jan 14, 2026

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,971 288 Updated May 15, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 20,170 3,489 Updated Mar 24, 2026

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 24,965 4,974 Updated Mar 24, 2026

Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali

Python 2,728 182 Updated Mar 24, 2026

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 74,178 14,726 Updated Mar 24, 2026

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,041 86 Updated Sep 4, 2024

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.

Python 493 67 Updated Nov 26, 2024

[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror

C++ 525 278 Updated Mar 24, 2026

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 5,076 349 Updated Mar 19, 2026

A high-performance inference system for large language models, designed for production environments.

C++ 496 40 Updated Dec 19, 2025

TVM Documentation in Chinese Simplified / TVM 中文文档

TypeScript 3,585 713 Updated Mar 12, 2026

how to optimize some algorithm in cuda.

Cuda 2,883 264 Updated Mar 24, 2026

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

Python 120 5 Updated Mar 6, 2024

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,257 177 Updated Jul 29, 2023

Ring attention implementation with flash attention

Python 998 96 Updated Sep 10, 2025

Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.

77,294 8,932 Updated Feb 5, 2026

cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,mlops算法链路全流程,算力租赁平台,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU虚拟化,边缘计算,标注平台自动化标注,deepseek等大模型sft微调/奖励模型/强化学习训练,vllm/ollama/mindie大模型多机推理,私有知识库,AI模型市场…

Python 4,909 867 Updated Feb 6, 2026
Next