Skip to content
View 1am9trash's full-sized avatar
  • Taiwan

Organizations

@ROCm @NTU-ECLAB

Block or report 1am9trash

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Tile-Based Runtime for Ultra-Low-Latency LLM Inference

Python 1,407 87 Updated Jun 8, 2026

LeetGPU Challenges

Python 929 99 Updated Jun 14, 2026

A quick tool box for agent initialization

Python 1 Updated Jun 16, 2026

A tool to parse PyTorch profiler trace files for kernel-level analysis.

Python 6 1 Updated Jun 8, 2026

A benchmark of real-world DL kernel problems

Python 228 24 Updated May 28, 2026

FlyDSL is the Python front‑end of the project: Flexible LaYout DSL.

Python 204 65 Updated Jun 16, 2026

FlashInfer: Kernel Library for LLM Serving

Python 5,801 1,054 Updated Jun 16, 2026

AiTer Optimized Model

Python 114 71 Updated Jun 16, 2026

🚀 Efficient implementations for emerging model architectures

Python 5,224 556 Updated Jun 11, 2026

DeepSeek Native Sparse Attention pytorch implementation

Jupyter Notebook 118 11 Updated Dec 17, 2025

Generating Efficient AI-Centric Kernels

Python 104 28 Updated Jun 16, 2026

Fast and Furious AMD Kernels

C++ 433 66 Updated Jun 13, 2026

Modular RDMA Interface

C++ 136 51 Updated Jun 16, 2026

TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.

Python 1 Updated Mar 10, 2023

Flash Attention from Scratch on CUDA Ampere

Assembly 181 29 Updated Sep 1, 2025

Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocations such as NCCL, ...)

Python 111 8 Updated Sep 11, 2025

AI Tensor Engine for ROCm

Python 463 353 Updated Jun 16, 2026

Flash Attention 2 CUDA implementations

Cuda 13 1 Updated Apr 29, 2025

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

Python 882 87 Updated May 10, 2026

Fast and memory-efficient exact attention

Python 24,166 2,835 Updated Jun 16, 2026

super repo for rocm systems projects

C++ 405 267 Updated Jun 16, 2026

[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror

C++ 537 300 Updated Jun 16, 2026

My learning notes for ML SYS.

Python 6,532 443 Updated Jun 8, 2026

CUDA Core Compute Libraries

C++ 2,383 412 Updated Jun 16, 2026

This repository hosts configuration files for HPC Toolkit, ROCprof, NVprof and ERT, and scripts to help us create roofline and instruction based roofline diagrams (performance models) for applications

C++ 4 3 Updated Jan 23, 2024

A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)

C++ 463 73 Updated May 31, 2026

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 29,084 6,558 Updated Jun 16, 2026

Material for gpu-mode lectures

Jupyter Notebook 6,185 623 Updated Jun 15, 2026
Next