1am9trash

Thomas Wang 1am9trash

It's my life. It's now or never.

41 followers · 35 following

Taiwan

Organizations

Stars

zufayu / ai-coding-deepdive-public

Forked from Jasen2201/ai-coding-deepdive-public

1 Updated Mar 4, 2026

tile-ai / TileRT

Tile-Based Runtime for Ultra-Low-Latency LLM Inference

Python 1,407 87 Updated Jun 8, 2026

AlphaGPU / leetgpu-challenges

LeetGPU Challenges

Python 929 99 Updated Jun 14, 2026

yichiche / agent-box

A quick tool box for agent initialization

Python 1 Updated Jun 16, 2026

yichiche / torch-profiler-parser

A tool to parse PyTorch profiler trace files for kernel-level analysis.

Python 6 1 Updated Jun 8, 2026

NVIDIA / SOL-ExecBench

A benchmark of real-world DL kernel problems

Python 228 24 Updated May 28, 2026

ROCm / FlyDSL

FlyDSL is the Python front‑end of the project: Flexible LaYout DSL.

Python 204 65 Updated Jun 16, 2026

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Python 5,801 1,054 Updated Jun 16, 2026

ROCm / ATOM

AiTer Optimized Model

Python 114 71 Updated Jun 16, 2026

fla-org / flash-linear-attention

🚀 Efficient implementations for emerging model architectures

Python 5,224 556 Updated Jun 11, 2026

dhcode-cpp / NSA-pytorch

DeepSeek Native Sparse Attention pytorch implementation

Jupyter Notebook 118 11 Updated Dec 17, 2025

AMD-AGI / GEAK

Generating Efficient AI-Centric Kernels

Python 104 28 Updated Jun 16, 2026

HazyResearch / HipKittens

Fast and Furious AMD Kernels

C++ 433 66 Updated Jun 13, 2026

ROCm / mori

Modular RDMA Interface

C++ 136 51 Updated Jun 16, 2026

sogalin / benchmark

Forked from pytorch/benchmark

TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.

Python 1 Updated Mar 10, 2023

sonnyli / flash_attention_from_scratch

Flash Attention from Scratch on CUDA Ampere

Assembly 181 29 Updated Sep 1, 2025

fzyzcjy / torch_utils

Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocations such as NCCL, ...)

Python 111 8 Updated Sep 11, 2025

ROCm / aiter

AI Tensor Engine for ROCm

Python 463 353 Updated Jun 16, 2026

rishisankar / flashattention2

Flash Attention 2 CUDA implementations

Cuda 13 1 Updated Apr 29, 2025

harleyszhang / llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

Python 882 87 Updated May 10, 2026

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 24,166 2,835 Updated Jun 16, 2026

ROCm / rocm-systems

super repo for rocm systems projects

C++ 405 267 Updated Jun 16, 2026

ROCm / composable_kernel

[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror

C++ 537 300 Updated Jun 16, 2026

zhaochenyang20 / Awesome-ML-SYS-Tutorial

My learning notes for ML SYS.

Python 6,532 443 Updated Jun 8, 2026

NVIDIA / cccl

CUDA Core Compute Libraries

C++ 2,383 412 Updated Jun 16, 2026

PawseySC / performance-modelling-tools

This repository hosts configuration files for HPC Toolkit, ROCprof, NVprof and ERT, and scripts to help us create roofline and instruction based roofline diagrams (performance models) for applications

C++ 4 3 Updated Jan 23, 2024

ekondis / mixbench

A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)

C++ 463 73 Updated May 31, 2026

sgl-project / sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 29,084 6,558 Updated Jun 16, 2026

gpu-mode / lectures

Material for gpu-mode lectures

Jupyter Notebook 6,185 623 Updated Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly