Highlights
- Pro
Stars
Fast and memory-efficient exact attention
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Efficient Triton Kernels for LLM Training
Efficient vision foundation models for high-resolution generation and perception.
Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
AMD Ryzen™ AI Software includes the tools and runtime libraries for optimizing and deploying AI inference on AMD Ryzen™ AI powered PCs.
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
Repository to host and maintain SCALE-Sim code
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
A Simulation Framework for Memristive Deep Learning Systems
SparseTIR: Sparse Tensor Compiler for Deep Learning
Training neural networks in TensorFlow 2.0 with 5x less memory
Artifact for paper "PIM is All You Need: A CXL-Enabled GPU-Free System for LLM Inference", ASPLOS 2025
Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators
Benchmark harness and baseline results for the NeuroBench algorithm track.
H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
Pytorch Implementation of the sparse attention from the paper: "Generating Long Sequences with Sparse Transformers"
A scheduler for spatial DNN accelerators that generate high-performance schedules in one shot using mixed integer programming (MIP)
[ASPLOS 2019] PUMA-simulator provides a detailed simulation model of a dataflow architecture built with NVM (non-volatile memory), and runs ML models compiled using the puma compiler.
Artifact material for [HPCA 2025] #2108 "UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures"
[ASPLOS 2024] CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators
UPMEM LLM Framework allows profiling PyTorch layers and functions and simulate those layers/functions with a given hardware profile.
Next-Generation AI-Assisted Kernel Engineering for Multi-Chip Systems
AccelOpt: Self-improving Agents for AI Accelerator Kernel Optimization
Simulator for LLM inference on an abstract 3D AIMC-based accelerator