Highlights
- Pro
Stars
AccelOpt: Self-improving Agents for AI Accelerator Kernel Optimization
A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching
SymEngine is a fast symbolic manipulation library, written in C++
Reference Code Implementation of paper "Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models"
Simulator for LLM inference on an abstract 3D AIMC-based accelerator
[HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
A Simulation Framework for Memristive Deep Learning Systems
Memory Array Simulation Testbed for Organization, Data, Operations, and Networks
Verilog used to evaluate the FASED dot product hardware unit [IEEE CAL 2026]
🤘 TT-NN operator library, and TT-Metalium low level kernel programming model.
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
Find shape errors before you run your code!
Artifact of MICRO'25 paper Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device
An end-to-end Transformer fusion integrating DAG-based pipeline scheduling and whole encoder and decoder fusion.
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Official repo for the paper "An Effective Training Framework for Light-Weight Automatic Speech Recognition Models" accepted at InterSpeech 2025.
Efficient vision foundation models for high-resolution generation and perception.
[PACT'24] GraNNDis. A fast and unified distributed graph neural network (GNN) training framework for both full-batch (full-graph) and mini-batch training. Provides unification of full-/mini-batch t…