Stars
A modern, C++-native, test framework for unit-tests, TDD and BDD - using C++14, C++17 and later (C++11 support is in v2.x branch, and C++03 on the Catch1.x branch)
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
CUDA Templates and Python DSLs for High-Performance Linear Algebra
lightweight, standalone C++ inference engine for Google's Gemma models.
Performance-portable, length-agnostic SIMD with runtime dispatch
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
The official repository for the gem5 computer-system architecture simulator.
Widelands is a free, open source real-time strategy game with singleplayer campaigns and a multiplayer mode. The game was inspired by Settlers II™ (© Bluebyte) but has significantly more variety an…
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
TinyChatEngine: On-Device LLM Inference Library
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
SystemVerilog 2017 Pre-processor, Parser, Elaborator, UHDM Compiler. Provides IEEE Design/TB C/C++ VPI and Python AST & UHDM APIs. Compiles on Linux gcc, Windows msys2-gcc & msvc, OsX
Universal Hardware Data Model. A complete modeling of the IEEE SystemVerilog Object Model with VPI Interface, Elaborator, Serialization, Visitor and Listener. Used as a compiled interchange format …
Xplace 3.0: An Extremely Fast, Extensible and Deterministic Placement Framework with Detailed-Routability and Timing Optimization
An FPGA accelerator for general-purpose Sparse-Matrix Dense-Matrix Multiplication (SpMM).