Stars
Flash Attention from Scratch on CUDA Ampere
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
This repository is responsible for the LLVM-related parts of Jeandle.
Jeandle is a Just-in-Time compiler for Java. It is built on OpenJDK and leverages the LLVM compiler infrastructure to generate machine code, aiming to provide powerful compilation optimizations and…
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
how to optimize some algorithm in cuda.
A high-throughput and memory-efficient inference and serving engine for LLMs
从无名小卒到大模型(LLM)大英雄~ 欢迎关注后续!!!
SGLang is a fast serving framework for large language models and vision language models.
FlashInfer: Kernel Library for LLM Serving
My learning notes for ML SYS.
Optimized primitives for collective multi-GPU communication
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" mea…
📝A simple and elegant markdown editor, available for Linux, macOS and Windows.
A Vector Database Tutorial (over CMU-DB's BusTub system)
The official home of the Presto distributed SQL query engine for big data
A light-weight RPC implement of google protobuf RPC framework.
An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
🚧 Build a SQL optimizer in 1000 lines of Rust using egg.