Lists (1)
Sort Name ascending (A-Z)
Starred repositories
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
【代码随想录知识星球】项目分享-基于Raft的k-v存储数据库🔥
C++11/14/17 std::optional with functional-style extensions and reference support
asyncio is a c++20 library to write concurrent code using the async/await syntax.
A common bricks library for building scalable and portable distributed machine learning.
Parameter server framework for distributed machine learning
C++ implementation of a fast hash map and hash set using hopscotch hashing
The Tensor Algebra SuperOptimizer for Deep Learning
Fast and memory efficient c++ flat hash table/map/set
Easy-Reactor是一个Linux C++高性能TCP服务框架,基于Reactor模式,支持单线程、多线程Reactor,也支持UDP服务
A high-performance inference system for large language models, designed for production environments.
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL for CPU/GPU, OpenCL for AMD/NVIDIA, Android CPU/GPU backends.
collection of benchmarks to measure basic GPU capabilities
A high-performance, extensible Python AOT compiler.
MSCCL++: A GPU-driven communication stack for scalable AI applications
Demonstration of various hardware effects on CUDA GPUs.
Code Examples from "C++ Software Design: Design Principles and Patterns for High-Quality Software" (ISBN: 1098113160)
Conversion to/from half-precision floating point formats
Data Processing benchmark featuring Rust, Go, Swift, Zig, Julia etc.
KV cache store for distributed LLM inference
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores