-
Zhejiang University
Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Stars
FlashMLA: Efficient Multi-head Latent Attention Kernels
CUDA Templates and Python DSLs for High-Performance Linear Algebra
High-speed Large Language Model Serving for Local Deployment
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
Adds AMD FSR 3 Frame Generation to games by replacing Nvidia DLSS Frame Generation (nvngx_dlssg).
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
A fast single-producer, single-consumer lock-free queue for C++
Lightning fast C++/CUDA neural network framework
fastllm是后端无依赖的高性能大模型推理库。同时支持张量并行推理稠密模型和混合模式推理MOE模型,任意10G以上显卡即可推理满血DeepSeek。双路9004/9005服务器+单显卡部署DeepSeek满血满精度原版模型,单并发20tps;INT4量化模型单并发30tps,多并发可达60+。
C/C++/ObjC language server supporting cross references, hierarchies, completion and semantic highlighting
A retargetable MLIR-based machine learning compiler and runtime toolkit.
LightSeq: A High Performance Library for Sequence Processing and Generation
Postmodern immutable and persistent data structures for C++ — value semantics at scale
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…
An efficient C++20 GPU numerical computing library with Python-like syntax
Collective communications library with various primitives for multi-machine training.
Userspace eBPF runtime for Observability, Network, GPU & General Extensions Framework
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
High-Performance Rendering Framework on Stream Architectures
VUDA is a header-only library based on Vulkan that provides a CUDA Runtime API interface for writing GPU-accelerated applications.
collection of benchmarks to measure basic GPU capabilities
MSCCL++: A GPU-driven communication stack for scalable AI applications
C++ library for reading and writing of numpy's .npy files