Lists (2)
Sort Name ascending (A-Z)
Stars
FlashInfer: Kernel Library for LLM Serving
how to optimize some algorithm in cuda.
Sample codes for my CUDA programming book
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )
REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU scheduling.
Artifact for OSDI'23: MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms.
Aaron: Compile-time Kernel Adaptation for Multi-DNN Inference Acceleration on Edge GPU [SenSys'22 Best Poster]
SW technique using Persistent Threads and SM Partitioning to enhance gpu resource utilization