Stars
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
Transformer related optimization, including BERT, GPT
header only, dependency-free deep learning framework in C++14
Optimized primitives for collective multi-GPU communication
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
TinySTL is a subset of STL(cut some containers and algorithms) and also a superset of STL(add some other containers and algorithms)
[ARCHIVED] The C++ Standard Library for your entire system. See https://github.com/NVIDIA/cccl
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
Notes about modern C++, C++11, C++14 and C++17, Boost Libraries, ABI, foreign function interface and reference cards.
cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
Some CMake Templates (examples). Qt, Boost, OpenCV, C++11, etc 一些栗子
Demonstration of various hardware effects on CUDA GPUs.
Distributed LR、 FM model on Parameter Server. FTRL and SGD Optimization Algorithm.
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
Google Colab Notebooks for Udacity CS344 - Intro to Parallel Programming
Accelerating DNN Convolutional Layers with Micro-batches
Canvas: End-to-End Kernel Architecture Search in Neural Networks