yanring

Follow

Zijie Yan yanring

Follow

LLM Training System at NVIDIA Megatron Core MoE

139 followers · 52 following

Achievements

Achievements

Stars

27 stars written in C++

apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

C++ 20,831 6,753 Updated Oct 25, 2023

NVIDIA / TensorRT-LLM

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

C++ 12,051 1,843 Updated Nov 6, 2025

Oneflow-Inc / oneflow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.

C++ 9,368 1,010 Updated Aug 20, 2025

linyacool / WebServer

A C++ High Performance Web Server

C++ 8,130 2,138 Updated Sep 27, 2023

NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT

C++ 6,343 920 Updated Mar 27, 2024

tiny-dnn / tiny-dnn

header only, dependency-free deep learning framework in C++14

C++ 5,995 1,398 Updated Apr 17, 2022

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

C++ 4,211 1,063 Updated Nov 6, 2025

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 3,865 301 Updated Nov 6, 2025

CVCUDA / CV-CUDA

CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.

C++ 2,597 241 Updated May 21, 2025

zouxiaohang / TinySTL

TinySTL is a subset of STL(cut some containers and algorithms) and also a superset of STL(add some other containers and algorithms)

C++ 2,487 655 Updated Oct 27, 2018

NVIDIA / libcudacxx

[ARCHIVED] The C++ Standard Library for your entire system. See https://github.com/NVIDIA/cccl

C++ 2,309 191 Updated Feb 7, 2024

flexflow / flexflow-train

Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training

C++ 1,844 245 Updated Nov 4, 2025

OpenPPL / ppl.nn

A primitive library for neural network

C++ 1,367 222 Updated Nov 24, 2024

microsoft / nnfusion

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.

C++ 995 165 Updated Sep 19, 2024

caiorss / C-Cpp-Notes

Notes about modern C++, C++11, C++14 and C++17, Boost Libraries, ABI, foreign function interface and reference cards.

C++ 750 146 Updated Feb 16, 2025

NVIDIA / cudnn-frontend

cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it

C++ 638 135 Updated Nov 6, 2025

district10 / cmake-templates

Some CMake Templates (examples). Qt, Boost, OpenCV, C++11, etc 一些栗子

C++ 544 142 Updated Dec 7, 2023

Kobzol / hardware-effects-gpu

Demonstration of various hardware effects on CUDA GPUs.

C++ 388 30 Updated Nov 22, 2023

xswang / xflow

Distributed LR、 FM model on Parameter Server. FTRL and SGD Optimization Algorithm.

C++ 225 83 Updated Mar 14, 2018

mit-han-lab / inter-operator-scheduler

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration

C++ 200 34 Updated Apr 27, 2022

depctg / udacity-cs344-colab

Google Colab Notebooks for Udacity CS344 - Intro to Parallel Programming

C++ 134 52 Updated Apr 14, 2021

xswang / ffm_mpi

LR、FM model solved by ftrl and sgd parallel on MPI

C++ 111 50 Updated Dec 3, 2017

abcdabcd987 / libfabric-efa-demo

C++ 69 9 Updated Jan 5, 2025

spcl / ucudnn

Accelerating DNN Convolutional Layers with Micro-batches

C++ 63 9 Updated Apr 30, 2020

tsinghua-ideal / Canvas

Canvas: End-to-End Kernel Architecture Search in Neural Networks

C++ 26 4 Updated Nov 18, 2024

kaiyux / CaffeBean

A deep learning framework in cpp

C++ 6 1 Updated Sep 17, 2020

kaiyux / CTC-decoder

A cpp reimplementation for CTC decoder

C++ 5 Updated Mar 15, 2021