Skip to content
View jundaf2's full-sized avatar
  • Bytedance Company
  • Shanghai

Block or report jundaf2

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

An efficient video loader for deep learning with smart shuffling that's super easy to digest

C++ 2,384 216 Updated Jul 17, 2024

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.

HTML 908 217 Updated Dec 23, 2025

Tutorials for NVIDIA CUPTI samples

C++ 45 9 Updated Nov 3, 2025

AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

Python 140 27 Updated Dec 22, 2025

Modular RDMA Interface

C++ 67 15 Updated Dec 24, 2025

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals

Python 13,170 431 Updated Dec 22, 2025

zInjector is a simple tool for injecting dynamic link libraries into arbitrary processes

C++ 23 6 Updated Oct 18, 2023

Flash-Muon: An Efficient Implementation of Muon Optimizer

Python 223 13 Updated Jun 15, 2025

Distributed Compiler based on Triton for Parallel Systems

Python 1,289 114 Updated Dec 16, 2025

Efficient Deep Learning Systems course materials (HSE, YSDA)

Jupyter Notebook 927 142 Updated Nov 28, 2025
JavaScript 42 3 Updated Jan 28, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 21,934 3,856 Updated Dec 24, 2025

A Quirky Assortment of CuTe Kernels

Python 718 64 Updated Dec 23, 2025

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 6,169 568 Updated Aug 22, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,829 1,036 Updated Dec 24, 2025

Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.

Python 3,271 257 Updated Dec 24, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 17,763 2,887 Updated Dec 24, 2025

my cs notes

Jupyter Notebook 56 4 Updated Oct 14, 2024

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

C++ 500 258 Updated Dec 24, 2025

ByteCheckpoint: An Unified Checkpointing Library for LFMs

Python 256 18 Updated Dec 8, 2025

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 7,161 395 Updated Jul 11, 2024

FlagGems is an operator library for large language models implemented in the Triton Language.

Python 809 182 Updated Dec 24, 2025

AI Tensor Engine for ROCm

Python 327 164 Updated Dec 24, 2025

Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.

Python 142 29 Updated Dec 19, 2025

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,463 1,977 Updated Dec 24, 2025

Scalable radix top-k selection on GPUs.

Cuda 19 2 Updated Jan 27, 2025

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

C++ 238 21 Updated Sep 30, 2024

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,212 85 Updated Aug 28, 2025

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…

Python 240 40 Updated Dec 23, 2025

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training

Python 592 32 Updated Dec 20, 2025
Next