-
Bytedance Company
- Shanghai
Stars
An efficient video loader for deep learning with smart shuffling that's super easy to digest
A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
zInjector is a simple tool for injecting dynamic link libraries into arbitrary processes
Flash-Muon: An Efficient Implementation of Muon Optimizer
Distributed Compiler based on Triton for Parallel Systems
Efficient Deep Learning Systems course materials (HSE, YSDA)
SGLang is a fast serving framework for large language models and vision language models.
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
DeepEP: an efficient expert-parallel communication library
Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.
verl: Volcano Engine Reinforcement Learning for LLMs
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
ByteCheckpoint: An Unified Checkpointing Library for LFMs
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
FlagGems is an operator library for large language models implemented in the Triton Language.
Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training