jundaf2

Follow

jundaf jundaf2

Follow

Currently a MLSys engineer @ bytedance. Previously researched and worked at CCEM, University of Illinois

61 followers · 47 following

Bytedance Company
Shanghai

Achievements

Achievements

Stars

dmlc / decord

An efficient video loader for deep learning with smart shuffling that's super easy to digest

C++ 2,384 216 Updated Jul 17, 2024

pytorch / kineto

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.

HTML 908 217 Updated Dec 23, 2025

eunomia-bpf / cupti-tutorial

Tutorials for NVIDIA CUPTI samples

C++ 45 9 Updated Nov 3, 2025

ROCm / iris

AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

Python 140 27 Updated Dec 22, 2025

ROCm / mori

Modular RDMA Interface

C++ 67 15 Updated Dec 24, 2025

plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals

Python 13,170 431 Updated Dec 22, 2025

EternityX / zInjector

zInjector is a simple tool for injecting dynamic link libraries into arbitrary processes

C++ 23 6 Updated Oct 18, 2023

nil0x9 / flash-muon

Flash-Muon: An Efficient Implementation of Muon Optimizer

Python 223 13 Updated Jun 15, 2025

ByteDance-Seed / Triton-distributed

Distributed Compiler based on Triton for Parallel Systems

Python 1,289 114 Updated Dec 16, 2025

mryab / efficient-dl-systems

Efficient Deep Learning Systems course materials (HSE, YSDA)

Jupyter Notebook 927 142 Updated Nov 28, 2025

OpenDocCN / python-code-anls

JavaScript 42 3 Updated Jan 28, 2025

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Python 21,934 3,856 Updated Dec 24, 2025

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 718 64 Updated Dec 23, 2025

meta-pytorch / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 6,169 568 Updated Aug 22, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,829 1,036 Updated Dec 24, 2025

inclusionAI / AReaL

Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.

Python 3,271 257 Updated Dec 24, 2025

volcengine / verl

verl: Volcano Engine Reinforcement Learning for LLMs

Python 17,763 2,887 Updated Dec 24, 2025

ZonePG / cs-notes

my cs notes

Jupyter Notebook 56 4 Updated Oct 14, 2024

ROCm / composable_kernel

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

C++ 500 258 Updated Dec 24, 2025

ByteDance-Seed / ByteCheckpoint

ByteCheckpoint: An Unified Checkpointing Library for LFMs

Python 256 18 Updated Dec 8, 2025

mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 7,161 395 Updated Jul 11, 2024

flagos-ai / FlagGems

FlagGems is an operator library for large language models implemented in the Triton Language.

Python 809 182 Updated Dec 24, 2025

ROCm / aiter

AI Tensor Engine for ROCm

Python 327 164 Updated Dec 24, 2025

yanring / Megatron-MoE-ModelZoo

Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.

Python 142 29 Updated Dec 19, 2025

NVIDIA / TensorRT-LLM

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,463 1,977 Updated Dec 24, 2025

leefige / radik

Scalable radix top-k selection on GPUs.

Cuda 19 2 Updated Jan 27, 2025

bytedance / ABQ-LLM

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

C++ 238 21 Updated Sep 30, 2024

bytedance / flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,212 85 Updated Aug 28, 2025

NVIDIA / nvidia-resiliency-ext

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…

Python 240 40 Updated Dec 23, 2025

SandAI-org / MagiAttention

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training

Python 592 32 Updated Dec 20, 2025