-
-
cuJSON Public
Forked from AutomataLab/cuJSONcuJSON: A Highly Parallel JSON Parser for GPUs
C++ MIT License UpdatedDec 12, 2025 -
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
Python MIT License UpdatedDec 10, 2025 -
TileGym Public
Forked from NVIDIA/TileGymHelpful kernel tutorials and examples for tile-based GPU programming
Python Other UpdatedDec 5, 2025 -
cutile-python Public
Forked from NVIDIA/cutile-pythoncuTile is a programming model for writing parallel kernels for NVIDIA GPUs
Python Other UpdatedDec 4, 2025 -
fouroversix Public
Forked from mit-han-lab/fouroversixCode for the paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling”
Python MIT License UpdatedDec 2, 2025 -
nsight-python Public
Forked from NVIDIA/nsight-pythonNsight Python is a Python kernel profiling interface based on NVIDIA Nsight Tools
Python Apache License 2.0 UpdatedNov 27, 2025 -
asystem-amem Public
Forked from inclusionAI/asystem-amemA NCCL extension library, designed to efficiently offload GPU memory allocated by the NCCL communication library.
C++ Apache License 2.0 UpdatedNov 27, 2025 -
flex-block-attn Public
Forked from Tencent-Hunyuan/flex-block-attnflex-block-attn: an efficient block sparse attention computation library
Jupyter Notebook Other UpdatedNov 20, 2025 -
tilelang-dsa Public
Forked from lemyx/tilelang-dsaDeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang
Python Other UpdatedNov 19, 2025 -
NexVenusCL Public
Forked from nex-agi/NexVenusCLNex Venus Communication Library
C++ Apache License 2.0 UpdatedNov 17, 2025 -
flash-moba Public
Forked from mit-han-lab/flash-moba -
nanotrace Public
Forked from aikitoria/nanotraceLow overhead tracing library and trace visualizer for pipelined CUDA kernels
C MIT License UpdatedNov 9, 2025 -
pplx-garden Public
Forked from perplexityai/pplx-gardenPerplexity open source garden for inference technology
Rust MIT License UpdatedNov 5, 2025 -
-
-
-
-
flashpack Public
Forked from fal-ai/flashpackHigh-throughput tensor loading for PyTorch
Python MIT License UpdatedOct 27, 2025 -
Penny Public
Forked from SzymonOzog/PennyHand-Rolled GPU communications library
Cuda UpdatedOct 23, 2025 -
hoti-2025-gpu-comms-tutorial Public
Forked from NVIDIA/hoti-2025-gpu-comms-tutorialTutorial Exercises and Code for GPU Communications Tutorial at HOT Interconnects 2025
C++ Other UpdatedOct 22, 2025 -
ai-performance-engineering Public
Forked from cfregly/ai-performance-engineeringPython Apache License 2.0 UpdatedOct 22, 2025 -
DeepSeek-OCR Public
Forked from deepseek-ai/DeepSeek-OCRContexts Optical Compression
Python MIT License UpdatedOct 20, 2025 -
-
reasoning-from-scratch Public
Forked from rasbt/reasoning-from-scratchImplement a reasoning LLM in PyTorch from scratch, step by step
Jupyter Notebook Apache License 2.0 UpdatedOct 10, 2025 -
i_am_dsp Public
Forked from IAMMRGODIE/i_am_dspA simple DSP crate
Rust Mozilla Public License 2.0 UpdatedOct 6, 2025 -
gpu-experiments Public
Forked from StuartSul/gpu-experimentsA collection of GPU tests and benchmarks for my own research.
Cuda UpdatedOct 5, 2025 -
gpunetio Public
Forked from NVIDIA-DOCA/gpunetioOpen source version of DOCA GPUNetIO and DOCA Verbs libraries (limited features) to enable GDAKI technology on RDMA (IB and RoCE)
C++ Other UpdatedOct 2, 2025 -
FlashMoE Public
Forked from osayamenja/FlashMoEDistributed MoE in a Single Kernel [NeurIPS '25]
-
DLSlime Public
Forked from DeepLink-org/DLSlimeDLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
C++ BSD 3-Clause "New" or "Revised" License UpdatedSep 18, 2025