-
flex-block-attn Public
Forked from Tencent-Hunyuan/flex-block-attnflex-block-attn: an efficient block sparse attention computation library
Jupyter Notebook Other UpdatedNov 20, 2025 -
tilelang-dsa Public
Forked from lemyx/tilelang-dsaDeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang
Python Other UpdatedNov 19, 2025 -
NexVenusCL Public
Forked from nex-agi/NexVenusCLNex Venus Communication Library
C++ Apache License 2.0 UpdatedNov 17, 2025 -
flash-moba Public
Forked from mit-han-lab/flash-moba -
nanotrace Public
Forked from aikitoria/nanotraceLow overhead tracing library and trace visualizer for pipelined CUDA kernels
C MIT License UpdatedNov 9, 2025 -
pplx-garden Public
Forked from perplexityai/pplx-gardenPerplexity open source garden for inference technology
Rust MIT License UpdatedNov 5, 2025 -
-
-
-
-
flashpack Public
Forked from fal-ai/flashpackHigh-throughput tensor loading for PyTorch
Python MIT License UpdatedOct 27, 2025 -
Penny Public
Forked from SzymonOzog/PennyHand-Rolled GPU communications library
Cuda UpdatedOct 23, 2025 -
hoti-2025-gpu-comms-tutorial Public
Forked from NVIDIA/hoti-2025-gpu-comms-tutorialTutorial Exercises and Code for GPU Communications Tutorial at HOT Interconnects 2025
C++ Other UpdatedOct 22, 2025 -
ai-performance-engineering Public
Forked from cfregly/ai-performance-engineeringPython Apache License 2.0 UpdatedOct 22, 2025 -
DeepSeek-OCR Public
Forked from deepseek-ai/DeepSeek-OCRContexts Optical Compression
Python MIT License UpdatedOct 20, 2025 -
-
reasoning-from-scratch Public
Forked from rasbt/reasoning-from-scratchImplement a reasoning LLM in PyTorch from scratch, step by step
Jupyter Notebook Apache License 2.0 UpdatedOct 10, 2025 -
i_am_dsp Public
Forked from IAMMRGODIE/i_am_dspA simple DSP crate
Rust Mozilla Public License 2.0 UpdatedOct 6, 2025 -
gpu-experiments Public
Forked from StuartSul/gpu-experimentsA collection of GPU tests and benchmarks for my own research.
Cuda UpdatedOct 5, 2025 -
gpunetio Public
Forked from NVIDIA-DOCA/gpunetioOpen source version of DOCA GPUNetIO and DOCA Verbs libraries (limited features) to enable GDAKI technology on RDMA (IB and RoCE)
C++ Other UpdatedOct 2, 2025 -
FlashMoE Public
Forked from osayamenja/FlashMoEDistributed MoE in a Single Kernel [NeurIPS '25]
-
DLSlime Public
Forked from DeepLink-org/DLSlimeDLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
C++ BSD 3-Clause "New" or "Revised" License UpdatedSep 18, 2025 -
-
-
checkpoint-engine Public
Forked from MoonshotAI/checkpoint-engineCheckpoint-engine is a simple middleware to update model weights in LLM inference engines
Python MIT License UpdatedSep 10, 2025 -
batch_invariant_ops Public
Forked from thinking-machines-lab/batch_invariant_opsPython MIT License UpdatedSep 10, 2025 -
NVSHMEM Public
Forked from NVIDIA/nvshmemNVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…
C++ Other UpdatedSep 6, 2025 -
flash_attention_from_scratch Public
Forked from sonnyli/flash_attention_from_scratchFlash Attention from Scratch on CUDA Ampere
-
uccl Public
Forked from uccl-project/ucclUltra and Unified CCL
C++ Apache License 2.0 UpdatedAug 15, 2025 -
VeOmni Public
Forked from ByteDance-Seed/VeOmniVeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework
Python Apache License 2.0 UpdatedAug 12, 2025