Skip to content
View StuartSul's full-sized avatar

Highlights

  • Pro

Organizations

@HazyResearch @anysphere @xai-org

Block or report StuartSul

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Delta-debugging minimizer for CUDA register spills.

Cuda 9 Updated Mar 21, 2026

Fast and memory-efficient exact attention

Python 24,216 2,851 Updated Jun 22, 2026

experimental torch program roofline tracing

Python 4 1 Updated Mar 4, 2026

Tile primitives for speedy kernels

Cuda 3,466 299 Updated Jun 15, 2026

Open ABI and FFI for Machine Learning Systems

C++ 418 80 Updated Jun 21, 2026

Accelerating MoE with IO and Tile-aware Optimizations

Python 719 90 Updated Jun 15, 2026

all class materials for 140e

C 75 13 Updated Apr 1, 2025

all class materials for 340lx

C 5 2 Updated Dec 5, 2025

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 552 90 Updated Jun 15, 2026

A collection of GPU experiments and benchmarks for my personal understanding and research.

Cuda 31 8 Updated Jun 15, 2026

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…

Python 3,400 754 Updated Jun 23, 2026

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,939 1,919 Updated Jun 23, 2026

Implementation for FP8/INT8 Rollout for RL training without performence drop.

Python 304 23 Updated Nov 7, 2025
Jupyter Notebook 125 15 Updated Mar 18, 2026

Infiniband Verbs Performance Tests

C 989 408 Updated Jun 22, 2026

example code for using DC QP for providing RDMA READ and WRITE operations to remote GPU memory

C 157 37 Updated Jul 30, 2024

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,331 104 Updated Aug 28, 2025

A PyTorch native platform for training generative AI models

Python 5,456 868 Updated Jun 23, 2026

C++ extensions in PyTorch

Python 1,192 249 Updated Jan 13, 2026

cs240lx stanford 2025 spring

C 17 3 Updated Jun 11, 2025

Large Context Attention

Python 773 53 Updated Oct 13, 2025
JavaScript 1 Updated May 18, 2024

A free, source-available and fair-code licensed mac app cleaner

Swift 13,700 338 Updated Jun 22, 2026

Optimized primitives for collective multi-GPU communication

C++ 4,826 1,305 Updated Jun 23, 2026

Co-Chuck: WebChucK IDE with Multi-User Collaboration and Synchronized ChucK Shreds

TypeScript 4 Updated Dec 4, 2024

An OSX print to pdf-file printer driver

Swift 1,173 95 Updated Sep 9, 2025

A frontend Framework for single-page applications on top of REST/GraphQL APIs, using TypeScript, React and Material Design

TypeScript 26,797 5,454 Updated Jun 22, 2026

This boilerplate contains terraform configurations for the rapid deployment of a Kubernetes cluster, supporting services, and the underlying infrastructure in AWS.

HCL 629 108 Updated Sep 2, 2025

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

C++ 14,252 1,223 Updated Oct 29, 2025
Next