Skip to content
View Ammar-Alnagar's full-sized avatar
:copilot:
Deciphering the GPU manuscript.....
:copilot:
Deciphering the GPU manuscript.....

Block or report Ammar-Alnagar

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU par…

Shell 865 100 Updated Mar 29, 2025

NVIDIA curated collection of educational resources related to general purpose GPU programming.

Jupyter Notebook 1,166 209 Updated Feb 9, 2026
Rust 95 17 Updated Feb 9, 2026

Solve puzzles. Learn CUDA.

Jupyter Notebook 11,940 927 Updated Sep 1, 2024

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,247 1,670 Updated Feb 4, 2026

Material for gpu-mode lectures

Jupyter Notebook 5,709 571 Updated Feb 1, 2026

Godot Engine – Multi-platform 2D and 3D game engine

C++ 106,504 24,294 Updated Feb 9, 2026

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,636 952 Updated Feb 5, 2026

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 8,826 2,265 Updated Jan 6, 2026

OpenVEX Specification

167 20 Updated Jan 16, 2026

A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS

251 12 Updated May 6, 2025

High-performance C++ tensor library with NumPy/PyTorch-like API, SIMD vectorization, BLAS acceleration, and Metal GPU support.

C++ 38 1 Updated Feb 10, 2026

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,849 2,095 Updated Feb 10, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,169 817 Updated Feb 3, 2026

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,916 112 Updated Feb 3, 2026

CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…

MLIR 827 60 Updated Jan 14, 2026

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

C++ 12,680 2,317 Updated Feb 9, 2026

Optimized primitives for collective multi-GPU communication

C++ 4,443 1,131 Updated Feb 3, 2026

CUDA Core Compute Libraries

C++ 2,163 336 Updated Feb 9, 2026

Claude Code for CUDA. Free AI assistant that actually understands GPU architecture

Python 68 13 Updated Oct 10, 2025

A tool for bandwidth measurements on NVIDIA GPUs.

C++ 620 69 Updated Apr 15, 2025

A book for competitive programming exams

Typst 29 1 Updated Feb 5, 2026

MLIR For Beginners tutorial

C++ 1,221 116 Updated Jul 18, 2025

Hands-On Practical MLIR Tutorial

C++ 718 106 Updated Oct 20, 2023

The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.

C++ 1,743 649 Updated Feb 9, 2026
C++ 18 1 Updated May 14, 2024

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 23,463 4,378 Updated Feb 10, 2026

NCCL Tests

Cuda 1,427 347 Updated Feb 9, 2026

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 41,590 4,710 Updated Feb 10, 2026
Next