Skip to content
View ardacoskunses's full-sized avatar

Block or report ardacoskunses

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,977 778 Updated Dec 8, 2025

The Modular Platform (includes MAX & Mojo)

Mojo 25,366 2,743 Updated Dec 18, 2025

Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]

321 22 Updated Nov 8, 2022

Google Drive CLI Client

Go 8,980 1,171 Updated Apr 19, 2023

MathBox-based conference talks

JavaScript 341 40 Updated May 12, 2016

TLB Benchmarks

Cuda 35 10 Updated Sep 11, 2017
Jupyter Notebook 91 6 Updated Feb 29, 2024

Fastest kernels written from scratch

Cuda 499 62 Updated Sep 18, 2025

Solve puzzles. Learn CUDA.

Jupyter Notebook 11,838 908 Updated Sep 1, 2024

A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU par…

Shell 821 94 Updated Mar 29, 2025

Algorithms implemented in CUDA + resources about GPGPU

Cuda 62 15 Updated Jan 18, 2022

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 8,985 1,587 Updated Dec 19, 2025

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,207 177 Updated Jul 29, 2023

how to optimize some algorithm in cuda.

Cuda 2,696 244 Updated Dec 6, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,979 877 Updated Dec 4, 2025

Step by step implementation of a fast softmax kernel in CUDA

Cuda 59 6 Updated Jan 6, 2025
Python 86 8 Updated Nov 11, 2025

Fast CUDA matrix multiplication from scratch

Cuda 980 148 Updated Sep 2, 2025

LeetGPU Challenges

Python 544 42 Updated Dec 11, 2025

Fast and memory-efficient exact attention

Python 21,186 2,230 Updated Dec 18, 2025

A collection of free miscellaneous Windows tools

C# 1 Updated Mar 29, 2023

View ETW Provider manifest

C# 556 77 Updated Nov 1, 2024

Event Tracing For Windows (ETW) Resources

Python 412 78 Updated Oct 30, 2025

Frame profiler

C++ 14,842 971 Updated Dec 18, 2025

KrabsETW provides a modern C++ wrapper and a .NET wrapper around the low-level ETW trace consumption functions.

C++ 734 159 Updated Dec 15, 2025

A library containing utilities for mapping higher-level graphics work to D3D12

C++ 349 51 Updated May 21, 2025

The OpenCL-on-D3D12 mapping layer

C++ 122 17 Updated Aug 26, 2025

AMD Research Instruction Based Sampling Toolkit

C 94 17 Updated Apr 29, 2021

Read and modify memory timings on the fly

C++ 4 Updated Mar 25, 2019

Read and modify memory timings on the fly

C++ 310 79 Updated Mar 7, 2023
Next