Skip to content
View Shalom1204's full-sized avatar

Block or report Shalom1204

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Fast inference engine for Transformer models

C++ 4,199 434 Updated Dec 22, 2025

Fast and memory-efficient exact attention

Python 21,237 2,241 Updated Dec 22, 2025

Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering"

Python 20,004 2,830 Updated Oct 17, 2025

GPTQ inference TVM kernel

Cuda 41 1 Updated Apr 25, 2024

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on h…

Cuda 185 31 Updated Nov 2, 2025

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 230 22 Updated Sep 24, 2023

DeepEP: an efficient expert-parallel communication library

Cuda 8,824 1,035 Updated Dec 5, 2025

Material for gpu-mode lectures

Jupyter Notebook 5,443 552 Updated Dec 8, 2025

Some C++ codes for computing a 1D and 2D convolution product using the FFT implemented with the GSL or FFTW

C 60 17 Updated May 9, 2013

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

C++ 337 33 Updated Dec 28, 2024

Implementation of 1D, 2D, and 3D FFT convolutions in PyTorch. Much faster than direct convolutions for large kernel sizes.

Python 514 62 Updated Sep 28, 2023

Solve puzzles. Learn CUDA.

Jupyter Notebook 11,841 909 Updated Sep 1, 2024

Cross-platform text editor, written in Free Pascal

Python 2,887 188 Updated Dec 20, 2025

Lightning fast C++/CUDA neural network framework

C++ 4,359 536 Updated Dec 14, 2025

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,151 2,039 Updated Dec 14, 2025

CUDA accelerated rasterization of gaussian splatting

Cuda 4,180 641 Updated Nov 18, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,887 291 Updated Dec 22, 2025

Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

Cuda 907 177 Updated Jul 19, 2023

cuGraph - RAPIDS Graph Analytics Library

Cuda 2,092 342 Updated Dec 19, 2025

Sample codes for my CUDA programming book

Cuda 1,955 378 Updated Dec 14, 2025

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,809 464 Updated Oct 9, 2023

Learn CUDA Programming, published by Packt

Cuda 1,217 262 Updated Dec 30, 2023

Fast CUDA matrix multiplication from scratch

Cuda 983 148 Updated Sep 2, 2025

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 963 220 Updated Dec 20, 2025

Automatically exported from code.google.com/p/cuda-convnet2

Cuda 816 293 Updated Dec 3, 2015

CUDA 算子手撕与面试指南

Cuda 736 81 Updated Aug 23, 2025

CUDA Kernel Benchmarking Library

Cuda 781 97 Updated Dec 10, 2025

CUDA-accelerated GIS and spatiotemporal algorithms

Cuda 695 163 Updated Jul 28, 2025

Graphics Processing Units Molecular Dynamics

Cuda 700 164 Updated Dec 21, 2025
Next