Skip to content
View cocoshe's full-sized avatar
💤
Sleeping
💤
Sleeping
  • MAC Lab, XMU
  • Fujian, China
  • 23:23 (UTC +08:00)

Highlights

  • Pro

Block or report cocoshe

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
17 results for source starred repositories written in Cuda
Clear filter

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,036 2,018 Updated Oct 8, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,696 973 Updated Nov 6, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,332 826 Updated Nov 6, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,863 738 Updated Oct 15, 2025

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Cuda 4,445 930 Updated Aug 30, 2024

how to optimize some algorithm in cuda.

Cuda 2,597 235 Updated Oct 30, 2025

Sample codes for my CUDA programming book

Cuda 1,924 375 Updated Feb 15, 2025

Deformable ConvNets V2 (DCNv2) in PyTorch

Cuda 1,484 231 Updated Nov 18, 2022

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,183 172 Updated Jul 29, 2023

A simple high performance CUDA GEMM implementation.

Cuda 415 42 Updated Jan 4, 2024

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 387 52 Updated Jan 2, 2025

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 271 22 Updated Jul 16, 2025

Tutorials for writing high-performance GPU operators in AI frameworks.

Cuda 133 15 Updated Aug 12, 2023

Implement Flash Attention using Cute.

Cuda 96 8 Updated Dec 17, 2024

Parallel Prefix Sum (Scan) with CUDA

Cuda 27 3 Updated Jun 22, 2024

some hpc project for learning

Cuda 24 4 Updated Aug 28, 2024