cocoshe

💤

Sleeping

coco cocoshe

💤

Sleeping

59 followers · 242 following

MAC Lab, XMU
Fujian, China
23:23 (UTC +08:00)

Achievements

x3 x2

Achievements

x3 x2

Highlights

Lists (21)

Sort

Stars

17 results for source starred repositories written in Cuda

Clear filter

NVlabs / instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,036 2,018 Updated Oct 8, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,696 973 Updated Nov 6, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,332 826 Updated Nov 6, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,863 738 Updated Oct 15, 2025

leoxiaobin / deep-high-resolution-net.pytorch

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Cuda 4,445 930 Updated Aug 30, 2024

Tony-Tan / CUDA_Freshman

Cuda 2,602 495 Updated Jan 16, 2024

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,597 235 Updated Oct 30, 2025

brucefan1983 / CUDA-Programming

Sample codes for my CUDA programming book

Cuda 1,924 375 Updated Feb 15, 2025

chengdazhi / Deformable-Convolution-V2-PyTorch

Deformable ConvNets V2 (DCNv2) in PyTorch

Cuda 1,484 231 Updated Nov 18, 2022

Liu-xiandong / How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,183 172 Updated Jul 29, 2023

Cjkkkk / CUDA_gemm

A simple high performance CUDA GEMM implementation.

Cuda 415 42 Updated Jan 4, 2024

yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 387 52 Updated Jan 2, 2025

usyd-fsalab / fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 271 22 Updated Jul 16, 2025

openmlsys / openmlsys-cuda

Tutorials for writing high-performance GPU operators in AI frameworks.

Cuda 133 15 Updated Aug 12, 2023

luliyucoordinate / cute-flash-attention

Implement Flash Attention using Cute.

Cuda 96 8 Updated Dec 17, 2024

li199603 / parallel_prefix_sum

Parallel Prefix Sum (Scan) with CUDA

Cuda 27 3 Updated Jun 22, 2024

xgqdut2016 / hpc_project

some hpc project for learning

Cuda 24 4 Updated Aug 28, 2024

coco cocoshe

Highlights

Lists (21)

3dres

3dvg

agent

CIR

dataset

diffusion

infer

LLM

music

ov

PaperReading

poi

proj

rec

res

retrieval

RL

search

useful

video

work

Stars