Skip to content
View crcrpar's full-sized avatar
  • NVIDIA
  • Tokyo
  • 21:37 (UTC +09:00)

Block or report crcrpar

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
17 stars written in Cuda
Clear filter

Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189

Cuda 6,057 615 Updated Aug 2, 2021

FlashInfer: Kernel Library for LLM Serving

Cuda 4,347 614 Updated Dec 24, 2025

Fast parallel CTC.

Cuda 4,078 1,036 Updated Mar 4, 2024

Squeeze-and-Excitation Networks

Cuda 3,592 851 Updated Feb 25, 2019

Fully Convolutional Instance-aware Semantic Segmentation

Cuda 1,566 411 Updated Sep 27, 2021

Efficient GPU kernels for block-sparse matrix multiplication and convolution

Cuda 1,063 198 Updated Jun 8, 2023

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 963 220 Updated Dec 23, 2025

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 925 339 Updated Aug 19, 2024

CUDA Kernel Benchmarking Library

Cuda 786 97 Updated Dec 10, 2025

Reference implementation of real-time autoregressive wavenet inference

Cuda 744 126 Updated Jan 19, 2021

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 277 22 Updated Jul 16, 2025

Parrot is a C++ library for fused array operations using CUDA/Thrust. It provides efficient GPU-accelerated operations with lazy evaluation semantics, allowing for chaining of operations without un…

Cuda 240 14 Updated Dec 18, 2025

PyTorch bindings for CUTLASS grouped GEMM.

Cuda 176 46 Updated Dec 16, 2025

WholeGraph - large scale Graph Neural Networks

Cuda 106 37 Updated Nov 25, 2024
Cuda 43 10 Updated Dec 10, 2025

Optimized Parallel Tiled Approach to perform Matrix Multiplication by taking advantage of the lower latency, higher bandwidth shared memory within GPU thread blocks.

Cuda 16 1 Updated Sep 24, 2017
Cuda 2 Updated Sep 26, 2025