🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
-
Updated
Dec 18, 2025 - Python
🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
Rust製深層学習フレームワーク。Rustでゼロから実装して深層学習の原理を探究しよう!
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
Safe rust wrapper around CUDA toolkit
High-performance GPU-accelerated linear algebra library for scientific computing. Custom kernels outperform cuBLAS+cuSPARSE by 2.4x in iterative solvers. Built for circuit simulation workloads.
Framework, toolkit and ready-to-use applications for numerical linear algebra dependent machine learning algorithms.
Scientific CUDA benchmarking framework: 4 implementations x 3 power modes x 5 matrix sizes on Jetson Orin Nano. 1,282 GFLOPS peak, 90% performance @ 88% power (25W mode), 99.5% accuracy validation, edge AI deployment guide.
Multiple GEMM operators are constructed with cutlass to support LLM inference.
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
VTensor, a C++ library, facilitates tensor manipulation on GPUs, emulating the python-numpy style for ease of use. It leverages RMM (RAPIDS Memory Manager) for efficient device memory management. It also supports xtensor for host memory operations.
SwiftCUBLAS is a wrapper for cuBLAS APIs with extra utilities for ease of usage, along with a suite of tests. The repository is tested on the newest (v12.5) CUDA runtime API on both Linux and Windows.
A Deep Learning framework with very few dependencies, Written in Rust
CUDA kernel functions
This project utilizes CUDA and cuBLAS to optimize matrix multiplication, achieving up to a 5x speedup on large matrices by leveraging GPU acceleration. It also improves memory efficiency and reduces data transfer times between CPU and GPU.
Add a description, image, and links to the cublas topic page so that developers can more easily learn about it.
To associate your repository with the cublas topic, visit your repo's landing page and select "manage topics."