cublas
Here are 94 public repositories matching this topic...
🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
-
Updated
Dec 18, 2025 - Python
High-performance CUDA implementation of Muon optimizer for LLM training. Features Newton-Schulz polar decomposition, cuBLAS acceleration, and transpose optimization for 8x FLOP savings on transformer FFN layers. Benchmarked on NVIDIA A100 with Llama 3.1 8B architectures (4096×11008 weights).
-
Updated
Dec 18, 2025 - Python
Rust製深層学習フレームワーク。Rustでゼロから実装して深層学習の原理を探究しよう!
-
Updated
Dec 18, 2025 - Rust
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
-
Updated
Dec 15, 2025 - Cuda
Safe rust wrapper around CUDA toolkit
-
Updated
Dec 11, 2025 - Rust
High-performance GPU-accelerated linear algebra library for scientific computing. Custom kernels outperform cuBLAS+cuSPARSE by 2.4x in iterative solvers. Built for circuit simulation workloads.
-
Updated
Dec 6, 2025 - Cuda
Framework, toolkit and ready-to-use applications for numerical linear algebra dependent machine learning algorithms.
-
Updated
Nov 11, 2025 - C++
Scientific CUDA benchmarking framework: 4 implementations x 3 power modes x 5 matrix sizes on Jetson Orin Nano. 1,282 GFLOPS peak, 90% performance @ 88% power (25W mode), 99.5% accuracy validation, edge AI deployment guide.
-
Updated
Oct 14, 2025 - Python
Multiple GEMM operators are constructed with cutlass to support LLM inference.
-
Updated
Aug 3, 2025 - C++
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
-
Updated
Aug 2, 2025
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
-
Updated
Jun 16, 2025 - C++
VTensor, a C++ library, facilitates tensor manipulation on GPUs, emulating the python-numpy style for ease of use. It leverages RMM (RAPIDS Memory Manager) for efficient device memory management. It also supports xtensor for host memory operations.
-
Updated
Apr 1, 2025 - C++
SwiftCUBLAS is a wrapper for cuBLAS APIs with extra utilities for ease of usage, along with a suite of tests. The repository is tested on the newest (v12.5) CUDA runtime API on both Linux and Windows.
-
Updated
Feb 22, 2025 - Swift
A Deep Learning framework with very few dependencies, Written in Rust
-
Updated
Feb 14, 2025 - Rust
CUDA kernel functions
-
Updated
Dec 2, 2024 - Cuda
Improve this page
Add a description, image, and links to the cublas topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the cublas topic, visit your repo's landing page and select "manage topics."