6 releases
| 0.3.1 | Mar 27, 2026 |
|---|---|
| 0.3.0 | Mar 18, 2026 |
| 0.2.0 | Feb 17, 2026 |
| 0.1.0 | Dec 30, 2025 |
| 0.0.0 |
|
#2052 in Hardware support
Used in optirs
5MB
112K
SLoC
OptiRS GPU
GPU acceleration and multi-GPU optimization for the OptiRS machine learning optimization library.
Overview
OptiRS-GPU provides hardware acceleration for machine learning optimization workloads across multiple GPU backends. This crate enables high-performance, parallel optimization on CUDA, Metal, OpenCL, and WebGPU platforms, with automatic device selection and memory management.
Features
- Multi-Backend Support: CUDA, Metal, OpenCL, and WebGPU compatibility
- Automatic Device Detection: Intelligent GPU selection and fallback mechanisms
- Multi-GPU Training: Distributed optimization across multiple GPUs
- Memory Management: Efficient GPU memory allocation and transfer
- Async Operations: Non-blocking GPU kernel execution
- Cross-Platform: Unified API across different GPU ecosystems
- Performance Monitoring: GPU utilization and memory usage tracking
Supported Backends
CUDA (NVIDIA)
- Full CUDA runtime integration via cudarc
- Support for CUDA 12.0+ features
- Optimized kernels for tensor operations
- Multi-GPU scaling with NCCL integration
Metal (Apple Silicon)
- Native Metal Performance Shaders integration
- Optimized for Apple M1/M2/M3 architecture
- Unified memory architecture support
- Metal Performance Shaders Graph integration
OpenCL (Cross-platform)
- OpenCL 3.0 support for maximum compatibility
- Vendor-agnostic GPU acceleration
- Support for Intel, AMD, and other OpenCL devices
- Custom kernel compilation and caching
WebGPU (Web/Native)
- Cross-platform GPU acceleration via wgpu
- WebAssembly compatibility for web deployment
- Vulkan, DirectX, and Metal backend support
- Portable compute shaders
Installation
Add this to your Cargo.toml:
[dependencies]
optirs-gpu = "0.3.1"
scirs2-core = "0.4.0" # Required foundation
Feature Selection
Enable specific GPU backends:
[dependencies]
optirs-gpu = { version = "0.3.1", features = ["cuda", "metal"] }
Available features:
cuda: NVIDIA CUDA supportmetal: Apple Metal supportopencl: OpenCL supportwgpu: WebGPU support (enabled by default)
Usage
Basic GPU Optimization
use optirs_gpu::{GpuOptimizer, DeviceManager};
use optirs_core::optimizers::Adam;
// Initialize GPU device manager
let device_manager = DeviceManager::new().await?;
let device = device_manager.select_best_device()?;
// Create GPU-accelerated optimizer
let mut optimizer = GpuOptimizer::new(device)
.with_optimizer(Adam::new(0.001))
.build()?;
// Your model parameters (automatically transferred to GPU)
let mut params = optimizer.create_tensor(&[1024, 512])?;
let grads = optimizer.create_tensor_from_slice(&gradient_data)?;
// Perform optimization step on GPU
optimizer.step(&mut params, &grads).await?;
Multi-GPU Training
use optirs_gpu::{MultiGpuOptimizer, DataParallelStrategy};
// Setup multi-GPU training
let mut multi_gpu = MultiGpuOptimizer::new()
.with_strategy(DataParallelStrategy::AllReduce)
.with_devices(&device_manager.available_devices())
.build()?;
// Distribute model across GPUs
multi_gpu.distribute_model(&model_parameters).await?;
// Synchronized optimization across all GPUs
multi_gpu.step_synchronized(&gradients).await?;
Backend-Specific Features
CUDA-Specific Operations
use optirs_gpu::cuda::{CudaContext, CudaStream};
let cuda_ctx = CudaContext::new(device_id)?;
let stream = CudaStream::new(&cuda_ctx)?;
// Custom CUDA kernels
let result = stream.launch_kernel("custom_optimizer", ¶ms, &config).await?;
Metal-Specific Operations
use optirs_gpu::metal::{MetalDevice, MetalCommandQueue};
let metal_device = MetalDevice::system_default()?;
let command_queue = metal_device.new_command_queue();
// Metal Performance Shaders integration
let mps_optimizer = command_queue.create_mps_optimizer(&config)?;
Architecture
Device Management
- Automatic GPU detection and selection
- Device capability querying
- Memory and compute capability assessment
- Fallback mechanism for unsupported operations
Memory Management
- Efficient GPU memory allocation
- Automatic CPU-GPU data transfer
- Memory pool management
- Garbage collection for GPU resources
Kernel Management
- Optimized compute kernels for common operations
- JIT compilation and caching
- Kernel fusion for performance optimization
- Platform-specific optimizations
Performance Considerations
Memory Transfer Optimization
- Asynchronous CPU-GPU transfers
- Memory pinning for faster transfers
- Batch operations to reduce transfer overhead
- Smart caching of frequently accessed data
Compute Optimization
- Kernel fusion for reduced memory bandwidth
- Optimal thread block sizing
- Memory coalescing patterns
- Shared memory utilization
Multi-GPU Scaling
- Efficient gradient synchronization
- Load balancing across devices
- Topology-aware communication
- Overlap communication and computation
Error Handling
OptiRS-GPU provides comprehensive error handling for GPU operations:
use optirs_gpu::error::{GpuError, GpuResult};
match optimizer.step(&mut params, &grads).await {
Ok(()) => println!("Optimization successful"),
Err(GpuError::OutOfMemory) => {
// Handle GPU memory exhaustion
optimizer.clear_cache()?;
}
Err(GpuError::DeviceNotAvailable) => {
// Fallback to CPU optimization
fallback_to_cpu_optimizer()?;
}
Err(e) => return Err(e.into()),
}
Benchmarking
OptiRS-GPU includes built-in benchmarking tools:
use optirs_gpu::benchmarks::{GpuBenchmark, BenchmarkConfig};
let benchmark = GpuBenchmark::new()
.with_config(BenchmarkConfig::default())
.with_optimizer(optimizer)
.build()?;
let results = benchmark.run_performance_suite().await?;
println!("GPU throughput: {:.2} GFLOPS", results.throughput);
Platform Support
| Platform | CUDA | Metal | OpenCL | WebGPU |
|---|---|---|---|---|
| Linux | ✅ | ❌ | ✅ | ✅ |
| macOS | ❌ | ✅ | ✅ | ✅ |
| Windows | ✅ | ❌ | ✅ | ✅ |
| Web | ❌ | ❌ | ❌ | ✅ |
Contributing
OptiRS follows the Cool Japan organization's development standards. See the main OptiRS repository for contribution guidelines.
License
This project is licensed under the Apache License, Version 2.0.
Dependencies
~101MB
~2M SLoC