Skip to content

GPU Architecture: Multi-Backend GPU Acceleration for VSL #236

@ulises-jeremias

Description

@ulises-jeremias

Context

VSL Role in VTL/VSL GPU Architecture

VSL (V Scientific Library) is the low-level math foundation for VTL. VSL provides:

  • vsl.la — Linear algebra (Matrix, Vector, BLAS/LAPACK wrappers)
  • vsl.vcl — OpenCL data transport (Device, Vector[T], Kernel, async transfers)
  • vsl.plot — Visualization
  • vsl.random — Random number generation
  • vsl.ml — Machine learning primitives

For GPU acceleration, VSL's role is:

  1. GPU-accelerated linear algebra (la.gemm, la.matmul, etc.) on Vulkan, CUDA, and OpenCL
  2. Compute infrastructure for VTL — VTL's la/ module delegates to VSL; VSL's GPU compute flows back to VTL

VTL Phase Mapping to VSL

VTL Phase VSL Equivalent Scope
Phase 1 (#58): Vulkan foundation Phase A (#237): VSL Vulkan backend GPU GEMM, element-wise, broadcast for VSL Matrix
Phase 2 (#59): NN forward N/A VSL is not an NN library; VTL handles NN
Phase 3 (#60): VSL + CUDA Phase B (#238): VSL CUDA backend GPU GEMM, element-wise, broadcast for VSL Matrix
Phase 4 (#61): GPU autograd N/A VTL handles autograd; VSL needs GPU forward only
Phase 5 (#62): OpenCL Phase C (#239): VSL VCL compute Extend existing VCL from transport to compute
Phase 6 (#63): ARM Covered by Phase A Vulkan on ARM (Android) is the same code
Phase 7+ (#64): Performance Covered by all Kernel fusion, mixed precision, async

Reference Repositories

The V ecosystem has existing GPU/compute infrastructure used as reference for this plan:

antono2/vulkan — Full Raw Vulkan Bindings (Reference)

  • What: Complete, auto-generated Vulkan 1.0–1.4 API bindings (~1.3 MB, weekly auto-regeneration from Khronos XML)
  • Coverage: Every struct, enum, handle, and function from Vulkan core + extensions
  • Handles: Instance, PhysicalDevice, Device, Queue, CommandBuffer, Buffer, DeviceMemory, ShaderModule, Pipeline, PipelineLayout, DescriptorSetLayout, DescriptorPool, DescriptorSet, Fence, Semaphore, etc.
  • Compute support: Full — ComputePipelineCreateInfo, create_compute_pipelines, cmd_dispatch, compute queue flags, ShaderStageFlags.compute
  • C→V mapping pattern: @[typedef] pub struct C.VkXXX, fn C.vkCreateInstance(...) int, etc.
  • Reference value: Use as the pattern for VSL's own Vulkan bindings (vsl/vulkan/). Study the C→V type mapping, struct layout, and API signatures.
  • License: MIT

antono2/v_vulkan_bindings — Python Generator (Reference)

  • What: Python tool that translates Khronos vk.xml registry → V code
  • Reference value: Fork to generate a vsl-specific subset of Vulkan bindings (compute-only, stripped of graphics APIs for smaller footprint)
  • License: MIT

antono2/vulkan_memory_allocator — GPU Memory Allocator (Reference)

  • What: Pool-based GPU memory allocator using dlmalloc for CPU-side bookkeeping
  • API: Allocator.new(), allocate(), create_buffer(), map(), unmap(), destroy()
  • Reference value: Use as architectural reference for VSL's Vulkan memory allocator
  • License: MIT

vsl.vcl — Existing OpenCL Compute (Foundation)

  • What: Mature OpenCL wrapper with Device, Vector[T], Kernel, async transfers
  • License: MIT
  • Reference value: Use as architectural reference for compute abstraction pattern

Implementation Pattern: Self-Contained Wrappers (Like BLAS/LAPACK/VCL)

VSL already maintains self-contained wrappers for all external math libraries:

vsl/
├── blas/           ← Pure-V BLAS fallback (own implementation)
├── lapack/         ← Pure-V LAPACK fallback (own implementation)
├── vcl/            ← OpenCL wrapper (self-contained, ~12 .v files)
├── vk/             ← Vulkan bindings + compute (NEW — follow same pattern)
└── compute/         ← VSL compute abstraction (NEW)

Decision: Do NOT import antono2/vulkan as a dependency. Instead, maintain VSL's own Vulkan bindings within vsl/vulkan/, following the same pattern as BLAS, LAPACK, and VCL. Use antono2/vulkan, antono2/v_vulkan_bindings, and antono2/vulkan_memory_allocator as reference implementations for:

  • C→V type mapping patterns
  • Struct layout and handle definitions
  • API function signatures
  • Memory allocator design

This keeps VSL self-contained, avoids external dependency on a third-party module, and gives full control over the Vulkan API surface exposed to VTL.

Reference: VSL Current State

VSL has no GPU compute — all operations run on CPU:

Component Status Location
vsl.vcl.Vector[T] ✅ Data transport vcl/vector.c.v
vsl.vcl.Device ✅ Device/context/queue vcl/device.c.v
vsl.vcl.Kernel ✅ Kernel loading/execution vcl/kernel.c.v
vsl.vcl.async transfers ✅ load/data async vcl/vector.c.v, vcl/buffer.c.v
la.gemm ❌ CPU BLAS only la/*.v (filename-dispatched BLAS)
la.matmul ❌ CPU BLAS only la/*.v
la.lstsq ❌ CPU LAPACK only la/extra.v

VSL Matrix vs VTL Tensor

VSL uses vsl.la.Matrix and vsl.la.Vector — different from VTL's Tensor[T]:

// VSL Matrix (column-major, f64 only)
pub struct Matrix {
    data []f64
    m    int
    n    int
}

// VTL Tensor (generic, row/col-major)
@[heap]
pub struct Tensor[T] {
    data    &storage.CpuStorage[T]
    memory  MemoryFormat
    shape   []int
    strides []int
}

VSL compute needs to work with VSL Matrix/Vector types, not VTL Tensor types. The bridge is in VTL's la/ module which converts Tensor[T] -> []f64 -> vsl.la.Matrix -> calls VSL -> converts back.

Architecture: VSL Vulkan + Compute Layer

vsl/
├── vk/                ← Vulkan bindings (self-contained, like vcl/)
│   ├── vk.c.v         ← C function declarations (vkCreateInstance, vkCmdDispatch, etc.)
│   ├── vk.ctypes.v   ← C type definitions (VkInstance, VkDevice, VkBuffer, etc.)
│   ├── vk.device.v    ← Device, physical device, instance management
│   ├── vk.buffer.v    ← Buffer creation, memory binding
│   ├── vk.memory.v    ← Memory allocation, map/unmap, staging
│   ├── vk.shader.v    ← ShaderModule from SPIR-V
│   ├── vk.pipeline.v  ← Compute pipeline, pipeline layout
│   ├── vk.descriptor.v← Descriptor set layout, pool, allocation
│   ├── vk.command.v   ← Command buffer, submit, wait
│   └── vk.kernels.v   ← GLSL compute shader sources (string constants)
├── vcl/               ← Existing OpenCL (data transport + future compute)
├── compute/           ← VSL compute abstraction (dispatch: Vulkan / CUDA / VCL / BLAS)
│   ├── gemm.v
│   ├── elementwise.v
│   └── broadcast.v
└── la/
    ├── la.v           ← Updated: dispatches to compute/ for GPU
    ├── vulkan.v       ← Vulkan GEMM for vsl.la.Matrix
    ├── cuda.v         ← CUDA GEMM for vsl.la.Matrix
    └── extra.v        ← Existing: trace, norm, lstsq, qr, lu (CPU only)

VSL vs VTL Phases Summary

Phase VTL Issue VSL Issue VSL Scope
1 #58 Vulkan foundation #237 VSL Vulkan backend GPU GEMM/element-wise for VSL Matrix
2 #59 NN forward on GPU N/A Not applicable to VSL
3 #60 VSL + CUDA #238 VSL CUDA backend GPU GEMM/element-wise for VSL Matrix
4 #61 GPU autograd N/A VTL handles autograd
5 #62 OpenCL VCL #239 VSL VCL compute Extend VCL from transport to compute
6 #63 ARM Covered by #237 Vulkan on ARM is same code
7+ #64 Performance Covered by all Kernel fusion, mixed precision

Task List

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions