GPU Architecture: Multi-Backend GPU Acceleration for VSL

## Context

### VSL Role in VTL/VSL GPU Architecture

VSL (V Scientific Library) is the **low-level math foundation** for VTL. VSL provides:
- `vsl.la` — Linear algebra (Matrix, Vector, BLAS/LAPACK wrappers)
- `vsl.vcl` — OpenCL data transport (Device, Vector[T], Kernel, async transfers)
- `vsl.plot` — Visualization
- `vsl.random` — Random number generation
- `vsl.ml` — Machine learning primitives

For GPU acceleration, VSL's role is:
1. **GPU-accelerated linear algebra** (`la.gemm`, `la.matmul`, etc.) on Vulkan, CUDA, and OpenCL
2. **Compute infrastructure** for VTL — VTL's `la/` module delegates to VSL; VSL's GPU compute flows back to VTL

### VTL Phase Mapping to VSL

| VTL Phase | VSL Equivalent | Scope |
|-----------|---------------|-------|
| Phase 1 (#58): Vulkan foundation | **Phase A (#237)**: VSL Vulkan backend | GPU GEMM, element-wise, broadcast for VSL Matrix |
| Phase 2 (#59): NN forward | N/A | VSL is not an NN library; VTL handles NN |
| Phase 3 (#60): VSL + CUDA | **Phase B (#238)**: VSL CUDA backend | GPU GEMM, element-wise, broadcast for VSL Matrix |
| Phase 4 (#61): GPU autograd | N/A | VTL handles autograd; VSL needs GPU forward only |
| Phase 5 (#62): OpenCL | **Phase C (#239)**: VSL VCL compute | Extend existing VCL from transport to compute |
| Phase 6 (#63): ARM | Covered by Phase A | Vulkan on ARM (Android) is the same code |
| Phase 7+ (#64): Performance | Covered by all | Kernel fusion, mixed precision, async |

### Reference Repositories

The V ecosystem has existing GPU/compute infrastructure used as **reference** for this plan:

#### `antono2/vulkan` — Full Raw Vulkan Bindings (Reference)
- **What**: Complete, auto-generated Vulkan 1.0–1.4 API bindings (~1.3 MB, weekly auto-regeneration from Khronos XML)
- **Coverage**: Every struct, enum, handle, and function from Vulkan core + extensions
- **Handles**: `Instance`, `PhysicalDevice`, `Device`, `Queue`, `CommandBuffer`, `Buffer`, `DeviceMemory`, `ShaderModule`, `Pipeline`, `PipelineLayout`, `DescriptorSetLayout`, `DescriptorPool`, `DescriptorSet`, `Fence`, `Semaphore`, etc.
- **Compute support**: Full — `ComputePipelineCreateInfo`, `create_compute_pipelines`, `cmd_dispatch`, compute queue flags, `ShaderStageFlags.compute`
- **C→V mapping pattern**: `@[typedef] pub struct C.VkXXX`, `fn C.vkCreateInstance(...) int`, etc.
- **Reference value**: Use as the pattern for VSL's own Vulkan bindings (`vsl/vulkan/`). Study the C→V type mapping, struct layout, and API signatures.
- **License**: MIT

#### `antono2/v_vulkan_bindings` — Python Generator (Reference)
- **What**: Python tool that translates Khronos `vk.xml` registry → V code
- **Reference value**: Fork to generate a `vsl`-specific subset of Vulkan bindings (compute-only, stripped of graphics APIs for smaller footprint)
- **License**: MIT

#### `antono2/vulkan_memory_allocator` — GPU Memory Allocator (Reference)
- **What**: Pool-based GPU memory allocator using `dlmalloc` for CPU-side bookkeeping
- **API**: `Allocator.new()`, `allocate()`, `create_buffer()`, `map()`, `unmap()`, `destroy()`
- **Reference value**: Use as architectural reference for VSL's Vulkan memory allocator
- **License**: MIT

#### `vsl.vcl` — Existing OpenCL Compute (Foundation)
- **What**: Mature OpenCL wrapper with `Device`, `Vector[T]`, `Kernel`, async transfers
- **License**: MIT
- **Reference value**: Use as architectural reference for compute abstraction pattern

### Implementation Pattern: Self-Contained Wrappers (Like BLAS/LAPACK/VCL)

VSL already maintains **self-contained** wrappers for all external math libraries:

```
vsl/
├── blas/           ← Pure-V BLAS fallback (own implementation)
├── lapack/         ← Pure-V LAPACK fallback (own implementation)
├── vcl/            ← OpenCL wrapper (self-contained, ~12 .v files)
├── vk/             ← Vulkan bindings + compute (NEW — follow same pattern)
└── compute/         ← VSL compute abstraction (NEW)
```

**Decision:** Do NOT import `antono2/vulkan` as a dependency. Instead, maintain VSL's own Vulkan bindings within `vsl/vulkan/`, following the same pattern as BLAS, LAPACK, and VCL. Use `antono2/vulkan`, `antono2/v_vulkan_bindings`, and `antono2/vulkan_memory_allocator` as **reference implementations** for:
- C→V type mapping patterns
- Struct layout and handle definitions
- API function signatures
- Memory allocator design

This keeps VSL self-contained, avoids external dependency on a third-party module, and gives full control over the Vulkan API surface exposed to VTL.

### Reference: VSL Current State

VSL has **no GPU compute** — all operations run on CPU:

| Component | Status | Location |
|-----------|--------|----------|
| `vsl.vcl.Vector[T]` | :white_check_mark: Data transport | `vcl/vector.c.v` |
| `vsl.vcl.Device` | :white_check_mark: Device/context/queue | `vcl/device.c.v` |
| `vsl.vcl.Kernel` | :white_check_mark: Kernel loading/execution | `vcl/kernel.c.v` |
| `vsl.vcl.async transfers` | :white_check_mark: load/data async | `vcl/vector.c.v`, `vcl/buffer.c.v` |
| `la.gemm` | :x: CPU BLAS only | `la/*.v` (filename-dispatched BLAS) |
| `la.matmul` | :x: CPU BLAS only | `la/*.v` |
| `la.lstsq` | :x: CPU LAPACK only | `la/extra.v` |

### VSL Matrix vs VTL Tensor

VSL uses `vsl.la.Matrix` and `vsl.la.Vector` — different from VTL's `Tensor[T]`:

```v
// VSL Matrix (column-major, f64 only)
pub struct Matrix {
    data []f64
    m    int
    n    int
}

// VTL Tensor (generic, row/col-major)
@[heap]
pub struct Tensor[T] {
    data    &storage.CpuStorage[T]
    memory  MemoryFormat
    shape   []int
    strides []int
}
```

VSL compute needs to work with VSL Matrix/Vector types, not VTL Tensor types. The bridge is in VTL's `la/` module which converts `Tensor[T]` -> `[]f64` -> `vsl.la.Matrix` -> calls VSL -> converts back.

### Architecture: VSL Vulkan + Compute Layer

```
vsl/
├── vk/                ← Vulkan bindings (self-contained, like vcl/)
│   ├── vk.c.v         ← C function declarations (vkCreateInstance, vkCmdDispatch, etc.)
│   ├── vk.ctypes.v   ← C type definitions (VkInstance, VkDevice, VkBuffer, etc.)
│   ├── vk.device.v    ← Device, physical device, instance management
│   ├── vk.buffer.v    ← Buffer creation, memory binding
│   ├── vk.memory.v    ← Memory allocation, map/unmap, staging
│   ├── vk.shader.v    ← ShaderModule from SPIR-V
│   ├── vk.pipeline.v  ← Compute pipeline, pipeline layout
│   ├── vk.descriptor.v← Descriptor set layout, pool, allocation
│   ├── vk.command.v   ← Command buffer, submit, wait
│   └── vk.kernels.v   ← GLSL compute shader sources (string constants)
├── vcl/               ← Existing OpenCL (data transport + future compute)
├── compute/           ← VSL compute abstraction (dispatch: Vulkan / CUDA / VCL / BLAS)
│   ├── gemm.v
│   ├── elementwise.v
│   └── broadcast.v
└── la/
    ├── la.v           ← Updated: dispatches to compute/ for GPU
    ├── vulkan.v       ← Vulkan GEMM for vsl.la.Matrix
    ├── cuda.v         ← CUDA GEMM for vsl.la.Matrix
    └── extra.v        ← Existing: trace, norm, lstsq, qr, lu (CPU only)
```

### VSL vs VTL Phases Summary

| Phase | VTL Issue | VSL Issue | VSL Scope |
|-------|-----------|-----------|-----------|
| 1 | #58 Vulkan foundation | **#237** VSL Vulkan backend | GPU GEMM/element-wise for VSL Matrix |
| 2 | #59 NN forward on GPU | N/A | Not applicable to VSL |
| 3 | #60 VSL + CUDA | **#238** VSL CUDA backend | GPU GEMM/element-wise for VSL Matrix |
| 4 | #61 GPU autograd | N/A | VTL handles autograd |
| 5 | #62 OpenCL VCL | **#239** VSL VCL compute | Extend VCL from transport to compute |
| 6 | #63 ARM | Covered by #237 | Vulkan on ARM is same code |
| 7+ | #64 Performance | Covered by all | Kernel fusion, mixed precision |

---

## Task List

- [ ] #237 Phase A: Vulkan Compute Backend for VSL Matrix/Vector
- [ ] #238 Phase B: CUDA Backend for VSL Matrix/Vector
- [ ] #239 Phase C: OpenCL Compute — VCL Extended from Transport to Compute


Component	Status	Location
`vsl.vcl.Vector[T]`	✅ Data transport	`vcl/vector.c.v`
`vsl.vcl.Device`	✅ Device/context/queue	`vcl/device.c.v`
`vsl.vcl.Kernel`	✅ Kernel loading/execution	`vcl/kernel.c.v`
`vsl.vcl.async transfers`	✅ load/data async	`vcl/vector.c.v`, `vcl/buffer.c.v`
`la.gemm`	❌ CPU BLAS only	`la/*.v` (filename-dispatched BLAS)
`la.matmul`	❌ CPU BLAS only	`la/*.v`
`la.lstsq`	❌ CPU LAPACK only	`la/extra.v`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU Architecture: Multi-Backend GPU Acceleration for VSL #236

Context

VSL Role in VTL/VSL GPU Architecture

VTL Phase Mapping to VSL

Reference Repositories

`antono2/vulkan` — Full Raw Vulkan Bindings (Reference)

`antono2/v_vulkan_bindings` — Python Generator (Reference)

`antono2/vulkan_memory_allocator` — GPU Memory Allocator (Reference)

`vsl.vcl` — Existing OpenCL Compute (Foundation)

Implementation Pattern: Self-Contained Wrappers (Like BLAS/LAPACK/VCL)

Reference: VSL Current State

VSL Matrix vs VTL Tensor

Architecture: VSL Vulkan + Compute Layer

VSL vs VTL Phases Summary

Task List

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

VTL Phase	VSL Equivalent	Scope
Phase 1 (#58): Vulkan foundation	Phase A (#237): VSL Vulkan backend	GPU GEMM, element-wise, broadcast for VSL Matrix
Phase 2 (#59): NN forward	N/A	VSL is not an NN library; VTL handles NN
Phase 3 (#60): VSL + CUDA	Phase B (#238): VSL CUDA backend	GPU GEMM, element-wise, broadcast for VSL Matrix
Phase 4 (#61): GPU autograd	N/A	VTL handles autograd; VSL needs GPU forward only
Phase 5 (#62): OpenCL	Phase C (#239): VSL VCL compute	Extend existing VCL from transport to compute
Phase 6 (#63): ARM	Covered by Phase A	Vulkan on ARM (Android) is the same code
Phase 7+ (#64): Performance	Covered by all	Kernel fusion, mixed precision, async

Phase	VTL Issue	VSL Issue	VSL Scope
1	#58 Vulkan foundation	#237 VSL Vulkan backend	GPU GEMM/element-wise for VSL Matrix
2	#59 NN forward on GPU	N/A	Not applicable to VSL
3	#60 VSL + CUDA	#238 VSL CUDA backend	GPU GEMM/element-wise for VSL Matrix
4	#61 GPU autograd	N/A	VTL handles autograd
5	#62 OpenCL VCL	#239 VSL VCL compute	Extend VCL from transport to compute
6	#63 ARM	Covered by #237	Vulkan on ARM is same code
7+	#64 Performance	Covered by all	Kernel fusion, mixed precision

Uh oh!

GPU Architecture: Multi-Backend GPU Acceleration for VSL #236

Description

Context

VSL Role in VTL/VSL GPU Architecture

VTL Phase Mapping to VSL

Reference Repositories

antono2/vulkan — Full Raw Vulkan Bindings (Reference)

antono2/v_vulkan_bindings — Python Generator (Reference)

antono2/vulkan_memory_allocator — GPU Memory Allocator (Reference)

vsl.vcl — Existing OpenCL Compute (Foundation)

Implementation Pattern: Self-Contained Wrappers (Like BLAS/LAPACK/VCL)

Reference: VSL Current State

VSL Matrix vs VTL Tensor

Architecture: VSL Vulkan + Compute Layer

VSL vs VTL Phases Summary

Task List

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`antono2/vulkan` — Full Raw Vulkan Bindings (Reference)

`antono2/v_vulkan_bindings` — Python Generator (Reference)

`antono2/vulkan_memory_allocator` — GPU Memory Allocator (Reference)

`vsl.vcl` — Existing OpenCL Compute (Foundation)