2 unstable releases

new 0.2.0 Apr 14, 2026
0.1.0 Mar 30, 2026

#193 in macOS and iOS APIs


Used in honeycrisp

LicenseRef-Cyber

715KB
16K SLoC

aruminium

the lightest metal.

pure Rust Apple Metal GPU driver. zero external dependencies. direct objc_msgSend FFI to Metal.framework — no objc runtime, no Swift, no headers.

standard Rust Metal bindings (objc2-metal) add retain/release overhead on every call. aruminium pre-resolves ObjC method implementations at construction and dispatches through raw function pointers on the hot path — bypassing objc_msgSend entirely in inference loops. the result: 1.79x faster pipelined throughput on identical workloads.

experimental. API unstable.

let device = aruminium::Gpu::open()?;
let queue = device.new_command_queue()?;

let lib = device.compile(r#"
    #include <metal_stdlib>
    kernel void add(device float *a [[buffer(0)]],
                    device float *b [[buffer(1)]],
                    device float *c [[buffer(2)]],
                    uint id [[thread_position_in_grid]]) {
        c[id] = a[id] + b[id];
    }
"#)?;

let pipe = device.pipeline(&lib.function("add")?)?;
let buf = device.buffer(n * 4)?;

let cmd = queue.commands()?;
let enc = cmd.encoder()?;
enc.bind(&pipe);
enc.bind_buffer(&buf, 0, 0);
enc.launch((n, 1, 1), (256, 1, 1));
enc.finish();
cmd.submit();
cmd.wait();

numbers

M1 Pro 16-core:

buffer create (1 MB):     0.01 ms
shader compile:           0.01 ms
dispatch overhead:        0.21 ms CPU / 0.18 ms GPU
SAXPY 16M floats:         143 GB/s
fp16 conversion:          58-72 GB/s

vs objc2-metal (standard Rust Metal binding):

batch encode:    1.13x faster
pipelined:       1.79x faster
inference sim:   1.11x faster (300 layers)
SAXPY:           143 vs 140 GB/s

same GPU, same work. aruminium is lighter.

zero-copy memory

GPU buffers can wrap unimem::Block — IOSurface-backed pinned memory shared with CPU (acpu) and ANE (rane):

use aruminium::{Gpu, Block};

let block = Block::open(n * 4)?;
let gpu = Gpu::open()?;
let buf = gpu.wrap(&block)?;  // MTLBuffer over same physical pages
// GPU reads/writes block's memory directly — zero copies

one allocation. three devices. no copies.

api

// device
Gpu::open() -> Result<Gpu>
device.name() -> String
device.has_unified_memory() -> bool

// buffers
device.buffer(bytes) -> Result<Buffer>
device.buffer_with_data(&[u8]) -> Result<Buffer>
buf.read(|&[u8]|)
buf.write_f32(|&mut [f32]|)

// shaders
device.compile(&str) -> Result<ShaderLib>
lib.function(&str) -> Result<Shader>
device.pipeline(&Shader) -> Result<Pipeline>

// dispatch
queue.commands() -> Result<Commands>
cmd.encoder() -> Result<Encoder>
enc.bind(&pipe)
enc.bind_buffer(&buf, offset, index)
enc.launch_groups(grid, group)
enc.finish()
cmd.submit()
cmd.wait()

// sync
Fence, Event, SharedEvent

// profiling
cmd.gpu_time() -> f64
pipeline.static_threadgroup_memory_length() -> usize

// fp16
aruminium::f32_to_fp16(f32) -> u16
aruminium::fp16_to_f32(u16) -> f32
aruminium::cast_f32_f16(&mut [u16], &[f32])  // bulk NEON
aruminium::cast_f16_f32(&mut [f32], &[u16])  // bulk NEON

build

cargo build --release
cargo run --example vecadd
cargo run --example matmul
cargo run --release -p metal-benches --bin bench

requires macOS with Metal-capable GPU.

contributing

we welcome pull requests. especially:

  • performance — faster dispatch, better batching, tighter FFI
  • safety — soundness fixes, better error handling, UB elimination
  • hardware — tested only on M1 Pro. if you have M2/M3/M4 or any other Apple Silicon — your benchmarks and fixes are gold

fork, branch, PR. keep it simple: cargo fmt, cargo clippy, cargo test, examples run.

license

don't trust. don't fear. don't beg.

Dependencies