SIMBA (SIMD Binary Accelerator) is a high-performance runtime and tooling layer that lets Go binaries call Rust SIMD intrinsics without CGO.
Whether you're building data-intensive pipelines, number-crunching algorithms, or real-time systems, SIMBA lets your Go code roar with the speed of native vectorized instructions β without sacrificing code clarity or portability.
- π§ Simple interface β Access powerful SIMD instructions from Go via intuitive wrappers.
- βοΈ Powered by Rust β Leverages mature SIMD support in Rust for portability and safety.
- π¦Ύ No CGO needed β one tiny assembly shim per function, no external linker or
cctool-chain. - π Tooling included β Optional CLI tooling to build, inspect, and test SIMD-accelerated modules.
- π¦ Modular β Use SIMD intrinsics where you need them, and fall back to pure Go when you donβt.
SIMBA compiles Rust functions into a position-independent static
library, renames it to *.syso, and the Go linker treats it like a
native object. A 3-instruction assembly trampoline (one per
function) bridges Goβs internal ABI to the System-V / AAPCS64 calling
convention β no cgo, no dynamic loader, ~2 ns overhead.
[ Go Code ] --asm shim--> [ .syso object ] --> [ Rust SIMD ]
SIMBA trampolines have fundamental stack usage limitations that determine which Rust functions can be safely called:
- Simple SIMD operations (sum_u8, crc32, is_ascii, etc.)
- Predictable small stack usage (<2KB total)
- Direct computations without deep library calls
- Leaf-like functions with minimal call depth
- Complex library operations (CSV parsing, JSON, file I/O)
- Deep recursive calls or nested function chains
- Large local buffers or complex data structures
- Standard library I/O with unpredictable stack usage
Go's NOSPLIT constraint limits stack growth to ~2KB for functions that can't trigger stack growth. SIMBA trampolines use $0 frame size, assuming minimal stack usage. Functions that exceed this limit cause segmentation faults.
Technical Background:
- Go Assembly Manual - NOSPLIT directive documentation
- Stack Management in Go - Runtime stack constraints
- rustgo Project Analysis - Similar FFI approach and limitations
# Analyze Rust function stack usage at compile time
RUSTFLAGS="-Z emit-stack-sizes" cargo build --release --target x86_64-unknown-linux-gnu
objdump -s -j .stack_sizes target/x86_64-unknown-linux-gnu/release/libmycrate.a
# Alternative: Use cargo-call-stack (embedded-focused)
cargo install cargo-call-stack
cargo call-stack --bin mybinTool References:
- emit-stack-sizes documentation - Rust compiler flag for stack analysis
- cargo-call-stack - Call graph and stack usage analysis tool
- Stack Size Analysis Guide - Community discussion on measuring stack usage
- Keep SIMBA functions simple - direct computations only
- Use CGO for complex operations - file I/O, parsing, etc.
- Test thoroughly before deploying SIMBA trampolines
- Measure stack usage during development with tools above
- When in doubt, use CGO - it's safer for complex operations
Remember: SIMBA excels at simple, predictable SIMD kernels. For complex library operations, stick with CGO.
- A Guide to Go Assembly - Deep dive into Go assembly internals
- Go Internal ABI Documentation - Go's internal calling convention
- System V ABI - x86-64 calling convention used by Rust
- AAPCS64 Specification - ARM64 calling convention
git submodule add https://github.com/yourname/simba#[no_mangle]
pub extern "C" fn sum_u8_avx2(ptr: *const u8, len: usize) -> u32 {
// Rust SIMD code using AVX2 intrinsics
}package algo
//go:generate go run ./internal/ffi // rebuilds *.syso archive
// SumU8 adds a slice of bytes via SIMD.
func SumU8(b []byte) uint32 {
if len(b) == 0 {
return 0
}
return ffi.SumU8(b) // ~2 ns call-return
}No build tags needed β SIMBA always builds with CGO disabled. The
go generate ./internal/ffi step produces two files:
libsimba_darwin_amd64.sysolibsimba_darwin_arm64.syso
They are auto-linked by the Go tool-chain on any platform.
SIMBA now ships two lane widths for every byte-wise primitive:
| Operation | 32-lane symbol | 64-lane symbol |
|---|---|---|
| Sum of bytes | sum_u8_32 |
sum_u8_64 |
| ASCII check | is_ascii32 |
is_ascii64 |
| Validate via LUT | validate_u8_lut32 |
validate_u8_lut64 |
| Map via LUT | map_u8_lut32 |
map_u8_lut64 |
The intrinsics layer (pkg/intrinsics) automatically picks the widest
kernel that amortises its 0.3 ns FFI cost:
// β₯64 B β 64-lane kernel, else 32-lane
if len(b) >= 64 {
return ffi.SumU8_64(b)
}
return ffi.SumU8_32(b)The algo layer adds a scalar fallback for tiny slices where pure Go still beats SIMD. Current thresholds (Apple M-series):
- Generic helpers (
SumU8, LUT ops): 16 B - ASCII check: 32 B
These cut-offs are recorded in pkg/algo/threshold_*.go and can be tuned per
platform β early experiments on AWS Graviton look similar.
Calling a single SIMD kernel is cheap once the data size amortises the fixed FFI cost (~2 ns via the syso trampoline). The moment you chain two kernels back-to-back you pay that gateway latency twice, which can wipe out the win for small/medium slices.
Design options:
-
Custom merged kernels (recommended) Write the exact combination you need (e.g. lower-case + validate). Rustβs generics/macros make adding a new symbol trivial and the call still costs one hop.
-
Batch API Pass a tiny op-code list to one exported function so multiple operations run inside one call. Keeps Go in charge but needs a small βmini-VMβ on the Rust side.
-
Handle / pipeline builder Build the op list once, get back an opaque handle (
u64), then execute it many times. Saves parameter marshaling but adds lifetime management.
We currently expose low-level primitives (validate_u8_lut, map_u8_lut) that you can stitch together in Go for rapid prototyping. For production paths, prefer option 1βgenerate a bespoke kernel and export it. It scales linearly with the number of unique pipelines and keeps the public API intuitive.
(Waiting for βnative Go SIMDβ isnβt part of the near-term plan; the proposal has been open for years and still lacks a stable design.)
- High-performance parsing (e.g., JSON, CSV, binary protocols)
- Fast image or video preprocessing
- Bitwise vector math
- Custom hashing or compression
- Filtering, mapping, scanning of large datasets
- Platform-independent vector dispatch
- Optional fallback to Go implementation
- Generator for wrappers from Rust β Go
- CLI:
simba build,simba inspect,simba bench - Docs site with examples
Contributions are welcome! If you have ideas for performance improvements, target architecture support, or want to help with the CLI, open an issue or pull request.
SIMBAβs goal is to democratize low-level performance for Go developers, without forcing them to write unsafe, unreadable code. You should be able to think in Go β and roar with SIMD.
This project is licensed under the Apache License, Version 2.0.
You may obtain a copy of the license at http://www.apache.org/licenses/LICENSE-2.0.
SIMBA ships a trampoline-sanity unit-test that exercises the FFI layer with
seven mixed-width arguments (pointer, usize, 8-/32-/64-bit ints, raw
float32/float64 bit-patterns). On amd64 the last argument spills to the
stack; on arm64 all fit in registers. The Rust side recomputes a simple FNV
hash and Go asserts equality, so any future stub width/offset error fails
instantly in CI.
Run just this guard:
go test ./internal/ffi -run TestTrampolineSanitygo generate ./internal/ffi regenerates the assembly stubs; the test must stay
green on both amd64 and arm64.