Skip to content

axiomhq/simd-go

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

simd-go

Go Reference CI Go Report Card License: MIT

SIMD-accelerated numeric operations for Go, optimized for ARM64 (NEON and SVE).

Features

  • Hand-tuned ARM64 assembly for NEON and SVE instruction sets
  • Automatic CPU detection - detects Graviton3/4, uses optimal thresholds
  • Threshold-based dispatch - uses scalar for small arrays, SIMD for large
  • Scalar fallbacks for non-ARM64 platforms
  • Zero allocations in hot paths
  • Fuzz tested for correctness

Installation

go get github.com/axiomhq/simd-go

Usage

package main

import (
    "fmt"

    simd "github.com/axiomhq/simd-go"
)

func main() {
    vals := []float64{1.0, 2.0, 3.0, 4.0, 5.0}

    sum := simd.SumFloat64(vals)
    min := simd.MinFloat64(vals)
    max := simd.MaxFloat64(vals)

    fmt.Printf("Sum: %v, Min: %v, Max: %v\n", sum, min, max)
    // Output: Sum: 15, Min: 1, Max: 5
}

Supported Operations

Float64

Function Description
SumFloat64(vals []float64) float64 Sum of all values
MinFloat64(vals []float64) float64 Minimum value
MaxFloat64(vals []float64) float64 Maximum value
DotProductFloat64(a, b []float64) float64 Dot product of two vectors

Float32

Function Description
SumFloat32(vals []float32) float32 Sum of all values
MinFloat32(vals []float32) float32 Minimum value
MaxFloat32(vals []float32) float32 Maximum value
DotProductFloat32(a, b []float32) float32 Dot product of two vectors

Int64

Function Description
SumInt64(vals []int64) int64 Sum of all values
MinInt64(vals []int64) int64 Minimum value
MaxInt64(vals []int64) int64 Maximum value
DotProductInt64(a, b []int64) int64 Dot product of two vectors
SumSqInt64(vals []int64) int64 Sum of squares (Σv²)
AnyAbsGreaterThan(vals []int64, threshold int64) bool Check if any |v| > threshold

Int32

Function Description
SumInt32(vals []int32) int64 Sum of all values (returns int64 to avoid overflow)
MinInt32(vals []int32) int32 Minimum value
MaxInt32(vals []int32) int32 Maximum value
DotProductInt32(a, b []int32) int64 Dot product of two vectors (returns int64 to avoid overflow)
SumSqInt32(vals []int32) int64 Sum of squares (returns int64 to avoid overflow)
AnyAbsGreaterThanInt32(vals []int32, threshold int32) bool Check if any |v| > threshold

Int16

Function Description
SumInt16(vals []int16) int64 Sum of all values (returns int64 to avoid overflow)
MinInt16(vals []int16) int16 Minimum value
MaxInt16(vals []int16) int16 Maximum value
DotProductInt16(a, b []int16) int64 Dot product of two vectors (returns int64 to avoid overflow)
SumSqInt16(vals []int16) int64 Sum of squares (returns int64 to avoid overflow)
AnyAbsGreaterThanInt16(vals []int16, threshold int16) bool Check if any |v| > threshold

CPU Feature Detection

Function Description
HasSVE() bool Returns true if CPU supports SVE
HasNEON() bool Returns true if CPU supports NEON
IsARM64() bool Returns true if running on ARM64
CPUName() string Returns detected CPU name (e.g., "AWS Graviton4 (Neoverse-V2)")

Performance

All benchmarks run with n=10,000 elements.

Apple M3 (NEON only)

Type Operation Scalar NEON Speedup
Float64 Sum 2.83µs (28.3 GB/s) 551ns (145.3 GB/s) 5.1x
Float64 Min 11.8µs (6.8 GB/s) 556ns (143.9 GB/s) 21.2x
Float64 Max 11.9µs (6.7 GB/s) 561ns (142.7 GB/s) 21.1x
Float64 DotProduct 2.96µs (54.1 GB/s) 1.04µs (154.4 GB/s) 2.9x
Float32 Sum 3.02µs (13.3 GB/s) 289ns (138.5 GB/s) 10.4x
Float32 Min 11.7µs (3.4 GB/s) 285ns (140.2 GB/s) 41.1x
Float32 Max 11.8µs (3.4 GB/s) 289ns (138.4 GB/s) 40.8x
Float32 DotProduct 3.01µs (26.6 GB/s) 520ns (153.9 GB/s) 5.8x
Int64 Sum 2.97µs (26.9 GB/s) 567ns (141.0 GB/s) 5.2x
Int64 Min 3.00µs (26.6 GB/s) 806ns (99.3 GB/s) 3.7x
Int64 Max 2.97µs (26.9 GB/s) 823ns (97.2 GB/s) 3.6x
Int64 SumSq 2.98µs (26.9 GB/s) 1.55µs (51.6 GB/s) 1.9x
Int32 Sum 2.96µs (13.5 GB/s) 376ns (106.3 GB/s) 7.9x
Int32 Min 3.00µs (13.3 GB/s) 292ns (137.1 GB/s) 10.3x
Int32 Max 3.03µs (13.2 GB/s) 285ns (140.3 GB/s) 10.6x
Int32 DotProduct 2.97µs (27.0 GB/s) 748ns (107.0 GB/s) 4.0x
Int32 SumSq 2.96µs (13.5 GB/s) 743ns (53.8 GB/s) 4.0x
Int16 Sum 2.97µs (6.7 GB/s) 284ns (70.5 GB/s) 10.5x
Int16 Min 2.97µs (6.7 GB/s) 148ns (134.9 GB/s) 20.0x
Int16 Max 2.98µs (6.7 GB/s) 149ns (134.2 GB/s) 20.0x
Int16 DotProduct 2.98µs (13.4 GB/s) 560ns (71.4 GB/s) 5.3x
Int16 SumSq 2.96µs (6.8 GB/s) 560ns (35.7 GB/s) 5.3x

AWS Graviton3 (Neoverse-V1, SVE 256-bit)

Type Operation Scalar NEON SVE Best
Float64 Sum 3.87µs (20.7 GB/s) 1.37µs (58.2 GB/s) 1.26µs (63.7 GB/s) 3.1x SVE
Float64 Min 15.4µs (5.2 GB/s) 1.56µs (51.3 GB/s) 1.61µs (49.7 GB/s) 9.9x NEON
Float64 Max 15.3µs (5.2 GB/s) 1.56µs (51.3 GB/s) 1.61µs (49.7 GB/s) 9.8x NEON
Float64 DotProduct 3.87µs (41.4 GB/s) 1.96µs (81.5 GB/s) 1.76µs (91.0 GB/s) 2.2x SVE
Float32 Sum 3.87µs (10.3 GB/s) 550ns (72.7 GB/s) 391ns (102.3 GB/s) 9.9x SVE
Float32 Min 15.4µs (2.6 GB/s) 570ns (70.2 GB/s) 574ns (69.7 GB/s) 27.0x NEON
Float32 Max 15.3µs (2.6 GB/s) 586ns (68.5 GB/s) 574ns (69.7 GB/s) 26.7x SVE
Float32 DotProduct 3.87µs (20.7 GB/s) 780ns (102.6 GB/s) 702ns (114.0 GB/s) 5.5x SVE
Int64 Sum 3.87µs (20.7 GB/s) 1.38µs (58.1 GB/s) 1.24µs (64.3 GB/s) 3.1x SVE
Int64 Min 3.91µs (20.5 GB/s) 1.99µs (40.2 GB/s) 1.61µs (49.6 GB/s) 2.4x SVE
Int64 Max 3.86µs (20.7 GB/s) 2.00µs (40.1 GB/s) 1.61µs (49.7 GB/s) 2.4x SVE
Int64 DotProduct 3.87µs (41.4 GB/s) 2.34µs (68.2 GB/s) 1.6x SVE
Int64 SumSq 3.87µs (20.7 GB/s) 3.18µs (25.2 GB/s) 2.19µs (36.6 GB/s) 1.8x SVE
Int32 Sum 3.86µs (10.4 GB/s) 765ns (52.3 GB/s) 1.27µs (31.5 GB/s) 5.1x NEON
Int32 Min 3.87µs (10.3 GB/s) 564ns (71.0 GB/s) 392ns (102.1 GB/s) 9.9x SVE
Int32 Max 3.91µs (10.2 GB/s) 564ns (71.0 GB/s) 393ns (101.9 GB/s) 9.9x SVE
Int32 DotProduct 3.87µs (20.7 GB/s) 1.76µs (45.4 GB/s) 2.76µs (29.0 GB/s) 2.2x NEON
Int32 SumSq 3.86µs (10.4 GB/s) 1.43µs (28.0 GB/s) 2.11µs (19.0 GB/s) 2.7x NEON
Int16 Sum 3.86µs (5.2 GB/s) 542ns (36.9 GB/s) 1.51µs (13.2 GB/s) 7.1x NEON
Int16 Min 3.87µs (5.2 GB/s) 287ns (69.8 GB/s) 201ns (99.3 GB/s) 19.2x SVE
Int16 Max 3.86µs (5.2 GB/s) 287ns (69.8 GB/s) 201ns (99.3 GB/s) 19.2x SVE
Int16 DotProduct 3.87µs (10.3 GB/s) 1.25µs (31.9 GB/s) 2.12µs (18.9 GB/s) 3.1x NEON
Int16 SumSq 3.86µs (5.2 GB/s) 1.12µs (17.8 GB/s) 1.87µs (10.7 GB/s) 3.4x NEON

AWS Graviton4 (Neoverse-V2, SVE2 128-bit)

Type Operation Scalar NEON SVE SVE2 Best
Float64 Sum 3.59µs (22.3 GB/s) 1.00µs (79.7 GB/s) 1.00µs (79.6 GB/s) 3.6x NEON
Float64 Min 14.3µs (5.6 GB/s) 1.22µs (65.7 GB/s) 1.23µs (65.1 GB/s) 11.7x NEON
Float64 Max 14.3µs (5.6 GB/s) 1.22µs (65.6 GB/s) 1.23µs (65.0 GB/s) 11.7x NEON
Float64 DotProduct 4.06µs (40.0 GB/s) 1.71µs (93.6 GB/s) 1.66µs (96.6 GB/s) 2.5x SVE
Float32 Sum 3.90µs (10.4 GB/s) 461ns (86.8 GB/s) 444ns (90.1 GB/s) 8.8x SVE
Float32 Min 14.3µs (2.8 GB/s) 563ns (71.1 GB/s) 565ns (70.8 GB/s) 25.4x NEON
Float32 Max 14.2µs (2.8 GB/s) 563ns (71.1 GB/s) 565ns (70.8 GB/s) 25.3x NEON
Float32 DotProduct 3.59µs (22.3 GB/s) 759ns (105.4 GB/s) 770ns (103.9 GB/s) 4.7x NEON
Int64 Sum 4.06µs (20.0 GB/s) 1.00µs (79.9 GB/s) 1.00µs (79.8 GB/s) 4.1x NEON
Int64 Min 3.90µs (20.8 GB/s) 1.58µs (50.7 GB/s) 1.23µs (65.2 GB/s) 3.2x SVE
Int64 Max 3.58µs (22.3 GB/s) 1.51µs (53.2 GB/s) 1.23µs (65.1 GB/s) 2.9x SVE
Int64 DotProduct 3.59µs (44.6 GB/s) 1.96µs (81.6 GB/s) 1.8x SVE
Int64 SumSq 3.59µs (22.3 GB/s) 1.97µs (40.6 GB/s) 1.80µs (44.4 GB/s) 2.0x SVE
Int32 Sum 3.59µs (11.2 GB/s) 612ns (65.4 GB/s) 1.02µs (39.2 GB/s) 657ns (60.9 GB/s) 5.9x NEON
Int32 Min 4.19µs (9.7 GB/s) 558ns (71.7 GB/s) 444ns (90.2 GB/s) 9.5x SVE
Int32 Max 3.65µs (11.0 GB/s) 558ns (71.7 GB/s) 445ns (89.8 GB/s) 8.2x SVE
Int32 DotProduct 3.59µs (22.3 GB/s) 1.38µs (57.9 GB/s) 2.17µs (36.9 GB/s) 1.35µs (59.5 GB/s) 2.7x SVE2
Int32 SumSq 3.59µs (11.2 GB/s) 1.12µs (35.6 GB/s) 1.90µs (21.1 GB/s) 1.12µs (35.6 GB/s) 3.2x NEON
Int16 Sum 3.78µs (5.3 GB/s) 408ns (49.1 GB/s) 1.26µs (15.9 GB/s) 339ns (59.0 GB/s) 11.1x SVE2
Int16 Min 3.90µs (5.2 GB/s) 344ns (58.2 GB/s) 224ns (89.2 GB/s) 17.4x SVE
Int16 Max 3.90µs (5.2 GB/s) 342ns (58.5 GB/s) 224ns (89.2 GB/s) 17.4x SVE
Int16 DotProduct 3.90µs (10.4 GB/s) 965ns (41.5 GB/s) 2.01µs (19.9 GB/s) 746ns (53.7 GB/s) 5.2x SVE2
Int16 SumSq 3.59µs (5.6 GB/s) 896ns (22.3 GB/s) 1.75µs (11.4 GB/s) 587ns (34.1 GB/s) 6.1x SVE2

Notes

  • DotProductInt64 has no NEON implementation (NEON lacks 64-bit integer multiply)
  • Graviton4 uses 128-bit SVE vectors; SVE2 provides additional instructions for better Int16/Int32 performance
  • Apple M3 has no SVE support

Run Your Own Benchmarks

go test -bench=. -benchmem -count=5 ./... | tee bench.txt
benchstat bench.txt

Use benchstat for comparison:

benchstat -row /fn,/n -col /impl bench.txt

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Public API (simd.go)                  │
│  SumFloat64, MinFloat64, MaxFloat64, DotProductFloat64  │
│  SumInt64, MinInt64, MaxInt64, DotProductInt64, ...     │
└─────────────────────────────┬───────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              │                               │
    ┌─────────▼─────────┐         ┌──────────▼──────────┐
    │  impl_arm64.go    │         │   impl_stub.go      │
    │  (ARM64 dispatch) │         │   (other platforms) │
    │                   │         │                     │
    │  1. Detect CPU    │         │  → scalar.go        │
    │     (G3/G4/other) │         │                     │
    │  2. Check size    │         └─────────────────────┘
    │     vs threshold  │
    │  3. Dispatch to:  │
    │     scalar/NEON/  │
    │     SVE           │
    └─────────┬─────────┘
              │
    ┌─────────┼─────────┐
    │         │         │
┌───▼───┐ ┌───▼───┐ ┌───▼───┐
│Scalar │ │ NEON  │ │  SVE  │
│ (.go) │ │ (.s)  │ │ (.s)  │
└───────┘ └───────┘ └───────┘

Safety Notes

SumSqInt64 Overflow

SumSqInt64 can overflow if input values are too large. Use AnyAbsGreaterThan to check:

const sqrtMaxInt64 = 3037000499

if simd.AnyAbsGreaterThan(vals, sqrtMaxInt64) {
    // Handle potential overflow - use float64 or arbitrary precision
}
sum := simd.SumSqInt64(vals)

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

SIMD-accelerated numeric operations for Go, optimized for ARM64 (NEON and SVE)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors