Skip to content

proposal: simd: architecture-specific SIMD intrinsics under a GOEXPERIMENT #73787

@cherrymui

Description

@cherrymui

Update (08/20/2025): A preliminary implementation of AMD64 low-level SIMD package is being developed on the dev.simd branch. You're welcome to check it out and try it with your use cases. Feedback welcome! See #73787 (comment) .


Proposal Details

SIMD is crucial for achieving high performance in many modern workloads. While Go currently allows access to SIMD via hand-written assembly, this approach has significant drawbacks: it is difficult to write, prevents asynchronous preemption, and hinders inlining for small kernels.

Adding SIMD support in Go (without requiring writing assembly code) has been requested, see e.g. #35307, #53171, #64634, #67520. Here we propose a SIMD API and intrinsics without a language change. It has some similarities with #53171, #64634, and #67520, with differences in the details.

Two-level approach

Generally, Go APIs are often simple and portable. However, the SIMD operations that the hardware supports are inherently nonportable and complex. Different CPU architectures support different vector sizes and operations, and sometimes have different representations.

A portable SIMD API would be nice. As mentioned on #67520, the work on Highway (a portable SIMD implementation for C++) demonstrated that it is possible to achieve a portable and performant SIMD implementation. For C++, it is built on top of architecture-specific compiler intrinsics, which we don't yet have for Go.

Our plan is to take a two-level approach: Low-level architecture-specific API and intrinsics, and a high-level portable vector API. The low-level intrinsics will closely resemble the machine instructions (most intrinsics will compile to a single instruction), and will serve as building blocks for the high-level API.

It is expected that most data-processing code can just use the high-level portable API and achieve good performance. When some uncommon architecture-specific operations are needed, the low-level API is there for power users to use and perform such operations.

Another way to look at them, as mentioned in #67520 (comment), is that the low-level API is analogous to the syscall package, whereas the high-level one is analogous to the os package. Most code that interacts with the system will use the os package, which works on nearly all platforms, and occasionally some code will reach into the syscall package for some uncommon system-specific operations.

In this proposal, we focus on the low-level architecture-specific fixed-size vectors for now, using AMD64 as a concrete example. Variable-size vectors (scalable vectors, e.g. ARM64 SVE) support and a portable high-level API will be addressed later. We propose adding this under GOEXPERIMENT=simd for a preview.

Design goals

Here are some design goals for the low-level architecture-specific API.

  • Expressive: Being an architecture-specific API, the lower-level package can cover most of the common and useful operations that the hardware supports.
  • Despite being a low-level API and intended for power users, we expect it to be relatively easy to use. It is expected that general users could read and understand code using this API without digging into the hardware details.
  • Best-effort portability: When an operation is supported on multiple platforms, we intend to have a portable API for it. However, strict or maximal portability is not the goal for this level. In most cases, we don't plan to emulate operations that are not supported by the hardware.
  • It will be a building block for the high-level portable API.

Portable(-ish) vector types with possibly architecture-specific methods

The SIMD vector types will be defined as opaque structs. The internal representation contains the elements of the proper type and count (we may include other zero-width unexported tagging fields if necessary).

package simd

type Uint32x4 struct { a0, a1, a2, a3 uint32 }
type Uint32x8 struct { a0, a1, a2, a3, a4, a5, a6, a7 uint32 }
type Float64x4 struct { a0, a1, a2, a3 float64 }
type Float64x8 struct { a0, a1, a2, a3, a4, a5, a6, a7 float64 }

etc.

The vector types will be defined on the architectures that support them. The compiler will recognize them as special types, using the vector registers to represent and pass them.

We do not define them as arrays, as often the hardware does not support element access with a dynamic index.

Operations will be defined as methods on the vector types, e.g.

// Add each element of two vectors.
//
// Equivalent to x86 instruction VPADDD.
func (Uint32x4) Add(Uint32x4) Uint32x4

which performs an element-wise add operation. The compiler will recognize it as intrinsic, and compile it to the corresponding machine instruction. E.g. on AMD64, it will be compiled to the VPADDD instruction.

The operations will be defined in an architecture-specific way (controlled by build tags). This allows different architectures to have different operations. Common operations (like Add) will be defined on almost all architectures.

Naming

We choose the names of the operations to be easy to understand, and not tied to a specific architecture. The goal is that general readers will understand the code without digging into the CPU details.

Common operations (like Add) will have the same names and signatures across architectures. If an operation is not supported by the hardware on some architectures, it will not be defined there. We will avoid methods with the same name and signature but different semantics on different architectures.

The machine instruction name (like VPADDD) will be included in the comment, for the ease of lookup.

Load and store

The machine instructions will load/store from/to a pointer. For type safety it should be a pointer to a properly sized array type.

func LoadUint32x4(*[4]uint32) Uint32x4
func (Uint32x4) Store(*[4]uint32)

Loading/storing from/to a slice are expected to be common useful operations, which will be defined from the above, e.g.

func LoadUint32x4FromSlice(s []uint32) Uint32x4 {
	return LoadFloat64x4((*[4]uint32)(s))
}

(LoadUint32x4FromSlice is long. Maybe we name it LoadUint32x4, and name the array pointer form LoadUint32x4Ptr?)

Another option is for the load operation to take a placeholder receiver, like

func (Uint32x4) Load(*[4]uint32) Uint32x4
func (Uint32x4) LoadFromSlice([]uint32) Uint32x4

The receiver's value will not be used, just for providing the type. This form keeps the name short. But this form is confusing, in that x.Load may appear as a load to x (updating x) except it does not (it returns the loaded value as a result), and is not idiomatic as a Go API.

Mask types

As pointed out in #67520, the internal representation of masks is different from architecture to architecture, and even different at different levels of CPU features. E.g. on AVX512, a mask is one bit per element of the vector and stored in a mask register (K register); on AVX2 it is stored in a regular vector register with one element per vector element; and on ARM64 SVE, the mask is one bit per byte of the vector.

To handle this situation, we represent masks as opaque types. The compiler will choose the representation that is appropriate for the program. A mask can be used in an operation that supports masking out some elements, and in logical operations between masks. It can also be explicitly converted to a vector. E.g.

func (Uint32x4) AddMasked(Uint32x4, Mask32x4) Uint32x4 // VPADDD.Z
func (Uint32x4) Equal(Uint32x4) Mask32x4 // VCMPEQD or VPCMPD
func (Mask32x4) And(Mask32x4) Mask32x4 // VPANDD or KANDB
func (Mask32x4) AsVector() Int32x4 // VPMOVM2D or no-op

Besides operations that produce a mask (like comparison), masks can also be created from a vector, or from a bit pattern (on AMD64, it will be AVX512 style, 1 bit per element; on ARM64 SVE, it will be a different function getting a mask from a 1-bit-per-byte pattern.)

func (Uint32x4) AsMask() Mask32x4 // no-op or VPMOVD2M
func Mask32x4FromBits(uint8)

The compiler will choose the representation depending on how the mask is consumed, and will remove conversions between different representations if possible. E.g. for an Equal operation followed by an AsVector, the compiler will choose VCMPEQD which produces the result directly in a vector register. For an Equal operation followed by AddMasked, it will choose VPCMPD, which produces the result in a K register, which can be consumed in a VPADDD instruction.

It is not recommended to use the masks directly in other ways.

Alternatively, we could have separate, explicit types and methods for different representations of the masks, e.g.

func (Uint32x4) EqualToVec(Uint32x4) Uint32x4 // VCMPEQD
func (Uint32x4) EqualToMask(Uint32x4) Mask32x4 // VPCMPD

This way, the methods are more directly corresponding to machine instructions.

Note that the opaque mask representation doesn't preclude the addition of the more explicit representation and methods. If later it turns out that the explicit methods are needed, we can add them under different names.

Some masked operations (e.g. in AVX512) support two modes of masking: zero mask, where the unselected elements are set to 0, and merging mask, where the unselected elements remain unchanged in the destination register. The merging mask is more complicated, in that the destination register is actually both an input and an output. Zero mask has a simpler API. Therefore we choose the zero mask version (thus VPADDD.Z above). To use the merging mask operation, one can use the zero mask version of the operation followed by a blend operation with the same mask, and the compiler can optimize to a merging mask operation. For example, x.AddMasked(y, m).Blend(z, m) can be optimized to a merging masked instruction VPADDD x, y, m, z.

Conversions

Vector elements can be extended, truncated, or converted between integer and floating point types. E.g.

func (Uint32x4) TruncateToUint16() Uint16x8 // VPMOVDW
func (Uint32x4) ExtendToUint64() Uint64x4 // VPMOVZXDQ
func (Uint32x4) ConvertToFloat32() Float32x4 // VCVTUDQ2PS

In some cases it may be useful to truncate to fewer or expand to more elements of the same type, or to "reinterpret" a vector as another vector with the same bits but different element type and arrangement. Some such conversions don't need to generate a machine instruction.

func (Uint32x8) AsUint32x4() Uint32x4 // truncate
func (Uint32x4) AsUint32x8() Uint32x8 // expand, zero high elements
func (Uint32x4) AsInt32x4() Int32x4 // unsigned to signed
func (Uint32x4) AsFloat32x4() Float32x4 // interpret the bits as float32

Constant operands

Some machine instructions require a constant operand, or have a form with a constant operand. For example, on AMD64, getting or setting a specific element of a vector (VPEXTRD/VPINSRD instruction) requires a constant index. For shifts, one form is shifting all the elements by the same amount, which has to be a constant (the other form being shifting a variable amount for each element). Namely,

func (Uint32x4) GetElem(int) uint32 // VPEXTRD
func (Uint32x4) SetElem(int, uint32) uint32 // VPINSRD
func (Uint32x4) ShiftLeftConst(uint8) Uint32x4 // VPSLLD

It is recommended (and will be documented) that these methods are called with constant arguments, so it can generate efficient code. What happens if it is not? The corresponding C intrinsics may cause a compilation failure. We could choose to do an emulation, or generate a table switch (as the range of the constant operand is usually small), as a fallback path.

An (incomplete) API list

To give a concrete example, here is the proposed API for Uint32x4 on AMD64.

type Uint32x4 struct { a0, a1, a2, a3 uint32 }
func LoadUint32x4(*[4]uint32) Uint32x4 // VMOVDQU
func (Uint32x4) Store(*[4]uint32) // VMOVDQU
func (Uint32x4) Add(Uint32x4) Uint32x4 // VPADDD
func (Uint32x4) Sub(Uint32x4) Uint32x4 // VPSUBD
func (Uint32x4) Mul(Uint32x4) Uint32x4 // VPMULLD
func (Uint32x4) Min(Uint32x4) Uint32x4 // VPMINUD
func (Uint32x4) Max(Uint32x4) Uint32x4 // VPMAXUD
func (Uint32x4) And(Uint32x4) Uint32x4 // VPAND
func (Uint32x4) Or(Uint32x4) Uint32x4 // VPOR
func (Uint32x4) Xor(Uint32x4) Uint32x4 // VPXOR
func (Uint32x4) AndNot(Uint32x4) Uint32x4 // VPANDN
func (Uint32x4) ShiftLeft(Uint32x4) Uint32x4 // VPSLLVD
func (Uint32x4) ShiftLeftConst(uint8) Uint32x4 // VPSLLD
func (Uint32x4) ShiftRight(Uint32x4) Uint32x4 // VPSRLVD
func (Uint32x4) ShiftRightConst(uint8) Uint32x4 // VPSRLD
func (Uint32x4) RotateLeft(Uint32x4) Uint32x4 // VPROLVD
func (Uint32x4) RotateLeftConst(uint8) Uint32x4 // VPROLD
func (Uint32x4) RotateRight(Uint32x4) Uint32x4 // VPRORVD
func (Uint32x4) RotateRightConst(uint8) Uint32x4 // VPRORD
func (Uint32x4) PairwiseAdd(Uint32x4) Uint32x4 // VPHADDD
func (Uint32x4) MulEvenWiden(Uint32x4) Uint64x2 // VPMULUDQ
func (Uint32x4) OnesCount() Uint32x4 // VPOPCNTD
func (Uint32x4) LeadingZeros() Uint32x4 // VPLZCNTD
func (Uint32x4) Equal(Uint32x4) Uint32x4 // VCMPEQD or VPCMPD $0
func (Uint32x4) GreaterThan(Uint32x4) Uint32x4 // VCMPGTD or VPCMPD $6
func (Uint32x4) Blend(Uint32x4, Mask32x4) Uint32x4 // VPBLENDMD
func (Uint32x4) Compress(Mask32x4) Uint32x4 // VPCOMPRESSD
func (Uint32x4) Expand(Mask32x4) Uint32x4 // VPEXPANDD
func (Uint32x4) Permute(Uint32x4) Uint32x4 // VPERMD
func (Uint32x4) Broadcast() Uint32x4 // VPBROADCASTD
func (Uint32x4) GetElem(int) uint32 // VPEXTRD
func (Uint32x4) SetElem(int, uint32) uint32 // VPINSRD
func (Uint32x4) TruncateToUint16() Uint16x8 // VPMOVDW
func (Uint32x4) TruncateToUint8() Uint8x16 // VPMOVDB
func (Uint32x4) ExtendToUint64() Uint64x4 // VPMOVZXDQ
func (Uint32x4) ConvertToFloat32() Float32x4 // VCVTUDQ2PS
func (Uint32x4) ConvertToFloat64() Float64x4 // VCVTUDQ2PD
func (Uint32x4) AsInt32x4() Int32x4 // no-op
func (Uint32x4) AsFloat32x4() Float32x4 // no-op

For the operations that support masking, we will also have masked versions, like AddMasked above.

Some operations are supported only on some types. For example, a vector with floating point elements (e.g. Float64x4) supports Div, Reciprocal, and Sqrt, but does not support shifts or OnesCount.

Note that this list is not comprehensive, e.g. it doesn't yet include operations like Gather, Scatter, Intersect, or Galois Field Affine Transformation. We do plan to support these operations in the API. This proposal is just proposing the direction we're heading to. We can add more operations later.

Note also that we intentionally don't want to include some forms of the machine instructions in the API, and instead leave them to the optimizer. For example, on AMD64, many arithmetic operations support a memory operand. Just like the language has a + operator and a dereference (unary *) operator but not an operator for a + *b, we don't provide an API for the same operation on vectors. Instead, if a Load operation is followed by an Add, the compiler can optimize it to the memory form of the ADD instruction.

CPU features

SIMD operations often require certain CPU features. One may want to check if specific CPU features are available on the target hardware, and guard the SIMD code with it. CPU feature check functions will be provided, e.g.

func HasAVX512() bool
func HasAVX512VL() bool

The compiler will treat them like pure functions, as they never change after runtime initialization. That said, it is still recommended to check CPU features before doing SIMD operations, instead of doing it in the middle of a sequence of SIMD operations.

It is an open question whether we want to enforce that a CPU feature check must be performed before using a vector intrinsic, through static or dynamic analysis. Required checks would encourage portable code (across machines of the same architecture with different CPU features).

AVX vs. SSE

On AMD64, there are two "flavors" of SIMD instructions, SSE and AVX. Almost all the SSE operations have equivalent AVX instructions. It is not recommended to mix the two forms, which may sometimes result in a performance penalty. Therefore, we want to stick to one form. The initial version of this API will always generate the instructions in AVX form. One goal of this API is to support writing performant code on the evolving hardware, making use of advanced CPU features. SSE operations may be added later in an explicit or transparent way, if there is a strong need.

Discussion

Alternatives

We have considered a few alternatives APIs.

Instead of methods, the operations could be defined as top-level functions, like AddUint32x4. This name is unnecessarily long, and repetitive if a piece of code operates on the same type of vectors over and over. We also considered defining them as generic functions, like Add[T Vec](T, T) T. This makes the name short, but in user code it is likely that the package prefix (simd.) will still be repeated. The main difficulty for generic functions is that it is hard to express relationships between types, e.g. MulEvenWiden returns a vector with half the number of elements with doubled width. Besides, the implementation will be more complex. Overall it doesn't provide much benefit over methods.

We also considered defining the vector types as generic types, like Vec4[T]. This approach has similar difficulties as generic functions. It is also hard to express irregularity. For example, on AMD64, SaturatedAdd operation is supported with 8- and 16-bit integer elements, but not wider ones.

Future work

Scalable vectors and high-level portable API

A number of architectures have chosen to adopt scalable vectors, such as ARM64 SVE and RISC-V Vector Extension. For scalable vectors, the size cannot be determined at compile time, and it theoretically can be quite large. We plan to add support for scalable vectors in Go, although currently we're not ready to propose a concrete design.

On top of that, we plan to add a high-level portable API for vector operations. Existing portable SIMD implementations such as Highway will be a source of inspiration. To support various architectures and CPU features, the API will probably be based on scalable vectors. On platforms like AMD64, it may be lowered to a fixed size vector representation, depending on the hardware features.

It is expected that for the majority of use cases in data processing and AI infrastructure, it will be possible to write just with the high-level API, which achieves portability and performance. We also hope that the low-level and high-level APIs are interoperable. If some code is mostly portable but just needs an operation or two that is very architecture-specific, one can write the code mostly using the high-level API, and drop to the low-level just for these operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Active

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions