ane

Bare-metal ARM SME2 bytecode interpreter for Apple Silicon (M4/M5).

A C++20 static library providing 127 custom opcodes executed by a hand-written ARM64 bytecode interpreter written in assembly. All matrix operations run directly on the SME2 za tiles — the same hardware that Apple's "Neural Engine" software stack targets — bypassing CoreML and BNNS entirely. For the full story on that discovery, see docs/DISCOVERY.md.

Benchmarks

Raw SME smopa achieves 2x the throughput of Apple's own BNNS INT8 path on the same hardware:

Test	GPU	SME	BNNS INT8	Total
SME alone		8.5 TOPS		8.5
BNNS alone			4.2 TOPS	4.2
SME + BNNS		5.2	1.6	6.9
GPU + SME	37.4	8.5		45.9

When SME and BNNS run simultaneously, both drop — proving they share hardware. The GPU is unaffected. Full results and methodology.

bash tests/run_full_throughput_tests.sh    # Run it yourself

Bytecode Interpreter?

The smstart and smstop come with a little bit of overhead. They call it streaming for a reason, it is designed so that several streaming operations happen within a single streaming session. How many exactly? I'm not sure yet. Right now the bytecode programs are limited to 1 loop that can only repeat 255 times. We'll see how that goes.

Quick Start

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -GNinja
cmake --build .
./bin/test_mnist

This builds the library and runs a multi-threaded MNIST training demo (784 -> 128 -> 10, Hogwild SGD) with all compute dispatched through the SME bytecode interpreter. See tests/test_mnist.cpp for the full implementation.

Installation

ane is a C++20 static library. The header (ane.hpp) provides the dispatch template and helper types; the assembly interpreter (bytecode_interpreter.s) is compiled into libane.a.

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
sudo cmake --install .

Then link via CMake:

find_package(ane REQUIRED)
target_link_libraries(my_app PRIVATE ane::ane)

Usage

All operations are encoded as ane::Op bytecodes and executed by the assembly interpreter via ane::dispatch:

#include <ane/ane.hpp>
#include <cstring>

int main() {
    int batch_size = 50;
    int HIDDEN_DIM = 128;
    int INPUT_DIM = 784;

    // All memory should be 64-byte aligned and padded to multiples of 16 for SME vector bounds
    float* hidden = ... // allocation
    float* xi     = ... // input data
    float* W1     = ... // model parameters

    // Forward pass: dense matmul with fused 8-bit quantization and ReLU
    memset(hidden, 0, batch_size * HIDDEN_DIM * sizeof(float));

    ane::dispatch(
        ane::Op::dense_fused_i8,
        batch_size,     // Output rows (M)
        HIDDEN_DIM,     // Output cols (N)
        INPUT_DIM,      // Inner dim (K)
        1.0f,           // Float scaling
        true,           // Fuse ReLU
        xi,             // Matrix A (batch_size x INPUT_DIM)
        W1,             // Matrix B (INPUT_DIM x HIDDEN_DIM)
        hidden          // Matrix C (Output)
    );

    return 0;
}

See docs/API.md for the full opcode reference.

Dashboard

An interactive React dashboard that runs live hardware probes and benchmarks against your SME hardware:

./dashboard.sh

See docs/TOOLS.md for details on the probe scripts, throughput benchmarks, and dashboard development.

Requirements

Apple M4 or M5 (or any ARM processor with SME2)
CMake 3.19+
C++20 compiler (Apple Clang recommended)
Node.js 20+ and npm (optional, for the dashboard)
Python 3.10+ with torch and torchvision (optional, for PyTorch comparisons)

Documentation


docs/API.md	Opcode reference, dispatch API, helper types
docs/DISCOVERY.md	How we found that SME2 is the hardware behind Apple's "Neural Engine"
docs/TOOLS.md	Hardware probes, throughput benchmarks, and dashboard

Contributors

Thanks to Anthropic's Claude models (Haiku, Sonnet, and Opus 4.5 and 4.6 were all used at some point). Thanks to Gemini 3.1 Pro for assembling the React dashboards, benchmark integration, and keeping things clean!

License

MIT. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.vscode		.vscode
dashboards		dashboards
docs		docs
include/ane		include/ane
probes		probes
scripts		scripts
sme-ui		sme-ui
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
FEATURE_REQUEST.md		FEATURE_REQUEST.md
LICENSE		LICENSE
README.md		README.md
dashboard.sh		dashboard.sh
mnist_demo.sh		mnist_demo.sh
run_build.sh		run_build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ane

Benchmarks

Bytecode Interpreter?

Quick Start

Installation

Usage

Dashboard

Requirements

Documentation

Contributors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ane

Benchmarks

Bytecode Interpreter?

Quick Start

Installation

Usage

Dashboard

Requirements

Documentation

Contributors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages