3 releases

Uses new Rust 2024

0.1.2	Apr 11, 2026
0.1.1	Apr 10, 2026
0.1.0	Apr 9, 2026

#111 in Memory management

Used in rustweb2

MIT license

69KB
1K SLoC

NUMAlloc

A blazing-fast, NUMA-aware memory allocator written in pure Rust.

NUMAlloc is a drop-in replacement for the global allocator, purpose-built for Non-Uniform Memory Access (NUMA) machines. It pins threads and memory to NUMA nodes, routes freed objects back to their origin node, and shares huge pages incrementally -- delivering fewer remote memory accesses, fewer TLB misses, and lower latency than general-purpose allocators.

Note: This project has not been tested in production. Any feedback is welcome — please use with caution.

Features

Zero-cost NUMA awareness - O(1) origin-node lookup via pointer arithmetic, no syscalls on the hot path
Origin-aware deallocation - freed objects return to their origin node's freelist, eliminating remote reuse
Lock-free concurrency - Treiber stacks with ABA-safe generation tags for inter-thread communication
Incremental huge page sharing - threads on the same node share 2 MB transparent huge pages in 32 KB–256 KB increments
Memory safe - built with Rust's ownership model; every unsafe block is documented with safety invariants
Minimal dependencies - only libc for POSIX syscalls, no proc macros, no heavy frameworks
Drop-in GlobalAlloc - one line to replace your allocator

Quick Start

Add to your Cargo.toml:

[dependencies]
numalloc = "0.1"

Set as global allocator:

use numalloc::NumaAlloc;

#[global_allocator]
static ALLOC: NumaAlloc = NumaAlloc::new();

fn main() {
    // All allocations now go through NUMAlloc.
    let v: Vec<u64> = vec![1, 2, 3];
    println!("{v:?}");
}

Architecture

NUMAlloc uses a five-layer allocation hierarchy, from fastest to slowest:

graph LR
    A["Thread Cache<br/>(lock-free)"] --> B["Node Heap<br/>(lock-free CAS)"]
    B --> C["Region<br/>(atomic bump)"]
    C --> D["Large Cache<br/>(per-thread)"]
    D --> E["OS mmap"]

Allocation path (small objects, <= 256 KB)

graph TD
    A["malloc(size)"] --> B["Per-Thread Freelist"]
    B -- hit --> C["return pointer<br/>(zero synchronization)"]
    B -- miss --> D["Per-Node Treiber Stack"]
    D -- hit --> E["batch-pop up to 64 objects,<br/>return one"]
    D -- miss --> F["Region Bump Allocator"]
    F --> G["carve bag (32 KB–256 KB),<br/>fill freelist, return one"]

Allocation path (large objects, > 256 KB)

graph TD
    A["malloc(size)"] --> B["Per-Thread Large Cache"]
    B -- hit --> C["reuse cached mmap region<br/>(~7.5 ns)"]
    B -- miss --> D["OS mmap"]
    D --> E["fresh mapping,<br/>mbind to thread's node"]

Deallocation path

graph TD
    A["free(ptr)"] --> B["Compute origin node<br/>(ptr - base) / region_size<br/>O(1), no syscall"]
    B --> C{"Object type?"}
    C -- "small, local" --> D["push to per-thread freelist<br/>(no locks)"]
    C -- "small, remote" --> E["push to origin node's<br/>Treiber stack (lock-free CAS)"]
    C -- "large (> 256KB)" --> F["cache mmap region for reuse<br/>(evict old if full)"]

Heap layout

A single contiguous virtual region is mapped at init and divided equally among NUMA nodes. Each sub-region is bound to its physical node via mbind. This design enables origin-node identification through simple integer division on any pointer.

block-beta
    columns 8
    block:n0["Node 0"]:2
        s0["small"]
        b0["big"]
    end
    block:n1["Node 1"]:2
        s1["small"]
        b1["big"]
    end
    block:n2["Node 2"]:2
        s2["small"]
        b2["big"]
    end
    block:nN["Node N"]:2
        sN["small"]
        bN["big"]
    end
    bp0["bpSmall"] space bp1["bpSmall"] space bp2["bpSmall"] space bpN["bpSmall"] space
    bp0 --> s0
    bp1 --> s1
    bp2 --> s2
    bpN --> sN

For the full design document with Mermaid diagrams and benchmark details, see docs/architecture_design.md.

Benchmarks

Single-threaded alloc+dealloc (steady state, lower is better):

Size	numalloc	system (glibc)	mimalloc	jemalloc
8 B	7.4 ns	13.8 ns	10.9 ns	9.2 ns
64 B	7.4 ns	15.4 ns	11.1 ns	9.4 ns
1 KB	7.4 ns	31.2 ns	15.7 ns	9.8 ns
16 KB	7.3 ns	31.7 ns	19.3 ns	23.2 ns
64 KB	7.7 ns	692 ns	19.4 ns	104 ns
256 KB	8.4 ns	707 ns	956 ns	105 ns

Bulk alloc+free (1000 items, single-threaded, lower is better):

Size	numalloc	system	mimalloc	jemalloc
64 B	8.2 us	16.8 us	8.0 us	17.1 us
4 KB	24.3 us	88.1 us	24.4 us	46.5 us
64 KB	57.2 us	709 us	43.5 us	508 us
256 KB	37.4 us	704 us	155 us	545 us

Multi-threaded alloc+dealloc (10,000 ops/thread, lower is better):

Config	numalloc	system	mimalloc	jemalloc
64 B, 4 threads	153 us	1.2 ms	172 us	158 us
1 KB, 8 threads	213 us	2.9 ms	293 us	221 us
4 KB, 8 threads	213 us	3.3 ms	337 us	245 us

Axum HTTP server benchmark (4 threads, 100 connections, 10s, higher is better):

Endpoint	numalloc	system	mimalloc
`/small` ~32 B	49,124 rps	48,286 rps	48,310 rps
`/medium` ~256 B	49,135 rps	48,892 rps	49,017 rps
`/large` ~16 KB	46,600 rps	46,244 rps	46,511 rps
`/bulk` ~64 KB	37,558 rps	35,121 rps	37,155 rps

Allocator	RSS after load
system	11 MB
numalloc	16 MB
mimalloc	27 MB

Run benchmarks yourself with cargo bench or cd examples/axum-bench && bash bench.sh.

Design Principles

Hot path = zero synchronization. Per-thread freelists are single-owner, no atomics, no syscalls.
Lock-free over locks. Shared per-node heaps use Treiber stacks with CAS, not mutexes.
Batch everything. Drain and refill operations move objects in bulk via chain-insert, amortizing CAS cost.
Adaptive caching. Per-thread cache thresholds scale with object size -- small objects cache up to 2048 items, large objects cache 64.
Intrusive data structures. Free blocks store their next pointer in the freed memory itself -- zero extra overhead.
Explicit safety. Every unsafe block carries a // SAFETY: comment. NonNull<T> over *mut T. UnsafeCell for interior mutability.

When to Use NUMAlloc

Good fit:

Multi-threaded server applications on NUMA hardware (2+ sockets)
Workloads with many small allocations and high thread counts
Environments with transparent huge pages enabled

Not ideal for:

Single-threaded applications
Heavy cross-node producer-consumer patterns
Asymmetric or heterogeneous memory topologies

Contributing

Contributions are welcome! Please ensure your changes pass:

cargo fmt
cargo clippy -- -D warnings
cargo test

License

See LICENSE for details.

References

Based on: Yang et al., "NUMAlloc: A Faster NUMA Memory Allocator," ISMM 2023, ACM.