3 releases
Uses new Rust 2024
| 0.1.2 | Apr 11, 2026 |
|---|---|
| 0.1.1 | Apr 10, 2026 |
| 0.1.0 | Apr 9, 2026 |
#111 in Memory management
Used in rustweb2
69KB
1K
SLoC
NUMAlloc
A blazing-fast, NUMA-aware memory allocator written in pure Rust.
NUMAlloc is a drop-in replacement for the global allocator, purpose-built for Non-Uniform Memory Access (NUMA) machines. It pins threads and memory to NUMA nodes, routes freed objects back to their origin node, and shares huge pages incrementally -- delivering fewer remote memory accesses, fewer TLB misses, and lower latency than general-purpose allocators.
Note: This project has not been tested in production. Any feedback is welcome — please use with caution.
Features
- Zero-cost NUMA awareness - O(1) origin-node lookup via pointer arithmetic, no syscalls on the hot path
- Origin-aware deallocation - freed objects return to their origin node's freelist, eliminating remote reuse
- Lock-free concurrency - Treiber stacks with ABA-safe generation tags for inter-thread communication
- Incremental huge page sharing - threads on the same node share 2 MB transparent huge pages in 32 KB–256 KB increments
- Memory safe - built with Rust's ownership model; every
unsafeblock is documented with safety invariants - Minimal dependencies - only
libcfor POSIX syscalls, no proc macros, no heavy frameworks - Drop-in
GlobalAlloc- one line to replace your allocator
Quick Start
Add to your Cargo.toml:
[dependencies]
numalloc = "0.1"
Set as global allocator:
use numalloc::NumaAlloc;
#[global_allocator]
static ALLOC: NumaAlloc = NumaAlloc::new();
fn main() {
// All allocations now go through NUMAlloc.
let v: Vec<u64> = vec![1, 2, 3];
println!("{v:?}");
}
Architecture
NUMAlloc uses a five-layer allocation hierarchy, from fastest to slowest:
graph LR
A["Thread Cache<br/>(lock-free)"] --> B["Node Heap<br/>(lock-free CAS)"]
B --> C["Region<br/>(atomic bump)"]
C --> D["Large Cache<br/>(per-thread)"]
D --> E["OS mmap"]
Allocation path (small objects, <= 256 KB)
graph TD
A["malloc(size)"] --> B["Per-Thread Freelist"]
B -- hit --> C["return pointer<br/>(zero synchronization)"]
B -- miss --> D["Per-Node Treiber Stack"]
D -- hit --> E["batch-pop up to 64 objects,<br/>return one"]
D -- miss --> F["Region Bump Allocator"]
F --> G["carve bag (32 KB–256 KB),<br/>fill freelist, return one"]
Allocation path (large objects, > 256 KB)
graph TD
A["malloc(size)"] --> B["Per-Thread Large Cache"]
B -- hit --> C["reuse cached mmap region<br/>(~7.5 ns)"]
B -- miss --> D["OS mmap"]
D --> E["fresh mapping,<br/>mbind to thread's node"]
Deallocation path
graph TD
A["free(ptr)"] --> B["Compute origin node<br/>(ptr - base) / region_size<br/>O(1), no syscall"]
B --> C{"Object type?"}
C -- "small, local" --> D["push to per-thread freelist<br/>(no locks)"]
C -- "small, remote" --> E["push to origin node's<br/>Treiber stack (lock-free CAS)"]
C -- "large (> 256KB)" --> F["cache mmap region for reuse<br/>(evict old if full)"]
Heap layout
A single contiguous virtual region is mapped at init and divided equally among NUMA nodes. Each sub-region is bound to its physical node via mbind. This design enables origin-node identification through simple integer division on any pointer.
block-beta
columns 8
block:n0["Node 0"]:2
s0["small"]
b0["big"]
end
block:n1["Node 1"]:2
s1["small"]
b1["big"]
end
block:n2["Node 2"]:2
s2["small"]
b2["big"]
end
block:nN["Node N"]:2
sN["small"]
bN["big"]
end
bp0["bpSmall"] space bp1["bpSmall"] space bp2["bpSmall"] space bpN["bpSmall"] space
bp0 --> s0
bp1 --> s1
bp2 --> s2
bpN --> sN
For the full design document with Mermaid diagrams and benchmark details, see docs/architecture_design.md.
Benchmarks
Single-threaded alloc+dealloc (steady state, lower is better):
| Size | numalloc | system (glibc) | mimalloc | jemalloc |
|---|---|---|---|---|
| 8 B | 7.4 ns | 13.8 ns | 10.9 ns | 9.2 ns |
| 64 B | 7.4 ns | 15.4 ns | 11.1 ns | 9.4 ns |
| 1 KB | 7.4 ns | 31.2 ns | 15.7 ns | 9.8 ns |
| 16 KB | 7.3 ns | 31.7 ns | 19.3 ns | 23.2 ns |
| 64 KB | 7.7 ns | 692 ns | 19.4 ns | 104 ns |
| 256 KB | 8.4 ns | 707 ns | 956 ns | 105 ns |
Bulk alloc+free (1000 items, single-threaded, lower is better):
| Size | numalloc | system | mimalloc | jemalloc |
|---|---|---|---|---|
| 64 B | 8.2 us | 16.8 us | 8.0 us | 17.1 us |
| 4 KB | 24.3 us | 88.1 us | 24.4 us | 46.5 us |
| 64 KB | 57.2 us | 709 us | 43.5 us | 508 us |
| 256 KB | 37.4 us | 704 us | 155 us | 545 us |
Multi-threaded alloc+dealloc (10,000 ops/thread, lower is better):
| Config | numalloc | system | mimalloc | jemalloc |
|---|---|---|---|---|
| 64 B, 4 threads | 153 us | 1.2 ms | 172 us | 158 us |
| 1 KB, 8 threads | 213 us | 2.9 ms | 293 us | 221 us |
| 4 KB, 8 threads | 213 us | 3.3 ms | 337 us | 245 us |
Axum HTTP server benchmark (4 threads, 100 connections, 10s, higher is better):
| Endpoint | numalloc | system | mimalloc |
|---|---|---|---|
/small ~32 B |
49,124 rps | 48,286 rps | 48,310 rps |
/medium ~256 B |
49,135 rps | 48,892 rps | 49,017 rps |
/large ~16 KB |
46,600 rps | 46,244 rps | 46,511 rps |
/bulk ~64 KB |
37,558 rps | 35,121 rps | 37,155 rps |
| Allocator | RSS after load |
|---|---|
| system | 11 MB |
| numalloc | 16 MB |
| mimalloc | 27 MB |
Run benchmarks yourself with cargo bench or cd examples/axum-bench && bash bench.sh.
Design Principles
- Hot path = zero synchronization. Per-thread freelists are single-owner, no atomics, no syscalls.
- Lock-free over locks. Shared per-node heaps use Treiber stacks with CAS, not mutexes.
- Batch everything. Drain and refill operations move objects in bulk via chain-insert, amortizing CAS cost.
- Adaptive caching. Per-thread cache thresholds scale with object size -- small objects cache up to 2048 items, large objects cache 64.
- Intrusive data structures. Free blocks store their
nextpointer in the freed memory itself -- zero extra overhead. - Explicit safety. Every
unsafeblock carries a// SAFETY:comment.NonNull<T>over*mut T.UnsafeCellfor interior mutability.
When to Use NUMAlloc
Good fit:
- Multi-threaded server applications on NUMA hardware (2+ sockets)
- Workloads with many small allocations and high thread counts
- Environments with transparent huge pages enabled
Not ideal for:
- Single-threaded applications
- Heavy cross-node producer-consumer patterns
- Asymmetric or heterogeneous memory topologies
Contributing
Contributions are welcome! Please ensure your changes pass:
cargo fmt
cargo clippy -- -D warnings
cargo test
License
See LICENSE for details.
References
Based on: Yang et al., "NUMAlloc: A Faster NUMA Memory Allocator," ISMM 2023, ACM.