A deterministic, O(1) slab allocator with a provable latency ceiling β
for systems where the worst case matters more than the average.
TL;DR Β· Benchmarks Β· When to use PMAD Β· Architecture Β· quicx integration
Quick start Β· Usage & API Β· Limitations Β· Reproducibility
PMAD is a specialized memory allocator in C. It reserves one contiguous pool
via a single mmap at startup, splits it into user-defined size classes, and
serves every allocation as an O(1) free-list pop β zero system calls at
runtime, a fixed memory footprint, and latency that is independent of block
size. There is no slow path in the code, so the worst case is bounded by
construction.
It does not try to beat jemalloc/tcmalloc on raw average. It optimizes for worst-case determinism, and the benchmarks below are head-to-head, on one machine, and fully reproducible.
At a glance (Apple M4 Pro, clang -O3, measured β see REPORT.md):
- Hot-path latency P50 2.59 ns / P99.9 6.50 ns β the flattest body of any allocator tested
- 2.59 ns at every size from 16 B to 4096 B β latency independent of block size
- Under sustained churn, PMAD's worst case stays at ~40 Β΅s while the system allocator blows out to 6.95 ms
- Throughput 691 Mops/s @ 64 B (small-object hot path)
- 19/19 correctness checks pass
Standard allocators optimize for average-case throughput. PMAD optimizes for predictable, bounded worst-case latency.
Per-operation latency on the hot path @ 64 B (block-amortised body, K=32; medians, ns/op). The right-most column is the spread from median to three-nines β lower is flatter, flatter is more deterministic.
| Allocator | mean | P50 | P99 | P99.9 | P99.9 / P50 |
|---|---|---|---|---|---|
| PMAD | 2.31 | 2.59 | 3.91 | 6.50 | 2.5Γ |
| tcmalloc | 3.83 | 3.91 | 6.50 | 7.81 | 2.0Γ |
| mimalloc | 3.32 | 2.62 | 5.19 | 9.09 | 3.5Γ |
| jemalloc | 7.69 | 7.81 | 16.91 | 144.53 | 18.5Γ |
| system (libmalloc) | 16.74 | 15.62 | 20.84 | 239.59 | 15.3Γ |
PMAD also has the fewest slow operations on the hot path β only 83 ppm of ops exceed 100 ns, ~20Γ fewer than the system allocator and ~9Γ fewer than jemalloc.
The point in one line: under sustained 1024 B churn, the system allocator's worst case explodes to 6.95 ms and mimalloc's to 73 Β΅s β PMAD stays at 40 Β΅s, because the slow path that produces those spikes does not exist in its code.
Apple M4 Pro (8P+4E), macOS, clang -O3 -march=native, no LTO. jemalloc 5.3.0,
tcmalloc 2.18.1, mimalloc 3.3.2, system libmalloc β all measured head-to-head on
the same machine. Full methodology, raw data, and every table:
benchmarks/v2/REPORT.md.
| β Reach for PMAD when | β Look elsewhere when |
|---|---|
| Objects are small & fixed-size, β€ 4096 B, sizes known ahead of time | You need a general-purpose, multithreaded shared heap (use jemalloc / mimalloc) |
| You need a provable latency ceiling (RTOS, HFT, frame budgets, packet pools) | Allocations are large or variable (> 4 KB, unbounded payloads) |
| Ownership is single-threaded or shared-nothing per-core | You can't predeclare your size classes |
| A fixed, capped memory footprint is acceptable β or desirable | You need runtime double-free / use-after-free detection |
PMAD is a special-purpose allocator and says so. Honest limitations are listed in their own section.
Five workloads, each measured head-to-head against jemalloc, tcmalloc, mimalloc, and the macOS system allocator. Summaries below; the full method (13 rigor principles), per-rep spread, and committed raw samples are in REPORT.md.
On naming: macOS ships Apple's
libmalloc, not glibc'sptmalloc. The harness is portable and calls plainmalloc/freefor thesystembackend, so it measures glibc when compiled on Linux. Below, the macOS system allocator is labelledsystem, not glibc.
The flat curve above. PMAD barely moves from P50 to P99.9 (2.5Γ) while jemalloc and
the system allocator fan out 15β18Γ; it also has the fewest ops β₯ 100 ns (83 ppm).
The extreme tail (max, ~30β50 Β΅s) is OS-scheduling-bound and roughly equal for
every allocator β it is not an allocator discriminator on un-isolated macOS, and
we say so rather than claiming it.
PMAD's latency is independent of block size β the O(1) guarantee, demonstrated:
| Allocator | 16 B | 64 B | 256 B | 1024 B | 4096 B | (P50 / P99.9, ns/op) |
|---|---|---|---|---|---|---|
| PMAD | 2.59 / 6.50 | 2.59 / 6.50 | 2.59 / 6.50 | 2.59 / 5.19 | 2.59 / 6.50 | |
| tcmalloc | 3.91 / 7.81 | 3.91 / 7.81 | 3.91 / 13.03 | 3.91 / 7.81 | 3.91 / 29.94 | |
| mimalloc | 2.62 / 10.41 | 2.62 / 9.09 | 3.91 / 6.53 | 3.91 / 10.44 | 7.81 / 110.69 | |
| jemalloc | 7.81 / 52.06 | 7.81 / 144.53 | 7.81 / 143.22 | 7.81 / 154.94 | 9.09 / 15.62 | |
| system | 15.62 / 222.66 | 15.62 / 239.59 | 15.62 / 239.59 | 19.53 / 237.00 | 14.34 / 239.56 |
PMAD's P50 is 2.59 ns at every size; mimalloc's grows with size and its 4096 B tail reaches 110 ns.
Random interleaved alloc/free against a live working set of 262 144 objects,
64 M operations per run β this is where general allocators fragment and their tails
degrade. Single-op tail @ 1024 B (fraction of ops over threshold; max in ns):
| Allocator | β₯ 100 ns | β₯ 1 Β΅s | β₯ 10 Β΅s | max |
|---|---|---|---|---|
| PMAD | 2.9 % | 150.9 ppm | 5.9 ppm | 39 875 ns (40 Β΅s) |
| jemalloc | 0.1 % | 52.7 ppm | 2.2 ppm | 34 166 ns |
| tcmalloc | 1.6 % | 115.6 ppm | 3.7 ppm | 48 833 ns |
| mimalloc | 82 % | 572.4 ppm | 23.9 ppm | 73 375 ns |
| system | 95 % | 2208.7 ppm | 44.9 ppm | 6 950 250 ns (6.95 ms) |
Per-window P99 across the full 64 M-op run (left = op 0 β right = op 64 M) β flat lines mean no upward fragmentation drift:
PMAD ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ flat
jemalloc ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ flat
tcmalloc ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ flat
mimalloc ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ flat-ish
system βββ
βββ
ββ
βββββββββββββ
βββ
ββββββββββββ
β
βββββββββββββββββββββββ volatile
Read it honestly: PMAD, jemalloc, and tcmalloc all stay flat and bounded; the system allocator and mimalloc fan out catastrophically at 1024 B (95 % and 82 % of ops over 100 ns, millisecond and 73 Β΅s worst cases). PMAD's worst case is 40 Β΅s β 174Γ tighter than the system allocator. Honest caveat: at the smaller 64 B churn, jemalloc and tcmalloc actually have fewer ops β₯ 100 ns than PMAD β their sharded thread caches have better cache locality than PMAD's single global free list on a large random working set. PMAD's class is "top-tier determinism, dramatically better than the system allocator and mimalloc, competitive with the best on the extreme tail."
Sustained operations/second, uninstrumented batch loop (Mops/s):
| Allocator | 16 B | 64 B | 256 B | 1024 B | 4096 B |
|---|---|---|---|---|---|
| PMAD | 748.9 | 690.6 | 96.1 | 84.5 | 95.9 |
| mimalloc | 467.9 | 408.1 | 297.3 | 84.6 | 30.4 |
| tcmalloc | 267.8 | 211.8 | 111.7 | 93.8 | 37.5 |
| jemalloc | 133.6 | 125.9 | 119.1 | 94.3 | 48.1 |
| system | 134.9 | 124.0 | 95.8 | 38.9 | 19.7 |
PMAD wins decisively for small objects (16β64 B fit in cache β register-speed free-list pops). Honest weakness: beyond ~256 B the working set spills out of cache and PMAD's intrusive global free list has weaker spatial locality than mimalloc's segment-local design β throughput drops to the ~85β96 Mops band. Throughput is reported second on purpose; determinism is the story.
PMAD keeps metadata in-band β a fixed 16-byte header per block. The cost is exact and predictable, not statistical:
| Block size | 16 B | 32 B | 64 B | 128 B | 256 B | 512 B | 1024 B | 2048 B | 4096 B |
|---|---|---|---|---|---|---|---|---|---|
| Header overhead | 50.0 % | 33.3 % | 20.0 % | 11.1 % | 5.88 % | 3.03 % | 1.54 % | 0.78 % | 0.39 % |
The 16-byte header is significant for tiny objects (50 % at 16 B) and negligible for medium/large ones (β€ 1.5 % at β₯ 1024 B). The malloc-family keep metadata out-of-band, so their measured per-object RSS overhead is low for sizes that land on a size class (0.1β2.5 % here) β but they round to fixed classes and you can't predict the cost. Second honesty point: PMAD writes a header into every block at init, so it reserves and faults its entire pool upfront β RSS equals the configured pool from boot, regardless of utilisation. That fixed footprint is a feature for hard real-time and a cost if you over-provision.
quicx is a high-performance task-queue engine in C (epoll/kqueue event loop, custom binary protocol, slab allocator, official Java client), evolving toward a shared-nothing multithreaded dispatcher driven by a reinforcement-learning scheduler. It's a clean fit for PMAD:
- Shared-nothing per-core β one PMAD pool per worker, no locks needed. PMAD's single-threaded design stops being a limitation and becomes the right shape.
- User-configured message length maps directly to a PMAD size class β PMAD is built around predeclared sizes, so the engine hands it exactly what it needs.
- A bounded pool is natural backpressure: exhaustion returns
NULL, which a task queue must handle anyway β and the RL dispatcher can use per-core pool occupancy as a routing signal, steering away from cores near capacity. - Deterministic per-op latency β predictable queue SLAs.
Caveat, stated up front: PMAD caps a block at 4096 B. For larger
user-configured payloads, raise MAX_SIZE_OF_SIZE_CLASS (cheap up to ~64 KB) or
pair PMAD-for-small-buffers with a fallback allocator for large payloads.
The architecture diagram at the top of this README shows the full picture: the
static PMAD struct holds the O(1) lookup table and two pointers into a single
mmap pool, whose front holds the MemoryPool header and the SizeClass[]
metadata, with the block arena filling the rest.
The layered call path (top β bottom):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Program β
β pmad_alloc(size) / pmad_free(ptr) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
β Public API Layer β incPMAD.h
β (Singleton facade) β incPMAD.c
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β PMAD Allocator β PMAD.h / PMAD.c
β Lookup Table β O(1)β
β Size Class β Free β
β List pop/push β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Memory Pool β MemoryPool.h/.c
β mmap'd at init β BlockHeader.h/.c
β Split by % config β SizeClass.h
βββββββββββββββββββββββ
- Initialization β A single
mmapreserves the pool, which is split into size classes by your percentages and sizes (fully configurable). - Allocation β A lookup table maps the requested size to a class index in O(1); a block is popped from that class's free list.
- Deallocation β The block header identifies its size class; the block is pushed back onto the free list.
- Destruction β A single
munmapreleases all memory at once.
Alloc = lookup-table index + free-list pop. Free = header read + free-list push. That is the entire worst case β there is no slow path to audit. For the full deep-dive see
Documentation_v1.0.pdf.
| Requirement | Minimum |
|---|---|
| C Compiler | GCC or Clang with C99 support |
| OS | Linux or macOS (POSIX mmap) |
| Build Tool | GNU Make |
git clone https://github.com/anastassow/PMAD.git
cd PMAD
make # build the demo
./main # run itcd benchmarks/v2
make && make test # build + correctness suite β 19/19 must pass first
./run_all.sh # full head-to-head suite -> raw/
python3 aggregate.py # raw/ -> results/tables.mdCompetitor allocators (jemalloc, tcmalloc/gperftools, mimalloc) are auto-detected; missing ones are skipped. See REPORT.md for the exact versions and environment.
#include "incPMAD.h"
int main(void) {
/* Size classes and each class's share of the pool (%). Shares must sum to 100. */
size_t classes[] = { 16, 32, 64, 128, 256 };
size_t percentages[] = { 10, 20, 20, 20, 30 };
/* One mmap; zero further syscalls. Returns a PmadStatus. */
if (pmad_init(classes, 5, percentages, 1024 * 1024) != PMAD_OK)
return 1;
int* data = pmad_alloc(sizeof(int) * 4); /* O(1); NULL if class exhausted */
if (!data) { pmad_destroy(); return 1; }
for (int i = 0; i < 4; i++)
data[i] = i * 10;
pmad_free(data); /* O(1) */
pmad_destroy(); /* single munmap */
return 0;
}| Function | Returns | Description |
|---|---|---|
pmad_init(class_sizes, num_classes, percentages, pool_size) |
PmadStatus |
Initialize with custom size classes + pool shares (must sum to 100). One mmap. |
pmad_alloc(size) |
void* |
Allocate β₯ size bytes β O(1). Returns NULL on exhaustion or size > 4096. |
pmad_free(ptr) |
PmadStatus |
Return a block to its free list β O(1). |
pmad_destroy() |
void |
Release the whole pool back to the OS (single munmap). |
pmad_get_stats(out, max_classes) |
int |
Fill PmadClassStats[] (block size, total/allocated blocks); returns count. |
PmadStatus: PMAD_OK, PMAD_ERR_INIT_FAILED, PMAD_ERR_MAP_FAILED,
PMAD_ERR_INCOMPLETE_PERCENTAGE, PMAD_ERR_NULL_PTR, PMAD_ERR_INVALID_PTR,
PMAD_ERR_CORRUPT_HEADER, PMAD_ERR_OOM.
Limits: alignment 16 B; max block size 4096 B (MAX_SIZE_OF_SIZE_CLASS); up to
32 size classes (MAX_PMAD_CLASSES).
Unlike jemalloc (fixed classes), tcmalloc (fixed bands), or ptmalloc (dynamic bins), PMAD lets you define exactly which sizes exist and how much memory each gets β then computes every block count exactly before a line of application code runs. Everything is set at initialization: size classes, per-class pool share, total pool size.
The included interactive infographics dashboard has a Live Pool Configurator β adjust size classes, percentages, and pool size and watch block counts, usable bytes, and per-class utilisation update instantly. Every number is mathematically exact, computed before any code runs.
Example configuration recipes (size classes + split) for common domains β tune to your own workload; verify with
benchmarks/v2:
| Profile | Size classes (B) | Split (%) | Target |
|---|---|---|---|
| Max throughput | {16} |
100 |
Small-object velocity |
| Min overhead | {4096} |
100 |
Bulk data density |
| Balanced | {64, 256, 1024} |
{60, 30, 10} |
Mixed workloads |
| Latency-optimised | {32, 128} |
{80, 20} |
Critical signaling |
| HFT / network | {32, 128, 512, β¦} |
{60, 20, β¦} |
L3 packet processing |
| Embedded / RTOS | {8, 16, 32, β¦} |
{30, 30, β¦} |
Deterministic control |
- Single allocation, no statistical fragmentation β the whole pool is reserved
upfront via one
mmap. No runtime heap growth; metadata overhead is exact and predictable (see Bench 5), not a statistical unknown. - O(1) guaranteed β both alloc and free are a lookup-table index and a free-list pop/push. No fallback paths, no locking, no slow path.
- User-defined memory layout β size classes and pool shares are fully configurable for known workload profiles.
- Minimal, predictable metadata β a single 16-byte
BlockHeader(next pointer- class ID) per block.
- No external dependencies β pure C99 with POSIX
mmap.
PMAD is a special-purpose allocator. These are real constraints, not footnotes:
- Single-threaded β one global free list, no locks. Safe only per-thread or in a shared-nothing per-core design. (Roadmap: per-thread pools.)
- Max block 4096 B (
MAX_SIZE_OF_SIZE_CLASS) β larger blocks need a recompile or a fallback allocator. - No double-free / use-after-free detection β there is no per-block free bit, so
a second
pmad_freereturnsPMAD_OKand corrupts the free list. Documented in the test suite rather than hidden. - Reserves & faults the whole pool upfront β RSS equals the configured pool from boot, regardless of utilisation (a feature for hard real-time, a cost if over-provisioned).
- Throughput cliff beyond cache β for large-block batches the global free list's spatial locality is weaker than sharded thread-cache allocators (Bench 4).
- Not the most deterministic under small-block churn β jemalloc and tcmalloc edge PMAD on β₯ 100 ns op count at 64 B (Bench 3).
Every number in this README is reproducible. The benchmark harness is one source
compiled once per allocator (so swapping allocators changes exactly one
variable); raw samples are committed under
benchmarks/v2/raw/, and every table derives from them via
aggregate.py. The full 13-principle methodology β timer-resolution probing,
instrument-cost subtraction, optimizer defeat, warmup/pre-faulting, single-op vs
block-amortised timing, and the platform caveats β is documented in
REPORT.md.
cd benchmarks/v2 && make && make test && ./run_all.sh && python3 aggregate.pyPMAD/
βββ include/ # Public & internal headers
β βββ PMAD.h # Core struct & API; ALIGNMENT, MAX_SIZE_OF_SIZE_CLASS
β βββ incPMAD.h # Public singleton API + PmadClassStats
β βββ structures/ # BlockHeader.h, MemoryPool.h, SizeClass.h
β
βββ src/ # Implementation
β βββ PMAD.c # init, alloc, free, lookup table, pool split
β βββ incPMAD.c # singleton wrapper β public API
β βββ MemoryPool.c # pool attachment
β βββ BlockHeader.c # block creation & free-list insertion
β βββ SIzeClass.c # (reserved for size-class utilities)
β
βββ benchmarks/
β βββ v2/ # β current head-to-head suite (use this)
β βββ bench.c # one harness, compiled once per allocator
β βββ pmad_test.c # correctness suite (run first)
β βββ pmad_mem.c # exact metadata overhead
β βββ run_all.sh # orchestration -> raw/
β βββ aggregate.py # raw/ -> results/tables.md
β βββ raw/ # committed raw samples (every number traces here)
β βββ results/ # generated tables + overhead CSV
β βββ REPORT.md # full methodology + results
β
βββ allocator_info_graphics/ # Interactive dashboard + Live Pool Configurator
βββ docs/images/ # Architecture diagrams
βββ main.c # Example / demo entry point
βββ Makefile # Build system
βββ Documentation_v1.0.pdf # Full technical documentation
βββ LICENSE # MIT
PMAD is free for anyone to use, study, and build on. If you've done something with it β an improvement, a port, a bug fix, or a project built on top β you're very welcome to send it my way: open a pull request or reach out to Dimitar Anastasov. I'd genuinely love to see what you make with it.
The full technical write-up is in Documentation_v1.0.pdf β
this is the v1 documentation. The implementation has changed a little since then
(for example, the size-class metadata now lives inside the pool), and an updated
version is currently in the works. In the meantime, the most up-to-date details are
the benchmark methodology and results in benchmarks/v2/REPORT.md
and the interactive dashboard at
allocator_info_graphics/allocator_infographics.html.
Licensed under the MIT License β see LICENSE.
Built by Dimitar Anastasov Β· 2026