Skip to content

anastassow/PMAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

63 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PMAD Architecture Overview

PMAD β€” Predictive Memory Allocator by Dimitar Anastasov

A deterministic, O(1) slab allocator with a provable latency ceiling β€”
for systems where the worst case matters more than the average.

Quick Start Reproducible benchmarks Tests 19/19 License C Platform


Contents

TL;DR Β· Benchmarks Β· When to use PMAD Β· Architecture Β· quicx integration
Quick start Β· Usage & API Β· Limitations Β· Reproducibility


TL;DR

PMAD is a specialized memory allocator in C. It reserves one contiguous pool via a single mmap at startup, splits it into user-defined size classes, and serves every allocation as an O(1) free-list pop β€” zero system calls at runtime, a fixed memory footprint, and latency that is independent of block size. There is no slow path in the code, so the worst case is bounded by construction.

It does not try to beat jemalloc/tcmalloc on raw average. It optimizes for worst-case determinism, and the benchmarks below are head-to-head, on one machine, and fully reproducible.

At a glance (Apple M4 Pro, clang -O3, measured β€” see REPORT.md):

  • Hot-path latency P50 2.59 ns / P99.9 6.50 ns β€” the flattest body of any allocator tested
  • 2.59 ns at every size from 16 B to 4096 B β€” latency independent of block size
  • Under sustained churn, PMAD's worst case stays at ~40 Β΅s while the system allocator blows out to 6.95 ms
  • Throughput 691 Mops/s @ 64 B (small-object hot path)
  • 19/19 correctness checks pass

Standard allocators optimize for average-case throughput. PMAD optimizes for predictable, bounded worst-case latency.


The headline: a flat latency curve

Per-operation latency on the hot path @ 64 B (block-amortised body, K=32; medians, ns/op). The right-most column is the spread from median to three-nines β€” lower is flatter, flatter is more deterministic.

Allocator mean P50 P99 P99.9 P99.9 / P50
PMAD 2.31 2.59 3.91 6.50 2.5Γ—
tcmalloc 3.83 3.91 6.50 7.81 2.0Γ—
mimalloc 3.32 2.62 5.19 9.09 3.5Γ—
jemalloc 7.69 7.81 16.91 144.53 18.5Γ—
system (libmalloc) 16.74 15.62 20.84 239.59 15.3Γ—

PMAD also has the fewest slow operations on the hot path β€” only 83 ppm of ops exceed 100 ns, ~20Γ— fewer than the system allocator and ~9Γ— fewer than jemalloc.

The point in one line: under sustained 1024 B churn, the system allocator's worst case explodes to 6.95 ms and mimalloc's to 73 Β΅s β€” PMAD stays at 40 Β΅s, because the slow path that produces those spikes does not exist in its code.

Apple M4 Pro (8P+4E), macOS, clang -O3 -march=native, no LTO. jemalloc 5.3.0, tcmalloc 2.18.1, mimalloc 3.3.2, system libmalloc β€” all measured head-to-head on the same machine. Full methodology, raw data, and every table: benchmarks/v2/REPORT.md.


When to use PMAD β€” and when not to

βœ… Reach for PMAD when ❌ Look elsewhere when
Objects are small & fixed-size, ≀ 4096 B, sizes known ahead of time You need a general-purpose, multithreaded shared heap (use jemalloc / mimalloc)
You need a provable latency ceiling (RTOS, HFT, frame budgets, packet pools) Allocations are large or variable (> 4 KB, unbounded payloads)
Ownership is single-threaded or shared-nothing per-core You can't predeclare your size classes
A fixed, capped memory footprint is acceptable β€” or desirable You need runtime double-free / use-after-free detection

PMAD is a special-purpose allocator and says so. Honest limitations are listed in their own section.


Benchmarks

Five workloads, each measured head-to-head against jemalloc, tcmalloc, mimalloc, and the macOS system allocator. Summaries below; the full method (13 rigor principles), per-rep spread, and committed raw samples are in REPORT.md.

On naming: macOS ships Apple's libmalloc, not glibc's ptmalloc. The harness is portable and calls plain malloc/free for the system backend, so it measures glibc when compiled on Linux. Below, the macOS system allocator is labelled system, not glibc.

1 Β· Tail-latency distribution (the headline)

The flat curve above. PMAD barely moves from P50 to P99.9 (2.5Γ—) while jemalloc and the system allocator fan out 15–18Γ—; it also has the fewest ops β‰₯ 100 ns (83 ppm). The extreme tail (max, ~30–50 Β΅s) is OS-scheduling-bound and roughly equal for every allocator β€” it is not an allocator discriminator on un-isolated macOS, and we say so rather than claiming it.

2 Β· Latency by block size (the cleanest win)

PMAD's latency is independent of block size β€” the O(1) guarantee, demonstrated:

Allocator 16 B 64 B 256 B 1024 B 4096 B (P50 / P99.9, ns/op)
PMAD 2.59 / 6.50 2.59 / 6.50 2.59 / 6.50 2.59 / 5.19 2.59 / 6.50
tcmalloc 3.91 / 7.81 3.91 / 7.81 3.91 / 13.03 3.91 / 7.81 3.91 / 29.94
mimalloc 2.62 / 10.41 2.62 / 9.09 3.91 / 6.53 3.91 / 10.44 7.81 / 110.69
jemalloc 7.81 / 52.06 7.81 / 144.53 7.81 / 143.22 7.81 / 154.94 9.09 / 15.62
system 15.62 / 222.66 15.62 / 239.59 15.62 / 239.59 19.53 / 237.00 14.34 / 239.56

PMAD's P50 is 2.59 ns at every size; mimalloc's grows with size and its 4096 B tail reaches 110 ns.

3 Β· Churn / fragmentation (the most important real-world test)

Random interleaved alloc/free against a live working set of 262 144 objects, 64 M operations per run β€” this is where general allocators fragment and their tails degrade. Single-op tail @ 1024 B (fraction of ops over threshold; max in ns):

Allocator β‰₯ 100 ns β‰₯ 1 Β΅s β‰₯ 10 Β΅s max
PMAD 2.9 % 150.9 ppm 5.9 ppm 39 875 ns (40 Β΅s)
jemalloc 0.1 % 52.7 ppm 2.2 ppm 34 166 ns
tcmalloc 1.6 % 115.6 ppm 3.7 ppm 48 833 ns
mimalloc 82 % 572.4 ppm 23.9 ppm 73 375 ns
system 95 % 2208.7 ppm 44.9 ppm 6 950 250 ns (6.95 ms)

Per-window P99 across the full 64 M-op run (left = op 0 β†’ right = op 64 M) β€” flat lines mean no upward fragmentation drift:

PMAD      ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁  flat
jemalloc  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁  flat
tcmalloc  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁  flat
mimalloc  ▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁  flat-ish
system    β–‡β–‡β–…β–„β–‡β–…β–‡β–…β–ƒβ–‚β–β–‚β–‚β–β–β–„β–„β–„β–„β–„β–…β–„β–„β–…β–‚β–β–β–β–β–‚β–β–‚β–„β–†β–†β–…β–…β–†β–‡β–‡β–ˆβ–‚β–β–β–β–β–β–β–ƒβ–‡β–†β–‡β–‡β–‡β–‡β–‡β–†β–„β–β–  volatile

Read it honestly: PMAD, jemalloc, and tcmalloc all stay flat and bounded; the system allocator and mimalloc fan out catastrophically at 1024 B (95 % and 82 % of ops over 100 ns, millisecond and 73 Β΅s worst cases). PMAD's worst case is 40 Β΅s β€” 174Γ— tighter than the system allocator. Honest caveat: at the smaller 64 B churn, jemalloc and tcmalloc actually have fewer ops β‰₯ 100 ns than PMAD β€” their sharded thread caches have better cache locality than PMAD's single global free list on a large random working set. PMAD's class is "top-tier determinism, dramatically better than the system allocator and mimalloc, competitive with the best on the extreme tail."

4 Β· Throughput (the secondary story)

Sustained operations/second, uninstrumented batch loop (Mops/s):

Allocator 16 B 64 B 256 B 1024 B 4096 B
PMAD 748.9 690.6 96.1 84.5 95.9
mimalloc 467.9 408.1 297.3 84.6 30.4
tcmalloc 267.8 211.8 111.7 93.8 37.5
jemalloc 133.6 125.9 119.1 94.3 48.1
system 134.9 124.0 95.8 38.9 19.7

PMAD wins decisively for small objects (16–64 B fit in cache β†’ register-speed free-list pops). Honest weakness: beyond ~256 B the working set spills out of cache and PMAD's intrusive global free list has weaker spatial locality than mimalloc's segment-local design β€” throughput drops to the ~85–96 Mops band. Throughput is reported second on purpose; determinism is the story.

5 Β· Memory overhead by block size (choose your size)

PMAD keeps metadata in-band β€” a fixed 16-byte header per block. The cost is exact and predictable, not statistical:

Block size 16 B 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
Header overhead 50.0 % 33.3 % 20.0 % 11.1 % 5.88 % 3.03 % 1.54 % 0.78 % 0.39 %

The 16-byte header is significant for tiny objects (50 % at 16 B) and negligible for medium/large ones (≀ 1.5 % at β‰₯ 1024 B). The malloc-family keep metadata out-of-band, so their measured per-object RSS overhead is low for sizes that land on a size class (0.1–2.5 % here) β€” but they round to fixed classes and you can't predict the cost. Second honesty point: PMAD writes a header into every block at init, so it reserves and faults its entire pool upfront β€” RSS equals the configured pool from boot, regardless of utilisation. That fixed footprint is a feature for hard real-time and a cost if you over-provision.


Real-world integration: quicx

quicx is a high-performance task-queue engine in C (epoll/kqueue event loop, custom binary protocol, slab allocator, official Java client), evolving toward a shared-nothing multithreaded dispatcher driven by a reinforcement-learning scheduler. It's a clean fit for PMAD:

  • Shared-nothing per-core β†’ one PMAD pool per worker, no locks needed. PMAD's single-threaded design stops being a limitation and becomes the right shape.
  • User-configured message length maps directly to a PMAD size class β€” PMAD is built around predeclared sizes, so the engine hands it exactly what it needs.
  • A bounded pool is natural backpressure: exhaustion returns NULL, which a task queue must handle anyway β€” and the RL dispatcher can use per-core pool occupancy as a routing signal, steering away from cores near capacity.
  • Deterministic per-op latency β†’ predictable queue SLAs.

Caveat, stated up front: PMAD caps a block at 4096 B. For larger user-configured payloads, raise MAX_SIZE_OF_SIZE_CLASS (cheap up to ~64 KB) or pair PMAD-for-small-buffers with a fallback allocator for large payloads.


How it works

The architecture diagram at the top of this README shows the full picture: the static PMAD struct holds the O(1) lookup table and two pointers into a single mmap pool, whose front holds the MemoryPool header and the SizeClass[] metadata, with the block arena filling the rest.

The layered call path (top β†’ bottom):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      User Program                       β”‚
β”‚         pmad_alloc(size)  /  pmad_free(ptr)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   Public API Layer  β”‚    incPMAD.h
              β”‚  (Singleton facade) β”‚    incPMAD.c
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚    PMAD Allocator   β”‚    PMAD.h / PMAD.c
              β”‚  Lookup Table β†’ O(1)β”‚
              β”‚  Size Class β†’ Free  β”‚
              β”‚    List pop/push    β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚    Memory Pool      β”‚    MemoryPool.h/.c
              β”‚  mmap'd at init     β”‚    BlockHeader.h/.c
              β”‚  Split by % config  β”‚    SizeClass.h
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  1. Initialization β€” A single mmap reserves the pool, which is split into size classes by your percentages and sizes (fully configurable).
  2. Allocation β€” A lookup table maps the requested size to a class index in O(1); a block is popped from that class's free list.
  3. Deallocation β€” The block header identifies its size class; the block is pushed back onto the free list.
  4. Destruction β€” A single munmap releases all memory at once.

Alloc = lookup-table index + free-list pop. Free = header read + free-list push. That is the entire worst case β€” there is no slow path to audit. For the full deep-dive see Documentation_v1.0.pdf.


Quick start

Prerequisites

Requirement Minimum
C Compiler GCC or Clang with C99 support
OS Linux or macOS (POSIX mmap)
Build Tool GNU Make

Build, test, run

git clone https://github.com/anastassow/PMAD.git
cd PMAD

make            # build the demo
./main          # run it

Reproduce the benchmarks (correctness first)

cd benchmarks/v2
make && make test          # build + correctness suite β€” 19/19 must pass first
./run_all.sh               # full head-to-head suite -> raw/
python3 aggregate.py       # raw/ -> results/tables.md

Competitor allocators (jemalloc, tcmalloc/gperftools, mimalloc) are auto-detected; missing ones are skipped. See REPORT.md for the exact versions and environment.


Usage

#include "incPMAD.h"

int main(void) {
    /* Size classes and each class's share of the pool (%). Shares must sum to 100. */
    size_t classes[]     = { 16, 32, 64, 128, 256 };
    size_t percentages[] = { 10, 20, 20, 20, 30 };

    /* One mmap; zero further syscalls. Returns a PmadStatus. */
    if (pmad_init(classes, 5, percentages, 1024 * 1024) != PMAD_OK)
        return 1;

    int* data = pmad_alloc(sizeof(int) * 4);   /* O(1); NULL if class exhausted */
    if (!data) { pmad_destroy(); return 1; }

    for (int i = 0; i < 4; i++)
        data[i] = i * 10;

    pmad_free(data);     /* O(1) */
    pmad_destroy();      /* single munmap */
    return 0;
}

API reference

Function Returns Description
pmad_init(class_sizes, num_classes, percentages, pool_size) PmadStatus Initialize with custom size classes + pool shares (must sum to 100). One mmap.
pmad_alloc(size) void* Allocate β‰₯ size bytes β€” O(1). Returns NULL on exhaustion or size > 4096.
pmad_free(ptr) PmadStatus Return a block to its free list β€” O(1).
pmad_destroy() void Release the whole pool back to the OS (single munmap).
pmad_get_stats(out, max_classes) int Fill PmadClassStats[] (block size, total/allocated blocks); returns count.

PmadStatus: PMAD_OK, PMAD_ERR_INIT_FAILED, PMAD_ERR_MAP_FAILED, PMAD_ERR_INCOMPLETE_PERCENTAGE, PMAD_ERR_NULL_PTR, PMAD_ERR_INVALID_PTR, PMAD_ERR_CORRUPT_HEADER, PMAD_ERR_OOM.

Limits: alignment 16 B; max block size 4096 B (MAX_SIZE_OF_SIZE_CLASS); up to 32 size classes (MAX_PMAD_CLASSES).


Fully customizable β€” designed around your workload

Unlike jemalloc (fixed classes), tcmalloc (fixed bands), or ptmalloc (dynamic bins), PMAD lets you define exactly which sizes exist and how much memory each gets β€” then computes every block count exactly before a line of application code runs. Everything is set at initialization: size classes, per-class pool share, total pool size.

The included interactive infographics dashboard has a Live Pool Configurator β€” adjust size classes, percentages, and pool size and watch block counts, usable bytes, and per-class utilisation update instantly. Every number is mathematically exact, computed before any code runs.

Example configuration recipes (size classes + split) for common domains β€” tune to your own workload; verify with benchmarks/v2:

Profile Size classes (B) Split (%) Target
Max throughput {16} 100 Small-object velocity
Min overhead {4096} 100 Bulk data density
Balanced {64, 256, 1024} {60, 30, 10} Mixed workloads
Latency-optimised {32, 128} {80, 20} Critical signaling
HFT / network {32, 128, 512, …} {60, 20, …} L3 packet processing
Embedded / RTOS {8, 16, 32, …} {30, 30, …} Deterministic control

Design principles

  • Single allocation, no statistical fragmentation β€” the whole pool is reserved upfront via one mmap. No runtime heap growth; metadata overhead is exact and predictable (see Bench 5), not a statistical unknown.
  • O(1) guaranteed β€” both alloc and free are a lookup-table index and a free-list pop/push. No fallback paths, no locking, no slow path.
  • User-defined memory layout β€” size classes and pool shares are fully configurable for known workload profiles.
  • Minimal, predictable metadata β€” a single 16-byte BlockHeader (next pointer
    • class ID) per block.
  • No external dependencies β€” pure C99 with POSIX mmap.

Limitations & honest tradeoffs

PMAD is a special-purpose allocator. These are real constraints, not footnotes:

  • Single-threaded β€” one global free list, no locks. Safe only per-thread or in a shared-nothing per-core design. (Roadmap: per-thread pools.)
  • Max block 4096 B (MAX_SIZE_OF_SIZE_CLASS) β€” larger blocks need a recompile or a fallback allocator.
  • No double-free / use-after-free detection β€” there is no per-block free bit, so a second pmad_free returns PMAD_OK and corrupts the free list. Documented in the test suite rather than hidden.
  • Reserves & faults the whole pool upfront β€” RSS equals the configured pool from boot, regardless of utilisation (a feature for hard real-time, a cost if over-provisioned).
  • Throughput cliff beyond cache β€” for large-block batches the global free list's spatial locality is weaker than sharded thread-cache allocators (Bench 4).
  • Not the most deterministic under small-block churn β€” jemalloc and tcmalloc edge PMAD on β‰₯ 100 ns op count at 64 B (Bench 3).

Reproducibility

Every number in this README is reproducible. The benchmark harness is one source compiled once per allocator (so swapping allocators changes exactly one variable); raw samples are committed under benchmarks/v2/raw/, and every table derives from them via aggregate.py. The full 13-principle methodology β€” timer-resolution probing, instrument-cost subtraction, optimizer defeat, warmup/pre-faulting, single-op vs block-amortised timing, and the platform caveats β€” is documented in REPORT.md.

cd benchmarks/v2 && make && make test && ./run_all.sh && python3 aggregate.py

Repository structure

PMAD/
β”œβ”€β”€ include/                    # Public & internal headers
β”‚   β”œβ”€β”€ PMAD.h                  #   Core struct & API; ALIGNMENT, MAX_SIZE_OF_SIZE_CLASS
β”‚   β”œβ”€β”€ incPMAD.h               #   Public singleton API + PmadClassStats
β”‚   └── structures/             #   BlockHeader.h, MemoryPool.h, SizeClass.h
β”‚
β”œβ”€β”€ src/                        # Implementation
β”‚   β”œβ”€β”€ PMAD.c                  #   init, alloc, free, lookup table, pool split
β”‚   β”œβ”€β”€ incPMAD.c               #   singleton wrapper β€” public API
β”‚   β”œβ”€β”€ MemoryPool.c            #   pool attachment
β”‚   β”œβ”€β”€ BlockHeader.c           #   block creation & free-list insertion
β”‚   └── SIzeClass.c             #   (reserved for size-class utilities)
β”‚
β”œβ”€β”€ benchmarks/
β”‚   └── v2/                     # ← current head-to-head suite (use this)
β”‚       β”œβ”€β”€ bench.c             #     one harness, compiled once per allocator
β”‚       β”œβ”€β”€ pmad_test.c         #     correctness suite (run first)
β”‚       β”œβ”€β”€ pmad_mem.c          #     exact metadata overhead
β”‚       β”œβ”€β”€ run_all.sh          #     orchestration -> raw/
β”‚       β”œβ”€β”€ aggregate.py        #     raw/ -> results/tables.md
β”‚       β”œβ”€β”€ raw/                #     committed raw samples (every number traces here)
β”‚       β”œβ”€β”€ results/            #     generated tables + overhead CSV
β”‚       └── REPORT.md           #     full methodology + results
β”‚
β”œβ”€β”€ allocator_info_graphics/    # Interactive dashboard + Live Pool Configurator
β”œβ”€β”€ docs/images/                # Architecture diagrams
β”œβ”€β”€ main.c                      # Example / demo entry point
β”œβ”€β”€ Makefile                    # Build system
β”œβ”€β”€ Documentation_v1.0.pdf      # Full technical documentation
└── LICENSE                     # MIT

Contributing

PMAD is free for anyone to use, study, and build on. If you've done something with it β€” an improvement, a port, a bug fix, or a project built on top β€” you're very welcome to send it my way: open a pull request or reach out to Dimitar Anastasov. I'd genuinely love to see what you make with it.


Documentation & license

The full technical write-up is in Documentation_v1.0.pdf β€” this is the v1 documentation. The implementation has changed a little since then (for example, the size-class metadata now lives inside the pool), and an updated version is currently in the works. In the meantime, the most up-to-date details are the benchmark methodology and results in benchmarks/v2/REPORT.md and the interactive dashboard at allocator_info_graphics/allocator_infographics.html.

Licensed under the MIT License β€” see LICENSE.


Built by Dimitar Anastasov Β· 2026

About

Predictice Memory Allocator

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors