Cardinal

A horizontally-shardable time-series database built as two cleanly separated systems that speak a framed binary protocol over a Unix socket:

carde — a C++17 storage engine: Gorilla compression, a CRC-checked write-ahead log with crash recovery, in-memory head blocks that seal into immutable compressed chunks, and range queries with server-side aggregation.
cardinal — a Go control plane: an HTTP/JSON ingest+query API, the binary-protocol engine client, a consistent-hash router, backpressure / load-shedding, a cardinality guard, and Prometheus self-metrics.

flowchart LR
  client["clients / cardctl"]

  subgraph coordinator["cardinal — Go control plane"]
    api["API: admission, backpressure, cardinality"]
    router["consistent-hash router (RF / W / R)"]
    ec["engine client (UDS)"]
    shm["shm ring producer"]
    api --> router
    router --> ec
    router --> shm
  end

  subgraph nodes["carde — C++ engine nodes"]
    eng0["carded #0: Gorilla + WAL + chunks"]
    eng1["carded #1"]
    eng2["carded #2"]
  end

  client -->|HTTP JSON| api
  ec -->|"framed binary / UDS (query + remote ingest)"| eng0
  shm -->|"lock-free SPSC shared memory (local ingest)"| eng0
  ec -->|UDS| eng1
  ec -->|UDS| eng2

The language split is deliberate and load-bearing: C++ owns the bytes-and-cache-lines hot path (compression, memory layout, the lock-free ring); Go owns the I/O-bound, concurrency-heavy, operationally-complex control plane. See docs/design.md for the full rationale and docs/protocol.md for the wire format.

Why two processes instead of cgo?

Running the engine as its own daemon (rather than linking it into Go via cgo) buys crash isolation (an engine fault can't take down the API, and the WAL makes restarts safe), independent profiling/tuning of each runtime, and a clean seam to later swap the transport for a shared-memory ring buffer without touching the control-plane logic.

Status

A replicated, sharded, multi-node store with a complete tested vertical slice end to end: ingest → routing/replication → binary protocol → compression

WAL → query → aggregation, with quorum writes, tunable read consistency, a lock-free shared-memory ingest fast path (~1.9× a batched socket), and AVX2-vectorized aggregation with a scalar fallback (5–11× on cache-resident reductions). The roadmap's remaining items (OTel traces across the boundary, compaction/rollups) are documented as next.

Quickstart

Local (no Docker)

# 1. build + test the engine (C++)
make test-engine          # cmake build + ctest
make engine-asan          # also builds/runs under ASan+UBSan

# 2. test the control plane (Go)
make test-go              # go test -race ./...

# 3. run the stack
make run-engine &         # carded on /tmp/carde.sock, WAL at /tmp/carde.wal
make run-control &        # cardinal on :8080

# 4. write and query
curl -XPOST localhost:8080/api/v1/write -d '{"series":[
  {"metric":"cpu","labels":{"host":"a"},"samples":[
    {"ts":1000,"value":10},{"ts":2000,"value":20},{"ts":3000,"value":30}]}]}'

curl -XPOST localhost:8080/api/v1/query -d '{
  "metric":"cpu","labels":{"host":"a"},"start":0,"end":9999,"agg":"avg"}'
# => {"agg":"avg","scalar":20,"series_id":...}

Docker

docker compose up --build                          # single node
docker compose -f docker-compose.cluster.yml up --build   # 3-node cluster, RF=2

Clustering (replication, sharding, quorum)

Run one coordinator in front of several engine nodes:

cardinal \
  --peer node-0=/tmp/c0.sock --peer node-1=/tmp/c1.sock --peer node-2=/tmp/c2.sock \
  --rf 2 --write-quorum 2 --read-consistency quorum

Placement — a series' replica set is the first RF distinct nodes clockwise from its id on the consistent-hash ring. Different series start at different positions, so they shard across the cluster while each is replicated RF ways.
Write quorum (W) — a write fans out to all RF replicas in parallel and succeeds once W acks return; otherwise it fails so the client can retry.
Read consistency — one reads the nearest live replica using the engine's server-side aggregation (low latency); quorum reads RAW from a read-quorum of replicas, reconciles them by union (read-repair), and aggregates at the coordinator (strong).
With W + R > RF the read set overlaps the write set → read-your-writes.

The CAP trade-off is real and demonstrable. With RF=2, W=2 and one replica killed:

read consistency	result after node loss
`one`	still serves (from the surviving replica)
`quorum`	refuses with `502` (needs 2/2 replicas)

HTTP API

method	path	body
`GET`	`/healthz`	— (pings the engine)
`GET`	`/metrics`	Prometheus exposition
`POST`	`/api/v1/write`	`{"series":[{"metric","labels","samples":[{"ts","value"}]}]}`
`POST`	`/api/v1/query`	`{"metric","labels","start","end","agg"}`

agg ∈ raw, sum, min, max, avg, count, rate.

CLI

cardctl write  --metric cpu --value 42
cardctl query  --metric cpu --agg avg
cardctl status                       # dumps /metrics
cardctl bench  --metric load --n 50000

Performance: the shared-memory ingest fast path

Ingest can take two routes to a co-located engine: the framed binary protocol over the Unix socket, or a lock-free single-producer/single-consumer ring buffer in shared memory (/dev/shm) — the Go control plane writes slots, the C++ engine drains them, and no syscall or serialization happens per sample. Correctness rests on a release/acquire pairing that holds across processes (MAP_SHARED + 8-byte-aligned atomics, compatible between Go sync/atomic and C++ __atomic); it's validated by a threaded SPSC stress test under TSan on both sides.

Measured end-to-end (samples confirmed applied by the engine, WAL off to isolate transport from fsync), 1,000,000 samples per run, on a single core:

path	samples/sec	vs unbatched socket
socket, batch=1	42,581	1×
socket, batch=10	400,578	9×
socket, batch=100	2,597,966	61×
socket, batch=1000	6,172,678	145×
shared-memory ring	11,606,554	272×

The honest reading: batching amortizes the socket's per-frame syscall cost, so a well-batched client already does ~6M/s. Shared memory removes that cost entirely — ~1.9× faster than the best socket case, ~272× faster than unbatched (streaming one sample at a time). And this is understated: on a single core the producer and consumer time-share one CPU, so they can't run in parallel; on multi-core hardware with a dedicated consumer the gap widens.

What it costs (the part worth saying out loud): shared memory is single-machine only, so it's the fast path for the local node while remote replicas and all queries/admin stay on the socket. It's fire-and-forget — no per-sample durability ack — so a production wiring would have the producer read the consumer's committed index to confirm WAL durability. And the binary layout is hand-managed and must stay byte-identical across two languages. With WAL on, both transports become fsync-bound and the gap shrinks (group commit is the relevant optimization there). Cardinal therefore keeps both and uses each where it wins.

Reproduce:

make bench-shm     # builds, starts carded --no-wal --shm, runs the sweep

Performance: SIMD aggregation

Query aggregations (sum/min/max/avg) reduce over the decompressed values with an AVX2 path (four 4-wide accumulators for instruction-level parallelism) selected at runtime via __builtin_cpu_supports, falling back to scalar on CPUs without AVX2. Reductions don't need time order, so that branch also skips the sort and feeds the reducer a contiguous buffer.

Measured (make bench), two regimes:

reduction	scalar	AVX2	speedup
sum, L2-cache-resident	1,623 Melem/s	9,286 Melem/s	5.7×
max, L2-cache-resident	807 Melem/s	9,323 Melem/s	11.6×
sum, 400 MiB from RAM	1,020 Melem/s	1,476 Melem/s	1.4×
max, 400 MiB from RAM	611 Melem/s	1,562 Melem/s	2.6×

The nuance worth stating: decompressed chunks are small and cache-resident, so the 5–11× cache regime is the one that matters for real aggregation. Over a huge out-of-cache buffer the reduction is memory-bandwidth-bound (~11 GB/s) and SIMD only buys 1.4–2.6× — knowing which regime a workload is in is the point. (sum differs from the scalar sum in the last ULPs because lane-parallel addition reorders the floating-point rounding; min/max are bit-exact.)

Design highlights

Gorilla compression — delta-of-delta timestamps + XOR'd floats with a bit-level reader/writer. Regular-interval gauges compress to ~0.9 bytes/sample; a pessimistic high-entropy random walk lands around 7 bytes/sample. Verified by a randomized round-trip property test.
Crash safety — every acked write is appended (and fsync'd) to the WAL before the engine returns; replay stops cleanly at the first torn or CRC-mismatched record. Demonstrated end to end by SIGKILL-ing the engine mid-run and recovering on restart.
Backpressure — a bounded in-flight semaphore sheds load with 429 instead of unbounded queueing; a cardinality guard caps active series.
Consistent hashing — 256 virtual nodes per physical node with a splitmix64-finalized hash for even placement; removing a node relocates only ~1/N of keys (tested).

Two real bugs these tests caught

The test suite isn't decorative — it found two genuine defects during development:

Sign-extension boundary in delta-of-delta. The timestamp encoder used the Gorilla paper's asymmetric ranges ([-63, 64]), which don't round-trip under two's-complement sign extension, so dod == 64 decoded as -64. The randomized property test hit the boundary the hand-picked cases missed; fix was symmetric ranges ([-64, 63]).
Untrusted length → OOM. The decoder reserve()d based on a 32-bit count read straight from the (possibly corrupt) stream, so a garbage frame could request a ~64 GB allocation. AddressSanitizer flagged it via the garbage-input fuzz test; fix bounds the allocation to the input size.
Drop-counter conflated backpressure with loss. The ring's tryPush bumped a "dropped" counter on every full-ring return — but a producer that retries on full hasn't dropped anything. The threaded SPSC test caught a nonzero drop count under a lossless retry loop; fix made tryPush a pure probe and moved drop accounting into the caller's policy.

Repository layout

engine/      C++ storage engine (carded)
  include/carde/   bitstream, compression, wal, engine, protocol, shmring, simd headers
  src/             implementations + daemon (UDS server + shm consumer thread)
  tests/           dependency-free harness: compression/wal/engine/shmring suites
  bench/           compression + ingest microbenchmarks
control/     Go control plane
  cmd/cardinal/    API server / coordinator    cmd/cardctl/   CLI
  cmd/shmbench/    socket-vs-shared-memory ingest benchmark
  internal/        engineclient, api, router, cluster, metrics, shmring
docs/        design.md (full spec), protocol.md
.github/     CI: C++ release + ASan/UBSan + TSan matrix, Go race/vet, e2e

Testing

make test        # engine ctest + go test -race

CI runs the C++ suite under a sanitizer matrix (Release, ASan+UBSan, TSan), the Go suite with -race and a gofmt gate, and a full end-to-end roundtrip.

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
control		control
docs		docs
engine		engine
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.cluster.yml		docker-compose.cluster.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cardinal

Why two processes instead of cgo?

Status

Quickstart

Local (no Docker)

Docker

Clustering (replication, sharding, quorum)

HTTP API

CLI

Performance: the shared-memory ingest fast path

Performance: SIMD aggregation

Design highlights

Two real bugs these tests caught

Repository layout

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cardinal

Why two processes instead of cgo?

Status

Quickstart

Local (no Docker)

Docker

Clustering (replication, sharding, quorum)

HTTP API

CLI

Performance: the shared-memory ingest fast path

Performance: SIMD aggregation

Design highlights

Two real bugs these tests caught

Repository layout

Testing

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages