A horizontally-shardable time-series database built as two cleanly separated systems that speak a framed binary protocol over a Unix socket:
carde— a C++17 storage engine: Gorilla compression, a CRC-checked write-ahead log with crash recovery, in-memory head blocks that seal into immutable compressed chunks, and range queries with server-side aggregation.cardinal— a Go control plane: an HTTP/JSON ingest+query API, the binary-protocol engine client, a consistent-hash router, backpressure / load-shedding, a cardinality guard, and Prometheus self-metrics.
flowchart LR
client["clients / cardctl"]
subgraph coordinator["cardinal — Go control plane"]
api["API: admission, backpressure, cardinality"]
router["consistent-hash router (RF / W / R)"]
ec["engine client (UDS)"]
shm["shm ring producer"]
api --> router
router --> ec
router --> shm
end
subgraph nodes["carde — C++ engine nodes"]
eng0["carded #0: Gorilla + WAL + chunks"]
eng1["carded #1"]
eng2["carded #2"]
end
client -->|HTTP JSON| api
ec -->|"framed binary / UDS (query + remote ingest)"| eng0
shm -->|"lock-free SPSC shared memory (local ingest)"| eng0
ec -->|UDS| eng1
ec -->|UDS| eng2
The language split is deliberate and load-bearing: C++ owns the
bytes-and-cache-lines hot path (compression, memory layout, the lock-free ring);
Go owns the I/O-bound, concurrency-heavy, operationally-complex control plane.
See docs/design.md for the full rationale and
docs/protocol.md for the wire format.
Running the engine as its own daemon (rather than linking it into Go via cgo) buys crash isolation (an engine fault can't take down the API, and the WAL makes restarts safe), independent profiling/tuning of each runtime, and a clean seam to later swap the transport for a shared-memory ring buffer without touching the control-plane logic.
A replicated, sharded, multi-node store with a complete tested vertical slice end to end: ingest → routing/replication → binary protocol → compression
- WAL → query → aggregation, with quorum writes, tunable read consistency, a lock-free shared-memory ingest fast path (~1.9× a batched socket), and AVX2-vectorized aggregation with a scalar fallback (5–11× on cache-resident reductions). The roadmap's remaining items (OTel traces across the boundary, compaction/rollups) are documented as next.
# 1. build + test the engine (C++)
make test-engine # cmake build + ctest
make engine-asan # also builds/runs under ASan+UBSan
# 2. test the control plane (Go)
make test-go # go test -race ./...
# 3. run the stack
make run-engine & # carded on /tmp/carde.sock, WAL at /tmp/carde.wal
make run-control & # cardinal on :8080
# 4. write and query
curl -XPOST localhost:8080/api/v1/write -d '{"series":[
{"metric":"cpu","labels":{"host":"a"},"samples":[
{"ts":1000,"value":10},{"ts":2000,"value":20},{"ts":3000,"value":30}]}]}'
curl -XPOST localhost:8080/api/v1/query -d '{
"metric":"cpu","labels":{"host":"a"},"start":0,"end":9999,"agg":"avg"}'
# => {"agg":"avg","scalar":20,"series_id":...}docker compose up --build # single node
docker compose -f docker-compose.cluster.yml up --build # 3-node cluster, RF=2Run one coordinator in front of several engine nodes:
cardinal \
--peer node-0=/tmp/c0.sock --peer node-1=/tmp/c1.sock --peer node-2=/tmp/c2.sock \
--rf 2 --write-quorum 2 --read-consistency quorum- Placement — a series' replica set is the first
RFdistinct nodes clockwise from its id on the consistent-hash ring. Different series start at different positions, so they shard across the cluster while each is replicatedRFways. - Write quorum (
W) — a write fans out to allRFreplicas in parallel and succeeds onceWacks return; otherwise it fails so the client can retry. - Read consistency —
onereads the nearest live replica using the engine's server-side aggregation (low latency);quorumreads RAW from a read-quorum of replicas, reconciles them by union (read-repair), and aggregates at the coordinator (strong). - With
W + R > RFthe read set overlaps the write set → read-your-writes.
The CAP trade-off is real and demonstrable. With RF=2, W=2 and one replica
killed:
| read consistency | result after node loss |
|---|---|
one |
still serves (from the surviving replica) |
quorum |
refuses with 502 (needs 2/2 replicas) |
| method | path | body |
|---|---|---|
GET |
/healthz |
— (pings the engine) |
GET |
/metrics |
Prometheus exposition |
POST |
/api/v1/write |
{"series":[{"metric","labels","samples":[{"ts","value"}]}]} |
POST |
/api/v1/query |
{"metric","labels","start","end","agg"} |
agg ∈ raw, sum, min, max, avg, count, rate.
cardctl write --metric cpu --value 42
cardctl query --metric cpu --agg avg
cardctl status # dumps /metrics
cardctl bench --metric load --n 50000Ingest can take two routes to a co-located engine: the framed binary protocol
over the Unix socket, or a lock-free single-producer/single-consumer ring
buffer in shared memory (/dev/shm) — the Go control plane writes slots, the
C++ engine drains them, and no syscall or serialization happens per sample.
Correctness rests on a release/acquire pairing that holds across processes
(MAP_SHARED + 8-byte-aligned atomics, compatible between Go sync/atomic and
C++ __atomic); it's validated by a threaded SPSC stress test under TSan on
both sides.
Measured end-to-end (samples confirmed applied by the engine, WAL off to isolate transport from fsync), 1,000,000 samples per run, on a single core:
| path | samples/sec | vs unbatched socket |
|---|---|---|
| socket, batch=1 | 42,581 | 1× |
| socket, batch=10 | 400,578 | 9× |
| socket, batch=100 | 2,597,966 | 61× |
| socket, batch=1000 | 6,172,678 | 145× |
| shared-memory ring | 11,606,554 | 272× |
The honest reading: batching amortizes the socket's per-frame syscall cost, so a well-batched client already does ~6M/s. Shared memory removes that cost entirely — ~1.9× faster than the best socket case, ~272× faster than unbatched (streaming one sample at a time). And this is understated: on a single core the producer and consumer time-share one CPU, so they can't run in parallel; on multi-core hardware with a dedicated consumer the gap widens.
What it costs (the part worth saying out loud): shared memory is single-machine only, so it's the fast path for the local node while remote replicas and all queries/admin stay on the socket. It's fire-and-forget — no per-sample durability ack — so a production wiring would have the producer read the consumer's committed index to confirm WAL durability. And the binary layout is hand-managed and must stay byte-identical across two languages. With WAL on, both transports become fsync-bound and the gap shrinks (group commit is the relevant optimization there). Cardinal therefore keeps both and uses each where it wins.
Reproduce:
make bench-shm # builds, starts carded --no-wal --shm, runs the sweepQuery aggregations (sum/min/max/avg) reduce over the decompressed
values with an AVX2 path (four 4-wide accumulators for instruction-level
parallelism) selected at runtime via __builtin_cpu_supports, falling back to
scalar on CPUs without AVX2. Reductions don't need time order, so that branch
also skips the sort and feeds the reducer a contiguous buffer.
Measured (make bench), two regimes:
| reduction | scalar | AVX2 | speedup |
|---|---|---|---|
| sum, L2-cache-resident | 1,623 Melem/s | 9,286 Melem/s | 5.7× |
| max, L2-cache-resident | 807 Melem/s | 9,323 Melem/s | 11.6× |
| sum, 400 MiB from RAM | 1,020 Melem/s | 1,476 Melem/s | 1.4× |
| max, 400 MiB from RAM | 611 Melem/s | 1,562 Melem/s | 2.6× |
The nuance worth stating: decompressed chunks are small and cache-resident, so
the 5–11× cache regime is the one that matters for real aggregation. Over a
huge out-of-cache buffer the reduction is memory-bandwidth-bound (~11 GB/s) and
SIMD only buys 1.4–2.6× — knowing which regime a workload is in is the point.
(sum differs from the scalar sum in the last ULPs because lane-parallel
addition reorders the floating-point rounding; min/max are bit-exact.)
- Gorilla compression — delta-of-delta timestamps + XOR'd floats with a bit-level reader/writer. Regular-interval gauges compress to ~0.9 bytes/sample; a pessimistic high-entropy random walk lands around 7 bytes/sample. Verified by a randomized round-trip property test.
- Crash safety — every acked write is appended (and fsync'd) to the WAL
before the engine returns; replay stops cleanly at the first torn or
CRC-mismatched record. Demonstrated end to end by
SIGKILL-ing the engine mid-run and recovering on restart. - Backpressure — a bounded in-flight semaphore sheds load with
429instead of unbounded queueing; a cardinality guard caps active series. - Consistent hashing — 256 virtual nodes per physical node with a
splitmix64-finalized hash for even placement; removing a node relocates only
~
1/Nof keys (tested).
The test suite isn't decorative — it found two genuine defects during development:
- Sign-extension boundary in delta-of-delta. The timestamp encoder used
the Gorilla paper's asymmetric ranges (
[-63, 64]), which don't round-trip under two's-complement sign extension, sodod == 64decoded as-64. The randomized property test hit the boundary the hand-picked cases missed; fix was symmetric ranges ([-64, 63]). - Untrusted length → OOM. The decoder
reserve()d based on a 32-bit count read straight from the (possibly corrupt) stream, so a garbage frame could request a ~64 GB allocation. AddressSanitizer flagged it via the garbage-input fuzz test; fix bounds the allocation to the input size. - Drop-counter conflated backpressure with loss. The ring's
tryPushbumped a "dropped" counter on every full-ring return — but a producer that retries on full hasn't dropped anything. The threaded SPSC test caught a nonzero drop count under a lossless retry loop; fix madetryPusha pure probe and moved drop accounting into the caller's policy.
engine/ C++ storage engine (carded)
include/carde/ bitstream, compression, wal, engine, protocol, shmring, simd headers
src/ implementations + daemon (UDS server + shm consumer thread)
tests/ dependency-free harness: compression/wal/engine/shmring suites
bench/ compression + ingest microbenchmarks
control/ Go control plane
cmd/cardinal/ API server / coordinator cmd/cardctl/ CLI
cmd/shmbench/ socket-vs-shared-memory ingest benchmark
internal/ engineclient, api, router, cluster, metrics, shmring
docs/ design.md (full spec), protocol.md
.github/ CI: C++ release + ASan/UBSan + TSan matrix, Go race/vet, e2e
make test # engine ctest + go test -raceCI runs the C++ suite under a sanitizer matrix (Release, ASan+UBSan, TSan),
the Go suite with -race and a gofmt gate, and a full end-to-end roundtrip.
See LICENSE.