Skip to content

codejupiter/cardinal

Repository files navigation

Cardinal

ci C++17 Go 1.22 license MIT

A horizontally-shardable time-series database built as two cleanly separated systems that speak a framed binary protocol over a Unix socket:

  • carde — a C++17 storage engine: Gorilla compression, a CRC-checked write-ahead log with crash recovery, in-memory head blocks that seal into immutable compressed chunks, and range queries with server-side aggregation.
  • cardinal — a Go control plane: an HTTP/JSON ingest+query API, the binary-protocol engine client, a consistent-hash router, backpressure / load-shedding, a cardinality guard, and Prometheus self-metrics.
flowchart LR
  client["clients / cardctl"]

  subgraph coordinator["cardinal — Go control plane"]
    api["API: admission, backpressure, cardinality"]
    router["consistent-hash router (RF / W / R)"]
    ec["engine client (UDS)"]
    shm["shm ring producer"]
    api --> router
    router --> ec
    router --> shm
  end

  subgraph nodes["carde — C++ engine nodes"]
    eng0["carded #0: Gorilla + WAL + chunks"]
    eng1["carded #1"]
    eng2["carded #2"]
  end

  client -->|HTTP JSON| api
  ec -->|"framed binary / UDS (query + remote ingest)"| eng0
  shm -->|"lock-free SPSC shared memory (local ingest)"| eng0
  ec -->|UDS| eng1
  ec -->|UDS| eng2
Loading

The language split is deliberate and load-bearing: C++ owns the bytes-and-cache-lines hot path (compression, memory layout, the lock-free ring); Go owns the I/O-bound, concurrency-heavy, operationally-complex control plane. See docs/design.md for the full rationale and docs/protocol.md for the wire format.

Why two processes instead of cgo?

Running the engine as its own daemon (rather than linking it into Go via cgo) buys crash isolation (an engine fault can't take down the API, and the WAL makes restarts safe), independent profiling/tuning of each runtime, and a clean seam to later swap the transport for a shared-memory ring buffer without touching the control-plane logic.

Status

A replicated, sharded, multi-node store with a complete tested vertical slice end to end: ingest → routing/replication → binary protocol → compression

  • WAL → query → aggregation, with quorum writes, tunable read consistency, a lock-free shared-memory ingest fast path (~1.9× a batched socket), and AVX2-vectorized aggregation with a scalar fallback (5–11× on cache-resident reductions). The roadmap's remaining items (OTel traces across the boundary, compaction/rollups) are documented as next.

Quickstart

Local (no Docker)

# 1. build + test the engine (C++)
make test-engine          # cmake build + ctest
make engine-asan          # also builds/runs under ASan+UBSan

# 2. test the control plane (Go)
make test-go              # go test -race ./...

# 3. run the stack
make run-engine &         # carded on /tmp/carde.sock, WAL at /tmp/carde.wal
make run-control &        # cardinal on :8080

# 4. write and query
curl -XPOST localhost:8080/api/v1/write -d '{"series":[
  {"metric":"cpu","labels":{"host":"a"},"samples":[
    {"ts":1000,"value":10},{"ts":2000,"value":20},{"ts":3000,"value":30}]}]}'

curl -XPOST localhost:8080/api/v1/query -d '{
  "metric":"cpu","labels":{"host":"a"},"start":0,"end":9999,"agg":"avg"}'
# => {"agg":"avg","scalar":20,"series_id":...}

Docker

docker compose up --build                          # single node
docker compose -f docker-compose.cluster.yml up --build   # 3-node cluster, RF=2

Clustering (replication, sharding, quorum)

Run one coordinator in front of several engine nodes:

cardinal \
  --peer node-0=/tmp/c0.sock --peer node-1=/tmp/c1.sock --peer node-2=/tmp/c2.sock \
  --rf 2 --write-quorum 2 --read-consistency quorum
  • Placement — a series' replica set is the first RF distinct nodes clockwise from its id on the consistent-hash ring. Different series start at different positions, so they shard across the cluster while each is replicated RF ways.
  • Write quorum (W) — a write fans out to all RF replicas in parallel and succeeds once W acks return; otherwise it fails so the client can retry.
  • Read consistencyone reads the nearest live replica using the engine's server-side aggregation (low latency); quorum reads RAW from a read-quorum of replicas, reconciles them by union (read-repair), and aggregates at the coordinator (strong).
  • With W + R > RF the read set overlaps the write set → read-your-writes.

The CAP trade-off is real and demonstrable. With RF=2, W=2 and one replica killed:

read consistency result after node loss
one still serves (from the surviving replica)
quorum refuses with 502 (needs 2/2 replicas)

HTTP API

method path body
GET /healthz — (pings the engine)
GET /metrics Prometheus exposition
POST /api/v1/write {"series":[{"metric","labels","samples":[{"ts","value"}]}]}
POST /api/v1/query {"metric","labels","start","end","agg"}

aggraw, sum, min, max, avg, count, rate.

CLI

cardctl write  --metric cpu --value 42
cardctl query  --metric cpu --agg avg
cardctl status                       # dumps /metrics
cardctl bench  --metric load --n 50000

Performance: the shared-memory ingest fast path

Ingest can take two routes to a co-located engine: the framed binary protocol over the Unix socket, or a lock-free single-producer/single-consumer ring buffer in shared memory (/dev/shm) — the Go control plane writes slots, the C++ engine drains them, and no syscall or serialization happens per sample. Correctness rests on a release/acquire pairing that holds across processes (MAP_SHARED + 8-byte-aligned atomics, compatible between Go sync/atomic and C++ __atomic); it's validated by a threaded SPSC stress test under TSan on both sides.

Measured end-to-end (samples confirmed applied by the engine, WAL off to isolate transport from fsync), 1,000,000 samples per run, on a single core:

path samples/sec vs unbatched socket
socket, batch=1 42,581
socket, batch=10 400,578
socket, batch=100 2,597,966 61×
socket, batch=1000 6,172,678 145×
shared-memory ring 11,606,554 272×

The honest reading: batching amortizes the socket's per-frame syscall cost, so a well-batched client already does ~6M/s. Shared memory removes that cost entirely — ~1.9× faster than the best socket case, ~272× faster than unbatched (streaming one sample at a time). And this is understated: on a single core the producer and consumer time-share one CPU, so they can't run in parallel; on multi-core hardware with a dedicated consumer the gap widens.

What it costs (the part worth saying out loud): shared memory is single-machine only, so it's the fast path for the local node while remote replicas and all queries/admin stay on the socket. It's fire-and-forget — no per-sample durability ack — so a production wiring would have the producer read the consumer's committed index to confirm WAL durability. And the binary layout is hand-managed and must stay byte-identical across two languages. With WAL on, both transports become fsync-bound and the gap shrinks (group commit is the relevant optimization there). Cardinal therefore keeps both and uses each where it wins.

Reproduce:

make bench-shm     # builds, starts carded --no-wal --shm, runs the sweep

Performance: SIMD aggregation

Query aggregations (sum/min/max/avg) reduce over the decompressed values with an AVX2 path (four 4-wide accumulators for instruction-level parallelism) selected at runtime via __builtin_cpu_supports, falling back to scalar on CPUs without AVX2. Reductions don't need time order, so that branch also skips the sort and feeds the reducer a contiguous buffer.

Measured (make bench), two regimes:

reduction scalar AVX2 speedup
sum, L2-cache-resident 1,623 Melem/s 9,286 Melem/s 5.7×
max, L2-cache-resident 807 Melem/s 9,323 Melem/s 11.6×
sum, 400 MiB from RAM 1,020 Melem/s 1,476 Melem/s 1.4×
max, 400 MiB from RAM 611 Melem/s 1,562 Melem/s 2.6×

The nuance worth stating: decompressed chunks are small and cache-resident, so the 5–11× cache regime is the one that matters for real aggregation. Over a huge out-of-cache buffer the reduction is memory-bandwidth-bound (~11 GB/s) and SIMD only buys 1.4–2.6× — knowing which regime a workload is in is the point. (sum differs from the scalar sum in the last ULPs because lane-parallel addition reorders the floating-point rounding; min/max are bit-exact.)

Design highlights

  • Gorilla compression — delta-of-delta timestamps + XOR'd floats with a bit-level reader/writer. Regular-interval gauges compress to ~0.9 bytes/sample; a pessimistic high-entropy random walk lands around 7 bytes/sample. Verified by a randomized round-trip property test.
  • Crash safety — every acked write is appended (and fsync'd) to the WAL before the engine returns; replay stops cleanly at the first torn or CRC-mismatched record. Demonstrated end to end by SIGKILL-ing the engine mid-run and recovering on restart.
  • Backpressure — a bounded in-flight semaphore sheds load with 429 instead of unbounded queueing; a cardinality guard caps active series.
  • Consistent hashing — 256 virtual nodes per physical node with a splitmix64-finalized hash for even placement; removing a node relocates only ~1/N of keys (tested).

Two real bugs these tests caught

The test suite isn't decorative — it found two genuine defects during development:

  1. Sign-extension boundary in delta-of-delta. The timestamp encoder used the Gorilla paper's asymmetric ranges ([-63, 64]), which don't round-trip under two's-complement sign extension, so dod == 64 decoded as -64. The randomized property test hit the boundary the hand-picked cases missed; fix was symmetric ranges ([-64, 63]).
  2. Untrusted length → OOM. The decoder reserve()d based on a 32-bit count read straight from the (possibly corrupt) stream, so a garbage frame could request a ~64 GB allocation. AddressSanitizer flagged it via the garbage-input fuzz test; fix bounds the allocation to the input size.
  3. Drop-counter conflated backpressure with loss. The ring's tryPush bumped a "dropped" counter on every full-ring return — but a producer that retries on full hasn't dropped anything. The threaded SPSC test caught a nonzero drop count under a lossless retry loop; fix made tryPush a pure probe and moved drop accounting into the caller's policy.

Repository layout

engine/      C++ storage engine (carded)
  include/carde/   bitstream, compression, wal, engine, protocol, shmring, simd headers
  src/             implementations + daemon (UDS server + shm consumer thread)
  tests/           dependency-free harness: compression/wal/engine/shmring suites
  bench/           compression + ingest microbenchmarks
control/     Go control plane
  cmd/cardinal/    API server / coordinator    cmd/cardctl/   CLI
  cmd/shmbench/    socket-vs-shared-memory ingest benchmark
  internal/        engineclient, api, router, cluster, metrics, shmring
docs/        design.md (full spec), protocol.md
.github/     CI: C++ release + ASan/UBSan + TSan matrix, Go race/vet, e2e

Testing

make test        # engine ctest + go test -race

CI runs the C++ suite under a sanitizer matrix (Release, ASan+UBSan, TSan), the Go suite with -race and a gofmt gate, and a full end-to-end roundtrip.

License

See LICENSE.

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors