Cardinal
What does it take to make a database's hot path honest — to measure the cost of a syscall, then earn it back in shared memory?
Most full-stack work stops at the database. Cardinal is what happened when I decided to build the database.
The split is deliberate, not decorative. C++ owns compression, the write-ahead log, chunk storage, and SIMD aggregation — the places where memory layout and cache behavior decide throughput. Go owns the HTTP API, consistent-hash sharding, quorum replication, backpressure, and metrics — the places where developer velocity and concurrency ergonomics win. The two run as separate processes so an engine fault can't take down the API, and so each can be profiled on its own terms.
Running as two processes costs an inter-process hop on every request, so the next move was to measure that cost and buy it back where it mattered. The ingest fast path became a lock-free single-producer/single-consumer ring in shared memory: Go writes slots, C++ drains them, and no syscall or serialization happens per sample. End-to-end it moved ~1.9× faster than an aggressively-batched socket and ~270× faster than an unbatched one — and that is understated, because the benchmark ran on a single core where producer and consumer can't run in parallel.
The second optimization is in the read path. Aggregations reduce over decompressed values with an AVX2 path chosen at runtime, falling back to scalar where the instruction set is absent. On cache-resident chunks — the case that actually matters, since chunks are small — sum runs ~5.7× and max ~11.6× over scalar; on a huge out-of-cache scan the win shrinks to memory-bandwidth limits. Knowing which regime a workload is in is the whole point.
“Running as two processes costs a syscall on every request. The interesting work is measuring that cost, then earning it back.”
The boundary is a flat binary frame
Queries and remote ingest cross a length-prefixed little-endian protocol over a Unix socket; the same record layout maps directly onto shared-memory ring slots, so the fast path reuses the wire format instead of inventing a new one.
Consistency is a knob, and the failure is correct
Writes need a quorum of replica acks; reads are ONE (nearest live replica) or QUORUM (read-quorum plus read-repair). With one replica down, a quorum read returns a 502 by design — refusing to answer is the right behavior, not a bug.
The tests found real bugs
A property test caught a delta-of-delta sign-extension boundary; a fuzz test caught a decoder that trusted a corrupt length field into a 64GB allocation; a threaded stress test under ThreadSanitizer caught a drop counter that conflated backpressure with loss.