Consistency & anti-entropy
Implemented · some Plannedmarekvs is eventually consistent, but bounded-eventually: the requirement is that stale records never live for long. This page is the mechanism and the math behind the 15 s worst-case, milliseconds-typical staleness bound the overview promises.
Anti-entropy (AE) runs in two layers, log-first with Merkle as a backstop. Both run only between home replicas — interest replicas get their bound from connection-scoped leases (replication) and never participate in AE.
#Layer 1 — sequence-cursor catch-up
Every node persists applied_seq[origin_node] in the meta CF, updated as
ReplBatches are applied (each batch carries the origin's first_seq). On
(re)connect to a peer, a node sends ResumeFrom{origin: self, seq}; if the
peer's replication ring still
holds that seq, it replays from there.
This is the log-first resume: it covers restarts and brief blips at near-zero cost, and it is exact — no digests, no scanning. Merkle exchange only runs when the log has already rolled past the gap.
#Layer 2 — per-partition Merkle exchange
Per owned partition, a 2-level digest:
- 256 leaf buckets:
bucket = xxh3_64(ikey) & 0xFF. - Bucket digest = XOR-fold of
xxh3(ikey ‖ hlc ‖ value_hash)over the bucket's records. XOR is commutative, so no sorting is needed. Digests are content-aware (value_hash = xxh3(stored bytes)), not just version-aware. - Root = xxh3 over the 256 bucket digests.
Why content-aware. Merged CRDT records (PN counters, HLL) can hold
different slot sets under the same envelope version (symmetric max). An
(ikey, hlc)-only digest would call two divergent replicas identical and repair
would never fire — the chaos-suite clock-skew finding (design/10). Keying the
digest on value_hash too makes equal-version / different-content records repair
in both directions.
Buckets are dirty-marked by the commit hook (a bit set, nothing else) and recomputed lazily by prefix scan at sync time on the shard thread. A bucket is ~1/256 of a partition (≈1/1M of the keyspace), so scan cost is bounded and the hook never needs old values.
#Round protocol
Every ae_round = 5 s (plus jitter), each node walks its owned partitions in
rotation and, for each, picks a random other owner (pairwise, Dynamo-style):
for each owned pid this round:
peer = random other member of owners(pid)
→ MerkleRoot{pid, root}
if roots differ:
← MerkleBuckets{pid, [256 × u64]}
→ per differing bucket: BucketKeys{pid, bucket, [(ikey_hash, hlc, value_hash)]}
↔ push/pull ReplOps for keys whose hlc or content differs or that are missing
Runs on the bulk connection, lz4-framed. Steady-state cost is one 12-byte root
per (pid, round-participation) — tens of KB/s per node even on a 50-node
cluster. Repair cost is proportional to the actual diff, which per-element keys
keep minimal.
#Staleness bound
A committed write can be missing on some home only if its replication push failed (peer down, ring overrun, or membership-view divergence). Then:
staleness ≤ AE rotation delay + round exchange time
rotation delay ≤ ae_round × ceil(owned_pids / ae_partitions_per_round)
Owned pids shrink with cluster size (≈ 4096 × N / n), so the bound is sized
for the worst supported small cluster:
- n ≥ 24: all owned pids fit in one round → bound ≈
2 × ae_round + 1 s ≈ 11 s. - n = 3 (floor): with a per-round cap the design restores ≤ 2 rounds for any cluster size, holding the bound.
Published bound: 15 s worst case (2 rounds + exchange + margin);
typical: replication-push latency, single-digit milliseconds. The single
knob to tighten it is ae_round, and cost scales linearly.
The ae_partitions_per_round per-round cap / auto-scale
(max(512, ceil(owned_pids / 2))) is a design target and is not implemented.
Today every round walks all owned pids, so the ≤ 2-round rotation bound holds
trivially and the published 15 s figure stands — the cap only matters as an
optimization at large partition counts per node.
For interest replicas, the bound is ≤ the home bound plus push latency while connected; after a disconnect it is bounded by liveness detection plus revalidation on the next read. The pathological ceiling (a wedged-open connection) is the 60 s lease timer.
#Tombstone lifecycle & GC safe point
A delete is never a raw storage delete at write time — it is an envelope with the tombstone flag set (and, for element removes, the observed dots). GC works off the storage engine's per-key TTL:
- Every tombstone carries ondaDB per-key TTL =
gc_grace = 1 h, so the engine purges it automatically at the safe point — no sweep needed. - Safety invariant. A replica partitioned or down longer than
gc_gracemay hold data whose covering tombstone was already purged elsewhere; merging it back would resurrect the delete.
The pull-only-until-synced rejoin rule is designed but not yet enforced.
The intended enforcement: on rejoin, if now − last_alive > gc_grace, the
node's home partitions become pull-only — it receives AE repairs but never
pushes, until each partition completes a full Merkle sync against a current home
(its local data is a warm base; only the diff is pulled). Only then does it
regain push eligibility. This is the precise rule that prevents resurrection
across a long absence; today it is not wired into the rejoin path.
Interest replicas cannot resurrect by construction: they never push AE, and lease-gated reads revalidate against homes.
#TTL convergence
- Deadlines are absolute milliseconds, set once at the origin, shipped in
the envelope, never recomputed per hop. Every replica evaluates
now ≥ deadlinelocally → identical convergence modulo NTP-level skew. - An expiry is an implicit tombstone with
hlc = HLC(deadline, 0): any stale pre-expiry version loses the merge against expiry, so no expiry messages are needed and replicas can't disagree for longer than skew. EXPIRE/PERSIST/EXPIREATare ordinary LWW envelope writes and replicate like any write.
ttl_skew_grace (design 5 s) — excluding expired records from Merkle digests
only after deadline + ttl_skew_grace, so skewed replicas don't ping-pong
repairs around the deadline — is unimplemented. Today expiry is materialized
by the sweep as an ordinary tombstone write, with no digest-exclusion grace.
#HLC discipline
The full layout is in the data model; the rules that matter for convergence:
- One process-wide HLC (an atomic
u64). Local event:max(prev + 1, wall << 16). Receive:max(local, remote) + 1. - The receive rule runs at the replication apply point (
apply_opin marekvs-repl): every ingested record's HLC is observed before it is merged. This is load-bearing. Without it, a node with a lagging wall clock that reads a value and then overwrites it stamps the overwrite below the value it causally read, and the overwrite loses LWW everywhere. - LWW total order is
(hlc, origin); equal pairs denote the identical write. - A remote HLC more than
max_clock_drift = 5 sahead of local wall clock is clamped with a loud log. NTP is assumed on k8s nodes; the receive-max rule keeps causally-related updates ordered even under skew.
The clock-skew failure above was found in practice on Apple containers
(per-container VMs with skewed clocks). Docker Compose shares one VM clock and
cannot reproduce it, so the apple-container cluster test (just apple-test) is
also the clock-skew regression test.
#Defaults table
This is the single source of truth for every tunable — other pages reference it and never restate values.
The Where set column is the current-vs-planned oracle:
env VAR— startup environment variable (restart to change) → implemented.const (crate)— compile-time constant (rebuild to change) → implemented.manifest— the k8s pod spec → implemented.design— a design target not yet implemented; the Notes column says what the code does instead. These rows are marked Planned and collected in the callout below the table.
Runtime = adjustable on a live node without a restart. CONFIG SET applies
three live keys — requirepass, lua-time-limit (alias
busy-reply-threshold), and loglevel — and accepts-but-ignores everything
else. All runtime changes are ephemeral: the env is the source of truth again
after a restart (CONFIG REWRITE is a no-op; the k8s manifest is the durable
config).
| Parameter | Default | Where set | Notes |
|---|---|---|---|
| replicas N | 3 | env MAREKVS_REPLICAS_N | per-key homes; also the minimum node floor; must match cluster-wide |
| partitions P | 4096 | const (marekvs-core PARTITIONS) | fixed at cluster creation; u16 prefix, 12 bits used |
| shard threads | cores − 2, min 2 | env MAREKVS_SHARDS | storage/execution threads per node |
| gossip interval | 500 ms | const (marekvs-server) | chitchat |
| failure detection | ~5 s | chitchat defaults | phi-accrual |
| gossip dead-node grace | 1 h | const (marekvs-cluster) | chitchat marked_for_deletion_grace_period |
| ae_round | 5 s + 0–2 s jitter | const (marekvs-repl AE_ROUND) | jitter is uniform 0–2 s |
| ae_partitions_per_round | all owned pids | design | Planned — per-round cap max(512, owned/2) unimplemented; every round walks all owned pids, so the ≤ 2-round rotation bound holds trivially |
| stranded-record AE | every 3rd round | const (marekvs-repl) | push-only roots for non-owned pids with local data (chaos finding, design/10) |
| published staleness bound | 15 s worst / ms typical | derived | derivation above |
| merkle buckets / partition | 256 | const (marekvs-repl BUCKETS) | content-aware digests: (ikey, hlc, value_hash) |
| interest_lease | 60 s | const (marekvs-repl INTEREST_LEASE) | connection-scoped |
| interest renew interval | — | design (15 s) | Planned — InterestRenew msg exists and is handled but never sent; leases refresh by re-fetch on expiry |
| read-through fetch timeout | 300 ms | const (marekvs-repl FETCH_TIMEOUT) | miss → serve local/empty, AE reconciles |
| peer heartbeat / timeout | — | design (1 s / 3 s) | Planned — not implemented; peer liveness = TCP disconnect + gossip failure detection |
| interest_escalate | — | design (4096 keys/pid) | Planned — whole-partition escalation unimplemented |
| interest_max_entries | — | design (1,000,000) | Planned — no cap/LRU on the interest map; expired entries GC'd each AE round |
| replication ring | 128 MiB / 262,144 ops | const (marekvs-repl RING_MAX_*) | overrun → ring gap warning + AE backstop |
| repl batch | 256 ops / pump on notify or 50 ms tick | const (marekvs-repl BATCH_MAX_OPS) | Planned — design byte cap (256 KiB) + 2 ms linger unimplemented |
| per-peer unacked window | — | design (4 MiB) | Planned — AckSeq is received and ignored; no send-window flow control |
| ring high-water persist | 1 s | const (marekvs-repl) | restart resumes seq space +1,000,000 above persisted HW |
| mesh writer queue | 4096 msgs | const (marekvs-repl) | per-peer, per-lane |
| mesh reconnect backoff | 100 ms → 5 s | const (marekvs-repl) | exponential |
| gc_grace | 1 h | const (marekvs-engine GC_GRACE) | tombstone TTL; Planned — the pull-only-until-synced rejoin rule is not yet enforced |
| ttl_skew_grace | — | design (5 s) | Planned — expiry is materialized by the sweep as an ordinary tombstone write; digest-exclusion grace unimplemented |
| expiry sweep budget | 128 records | const (marekvs-engine) | incremental cursor walk between shard jobs |
| max_clock_drift | 5 s | const (marekvs-core MAX_CLOCK_DRIFT_MS) | remote HLC clamp + loud log |
| repair_delay | — | design (30 s + jitter) | Planned — unimplemented; AE repairs fire on the next round |
| bootstrap chunking | 256 ops/chunk, sequential | const (marekvs-repl) | lz4 bulk lane; Planned — design 8 streams / 64 MiB/s rate cap unimplemented |
| cold_purge_delay | — | design (15 m) | Planned — unimplemented; data kept after losing ownership (feeds stranded-record AE) |
| terminationGracePeriodSeconds | 60 | manifest (k8s/statefulset.yaml) | drain typically completes in ~3 s |
| listen addresses | :6379 / :7373 / :7946 / :9121 | env MAREKVS_{RESP,MESH,GOSSIP,METRICS}_ADDR | RESP / mesh / gossip(UDP) / metrics+probes |
| node identity | hostname ordinal, else 0 | env MAREKVS_NODE_ID | marekvs-3 → 3; StatefulSet needs no per-pod config |
| data dir | .data/n0 | env MAREKVS_DATA_DIR | |
| seeds | empty | env MAREKVS_SEEDS | chitchat re-resolves DNS names continuously |
| advertise IP | 127.0.0.1 | env MAREKVS_ADVERTISE_IP | auto = self-detect the pod IP |
| cluster name | marekvs | env MAREKVS_CLUSTER | gossip cluster isolation |
| requirepass | off | env MAREKVS_REQUIREPASS | runtime via CONFIG SET requirepass; new connections need the new password, authenticated sessions stay |
| upstream Redis master | none | env MAREKVS_REPLICAOF | runtime via REPLICAOF/SLAVEOF; live-migration ingest, node stays writable |
| script time limit | 20 ms | env MAREKVS_SCRIPT_TIME_LIMIT_MS | runtime via CONFIG SET lua-time-limit (alias busy-reply-threshold); applies from the next EVAL |
| Lua allocator limit | 16 MiB | const (marekvs-engine) | per script VM |
| blocking-list poll | 50 ms | const (marekvs-engine POLL_MS) | BLPOP/BRPOP wakeup granularity |
| ondaDB sync_mode | Interval, 128 ms | ondadb default | durability window per node |
| log level | info,chitchat=warn | env RUST_LOG | runtime via CONFIG SET loglevel; Redis levels map to tracing, any other value is a raw filter spec |
Planned parameters — designed, not yet in code. Collected here so the table above can stay factual. Each does something specific today (see the row's Notes); the design target is what is missing.
ae_partitions_per_round— per-round cap / auto-scale. Every round walks all owned pids instead.- interest renew interval (15 s) —
InterestRenewis handled but never sent; leases refresh by re-fetch on expiry. - peer heartbeat / timeout (1 s / 3 s) — liveness is TCP disconnect + gossip failure detection.
interest_escalate(4096 keys/pid) — no whole-partition escalation.interest_max_entries(1,000,000) — no cap/LRU on the interest map.- per-peer unacked window /
AckSeqflow control (4 MiB) —AckSeqis received and ignored; no send-window backpressure. - repl batch byte cap (256 KiB) + 2 ms linger — batching is by op count (256) and a 50 ms tick.
ttl_skew_grace(5 s) — no digest-exclusion grace around deadlines.repair_delay(30 s + jitter) — repairs fire on the next AE round.- bootstrap 8 streams / 64 MiB/s rate cap — chunking is 256 ops/chunk, sequential.
cold_purge_delay(15 m) — cold data is kept after ownership loss (feeds stranded-record AE).gc_gracepull-only-until-synced rejoin rule — the resurrection-prevention gate is not yet enforced on rejoin.
#Where to go next
- How writes are pushed before AE ever runs: Replication.
- Join, leave, and failure detection: Cluster membership.
- The guarantees this all backs: Overview.