marekvs

Consistency & anti-entropy

Implemented · some Planned

marekvs is eventually consistent, but bounded-eventually: the requirement is that stale records never live for long. This page is the mechanism and the math behind the 15 s worst-case, milliseconds-typical staleness bound the overview promises.

Anti-entropy (AE) runs in two layers, log-first with Merkle as a backstop. Both run only between home replicas — interest replicas get their bound from connection-scoped leases (replication) and never participate in AE.

#Layer 1 — sequence-cursor catch-up

Every node persists applied_seq[origin_node] in the meta CF, updated as ReplBatches are applied (each batch carries the origin's first_seq). On (re)connect to a peer, a node sends ResumeFrom{origin: self, seq}; if the peer's replication ring still holds that seq, it replays from there.

This is the log-first resume: it covers restarts and brief blips at near-zero cost, and it is exact — no digests, no scanning. Merkle exchange only runs when the log has already rolled past the gap.

#Layer 2 — per-partition Merkle exchange

Per owned partition, a 2-level digest:

  • 256 leaf buckets: bucket = xxh3_64(ikey) & 0xFF.
  • Bucket digest = XOR-fold of xxh3(ikey ‖ hlc ‖ value_hash) over the bucket's records. XOR is commutative, so no sorting is needed. Digests are content-aware (value_hash = xxh3(stored bytes)), not just version-aware.
  • Root = xxh3 over the 256 bucket digests.
Note

Why content-aware. Merged CRDT records (PN counters, HLL) can hold different slot sets under the same envelope version (symmetric max). An (ikey, hlc)-only digest would call two divergent replicas identical and repair would never fire — the chaos-suite clock-skew finding (design/10). Keying the digest on value_hash too makes equal-version / different-content records repair in both directions.

Buckets are dirty-marked by the commit hook (a bit set, nothing else) and recomputed lazily by prefix scan at sync time on the shard thread. A bucket is ~1/256 of a partition (≈1/1M of the keyspace), so scan cost is bounded and the hook never needs old values.

#Round protocol

Every ae_round = 5 s (plus jitter), each node walks its owned partitions in rotation and, for each, picks a random other owner (pairwise, Dynamo-style):

text

for each owned pid this round:
    peer = random other member of owners(pid)
    → MerkleRoot{pid, root}
    if roots differ:
        ← MerkleBuckets{pid, [256 × u64]}
        → per differing bucket: BucketKeys{pid, bucket, [(ikey_hash, hlc, value_hash)]}
        ↔ push/pull ReplOps for keys whose hlc or content differs or that are missing

Runs on the bulk connection, lz4-framed. Steady-state cost is one 12-byte root per (pid, round-participation) — tens of KB/s per node even on a 50-node cluster. Repair cost is proportional to the actual diff, which per-element keys keep minimal.

#Staleness bound

A committed write can be missing on some home only if its replication push failed (peer down, ring overrun, or membership-view divergence). Then:

text

staleness ≤ AE rotation delay + round exchange time
rotation delay ≤ ae_round × ceil(owned_pids / ae_partitions_per_round)

Owned pids shrink with cluster size (≈ 4096 × N / n), so the bound is sized for the worst supported small cluster:

  • n ≥ 24: all owned pids fit in one round → bound ≈ 2 × ae_round + 1 s ≈ 11 s.
  • n = 3 (floor): with a per-round cap the design restores ≤ 2 rounds for any cluster size, holding the bound.

Published bound: 15 s worst case (2 rounds + exchange + margin); typical: replication-push latency, single-digit milliseconds. The single knob to tighten it is ae_round, and cost scales linearly.

Planned

The ae_partitions_per_round per-round cap / auto-scale (max(512, ceil(owned_pids / 2))) is a design target and is not implemented. Today every round walks all owned pids, so the ≤ 2-round rotation bound holds trivially and the published 15 s figure stands — the cap only matters as an optimization at large partition counts per node.

For interest replicas, the bound is ≤ the home bound plus push latency while connected; after a disconnect it is bounded by liveness detection plus revalidation on the next read. The pathological ceiling (a wedged-open connection) is the 60 s lease timer.

#Tombstone lifecycle & GC safe point

A delete is never a raw storage delete at write time — it is an envelope with the tombstone flag set (and, for element removes, the observed dots). GC works off the storage engine's per-key TTL:

  • Every tombstone carries ondaDB per-key TTL = gc_grace = 1 h, so the engine purges it automatically at the safe point — no sweep needed.
  • Safety invariant. A replica partitioned or down longer than gc_grace may hold data whose covering tombstone was already purged elsewhere; merging it back would resurrect the delete.
Planned

The pull-only-until-synced rejoin rule is designed but not yet enforced.

The intended enforcement: on rejoin, if now − last_alive > gc_grace, the node's home partitions become pull-only — it receives AE repairs but never pushes, until each partition completes a full Merkle sync against a current home (its local data is a warm base; only the diff is pulled). Only then does it regain push eligibility. This is the precise rule that prevents resurrection across a long absence; today it is not wired into the rejoin path.

Interest replicas cannot resurrect by construction: they never push AE, and lease-gated reads revalidate against homes.

#TTL convergence

  • Deadlines are absolute milliseconds, set once at the origin, shipped in the envelope, never recomputed per hop. Every replica evaluates now ≥ deadline locally → identical convergence modulo NTP-level skew.
  • An expiry is an implicit tombstone with hlc = HLC(deadline, 0): any stale pre-expiry version loses the merge against expiry, so no expiry messages are needed and replicas can't disagree for longer than skew.
  • EXPIRE / PERSIST / EXPIREAT are ordinary LWW envelope writes and replicate like any write.
Planned

ttl_skew_grace (design 5 s) — excluding expired records from Merkle digests only after deadline + ttl_skew_grace, so skewed replicas don't ping-pong repairs around the deadline — is unimplemented. Today expiry is materialized by the sweep as an ordinary tombstone write, with no digest-exclusion grace.

#HLC discipline

The full layout is in the data model; the rules that matter for convergence:

  • One process-wide HLC (an atomic u64). Local event: max(prev + 1, wall << 16). Receive: max(local, remote) + 1.
  • The receive rule runs at the replication apply point (apply_op in marekvs-repl): every ingested record's HLC is observed before it is merged. This is load-bearing. Without it, a node with a lagging wall clock that reads a value and then overwrites it stamps the overwrite below the value it causally read, and the overwrite loses LWW everywhere.
  • LWW total order is (hlc, origin); equal pairs denote the identical write.
  • A remote HLC more than max_clock_drift = 5 s ahead of local wall clock is clamped with a loud log. NTP is assumed on k8s nodes; the receive-max rule keeps causally-related updates ordered even under skew.
Note

The clock-skew failure above was found in practice on Apple containers (per-container VMs with skewed clocks). Docker Compose shares one VM clock and cannot reproduce it, so the apple-container cluster test (just apple-test) is also the clock-skew regression test.

#Defaults table

This is the single source of truth for every tunable — other pages reference it and never restate values.

The Where set column is the current-vs-planned oracle:

  • env VAR — startup environment variable (restart to change) → implemented.
  • const (crate) — compile-time constant (rebuild to change) → implemented.
  • manifest — the k8s pod spec → implemented.
  • design — a design target not yet implemented; the Notes column says what the code does instead. These rows are marked Planned and collected in the callout below the table.

Runtime = adjustable on a live node without a restart. CONFIG SET applies three live keys — requirepass, lua-time-limit (alias busy-reply-threshold), and loglevel — and accepts-but-ignores everything else. All runtime changes are ephemeral: the env is the source of truth again after a restart (CONFIG REWRITE is a no-op; the k8s manifest is the durable config).

ParameterDefaultWhere setNotes
replicas N3env MAREKVS_REPLICAS_Nper-key homes; also the minimum node floor; must match cluster-wide
partitions P4096const (marekvs-core PARTITIONS)fixed at cluster creation; u16 prefix, 12 bits used
shard threadscores − 2, min 2env MAREKVS_SHARDSstorage/execution threads per node
gossip interval500 msconst (marekvs-server)chitchat
failure detection~5 schitchat defaultsphi-accrual
gossip dead-node grace1 hconst (marekvs-cluster)chitchat marked_for_deletion_grace_period
ae_round5 s + 0–2 s jitterconst (marekvs-repl AE_ROUND)jitter is uniform 0–2 s
ae_partitions_per_roundall owned pidsdesignPlanned — per-round cap max(512, owned/2) unimplemented; every round walks all owned pids, so the ≤ 2-round rotation bound holds trivially
stranded-record AEevery 3rd roundconst (marekvs-repl)push-only roots for non-owned pids with local data (chaos finding, design/10)
published staleness bound15 s worst / ms typicalderivedderivation above
merkle buckets / partition256const (marekvs-repl BUCKETS)content-aware digests: (ikey, hlc, value_hash)
interest_lease60 sconst (marekvs-repl INTEREST_LEASE)connection-scoped
interest renew intervaldesign (15 s)PlannedInterestRenew msg exists and is handled but never sent; leases refresh by re-fetch on expiry
read-through fetch timeout300 msconst (marekvs-repl FETCH_TIMEOUT)miss → serve local/empty, AE reconciles
peer heartbeat / timeoutdesign (1 s / 3 s)Planned — not implemented; peer liveness = TCP disconnect + gossip failure detection
interest_escalatedesign (4096 keys/pid)Planned — whole-partition escalation unimplemented
interest_max_entriesdesign (1,000,000)Planned — no cap/LRU on the interest map; expired entries GC'd each AE round
replication ring128 MiB / 262,144 opsconst (marekvs-repl RING_MAX_*)overrun → ring gap warning + AE backstop
repl batch256 ops / pump on notify or 50 ms tickconst (marekvs-repl BATCH_MAX_OPS)Planned — design byte cap (256 KiB) + 2 ms linger unimplemented
per-peer unacked windowdesign (4 MiB)PlannedAckSeq is received and ignored; no send-window flow control
ring high-water persist1 sconst (marekvs-repl)restart resumes seq space +1,000,000 above persisted HW
mesh writer queue4096 msgsconst (marekvs-repl)per-peer, per-lane
mesh reconnect backoff100 ms → 5 sconst (marekvs-repl)exponential
gc_grace1 hconst (marekvs-engine GC_GRACE)tombstone TTL; Planned — the pull-only-until-synced rejoin rule is not yet enforced
ttl_skew_gracedesign (5 s)Planned — expiry is materialized by the sweep as an ordinary tombstone write; digest-exclusion grace unimplemented
expiry sweep budget128 recordsconst (marekvs-engine)incremental cursor walk between shard jobs
max_clock_drift5 sconst (marekvs-core MAX_CLOCK_DRIFT_MS)remote HLC clamp + loud log
repair_delaydesign (30 s + jitter)Planned — unimplemented; AE repairs fire on the next round
bootstrap chunking256 ops/chunk, sequentialconst (marekvs-repl)lz4 bulk lane; Planned — design 8 streams / 64 MiB/s rate cap unimplemented
cold_purge_delaydesign (15 m)Planned — unimplemented; data kept after losing ownership (feeds stranded-record AE)
terminationGracePeriodSeconds60manifest (k8s/statefulset.yaml)drain typically completes in ~3 s
listen addresses:6379 / :7373 / :7946 / :9121env MAREKVS_{RESP,MESH,GOSSIP,METRICS}_ADDRRESP / mesh / gossip(UDP) / metrics+probes
node identityhostname ordinal, else 0env MAREKVS_NODE_IDmarekvs-3 → 3; StatefulSet needs no per-pod config
data dir.data/n0env MAREKVS_DATA_DIR
seedsemptyenv MAREKVS_SEEDSchitchat re-resolves DNS names continuously
advertise IP127.0.0.1env MAREKVS_ADVERTISE_IPauto = self-detect the pod IP
cluster namemarekvsenv MAREKVS_CLUSTERgossip cluster isolation
requirepassoffenv MAREKVS_REQUIREPASSruntime via CONFIG SET requirepass; new connections need the new password, authenticated sessions stay
upstream Redis masternoneenv MAREKVS_REPLICAOFruntime via REPLICAOF/SLAVEOF; live-migration ingest, node stays writable
script time limit20 msenv MAREKVS_SCRIPT_TIME_LIMIT_MSruntime via CONFIG SET lua-time-limit (alias busy-reply-threshold); applies from the next EVAL
Lua allocator limit16 MiBconst (marekvs-engine)per script VM
blocking-list poll50 msconst (marekvs-engine POLL_MS)BLPOP/BRPOP wakeup granularity
ondaDB sync_modeInterval, 128 msondadb defaultdurability window per node
log levelinfo,chitchat=warnenv RUST_LOGruntime via CONFIG SET loglevel; Redis levels map to tracing, any other value is a raw filter spec
Planned

Planned parameters — designed, not yet in code. Collected here so the table above can stay factual. Each does something specific today (see the row's Notes); the design target is what is missing.

  • ae_partitions_per_round — per-round cap / auto-scale. Every round walks all owned pids instead.
  • interest renew interval (15 s) — InterestRenew is handled but never sent; leases refresh by re-fetch on expiry.
  • peer heartbeat / timeout (1 s / 3 s) — liveness is TCP disconnect + gossip failure detection.
  • interest_escalate (4096 keys/pid) — no whole-partition escalation.
  • interest_max_entries (1,000,000) — no cap/LRU on the interest map.
  • per-peer unacked window / AckSeq flow control (4 MiB) — AckSeq is received and ignored; no send-window backpressure.
  • repl batch byte cap (256 KiB) + 2 ms linger — batching is by op count (256) and a 50 ms tick.
  • ttl_skew_grace (5 s) — no digest-exclusion grace around deadlines.
  • repair_delay (30 s + jitter) — repairs fire on the next AE round.
  • bootstrap 8 streams / 64 MiB/s rate cap — chunking is 256 ops/chunk, sequential.
  • cold_purge_delay (15 m) — cold data is kept after ownership loss (feeds stranded-record AE).
  • gc_grace pull-only-until-synced rejoin rule — the resurrection-prevention gate is not yet enforced on rejoin.

#Where to go next