Data model
ImplementedEverything marekvs stores lives in the ondaDB data column family as
internal key → envelope + payload. This page defines those byte layouts and
the merge rules that make replication converge — the machinery behind the
published guarantees.
#Partitioning
pid: u16 = (xxh3_64(userkey) >> 52) as u16 // top 12 bits → 0..4095
P = 4096 fixed partitions, chosen at cluster creation and never changed. Every
internal key begins with the big-endian pid, so one partition is one
contiguous ondaDB key range — which makes bootstrap streaming, Merkle digesting,
and handoff cheap prefix operations. For collections, pid derives from the
user key only (not the field or member), so a whole collection lives in one
partition and on one shard thread.
Redis hash tags are implemented and match Redis exactly: if a key contains
{...}, only the bytes inside the first non-empty brace pair are hashed — so
rate:{user1}:count and rate:{user1}:window land on the same partition (and
therefore the same shard thread). This is what lets a multi-key MULTI block or
a Lua script co-locate its keys.
#Internal key layouts
All fields are concatenated with integers big-endian, so memcmp order is key
order. klen is a varint length of the user key, so an element key can never
collide with a user key that happens to embed another.
string [pid:u16] [b's'] [userkey]
collection head [pid:u16] [b'M'] [klen] [userkey]
hash field [pid:u16] [b'h'] [klen] [userkey] [field]
set member [pid:u16] [b'S'] [klen] [userkey] [member]
zset member [pid:u16] [b'z'] [klen] [userkey] [member]
zset score idx [pid:u16] [b'Z'] [klen] [userkey] [score_be:u64] [member]
list element [pid:u16] [b'q'] [klen] [userkey] [pos:u64]
hll register [pid:u16] [b'H'] [klen] [userkey] [bucket:u16]
stream entry [pid:u16] [b'x'] [klen] [userkey] [id_ms:u64] [id_seq:u64]
Per-element keys (hash fields, set/zset members, list elements, HLL registers,
stream entries) pay off three ways: a mutation touches O(element) in the LSM,
not O(collection); every replicated op is naturally a delta, so one HSET
ships one field; and HGETALL / SMEMBERS / ZRANGE become prefix scans that
LSM iterators handle well.
The tag b'l' (a whole-list LWW blob) is retired. Lists are now per-element
records under b'q' — see Lists below. The old tag stays reserved, and
old blobs are neither read nor migrated.
The zset score index is a second key per member, maintained in the same
ondaDB transaction as the member key: the member key holds the score (the LWW
source of truth) and the index key is derived, carrying no payload. The score is
encoded as an order-preserving u64, so ZRANGEBYSCORE is a prefix scan over
b'Z' keys.
#Envelope
A fixed 19-byte header is prefixed to every stored value:
offset size field
0 1 flags: bit0 tombstone
bit1 collection-head
bits 2..4 record type (0 string, 1 hash-field, 2 set-member,
3 zset-member, 4 list, 5 stream-entry,
6 counter, 7 hll-register)
1 8 hlc: u64 big-endian = [phys_ms:48 | logical:16]
9 2 origin: NodeId (u16)
11 8 ttl_deadline_ms: u64 absolute wall-clock ms; 0 = no TTL
19 … payload
The origin node writes the envelope once, and replication ships it
byte-for-byte with no re-stamping, so every replica agrees on the record's
identity. (hlc, origin) is the record version — and, for element adds, its
dot.
A collection head key carries its own small payload: the collection type (hash / set / zset / stream / list / HLL), a whole-collection delete clock, and, for streams, the last id and configuration.
#Hybrid logical clock
marekvs timestamps every write with a Kulkarni hybrid logical clock packed
into a u64: 48 bits of physical milliseconds since the Unix epoch, plus a
16-bit logical counter (65k events per millisecond per node before it borrows a
millisecond).
- Local event:
hlc = max(hlc_prev + 1, wall_ms << 16). - On receive:
hlc = max(hlc_local, hlc_remote) + 1. - A remote HLC more than
max_clock_drift(5 s) ahead of local wall clock is clamped and logged loudly — NTP is assumed on the nodes. - The total order for last-writer-wins is
(hlc, origin); theu16origin breaks exact ties deterministically.
There is one HLC per process (an atomic u64 behind a CAS loop), shared by every
shard.
#Merge rules
The apply path — whether the write is a local command or an incoming replication
op — always reads the current envelope and merges. A blind overwrite never
happens on the replication path. Every merge is commutative, associative, and
idempotent, and those laws are enforced by property tests in
crates/marekvs-core/tests/merge_laws.rs.
#LWW registers
Strings, hash-field values, zset scores, list elements, and collection heads are
last-writer-wins: higher (hlc, origin) wins, equal versions are the same write
(a no-op). A tombstone is just a version with the tombstone flag set, so deletes
and writes race symmetrically. EXPIRE / PERSIST are LWW writes of the
envelope with an unchanged payload.
#Observed-remove elements
Set members, hash fields, and zset members use ORSWOT semantics at per-element granularity. Each element record carries two capped dot lattices:
element payload:
[nlive u8] nlive × [origin u16][hlc u64][vlen varint][value] # live adds
[ncov u8] ncov × [origin u16][hlc u64] # covered (removed) dots
A local add contributes (dot, value) to live; a remove contributes the dots
it observed to covered. Merge is the lattice join —
covered' = cov_a ∪ cov_b, live' = (live_a ∪ live_b) \ covered' — and the
element is dead when live' is empty. A lagging replica pushing an old add
carries the old dot, which covered still holds, so deletes never resurrect;
a concurrent add the remove never observed survives.
The caps keep the top-4 live dots and the top-255 covered dots (top-N-of-union is itself commutative, associative, and idempotent, so the caps preserve the merge laws).
This two-lattice design superseded an earlier single-dot sketch that property testing proved non-associative: a remove covering a dot the local record never held lost its covering information on merge. The test suite caught it.
#Whole-collection delete
DEL myhash writes one LWW tombstone on the head key with a delete clock. An
element is live only if its version is newer than that clock — so a whole
collection dies in one op, with no per-element tombstone fan-out. Stale elements
are filtered on read and reclaimed by the expiry sweeper and anti-entropy.
#Counters (PN counters)
INCR / DECR / INCRBY / DECRBY on a string key produce a PN-counter
record, not a plain string:
- a base register
(base_hlc, base_origin, base_value)set by the last explicitSET(or(0,0,0)for a fresh counter), merged LWW; - per-node delta slots
(pos, neg)— each node only grows its own slot, so slots merge by pointwise max. The value isbase + Σpos − Σneg.
When two nodes share a base version, their slots join and concurrent increments
on different nodes all survive. A different base version (from an explicit
SET) wins wholesale — matching Redis's "SET resets the counter". Reads
materialize the counter as a decimal string, so GET / STRLEN / TYPE behave
normally; APPEND / SETRANGE / GETSET / INCRBYFLOAT freeze it back into a
plain string.
HINCRBY is last-writer-wins on the result, and INCRBYFLOAT is LWW — counter
semantics inside hash fields and float slots are future work. Same-node
concurrency is always exact thanks to shard serialization.
#HyperLogLog
PFADD / PFCOUNT / PFMERGE store a HyperLogLog decomposed into one record
per touched register under a head-gated collection, merged by a one-byte
max-lattice: higher rank wins with its own envelope; equal ranks resolve to the
lower envelope version, so a duplicate PFADD is a strict no-op — no write, no
replication, no anti-entropy churn. Sparse cardinalities materialize only the
registers they touch; DEL writes one head tombstone whose delete clock gates
old registers, so resurrection safety is identical to sets.
The parameters P = 14 (m = 16384) and the xxh3_64 element hash are a frozen
wire format: every node must map an element to the identical (bucket, rank)
forever. Changing either in a rolling upgrade silently corrupts every estimate.
TYPE reports HLL keys as string for tooling compatibility, but GET on one
is a WRONGTYPE error (a documented divergence — Redis exposes the raw sketch
bytes; marekvs has no single blob).
#Streams
XADD generates the entry id at the origin: the millisecond comes from the HLC's
physical clock, and the sequence embeds the origin node id
(seq = (origin << 20) | local_counter), so ids are cluster-unique without
coordination. Merge is a union by id (entries are immutable — a duplicate id is
the same entry). XDEL / XTRIM write per-entry LWW tombstones.
XGROUP / XREADGROUP / XACK and the rest of the consumer-group surface are
not implemented. Streams currently expose only the raw entry operations:
XADD, XLEN, XRANGE, XREVRANGE, XREAD, XDEL, XTRIM.
#Lists
Lists are per-element position-keyed LWW registers: a list is a collection head (like a set or zset) plus one element record per position.
list element [pid][b'q'][klen][userkey][pos:u64 BE] payload = raw value bytes
- Positions.
posis an unsignedu64whose big-endian key suffix makes memcmp order equal list order. The first element lands atLIST_CENTER = 1<<63;LPUSHallocateshead − 1,RPUSHallocatestail + 1, leaving ~2⁶³ of headroom each way. While only the ends are touched, push and pop areO(1)point writes andLRANGEisO(range). - Elements are plain LWW registers, so every push/pop replicates as one delta through the shared write path — no list-specific merge code, no whole-value shipping.
- The head gates the collection: TTL and the whole-collection delete clock
ride the head exactly like a hash or set, and
DELis one head tombstone. - Interior ops rebuild.
LSEToverwrites one position in place (O(index)).LINSERT/LREM/LTRIMcannot open room between adjacent integer positions, so they read the live values, tombstone the old positions, and rewrite the sequence compacted fromLIST_CENTER(O(n), and rare).
Two nodes pushing concurrently can allocate the same position; the element records then merge LWW and one push is lost — but only that one colliding element, not the whole push set. This is a bounded, per-collision loss, and strictly better than the retired whole-list blob (which dropped an entire concurrent push set). Single-node order is exact (shard serialization); ordering across concurrent cross-node writers is best-effort. A true sequence CRDT (RGA) remains future work.
#TTL representation
A Redis TTL is stored as the envelope's absolute ttl_deadline_ms, set once at
the origin and shipped verbatim. It is evaluated locally at read time: once
now >= deadline, the record is treated as a tombstone whose clock is the
deadline, so a stale pre-expiry value loses the merge against expiry and TTL
converges cluster-wide without any expiry message. The record also gets an
ondaDB per-key TTL of deadline + gc_grace as a compaction backstop.
Because hash fields, set members, and zset members are separate records with
their own envelopes, per-member TTL (EXPIREMEMBER, a KeyDB extension) falls out
naturally — each element carries its own deadline.
#What a TYPE check reads
Type resolves from the first matching key: a string key reports string; a head
key reports its collection type. Writing a string where a hash exists (or vice
versa) returns WRONGTYPE from this lookup — one extra point-read per command on
the same shard, usually served from the memtable or block cache.