Skip to content

snapshotting for indexers#13

Open
mohammed-deepinfra wants to merge 1 commit into
deep-main-v1.3.0rc16from
kv-indexer-snapshot
Open

snapshotting for indexers#13
mohammed-deepinfra wants to merge 1 commit into
deep-main-v1.3.0rc16from
kv-indexer-snapshot

Conversation

@mohammed-deepinfra

Copy link
Copy Markdown

Description

The engine's KV-event ZMQ tee (KvZmqPublisher) broadcasts KV-cache events on a zmq.PUB socket so a Dynamo standalone indexer can reconstruct the cache tree. PUB/SUB is lossy, so a subscriber that detects a sequence gap asks the replay ROUTER socket to resend from start_seq. Until now replay could only answer from the in-memory ring buffer (the last buffer_steps batches).

The bug this fixes: when the requested start_seq had already been evicted from the ring, the old code streamed only whatever was left in the buffer — leaving a hole at the front. This is exactly what happens when an indexer cold-joins (empty tree, requests replay from seq 0) or falls behind the ring window. Every block whose parent landed in the missing range was then rejected with ParentBlockNotFound, and the subscriber's tree collapsed in a cascade.

The fix: when start_seq predates the ring buffer, the publisher now serves a full snapshot of its current tree instead of a hole.

  • The single consumer thread maintains a compact reconstructed tree alongside the ring buffer: kv_snapshot: {block_hash -> (tokens_hash, parent_hash, depth)}, kept in lockstep with the live stream (BlockStored adds, BlockRemoved / AllBlocksCleared prune). It lives on the same thread that owns the sockets and the buffer, so no lock is needed.
  • _service_replay now branches: start_seq still in the ring → incremental replay as before; start_seq below the ring floor → _send_snapshot.
  • _send_snapshot streams a new replay sub-protocol: a SNAPSHOT_SEQ (-2) header carrying the snapshot version S (= last published seq), then one AllBlocksCleared batch to reset the worker, then the whole tree as depth-ordered BlockStored batches (parents strictly before children), then the existing END_SEQ (-1) sentinel. Orphans — a non-root block whose parent was evicted — are skipped rather than emitted (emitting one would be rejected with ParentBlockNotFound and strand its subtree); they are re-learned from the live stream.
  • Snapshot blocks are packed _SNAPSHOT_BATCH_BLOCKS = 1000 per batch. ROUTER's mute action is to drop silently, so one-event-per-frame would overrun the send HWM and truncate a large tree — the very failure this feature exists to prevent. The replay socket's HWM is also held >= buffer_steps.

Coupled wire-format change (required to keep each snapshot node compact): the engine now sends one precomputed tokens_hash per block instead of the raw token_ids array. The hash is XXH3-64 with XXH3_SEED = 1337, computed to be byte-identical to dynamo's compute_block_hash_for_seq (LoRA name folded into the seed; sorted multimodal mm_hash values appended). The router no longer hashes tokens. This shrinks the wire payload and lets each kv_snapshot node store a single u64 rather than a block's worth of token ids. publish_stored loses its token_ids / num_block_tokens params and gains tokens_hashes.

This changes the engine↔indexer ZMQ wire format (token_idstokens_hashes, plus the new snapshot frames). The matching Dynamo standalone-indexer build must be deployed together. Internal tee protocol only — no public TRT-LLM API change.

Test Coverage

  • Hash parity (the hard invariant): _compute_block_tokens_hash must be byte-identical to dynamo's compute_block_hash_for_seq, or every indexer query misses. Guarded by tokens_hash_parity_test.py against the dynamo binding (lives in the consumer/dynamo repo, since it tests both sides).
  • End-to-end on real hardware (gpt-oss-20b, live-mirrored prod traffic at ~183 events/s, engine launched with buffer_steps=16 to force the snapshot path): restarted all 3 cold-joining indexers (perfect / routing / reality). Each detected the gap (expected seq 0, got ~432793), requested replay from 0, received a snapshot of ~89.7k blocks, and rebuilt in < 1 s with zero ParentBlockNotFound, even while ~3,250 live events arrived during/after the rebuild. A functional prefix query after rebuild returned the full longest-match on all three. Runbook + captured logs: standalone_indexer/snapshot_e2e_test.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant