diff --git a/CLAUDE.md b/CLAUDE.md index f70448113c..aad37d14e0 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -369,6 +369,7 @@ Load these only when the task touches the topic. - **[NAPI bridge](docs-internal/engine/napi-bridge.md)** — TSF callback slots, `ActorContextShared` cache reset, `#[napi(object)]` payload rules, cancellation token bridging, error prefix encoding. Read before touching `rivetkit-napi`. - **[Envoy load balancing](docs-internal/engine/envoy-load-balancing.md)** — Hash-ring layout, virtual nodes, allocator flow, stale-envoy expiry, and tuning. Read before touching pegboard envoy allocation. - **[BARE protocol crates](docs-internal/engine/bare-protocol-crates.md)** — vbare schema ordering, identity converters, `build.rs` TS codec generation pattern. Read before adding/changing protocol crates. +- **[Depot SQLite overview](docs-internal/engine/depot/overview.md)** — high-level map of the per-actor SQLite storage system: VFS↔depot-client↔depot, deltas/PIDX/shards, the read/write/commit path (inline vs remote envoy), compaction, GC, forking/pinning, and PITR. Start here, then drill into the `depot/` reference docs. - **[SQLite VFS parity](docs-internal/engine/sqlite-vfs.md)** — native Rust VFS ↔ WASM TypeScript VFS 1:1 parity rule, v2 storage keys, chunk layout, delete/truncate strategy. Read before touching either VFS. - **[SQLite optimizations](docs-internal/engine/SQLITE_OPTIMIZATIONS.md)** — brief tracker for SQLite cold-read, VFS, storage, preload, and benchmark optimization ideas. - **[TLS trust roots](docs-internal/engine/tls-trust-roots.md)** — rustls native+webpki union rationale, which clients use which backend. diff --git a/docs-internal/engine/depot.md b/docs-internal/engine/depot.md deleted file mode 100644 index a1daabcf19..0000000000 --- a/docs-internal/engine/depot.md +++ /dev/null @@ -1,72 +0,0 @@ -# Depot Crash Course - -How the Depot SQLite backend reads, writes, compacts, and fences branchable database storage. Read this before changing anything in `engine/packages/depot/`. - -For VFS-side parity rules, see [sqlite-vfs.md](sqlite-vfs.md). For exact key formats, see [sqlite/storage-structure.md](sqlite/storage-structure.md). - -## Storage Model - -Depot stores SQLite pages in UDB/FDB. OSS Depot does not include object-backed cold storage. - -| Row family | Holds | Owner | -|---|---|---| -| `DBPTR` / `BUCKET_PTR` | Current database and bucket branch pointers | Conveyer branch APIs | -| `BUCKET_CATALOG` | Database membership facts in bucket branches | Conveyer branch APIs | -| `BRANCHES` / `BUCKET_BRANCH` | Branch records, refcounts, pin floors, lifecycle generations | Conveyer, GC, workflow checks | -| `BR/{branch}/META/head` | Current database head | Commit path | -| `BR/{branch}/COMMITS` and `BR/{branch}/VTX` | Commit metadata and versionstamp-to-txid lookup | Commit path | -| `BR/{branch}/PIDX` and `BR/{branch}/DELTA` | Recent page-owner index and LTX delta chunks | Commit path | -| `BR/{branch}/SHARD` | Reader-visible hot shard versions | Workflow manager and reclaimer | -| `BR/{branch}/CMP/*` | Workflow root and staged hot output | Workflow manager and companions | -| `BR/{branch}/PITR_INTERVAL` | Automatic PITR interval coverage rows | Workflow hot install and reclaim | -| `RESTORE_POINT` and `DB_PIN` | User retained restore points and exact history pins | Restore point APIs and workflow proof | - -The main invariant is simple: **commits write deltas directly to UDB; workflow compaction is the only publish/delete authority for compaction output.** - -## Read Path - -Reads resolve the database pointer to a database branch, build a branch-aware read plan, and fetch each page through FDB-backed coverage: - -1. Read branch head or fork head metadata. -2. Return missing for pages above EOF. -3. Check PIDX and DELTA first. -4. If the DELTA is absent or reclaimed, fall back to the newest SHARD at or below the read cap. -5. Zero-fill only valid gaps inside the database size. - -Missing required DELTA/SHARD coverage below EOF is a storage error. The in-process PIDX and branch ancestry caches are perf caches only; correctness comes from UDB rows and workflow revalidation. - -## Write Path - -SQLite commits call Depot through the conveyer path: - -1. Resolve DBPTR and read the current branch head in the UDB transaction. -2. Encode dirty pages into LTX DELTA chunks. -3. Write COMMITS, VTX, DELTA, and PIDX rows. -4. Update META/head and quota counters. -5. After commit, update SQLITE_CMP_DIRTY and send a throttled DeltasAvailable wake when hot lag crosses thresholds. - -The commit path does **not** publish SHARD rows or delete old history. It records new committed history and wakes workflow compaction. - -## Workflow Compaction - -Each active database branch has one DB manager workflow plus hot and reclaimer companions, all unique by database branch id. - -- Hot jobs stage LTX shard blobs under `CMP/stage/{job_id}/hot_shard`; the manager validates the active job, copies output to reader-visible `SHARD`, advances `CMP/root`, writes selected `PITR_INTERVAL` rows, and compare-clears matching PIDX. -- Reclaim jobs delete hot rows only after the manager proves replacement coverage against branch pins, restore points, PITR intervals, PIDX, SHARD rows, lifecycle generation, and current branch state. - -`CMP/root` watermarks are scheduling summaries, not deletion proof by themselves. `CompactionRoot` retains legacy cold watermark fields for persisted compatibility, but OSS Depot does not update or act on them. - -## PITR And Restore - -Automatic timestamp restore coverage is stored as `PITR_INTERVAL` rows selected during hot compaction from commit wall-clock timestamps and the effective bucket/database PITR policy. Expired interval rows are soft pins until reclaim compare-clears them. - -Restore points are retained user tokens. Creating a restore point resolves a `SnapshotSelector` to exact branch, txid, versionstamp, and wall-clock metadata, then writes a `RestorePointRecord` and `DB_PIN(kind=RestorePoint)`. Deleting it removes that hard pin and recomputes branch pin floors. - -Fork and restore use the same primitive: resolve a snapshot selector, derive a branch at that exact point, and let the caller decide whether to keep a fork or move the database pointer. - -## Cross-References - -- Key layout: [sqlite/storage-structure.md](sqlite/storage-structure.md) -- Component ownership: [sqlite/components.md](sqlite/components.md) -- VFS parity rules: [sqlite-vfs.md](sqlite-vfs.md) -- Storage metrics: [SQLITE_METRICS.md](SQLITE_METRICS.md) diff --git a/docs-internal/engine/sqlite/comparison-to-other-systems.md b/docs-internal/engine/depot/comparison-to-other-systems.md similarity index 91% rename from docs-internal/engine/sqlite/comparison-to-other-systems.md rename to docs-internal/engine/depot/comparison-to-other-systems.md index 7dd016808a..5b882af9d1 100644 --- a/docs-internal/engine/sqlite/comparison-to-other-systems.md +++ b/docs-internal/engine/depot/comparison-to-other-systems.md @@ -1,14 +1,14 @@ # SQLite PITR Comparison To Other Systems -This design borrows proven ideas from adjacent systems, but the constraints are different: Rivet has single-writer database ownership, no local SQLite files, FDB as the source of truth, and storage-level fork primitives instead of storage-level rollback. +This design borrows proven ideas from adjacent systems, but the constraints are different: Rivet has single-writer database ownership, no local SQLite files, UDB as the source of truth, and storage-level fork primitives instead of storage-level rollback. | System | What We Share | What We Diverge On | Why | |---|---|---|---| -| Neon | Layer model, branching, dependency-graph GC. | Rough PITR by default instead of exact PITR everywhere; FDB is the durable page store instead of a pageserver. | Exact PITR is valuable for Postgres workloads but too expensive as the default for these database databases. | -| Cloudflare Durable Objects SQLite | RestorePoint-like time tokens and the idea that snapshots can be built from log state. | Durable Objects use a follower quorum and do not expose fork primitives. | FDB replaces the multi-replica WAL quorum. Forking and bucket cloning are first-class goals here. | +| Neon | Layer model, branching, dependency-graph GC. | Rough PITR by default instead of exact PITR everywhere; UDB is the durable page store instead of a pageserver. | Exact PITR is valuable for Postgres workloads but too expensive as the default for these database databases. | +| Cloudflare Durable Objects SQLite | RestorePoint-like time tokens and the idea that snapshots can be built from log state. | Durable Objects use a follower quorum and do not expose fork primitives. | UDB replaces the multi-replica WAL quorum. Forking and bucket cloning are first-class goals here. | | Snowflake | Time travel and zero-copy clone by metadata. | Snowflake is OLAP/table-oriented; this storage layer is per-SQLite-database and exposes lower-level primitives to the engine. | The metadata-only clone idea carries over, but the unit of identity is a database branch, not a warehouse/table abstraction. | | LiteFS | LTX file format and high-water-mark pending markers. | LiteFS uses local SQLite files and WAL replication. This design forbids local database files and builds PITR around branches. | Stateless database hosting cannot depend on local files. Branchable storage needs graph retention, not only replica catch-up. | -| Litestream | LTX-style incremental backup and rolling post-apply checksum. | Litestream backs up one SQLite database stream. It has no branch graph, bucket fork, or FDB tier. | Litestream answers "can I restore this database?" This design answers "can I fork this database or bucket cheaply?" | +| Litestream | LTX-style incremental backup and rolling post-apply checksum. | Litestream backs up one SQLite database stream. It has no branch graph, bucket fork, or UDB tier. | Litestream answers "can I restore this database?" This design answers "can I fork this database or bucket cheaply?" | | mvSQLite | Versionstamp awareness as a concept. | mvSQLite's multi-writer PLCC/DLCC/MPC machinery and content-addressed dedup are deliberately not adopted. | Pegboard already guarantees a single writer per database. Multi-writer conflict machinery would add cost without buying correctness. | | Turso/libSQL | Point-in-time fork/branch as a user-facing primitive. | Turso uses local SQLite files with replication and treats rollback as a storage operation. This design pushes rollback to the engine layer and exposes only fork/delete/restore_point primitives. | Keeping rollback out of storage removes mutable pointer swaps, pointer history, frozen states, and commit-vs-rollback races. | diff --git a/docs-internal/engine/sqlite/components.md b/docs-internal/engine/depot/components.md similarity index 90% rename from docs-internal/engine/sqlite/components.md rename to docs-internal/engine/depot/components.md index 184f76b01f..fbfdf5dfbf 100644 --- a/docs-internal/engine/sqlite/components.md +++ b/docs-internal/engine/depot/components.md @@ -15,9 +15,9 @@ Responsibilities: - Maintain `META/head`, quota counters, and access-touch manifest fields. - Update `SQLITE_CMP_DIRTY/{database_branch_id}` and send throttled `DeltasAvailable` workflow wakeups when hot lag crosses compaction thresholds. - Create buckets, create databases, fork buckets, fork databases, and write branch records/catalog markers. -- Create and resolve restore points. Pinned restore points write FDB pins directly and start as `PinStatus::Ready`. +- Create and resolve restore points. Pinned restore points write UDB pins directly and start as `PinStatus::Ready`. -Lease ownership: none. Correctness relies on Pegboard single-writer exclusivity for a live database plus FDB transaction fences. The conveyer must not take compactor leases. +Lease ownership: none. Correctness relies on Pegboard single-writer exclusivity for a live database plus UDB transaction fences. The conveyer must not take compactor leases. ## Workflow Compaction @@ -26,11 +26,11 @@ The workflow compaction path uses one persistent DB manager plus hot and reclaim Responsibilities: - Coalesce commit wakeups through `SQLITE_CMP_DIRTY/{database_branch_id}` and `DeltasAvailable` signals. -- Plan hot jobs from current FDB state instead of trusting signal payloads. +- Plan hot jobs from current UDB state instead of trusting signal payloads. - Carry the branch lifecycle generation through planned jobs and reject stale stage, publish, or reclaim work after branch deletion or recreation. - Have the hot companion write staged shard blobs under `CMP/stage/{job_id}/hot_shard`. - Install matching hot job output by copying staged blobs to reader-visible `SHARD`, advancing `CMP/root`, and compare-and-clearing expected PIDX rows. -- Have the reclaimer delete only manager-authorized FDB rows and stale staged output. +- Have the reclaimer delete only manager-authorized UDB rows and stale staged output. - Keep automatic PITR interval coverage and retained restore point pins live until reclaim can prove they are no longer needed. - Stop the manager and companion workflows through `DestroyDatabaseBranch` when a database branch is no longer live. @@ -42,6 +42,6 @@ Lease ownership: none. Gasoline workflow uniqueness uses only the database branc |---|---|---| | Conveyer | `META/head`, `COMMITS`, `VTX`, `PIDX`, `DELTA`, branch records, restore points | None | | Workflow DB manager | `CMP/root`, live `SHARD`, `PITR_INTERVAL`, matching PIDX clears | None | -| Workflow companions | Staged hot output and manager-authorized FDB cleanup | None | +| Workflow companions | Staged hot output and manager-authorized UDB cleanup | None | The components share branch metadata and pin counters, but each mutable manifest field has one owner. diff --git a/docs-internal/engine/sqlite/constraints-and-design-decisions.md b/docs-internal/engine/depot/constraints-and-design-decisions.md similarity index 93% rename from docs-internal/engine/sqlite/constraints-and-design-decisions.md rename to docs-internal/engine/depot/constraints-and-design-decisions.md index f5fa076ffd..9af09b07c1 100644 --- a/docs-internal/engine/sqlite/constraints-and-design-decisions.md +++ b/docs-internal/engine/depot/constraints-and-design-decisions.md @@ -5,28 +5,28 @@ This page records the constraints that shape the PITR/forking storage design. Th ## Binding Constraints - **Single writer per database.** Pegboard exclusivity is the release-mode concurrency fence. Storage does not implement multi-writer conflict resolution. -- **No local SQLite files.** The durable database state is in FDB. Local files would make storage stateful and non-migratable. -- **Lazy reads.** Forks do not copy data. Reads walk branch ancestry and hydrate from FDB DELTA/SHARD rows only when needed. +- **No local SQLite files.** The durable database state is in UDB. Local files would make storage stateful and non-migratable. +- **Lazy reads.** Forks do not copy data. Reads walk branch ancestry and hydrate from UDB DELTA/SHARD rows only when needed. - **Per-commit granularity.** PITR targets commits/versionstamps, not individual WAL frames inside a commit. -- **FDB is the source of truth.** OSS Depot has no object-backed cold tier. +- **UDB is the source of truth.** OSS Depot has no object-backed cold tier. - **Branches are immutable.** A bucket id is its bucket branch id, and a database id is its database branch id. - **Rollback is engine-owned.** Storage exposes fork primitives; the engine decides which database id a database currently uses. - **Persisted wire/storage records use vbare.** Raw fixed-width bytes are reserved for atomic counters and simple indexes such as `VTX`. ## Rough PITR By Default -The design keeps rough PITR cheap by preserving enough FDB history for branch-at-position recovery without writing a full image for every commit. Exact recovery is opt-in through restore points, which write FDB history pins that workflow compaction must preserve. +The design keeps rough PITR cheap by preserving enough UDB history for branch-at-position recovery without writing a full image for every commit. Exact recovery is opt-in through restore points, which write UDB history pins that workflow compaction must preserve. Compared with Neon's exact-PITR posture, this trades precision for lower steady-state cost. That fits Rivet Database-style workloads where "fork near this point" is usually enough, and exact restore points can be created explicitly for critical moments. ## Pages Are Self-Describing -LTX layers carry page numbers and checksums. That lets the system move bytes between DELTA and SHARD rows without a separate opaque page map. FDB PIDX remains the hot routing index. +LTX layers carry page numbers and checksums. That lets the system move bytes between DELTA and SHARD rows without a separate opaque page map. UDB PIDX remains the hot routing index. The result is an LSM-shaped flow: -- L0: DELTAs in FDB. -- L1: versioned SHARDs in FDB. +- L0: DELTAs in UDB. +- L1: versioned SHARDs in UDB. ## Why Versioned SHARDs @@ -127,7 +127,7 @@ align to are durable rows recorded per database branch: - `DB_PIN` rows: concrete `(txid, versionstamp)` for restore points, database forks, and bucket forks. -Snapping then compares **FoundationDB versionstamps** (monotonic, globally +Snapping then compares **UDB versionstamps** (monotonic, globally ordered commit tokens) against those recorded rows: it picks the covered row with the largest txid whose `versionstamp <= fork_versionstamp`. No clock is consulted in that decision. A bucket fork carries one `fork_versionstamp`, and diff --git a/docs-internal/engine/depot/overview.md b/docs-internal/engine/depot/overview.md new file mode 100644 index 0000000000..a6bd90b974 --- /dev/null +++ b/docs-internal/engine/depot/overview.md @@ -0,0 +1,469 @@ +# Depot SQLite Overview + +High-level map of how Rivet stores, reads, compacts, forks, and time-travels per-actor +SQLite databases. **UniversalDB ("UDB")** is the source of truth; there is no local SQLite +file. + +This doc stays high level and links out for detail: + +- Exact key/byte layout: [storage-structure.md](storage-structure.md) +- Design rationale: [constraints-and-design-decisions.md](constraints-and-design-decisions.md) +- Component ownership: [components.md](components.md) +- Native↔wasm VFS parity rules: [sqlite-vfs.md](../sqlite-vfs.md) +- Comparison to other PITR systems: [comparison-to-other-systems.md](comparison-to-other-systems.md) + +## Design constraints + +These shape everything below; full statements in +[constraints-and-design-decisions.md](constraints-and-design-decisions.md) and the +`engine/packages/depot/CLAUDE.md` "Hard Constraints" section. + +- **Single writer per database.** Pegboard guarantees at most one actor instance touches a + database's storage at a time. + - Storage does no multi-writer conflict resolution. + - A generation + head fence guards the brief failover window. +- **No local SQLite files, ever.** The VFS speaks to Depot; Depot speaks to UDB. Nothing on + disk or tmpfs. +- **Lazy reads.** Pages are fetched on demand, never bulk-preloaded. Forks copy no data. +- **Per-commit granularity.** PITR targets commits/versionstamps, not individual WAL frames. +- **Branches are immutable.** + - A database id *is* its database-branch id; a bucket id *is* its bucket-branch id. + - Rollback is engine-owned (fork + swap the pointer). +- **Persisted records use vbare**, except atomic counters and simple indexes (e.g. `VTX`). + +## Glossary + +### Layers & components + +- **VFS** — SQLite's pluggable I/O layer. Our VFS replaces file I/O with calls to Depot. +- **depot-client** — the crate that implements the VFS and the transport to Depot. +- **depot** — the storage engine: owns branches, the read/write/compaction logic over UDB. +- **conveyer** — Depot's commit/read path (the data-plane code in `depot/src/conveyer/`). +- **pegboard-envoy** — the engine-side service that hosts an actor's storage access and + validates it at the trust boundary. +- **envoy** — the actor↔engine bridge; "inline" vs "remote" SQLite are two modes of it. + +### Storage primitives + +- **page** — fixed-size (4 KiB) unit SQLite reads/writes. See the pages primer below. +- **delta** — one commit's changed pages, encoded as an LTX blob under `DELTA`. +- **LTX** — the on-disk delta/shard blob format (LTX V3). +- **PIDX** — page-index: maps a page number to the `DELTA` txid that currently owns it. +- **shard** — a 64-page group; a *shard version* is the full page set of that group as-of a + txid, written during compaction. + +### Commits & retention + +- **txid** — per-branch monotonic commit counter; the newest is the **head**. +- **versionstamp** — UDB's 16-byte globally-ordered commit token (see primer below). +- **head** — the newest committed txid on a branch. +- **hot watermark** — the txid up to which deltas have been folded into shards; the + retention/GC frontier. + +### Branching & history + +- **bucket / database** — a bucket groups databases; a Rivet Actor's SQLite database *is* a + `database_id`. Forks operate on either. +- **branch** — an immutable version of a database or bucket, with a parent link. +- **pointer / catalog** — `DBPTR`/`BUCKET_PTR` map ids→current branch; `BUCKET_CATALOG` + records database membership (lazily inherited across bucket forks). +- **pin** — a `DB_PIN` row at a concrete `(txid, versionstamp)` that keeps history alive. + Restore points (bookmarks), database forks, and bucket forks are all pins. +- **PITR** — point-in-time recovery: periodic auto-pins so you can restore to a past time. + +## UDB data structure (reference) + +Sketch only; exact bytes in [storage-structure.md](storage-structure.md). The sections below +recall the slice of this layout they touch, so you don't have to scroll back here. + +```text +BR/{database_branch_id}/ # per-database-branch data (one subtree per branch) + META/head # current head (txid, db_size, checksum) + META/quota # storage accounting (atomic i64) + COMMITS/{txid} # commit metadata (wall clock, versionstamp, size) + VTX/{versionstamp} # versionstamp -> txid + DELTA/{txid}/{chunk} # LTX delta blob (the commit's changed pages) + PIDX/{pgno} # pgno -> owning delta txid + SHARD/{shard_id}/{as_of} # compacted full-shard snapshot at txid `as_of` + PITR_INTERVAL/{bucket_ms} # one representative commit per time bucket + CMP/root, CMP/stage/... # compaction watermark + staged output +DBPTR/{bucket_branch}/{name}/cur # database name -> current database branch +BUCKET_PTR/{bucket}/cur # bucket -> current bucket branch +BUCKET_CATALOG/... # database membership (inherited on fork) +BRANCHES/..., BUCKET_BRANCH/... # immutable branch records + parent links +DB_PIN/{database_branch}/... # pins: restore points, db forks, bucket forks +RESTORE_POINT/... # user restore-point tokens +``` + +## Workflows + +Compaction, GC, and retention run as Gasoline workflows — one set per database branch, not yet +enabled in the production registry: + +- **`DbManagerWorkflow`** (the **manager**) — the authority: owns compaction state and + plans/dispatches every publish and delete; the companions only do what it authorizes. +- **`DbHotCompacterWorkflow`** (the **hot-compacter**) — stages compacted `SHARD` output for an + install and reports back. +- **`DbReclaimerWorkflow`** (the **reclaimer**) — runs GC: deletes manager-authorized rows and + stale staged output. + +Later sections use the short names (**manager**, **hot-compacter**, **reclaimer**) for these. + +The commit path is not a workflow — it runs inline in the conveyer transaction. Workflows own +only the background compaction/GC lifecycle. + +## Database pages (primer) + +- SQLite stores a database as a flat array of fixed-size **pages** (4 KiB here). +- Every read or write is page-granular; the page is the unit Depot stores and versions. +- We do not re-explain the page format here — see the SQLite reference: + . + +## SQLite VFS and depot client + +**What a VFS is.** SQLite delegates all file I/O (open, read, write, sync, lock, size) to a +pluggable VFS. Ours (`depot-client/src/vfs.rs`) replaces "read/write a file on disk" with +"read/write pages through Depot." There is no backing file — the *no local files* invariant. + +**Only two operations cross the boundary.** Despite implementing the full VFS surface, +exactly two calls reach Depot: + +- **`get_pages`** — page reads. On an `xRead` cache miss the VFS requests the missing page + numbers (lazy: only what's touched). +- **`commit`** — page writes. SQLite runs in batch-atomic mode; dirty pages are buffered in + memory and flushed as one delta on `xFileControl(COMMIT_ATOMIC_WRITE)` (or `xSync` for + non-atomic flushes). + +**Lock callbacks are no-ops** — single-writer is enforced by Pegboard exclusivity plus +fencing, not by SQLite's lock state machine. + +**Sequence (query → pages):** + +1. `SQL → SQLite → xRead(pgno)`. +2. Cache miss → `get_pages(pgnos)`. +3. Depot resolves from PIDX/DELTA/SHARD. +4. Pages returned → cached → SQLite continues. +5. Writes mirror this: buffered `xWrite`s → `commit(dirty_pages)` at the atomic-write + boundary. + +**Inline vs remote (envoy).** Two independent axes: + +- *Where SQLite runs:* + - **LocalNative** (common): SQLite + VFS run in the actor process; the two page ops are + tunneled over the envoy websocket to pegboard-envoy, which calls Depot against UDB. + - **RemoteEnvoy**: the actor ships whole SQL strings to pegboard-envoy, which runs SQLite + there with an embedded (in-process) Depot transport straight to UDB. +- *How the VFS reaches Depot:* + - **embedded** (`depot-client-embedded`): calls the Depot `Db` directly in-process (used by + pegboard-envoy's remote-SQL executor and the depot CLI). + - **websocket** (`EnvoySqliteTransport`): marshals the two ops over the envoy tunnel. + +Either way, pegboard-envoy is the trust boundary: it validates namespace, actor existence, +and generation before any request reaches Depot. + +**Fencing on read & write.** Every op carries `(generation, expected_head_txid)`: + +- pegboard-envoy CAS-checks the generation against UDB. +- Depot checks `expected_head_txid` against the branch head inside the same serializable + transaction and raises `HeadFenceMismatch` on a mismatch (`conveyer/read.rs`, + `conveyer/commit/apply.rs`). +- This catches the rare two-instances-writing case during actor failover. + +## How pages are stored and read + +(Forks are deferred to [Forking & pinning](#forking--pinning); this section assumes a single +linear branch.) + +**Data structure (the keys a read touches):** + +```text +PIDX/{pgno} # pgno -> owning DELTA txid +DELTA/{txid}/{chunk} # the owning commit's changed pages (LTX) +SHARD/{shard_id}/{as_of} # compacted full-shard snapshot at txid `as_of` +``` + +- A commit writes its changed pages as an **LTX delta** under `DELTA/{txid}` (plus `COMMITS` + and `VTX` rows). Deltas are append-only. +- **PIDX** maps each page number to the `DELTA` txid that last wrote it. +- **Shards** are compacted full snapshots of a 64-page group as-of a txid (built later by + compaction). A read uses them when the owning delta has been reclaimed. + +**LTX (the delta/shard blob format) primer.** Both `DELTA` and `SHARD` blobs are LTX V3 (the +LiteFS "lite transaction" format; V3 is our variant). One blob stores: + +- **A set of pages** — each as `(pgno, page bytes)`, plus a small header (page size, the txid + range it covers, db page count after the commit). A delta holds one commit's changed pages; a + shard holds a full 64-page group folded as-of a txid. +- **A page index** mapping `pgno → (offset, size)` in the blob, so the blob is + **frame-addressable**: a reader parses just the header + index and decompresses only the one + page it needs, never the whole blob. This is what keeps lazy `get_pages` cheap. + +More on the format: the upstream LTX spec is (see its "File +Format" section); our V3 byte layout lives in `conveyer/ltx.rs`. + +**Read path** — for each requested page: + +1. Consult `PIDX/{pgno}` for the owning txid, then load that `DELTA` and return the page. +2. If the delta is absent ([reclaimed](#gc)), fall back to a shard. Scan + `SHARD/{shard_id}/{as_of}` for the newest `as_of` at or below the read cap. + - The shard id is pure arithmetic from the page number: `shard_id = pgno / 64` + (`SHARD_SIZE`), so shard `N` owns pages `N*64 .. N*64+63`. +3. If neither a delta nor a shard provides an in-range page, it reads back as **zeros**. That + is a legitimate gap — a page below the database size that was never written, or one absent + from its covering shard — and SQLite expects zeros there. +4. If `PIDX` named a delta that's gone *and* no shard covers the page, the required content is + unrecoverable, so the read raises a storage error (`ShardCoverageMissing`) rather than + zero-filling. + +## Committing SQLite pages (conveyer) + +**Data structure (the keys a commit writes):** + +```text +COMMITS/{txid} # commit metadata (wall clock, versionstamp, size) +VTX/{versionstamp} # versionstamp -> txid +DELTA/{txid}/{chunk} # the dirty pages, as LTX +PIDX/{pgno} # repointed to this txid for each changed page +META/head, META/quota # advanced +``` + +A commit runs the conveyer commit path in one UDB transaction (`conveyer/commit/apply.rs`): + +1. Read `META/head` serializably and **fence** on `expected_head_txid` (reject on mismatch). +2. Encode the dirty pages as an LTX `DELTA` (chunked) and write `COMMITS`, `VTX`, `DELTA`, + `PIDX`. +3. Advance `META/head` and update `META/quota`. +4. Wake workflow compaction (a throttled signal) when delta lag crosses a threshold. + +The commit path **only records new history** — it never publishes shards or deletes anything. +Shards and deletion are compaction's job. + +## Compacting deltas to shards + +### Why compaction is needed + +`PIDX` makes a *current* read cheap — it routes each page straight to its one owning delta, no +replay. But two costs grow with every commit: + +- **Deltas accumulate without bound.** A delta can't be dropped while it still holds the only + copy of some page, so raw history grows forever. +- **Point-in-time reads get expensive.** A read pinned to an earlier point in history (a fork's + view of its parent, or a PITR/restore target — explained under [PITR](#pitr)) can't use + `PIDX`; it walks the delta chain backward to that point, which is O(deltas). + +Compaction fixes both: it folds deltas into full **shard snapshots** at the points reads can +land on, so an as-of read is a single shard fetch, and the folded deltas become reclaimable +(see [GC](#gc)). + +### The compaction process + +Compaction is a two-phase commit — plan and stage first, then install atomically. The three +phases: + +1. **Plan** (manager) — scan the batch and decide the work. +2. **Stage** (hot-compacter) — build and stage the shard snapshots. +3. **Install** (manager) — atomically promote them live and advance the watermark. + +**Plan.** Scan the batch range once and decide the work: which deltas exist, and the **coverage +points** to snapshot — the txids a later read can be anchored to, so each must stay readable +after the deltas below it are reclaimed. + +```text +commits, deltas, pidx = scan(hot_watermark+1 ..= head) # single range scan + +coverage_points = { head } # always +for pin in pins: # db/bucket fork, restore point + if pin.txid in batch: + add pin.txid +for rep in pitr_reps(commits): # commits bucketed by wall-clock + add rep.txid +``` + +**Stage.** Loop the coverage points, folding the in-memory deltas (no re-scan). At each point, +snapshot every shard that changed — overlaying the folded pages on the shard's previous +version — and write it as a *pending* blob. + +```text +for as_of in sorted(coverage_points): + # fold in-memory deltas <= as_of, grouped by shard. only changed shards + # appear; newest write per page wins; truncate-aware (shrunk pages dropped). + pages_by_shard = fold(deltas where txid <= as_of) + + for (shard_id, pages) in pages_by_shard: + base = prev_shard_version(shard_id, as_of) # newest snapshot, or empty + blob = encode(base overlaid with pages) # complete 64-page snapshot + stage(SHARD/{shard_id}/{as_of} = blob) # pending, not yet live +``` + +**Install.** In one atomic transaction, revalidate the plan, promote the staged blobs to live +`SHARD` rows, and advance the watermark. + +```text +if plan_fingerprint_changed(): + abort_and_replan() +promote staged SHARD blobs -> live SHARD rows # publish +hot_watermark = head # advance the frontier +``` + +- **Truncate-aware fold:** a page removed by a later truncate is dropped, not resurrected, so an + as-of read never sees pages a shrink had already freed. +- **Atomicity:** the publish and the watermark advance are in the *same* transaction, so "below + the watermark = covered" is always true — never a window where the watermark moved but the + shards are missing. +- **Authority:** the **manager** owns all publish/delete; the companions only stage or delete + what it authorizes. + +Garbage collection of the now-redundant deltas and superseded shards is handled by reclaim — +see [GC](#gc). + +## GC + +Reclaim is the unified collector. Once the watermark passes a txid, that txid's deltas are +redundant (the covered points have shards), so reclaim collects them. In one pass it: + +1. deletes `DELTA` rows at or below the watermark (see below), +2. deletes `COMMITS`/`VTX` below the watermark, except a **keep-set** (see below), +3. clears **stale `PIDX`** (see below), +4. deletes **superseded `SHARD` versions** (see below), and +5. drops **expired `PITR_INTERVAL` rows** (see below). + +It is inert while the watermark is 0. + +**Delta retention.** Retain a `DELTA` if and only if its txid is above the hot watermark. +Everything at or below is reclaimable with no per-shard proof, because the install published +shard coverage for every covered point in the same transaction that advanced the watermark. +This simple rule is sound *only because forks are constrained to covered points* (see +[Forking & pinning](#forking--pinning)): the alignment fence makes every reachable read cap a +covered point or the head, so a reclaimed below-watermark delta can never be the only source for +a read. Drop either half — the rule or the fork constraint — and the other breaks. + +**Keep-set.** Below the watermark, `COMMITS`/`VTX` are normally collected — but a commit must +stay readable if something still points at it. The keep-set is exactly those survivors: the +txids referenced by a **pin** (`DB_PIN`: database fork, bucket fork, or restore point) or a +**retained PITR interval representative**. Everything else below the watermark is provably +unreachable and is collected. + +**Stale `PIDX`.** A `PIDX` entry is stale once its owning txid is at or below the watermark: the +delta it names has been folded into a shard (and may already be deleted), leaving the entry a +dangling routing hint. Reclaim clears it with compare-and-clear; reads already fall back to the +shard in the meantime. + +**Superseded `SHARD` versions.** A shard version is superseded once no covered txid reads +through it — it is not the newest version at or below any covered point, and not above the +watermark. Reads resolve "newest version at or below the cap" and every reachable cap is a +covered point or the head, so such a version is unreachable and is deleted. + +**Expired `PITR_INTERVAL` rows.** Each interval representative carries a retention TTL +(`expires_at_ms`). Once it passes, the row is reclaimable, and it is deleted in the same pass +that drops it from coverage — so the `COMMITS`/`VTX` it was keeping never lose their last +reference before being collected. + +## Forking & pinning + +**Versionstamps & VTX (primer).** Forks, pins, and PITR are all defined *as of a versionstamp*, +never a txid: + +- **Why not txid:** a txid is a *per-branch* counter, so it can't order or compare commits + across branches. +- **versionstamp** — UDB's 16-byte, globally-ordered commit token, assigned at commit time. It + gives a total order over every commit on every branch *without a clock*. +- Every commit writes one (recorded as `VTX/{versionstamp}` → txid) because any commit might + later become a fork/pin/PITR target, and the global order has to be fixed at commit time. +- `VTX` is the reverse index that resolves a target versionstamp back to its branch txid. +- The whole alignment/retention fence compares versionstamps (not wall-clock times) to decide + what a read can reach. + +**Data structure (forks + pins):** + +```text +BRANCHES/{id} # immutable branch record: parent + parent_versionstamp +DBPTR/{bucket_branch}/{name}/cur, BUCKET_PTR/{bucket}/cur # id -> current branch +BUCKET_CATALOG/... # database membership (inherited lazily on bucket fork) +DB_PIN/{database_branch}/… # pins: restore points, db forks, bucket forks +``` + +**Forking is effectively free.** + +- A fork is just a new immutable `BRANCHES` record with a parent link and the fork + versionstamp; **no data is copied**. +- All the real work happens in compaction (which stages shard coverage at fork/pin points) and + the read path (which walks branch ancestry). + +**Where you can fork from.** Only: + +- a point **above the watermark** (its deltas still exist; the new pin makes the next + compaction stage coverage for it), or +- an **already-covered point** (the watermark, a PITR interval representative, or an existing + pin). +- A caller-supplied versionstamp between covered points is **snapped down** to the newest + covered point at or below it. +- This constraint is exactly what makes GC sound. + +**Reads with forks.** A read resolves across the branch and its ancestors: + +1. Start at the branch. For each `parent` link (`BRANCHES` carries `parent` + + `parent_versionstamp`), include the ancestor **capped** at the versionstamp it was forked at. +2. Resolve each requested page per source — PIDX/DELTA/SHARD, identical to the basic read path. +3. The most-specific branch that has the page wins. + +**Pins are one unified thing.** + +- Creating a restore point (a "bookmark"), a database fork, or a bucket fork all write a + `DB_PIN` row holding a concrete `(txid, versionstamp)`. +- Both the compaction coverage-staging and the reclaim keep-set read all `DB_PIN` rows, so a + pin both gets shard coverage staged at its txid and is kept by GC. +- There is no separate "bookmark" store — a bookmark *is* a pin of kind `RestorePoint`. + +**The three pin shapes:** + +- **Database fork.** Fork one database at a versionstamp → a new database branch with a parent + link and a `DatabaseFork` pin on the source. (An actor's database is a `database_id`, so + forking an actor's DB is a database fork.) +- **Bucket fork.** Forks a whole bucket metadata-only (catalog is inherited, not copied). + - A database inherited through the fork is **materialized lazily** on first access. + - Its first read/write derives a capped database fork at the fork point, so reads freeze at + the fork and writes build on the inherited state instead of leaking into the source. +- **Restore / rollback.** Reuse the same primitive: (1) resolve a snapshot selector to a + covered `(txid, versionstamp)`, (2) fork there, and (3) for rollback, move the engine-owned + pointer to the new branch. + +## PITR + +**Data structure:** + +```text +PITR_INTERVAL/{bucket_ms} # one representative commit (txid, versionstamp) per time bucket +``` + +**PITR coverage is just periodic auto-pins.** + +- During compaction we bucket the batch's commits by wall-clock time (default 5-minute + intervals). +- We record one representative commit per bucket as a `PITR_INTERVAL` row holding its + `(txid, versionstamp)`. + +**Why you can't restore to an arbitrary past point.** + +- We deliberately do *not* keep every delta — reclaim deletes them once the watermark passes. +- So only covered points survive, and a timestamp restore **floors** to the nearest interval + representative at or before your timestamp. +- To restore to an *exact* point, create a restore point (bookmark) while that point is still + reachable; that pin then survives reclaim. + +**Reading at a past point (the point-in-time read).** Restoring or forking to a past point +gives you a branch whose reads are *as of* that point: each page resolves to the newest write +at or below the target — the same capped ancestry read described under +[Forking & pinning](#forking--pinning). With a shard published at that covered point the read is +a single fetch; without one it would have to walk the delta chain backward to the target. That +walk is the cost compaction exists to remove, and why every restorable point is a covered +point with a shard. + +**PITR is just a fork.** An `AtTimestamp` restore: + +1. resolves the timestamp through the `PITR_INTERVAL` rows to a representative's + `(txid, versionstamp)`, +2. forks there, and +3. for an in-place restore, moves the engine-owned pointer to the new branch. + +Same fork primitive, same alignment rules — the only difference is how the target point is +chosen. diff --git a/docs-internal/engine/sqlite/storage-structure.md b/docs-internal/engine/depot/storage-structure.md similarity index 87% rename from docs-internal/engine/sqlite/storage-structure.md rename to docs-internal/engine/depot/storage-structure.md index 5daa5d4b61..968585be06 100644 --- a/docs-internal/engine/sqlite/storage-structure.md +++ b/docs-internal/engine/depot/storage-structure.md @@ -1,6 +1,6 @@ # SQLite Storage Structure -This is the key-format reference for the branchable Depot layer. Update it whenever FDB layout changes. +This is the key-format reference for the branchable Depot layer. Update it whenever UDB layout changes. ## Identity Model @@ -9,11 +9,11 @@ Depot has two external ids: - `BucketId`: the bucket branch id. There is no separate bucket pointer id. - `DatabaseId`: the database branch id. There is no separate database pointer id. -Branch records are append-only for ancestry fields. Database branch records also carry mutable lifecycle state and a monotonic lifecycle generation used to reject stale workflow compaction work. Forks allocate a new id and write a parent pointer to the source branch plus the fork versionstamp. Engine-layer rollback is implemented outside this crate by forking a database and changing the engine's database-to-database mapping. +Branch records are append-only for ancestry fields. Database branch records also carry mutable lifecycle state, a monotonic lifecycle generation used to reject stale workflow compaction work, and a denormalized policy scope (`policy_bucket_id`, `policy_database_id`, record version 2) so compaction resolves the effective PITR policy with point reads. Forks allocate a new id and write a parent pointer to the source branch plus the fork versionstamp. Engine-layer rollback is implemented outside this crate by forking a database and changing the engine's database-to-database mapping. -Bucket database membership is stored in `BUCKET_CATALOG`. Bucket forks do not copy catalog entries. Reads walk bucket parents and accept inherited entries only when the entry versionstamp is at or before the walking branch's `parent_versionstamp`. +Bucket database membership is stored in `BUCKET_CATALOG`. Bucket forks do not copy catalog entries. Reads walk bucket parents and accept inherited entries only when the entry versionstamp is at or before the walking branch's `parent_versionstamp`. The first data access through a forked bucket materializes a capped database fork (newest covered point at or below the fork chain's versionstamp cap) and writes a local pointer plus catalog marker. -## FDB Prefixes +## UDB Prefixes All Depot keys live under the crate-owned `[0x02]` prefix. The next byte is the partition. @@ -50,7 +50,7 @@ Database pointer resolution walks bucket parents when a current bucket branch do ```text BUCKET_CATALOG/{bucket_id_uuid_be:16}/{database_id_uuid_be:16} - -> 16-byte FDB versionstamp via SetVersionstampedValue + -> 16-byte UDB versionstamp via SetVersionstampedValue ``` The value is the database membership versionstamp. Parent walks use it as the AS-OF cap for `fork_bucket`. Database tombstones on the bucket branch hide matching inherited catalog entries. @@ -106,7 +106,7 @@ BR/{database_id_be:16}/PITR_INTERVAL/{bucket_start_ms_be:8} `COMMITS` stores commit metadata, including wall-clock time, captured versionstamp, size in pages, and post-apply checksum. `VTX` maps a versionstamp back to txid for restore point resolution and GC. `PIDX` maps a page number to the DELTA txid that currently owns it. -`SHARD` is versioned by `as_of_txid`. Reads choose the largest `as_of_txid <= read_txid`. Hot compaction writes new SHARD versions and does not overwrite older ones. +`SHARD` is versioned by `as_of_txid`. Reads choose the largest `as_of_txid <= read_txid`. Hot compaction writes new SHARD versions at coverage txids, truncate publishes pruned versions at the truncating txid, and workflow reclaim deletes versions once they are not the newest at or below any covered txid. ## Workflow Compaction Metadata diff --git a/docs-internal/engine/sqlite/vfs-brief.md b/docs-internal/engine/sqlite/vfs-brief.md deleted file mode 100644 index 06e85ef1ae..0000000000 --- a/docs-internal/engine/sqlite/vfs-brief.md +++ /dev/null @@ -1,31 +0,0 @@ -# SQLite VFS Brief - -This page is intentionally brief. Full VFS rules live in [../sqlite-vfs.md](../sqlite-vfs.md), and the storage backend crash course lives in [../depot.md](../depot.md). - -## Boundary - -The VFS presents SQLite page reads and commits to the storage conveyer. It does not own PITR, fork metadata, or FDB compaction/reclaim. Those are storage-layer responsibilities under `engine/packages/depot/`. - -## Read Shape - -For page reads, the VFS asks storage for pages by database id and generation. Storage: - -1. Resolves the database branch ancestry. -2. Checks database size and the current head. -3. Uses PIDX to find recent DELTA owners. -4. Falls through to the latest SHARD version at or below the read txid. -5. Zero-fills valid database gaps. - -The VFS should treat missing pages above EOF differently from storage misses below EOF. - -## Commit Shape - -For commits, the VFS passes dirty pages to storage. Storage encodes the pages into LTX chunks, writes DELTA/PIDX rows, updates `COMMITS` and `VTX`, and advances `META/head` in one FDB transaction. - -The VFS does not write local SQLite database files. Local files would break the stateless storage invariant and bypass the branch machinery. - -## Reference Links - -- [SQLite VFS](../sqlite-vfs.md) -- [Depot crash course](../depot.md) -- VFS source: `engine/packages/depot-client/src/` diff --git a/engine/CLAUDE.md b/engine/CLAUDE.md index 51a1be16c6..ff2f1c63f6 100644 --- a/engine/CLAUDE.md +++ b/engine/CLAUDE.md @@ -75,7 +75,7 @@ rivet-engine udb -q 'ls 0/1/2/workflow/by_name_and_tag/pegboard_actor/str:actor_ ## Depot tests -- For Depot key layout, component responsibilities, VFS interaction, design constraints, and prior-art comparisons, read `docs-internal/engine/sqlite/`. +- For Depot key layout, component responsibilities, VFS interaction, design constraints, and prior-art comparisons, read `docs-internal/engine/depot/` (start with `overview.md`). - `depot` tests live in `engine/packages/depot/tests/`; do not add inline module test blocks. - Run `depot` tests against temp RocksDB-backed UniversalDB via `test_db()`, `checkpoint_test_db(...)`, and `reopen_test_db(...)` instead of mocked storage paths. - `depot` PIDX entries are stored as the PIDX key prefix plus a big-endian `u32` page number, with the value encoded as a raw big-endian `u64` txid. @@ -86,8 +86,9 @@ rivet-engine udb -q 'ls 0/1/2/workflow/by_name_and_tag/pegboard_actor/str:actor_ - `depot` LTX decoders should validate the varint page index against the actual page-frame layout instead of trusting footer offsets alone. - `depot` `get_pages(...)` should keep `/META/head`, cold PIDX loads, and DELTA/SHARD blob fetches inside one UDB transaction, then decode each unique blob once and evict stale cached PIDX rows that now need SHARD fallback. - `depot` fast-path commits should update an already-cached PIDX in memory after the store write, but must not load PIDX from store just to mutate it or the one-RTT path is gone. -- `depot` shrink writes must delete above-EOF PIDX rows and fully-above-EOF SHARD blobs inside the same commit/takeover transaction; compaction only cleans partial shards by filtering pages at or below `head.db_size_pages`. -- `depot` compaction should choose shard passes from the live PIDX scan, then delete DELTA blobs by comparing all existing delta keys against the remaining global PIDX references so multi-shard and overwritten deltas only disappear when every page ref is gone. +- `depot` shrink writes delete above-EOF PIDX rows and publish pruned SHARD versions at the truncating txid inside the same commit transaction; historical SHARD versions are never deleted or rewritten by truncate because pins and PITR coverage read through them. +- `depot` reclaim deletes DELTA rows at or below the hot watermark with no per-shard proof; COMMITS/VTX below the watermark survive only as keep-set islands (pins plus retained PITR interval representatives), and superseded SHARD versions are deleted once no covered txid reads through them. +- `depot` snapshot targets (pins, forks, restores) must land on covered txids or above the hot watermark; creation paths fence on `CMP/root` serializably and versionstamp targets snap down to the newest covered point. - `depot` metrics should record compaction pass duration and totals in `compactor/worker.rs`, while shard outcome metrics such as folded pages, deleted deltas, delta gauge updates, and lag stay in `compactor/shard.rs` to avoid double counting. - `depot` quota accounting bills `/META/head`, COMMITS, VTX, DELTA, and PIDX keys at commit time and credits them when installs or reclaim delete them; SHARD versions and PITR interval rows are never billed. `/META/quota` tracks the sum with signed atomic-add deltas. - `depot` latency tests that depend on `UDB_SIMULATED_LATENCY_MS` should live in a dedicated integration test binary, because UniversalDB caches that env var once per process with `OnceLock`. diff --git a/engine/packages/depot/CLAUDE.md b/engine/packages/depot/CLAUDE.md index b7d91479e4..d5e81b3d00 100644 --- a/engine/packages/depot/CLAUDE.md +++ b/engine/packages/depot/CLAUDE.md @@ -1,11 +1,11 @@ # Depot Package Notes -The per-database Depot engine. FDB is the authoritative store for OSS SQLite state. LTX V3 file format is used throughout commit and compaction payloads. +The per-database Depot engine. UDB is the authoritative store for OSS SQLite state. LTX V3 file format is used throughout commit and compaction payloads. ## Hard Constraints -- **No local SQLite files. Ever.** Not on disk, not on tmpfs, not as a hydrated cache file. The VFS speaks to Depot, and Depot speaks to FDB. -- **Lazy reads only.** Do not bulk pre-load database pages at open. Fetch pages on demand through PIDX/DELTA and FDB SHARD coverage. +- **No local SQLite files. Ever.** Not on disk, not on tmpfs, not as a hydrated cache file. The VFS speaks to Depot, and Depot speaks to UDB. +- **Lazy reads only.** Do not bulk pre-load database pages at open. Fetch pages on demand through PIDX/DELTA and UDB SHARD coverage. - **No OSS cold tier.** Do not reintroduce `cold_tier`, S3 object storage, cold manifests, cold compacter workflows, cold read fallback, or shard-cache fill workers in this package. - **Unsupported cold config must fail clearly.** Do not silently disable old `workflow_cold_storage` config shapes. - **Workflow compaction is the only compaction authority.** Do not reintroduce standalone compactor modules or tests. @@ -14,7 +14,7 @@ The per-database Depot engine. FDB is the authoritative store for OSS SQLite sta ## Read Path - PIDX/DELTA wins. -- FDB SHARD fallback is next. +- UDB SHARD fallback is next. - Valid database gaps are zero-filled. - Missing required source coverage returns a storage error. - `debug::read_at` cannot trust PIDX alone; it scans DELTA history up to the target txid before falling through to SHARD rows and zero-fill. @@ -26,24 +26,29 @@ The per-database Depot engine. FDB is the authoritative store for OSS SQLite sta - `DbManagerState.active_jobs` stores concrete hot/reclaim active jobs. - `ForceCompactionWork` supports hot, reclaim, and final settle work. - `CompactionRoot` retains cold watermark fields for legacy persisted compatibility only. -- Reclaimer planning owns deletion eligibility from current manifest, pins, PIDX, SHARD, and staged hot output. +- The install that advances `hot_watermark_txid` is the coverage proof: DELTA rows at or below the watermark are reclaimable with no per-shard or PIDX proof. +- COMMITS/VTX below the watermark survive only as keep-set islands (pins plus retained PITR interval representatives); superseded SHARD versions die once no covered txid reads through them. +- Snapshot targets (pins, forks, restores) must be covered or above the watermark; creation paths fence on `CMP/root` serializably and snap versionstamp targets down to the newest covered point. +- Ambiguous bucket fork proofs fail-safe only commit/VTX and shard-version deletes; they never block installs or delta reclaim. +- Truncate publishes pruned SHARD versions at the truncating txid and never deletes or rewrites historical versions. ## Keys And Types - Conveyer type domains live behind the `conveyer/types.rs` facade. - Keep branch, restore point, compaction, history-pin, storage, page, and id payloads in focused `conveyer/types/*.rs` files. - Do not add raw `serde_bare` persisted encodings; use versioned BARE (`vbare`) for persisted/wire-format data. -- When changing FDB key layout, branch metadata, or compactor responsibilities, update `docs-internal/engine/sqlite/{storage-structure,components,constraints-and-design-decisions}.md` in the same change. +- When changing UDB key layout, branch metadata, or compactor responsibilities, update `docs-internal/engine/depot/{storage-structure,components,constraints-and-design-decisions}.md` in the same change. ## Tests - Put Rust tests under `tests/`, not inline `#[cfg(test)] mod tests` in `src/`, unless private-module access is truly required. -- Keep test fixtures FDB-backed. Do not add filesystem object-storage stand-ins for OSS Depot. +- Keep test fixtures UDB-backed. Do not add filesystem object-storage stand-ins for OSS Depot. - Fault tests should use surviving commit, read, hot compaction, and reclaim fault points. ## Reference Docs -- `docs-internal/engine/depot.md` -- `docs-internal/engine/sqlite/storage-structure.md` -- `docs-internal/engine/sqlite/components.md` -- `docs-internal/engine/sqlite/constraints-and-design-decisions.md` +- `docs-internal/engine/depot/overview.md` (start here: high-level system overview) +- `docs-internal/engine/depot/storage-structure.md` +- `docs-internal/engine/depot/components.md` +- `docs-internal/engine/depot/constraints-and-design-decisions.md` +- `docs-internal/engine/depot/comparison-to-other-systems.md` diff --git a/engine/packages/depot/README.md b/engine/packages/depot/README.md index 1d0cb0f5d0..c09d13b6e1 100644 --- a/engine/packages/depot/README.md +++ b/engine/packages/depot/README.md @@ -1,6 +1,6 @@ # depot -Per-database storage engine for Rivet's SQLite-on-FDB system. Depot owns FDB-backed durability, branch/fork metadata, restore points, PITR interval bookkeeping, hot compaction, and FDB cleanup. +Per-database storage engine for Rivet's SQLite-on-UDB system. Depot owns UDB-backed durability, branch/fork metadata, restore points, PITR interval bookkeeping, hot compaction, and UDB cleanup. OSS Depot does not include S3-backed cold storage. Configs that still specify SQLite workflow cold storage should be treated as unsupported instead of silently downgrading. @@ -8,23 +8,23 @@ OSS Depot does not include S3-backed cold storage. Configs that still specify SQ ```text src/ - conveyer/ commit/read paths, FDB keys, branch metadata, quotas + conveyer/ commit/read paths, UDB keys, branch metadata, quotas workflows/ DB manager, hot compacter, reclaimer compaction/ shared planning, payloads, workflow helpers gc/ branch/refcount/restore-point pin calculations doctor.rs storage diagnostics debug.rs historical debug reads and metadata dumps - inspect.rs raw FDB inspection helpers + inspect.rs raw UDB inspection helpers ``` ## Storage Model -Commits are durable after the FDB transaction commits. Depot stores dirty pages as LTX V3 DELTA chunks plus PIDX owner rows, then wakes workflow compaction when hot lag crosses thresholds. +Commits are durable after the UDB transaction commits. Depot stores dirty pages as LTX V3 DELTA chunks plus PIDX owner rows, then wakes workflow compaction when hot lag crosses thresholds. Reads resolve pages in this order: 1. PIDX-owned DELTA chunks. -2. Reader-visible FDB SHARD rows written by hot compaction. +2. Reader-visible UDB SHARD rows written by hot compaction. 3. Zero-fill only for valid gaps inside the database size. Missing DELTA/SHARD coverage is a storage error. Reads do not fall through to object storage in OSS. @@ -34,15 +34,15 @@ Missing DELTA/SHARD coverage is a storage error. Reads do not fall through to ob Each active database branch has: - `DbManagerWorkflow` — owns compaction state and dispatches manager-authorized work. -- `DbHotCompacterWorkflow` — stages compacted FDB SHARD output and reports completion. -- `DbReclaimerWorkflow` — deletes manager-authorized FDB rows and stale staged hot output. +- `DbHotCompacterWorkflow` — stages compacted UDB SHARD output and reports completion. +- `DbReclaimerWorkflow` — deletes manager-authorized UDB rows and stale staged hot output. Hot compaction is signal-driven by `DeltasAvailable` and explicit `ForceCompaction { hot: true }`. Reclaim runs from its own manager deadline. The manager never dispatches cold upload work in OSS. ## Important Invariants -- No local SQLite database files. The VFS talks to Depot; Depot talks to FDB. -- FDB remains the authoritative store for live and retained OSS history. +- No local SQLite database files. The VFS talks to Depot; Depot talks to UDB. +- UDB remains the authoritative store for live and retained OSS history. - Hot compaction output is staged first, then installed by the manager after revalidation. - Reclaim deletes only rows that manager planning proved safe against branch pins, restore-point pins, PITR coverage, and current branch state. - `CompactionRoot` keeps legacy cold watermark fields for persisted decode compatibility, but OSS code does not update or act on them. @@ -50,7 +50,7 @@ Hot compaction is signal-driven by `DeltasAvailable` and explicit `ForceCompacti ## Reference Docs -- `docs-internal/engine/depot.md` — system overview. -- `docs-internal/engine/sqlite/storage-structure.md` — FDB key layout. -- `docs-internal/engine/sqlite/components.md` — component responsibilities. -- `docs-internal/engine/sqlite/constraints-and-design-decisions.md` — design constraints and rationale. +- `docs-internal/engine/depot/overview.md` — high-level system overview (start here). +- `docs-internal/engine/depot/storage-structure.md` — UDB key layout. +- `docs-internal/engine/depot/components.md` — component responsibilities. +- `docs-internal/engine/depot/constraints-and-design-decisions.md` — design constraints and rationale.