Context / Problem
With the upcoming libcartesi version 0.20.0, machines can be loaded/stored very quickly. This unlocks a potentially simpler and cheaper operational model: keep the cartesi-jsonrpc-machine offline for all running applications and instead load/store machine state as part of the input processing loop.
Today, the node architecture assumes an always-on remote machine service for running applications. The current input feed / execution loop and snapshotting strategy were not designed with fast load/store in mind, so adopting this capability will require design work, experimentation, and likely some refactoring.
Primary goal: reduce cloud footprint and cost by removing the need for a continuously running cartesi-jsonrpc-machine while preserving correctness, determinism, and the same externally observable results.
Goals
- Allow running applications with
cartesi-jsonrpc-machine offline, using load/store of machine state on demand.
- Maintain deterministic execution and consistent outputs/hashes.
- Keep a safe path for incremental rollout and fallback to the current mode if needed.
Non-goals
- Full rewrite of the execution engine.
- Introducing new emulator features beyond what is necessary for load/store integration.
- Changing external APIs unless required (prefer internal implementation changes).
Proposed work (design + prototype)
1) Understand and validate the new primitives
- Identify the exact libcartesi APIs/semantics for fast load/store (file format, performance profile, atomicity guarantees, compatibility).
- Define error handling and recovery expectations (partial writes, corrupted snapshots, incompatible versions).
2) Define the target runtime model
Design how the node processes inputs without a remote machine service:
- Where machine state lives (local disk, volume, object storage, etc.).
- When machine state is loaded (per input, per epoch, etc.).
- When machine state is stored (after each input, after each epoch, etc.).
- Concurrency rules (single-writer per app; read-only access patterns; parallelism across apps).
3) Revisit snapshot strategy
Decide whether to:
- Keep the current snapshot approach unchanged and only alter how machines are executed, OR
- Replace the current snapshots approach with a new design.
Consider special cases:
- Initial syncing / catch-up: do we avoid storing and only persist at the end? Or do periodic stores to support restarts?
- Shutdown / restart semantics: what must be persisted to resume safely? Just the machine hash?
- Disk usage vs recovery time tradeoffs.
- Usage on disputes
- Garbage collect old images from accepted epochs?
4) Decide how to phase out cartesi-jsonrpc-machine always online mode
- Optional flag/config to enable “offline machine mode” per environment.
- Explicit fallback strategy: if load/store fails, can we fallback to remote server (or fail fast)?
- Observability requirements: metrics/logs to compare performance and detect regressions.
5) Prototype
- Implement a minimal prototype behind a feature flag that can:
- Load machine for an app,
- Process a small input batch,
- Store machine,
- Restart and resume from stored state,
- Produce identical outputs/hashes compared to the current mode.
Deliverables
Acceptance criteria
Notes / Open questions to resolve in the design
- How do we ensure atomic stores and prevent corruption on crash/power loss?
- How do we bound disk growth across many apps and long-running nodes? Retention Policy?
Context / Problem
With the upcoming libcartesi version
0.20.0, machines can be loaded/stored very quickly. This unlocks a potentially simpler and cheaper operational model: keep thecartesi-jsonrpc-machineoffline for all running applications and instead load/store machine state as part of the input processing loop.Today, the node architecture assumes an always-on remote machine service for running applications. The current input feed / execution loop and snapshotting strategy were not designed with fast load/store in mind, so adopting this capability will require design work, experimentation, and likely some refactoring.
Primary goal: reduce cloud footprint and cost by removing the need for a continuously running
cartesi-jsonrpc-machinewhile preserving correctness, determinism, and the same externally observable results.Goals
cartesi-jsonrpc-machineoffline, using load/store of machine state on demand.Non-goals
Proposed work (design + prototype)
1) Understand and validate the new primitives
2) Define the target runtime model
Design how the node processes inputs without a remote machine service:
3) Revisit snapshot strategy
Decide whether to:
Consider special cases:
4) Decide how to phase out
cartesi-jsonrpc-machinealways online mode5) Prototype
Deliverables
cartesi-jsonrpc-machinerunningAcceptance criteria
cartesi-jsonrpc-machinebeing shutdown after each input and still:Notes / Open questions to resolve in the design