Skip to content

[slopfix] test(universaldb): postgres leader failover + abort resolver task on driver drop#5336

Open
MasterPtato wants to merge 1 commit into
stack/slop-claude-opus-4-8-high-feat-ups-table-backed-postgres-transport-with-coalesced-doorbell-xutxrzrvfrom
stack/slopfix-test-universaldb-postgres-leader-failover-abort-resolver-task-on-driver-drop-tvmntvqo
Open

[slopfix] test(universaldb): postgres leader failover + abort resolver task on driver drop#5336
MasterPtato wants to merge 1 commit into
stack/slop-claude-opus-4-8-high-feat-ups-table-backed-postgres-transport-with-coalesced-doorbell-xutxrzrvfrom
stack/slopfix-test-universaldb-postgres-leader-failover-abort-resolver-task-on-driver-drop-tvmntvqo

Conversation

@MasterPtato

Copy link
Copy Markdown
Contributor

No description provided.

@MasterPtato

MasterPtato commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Stack for rivet-dev/rivet

Current stack:

Dependencies:

Get stack: forklift get 5336
Push local edits: forklift submit
Merge when ready: forklift merge 5336

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Code Review

This PR adds a Postgres leader failover integration test for universaldb, fixes the resolver task to be properly aborted when the driver drops (preventing zombie lease renewal), refactors the universalpubsub Postgres driver from per-subscriber heartbeats to a single per-node heartbeat, and introduces a no-responders fast-fail path for request/reply patterns. The core correctness fix (aborting the resolver's JoinHandle in Drop) is well-targeted.


Findings

1. engine/packages/perf/src/lib.rs — PerfMeasure drop level change breaks a CI assertion

The PR demotes the dropped-without-finish log from tracing::warn! to tracing::debug!. The existing test in engine/packages/perf/tests/perf_measure.rs at line 201 asserts Level::WARN for exactly this event:

assert_eq!(Level::WARN, events[0].level);

This assertion will now fail. Beyond the broken test, demoting this to debug silently hides violations of the CLAUDE.md invariant that every PerfMeasure must end with perf_finish! or perf_abandon!, making it easy for future callers to accidentally discard measurements undetected.


2. engine/packages/universalpubsub/src/driver/postgres/mod.rs — Extra DB roundtrip on every request-reply publish

has_responders() is called unconditionally on every publish(subject, payload, Some(reply_subject)). This adds a two-subquery Postgres round-trip to every request-reply publish, doubling the DB load on what is the most latency-sensitive publish path (actor action dispatch, pegboard coordination, etc.).

A targeted alternative: only probe has_responders on the first attempt and use a shorter timeout for retry, or cache the subscriber count per subject with a short TTL.


3. engine/packages/universalpubsub/src/driver/postgres/mod.rs — Queue-subscriber branch in has_responders filters by hash only, not full subject

The broadcast-subscriber branch guards correctly with WHERE s.subject_hash = $1 AND s.subject = $2, but the queue-subscriber branch uses only WHERE s.subject_hash = $1 because ups_queue_subs stores no raw subject string. A hash collision between two different subjects would cause has_responders to return true for a subject with no actual subscriber, causing the request to publish a message that sits unclaimed until GC rather than returning an immediate NoResponders. With a 64-bit hash this is low-probability but the asymmetry is worth closing by adding a subject column to ups_queue_subs.


4. engine/packages/universaldb/tests/failover.rs — Hardcoded 4-second sleep before first connection attempt

tokio::time::sleep(Duration::from_secs(4)).await is the only guard between docker_config.start() and the first connect_raw / make_db call. On a loaded CI machine, Postgres may not be accepting connections in 4 seconds; make_db will then fail and the result will be misattributed to a product regression. The scripts/run/engine-postgres.sh change in this same PR replaced nc -z with pg_isready for exactly this reason. The test should do the same: retry the connection in a loop instead of sleeping a fixed amount.


5. engine/packages/universalpubsub/src/driver/postgres/mod.rstokio::spawn in Drop impls may not run during shutdown

Both PostgresSubscriber::drop and PostgresQueueSubscriber::drop spawn a cleanup task to DELETE the subscriber row. If the Tokio runtime is shutting down when the subscriber is dropped, the spawned future may never be polled, leaving stale rows in ups_subs / ups_queue_subs. The node-level GC eventually removes them, but the window (up to NODE_TTL_SECS = 30 s) means messages can be queued for dead subscribers on normal process shutdown. This is best-effort cleanup that is probably acceptable given GC coverage, but worth a comment explaining the trade-off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant