Make world-postgres stream delivery resilient to missed NOTIFY events by Pom4H · Pull Request #1881 · vercel/workflow

Pom4H · 2026-04-30T12:39:19Z

Summary

Make @workflow/world-postgres stream delivery resilient when the dedicated PostgreSQL LISTEN connection disconnects or a NOTIFY event is missed.

LISTEN/NOTIFY is treated as a wake-up signal only; the durable streams table remains the source of truth.

Changes

Make the dedicated LISTEN workflow_event_chunk client self-healing
- attach error / end handlers
- reconnect with bounded exponential backoff
- re-run LISTEN after reconnect
- stop reconnect attempts after close()
Add a polling fallback in readFromStream
- track each reader's lastChunkId
- query streams WHERE chunk_id > lastChunkId
- use NOTIFY as the fast path
- periodically poll as the recovery path
- dedupe and preserve ordering by chunk_id
- stop polling on EOF / cancel / controller close
Add streamPollIntervalMs configuration for PostgresWorldConfig
- default keeps polling enabled
- 0 disables the polling fallback
Add regression coverage for reconnect, missed notifications, dedupe, and polling behavior

Why

PostgreSQL NOTIFY is not a durable backlog. If the listener is disconnected, chunks may still be inserted successfully into the streams table while active readers stop receiving live updates.

This change makes stream reads durable by always re-querying persisted chunks newer than the reader's last delivered chunk_id.

Notes

This is compatible with higher-level stream reconnect logic. Core-level reconnect can recover client transport interruptions, but it cannot recover a PostgreSQL notification that was never delivered to a disconnected LISTEN client. The Postgres world still needs to treat the table as the durable source of truth.

The dedicated `pg.Client` used for `LISTEN/NOTIFY` is long-lived and will eventually be dropped by the server (idle TCP timeout, pgbouncer rotation, k8s CNI eviction). Previously a single drop stopped all stream delivery until process restart. Two changes make delivery durable: 1. `listenChannel` now reconnects with bounded exponential backoff (250ms → 30s cap). The initial connect must succeed; subsequent reconnects are best-effort and logged. 2. `streams.get` runs a periodic `SELECT ... WHERE chunk_id > lastChunkId` as a safety net for chunks delivered while the LISTEN socket was reconnecting. The poll dedupes against in-band notifications via the existing `enqueue` ordering check. Configurable via `PostgresWorldConfig.streamPollIntervalMs` (default 5000ms; 0 to disable). Tracks vercel#1855. Tests cover three failure modes via testcontainers: - polling fallback delivers chunks inserted with NOTIFY suppressed - reader still receives chunks after pg_terminate_backend kills LISTEN - listenChannel itself reconnects and delivers post-reconnect notifies

`enqueue` previously decremented `offset` and returned without updating `lastChunkId`. The new polling fallback re-queries `chunk_id > lastChunkId` every tick, so chunks intentionally skipped for `startIndex` would come back on the next poll and be skipped again — double-decrementing `offset` and eventually mis-delivering them once `offset` hit zero. Move the high-water mark update to the top of `enqueue`, before the skip branch. Adds a regression test that pre-seeds two chunks, opens the reader with `startIndex=2`, lets several poll ticks fire (none should deliver), then writes a third chunk and asserts only the third reaches the reader.

Two reliability issues surfaced on review: 1. After natural EOF, `streams.get` set `closed = true` and closed the controller but never cleared the polling `setInterval` or removed the EventEmitter listener. The timer kept ticking (no-op via the `closed` guard) and the listener stayed attached for the lifetime of the process. Extracted an idempotent `stop()` that clears both, called from `cancel()` and from the EOF branch in `enqueue`. As a side benefit, the polling timer is no longer started at all if the initial chunk batch already delivered EOF. 2. `listenChannel.close()` called during an in-flight `connect()` could race: `closed = true` was set while `await next.connect()` / `LISTEN` was still resolving, after which the just-connected client would attach its notification listener and persist past close. Added a `closed` re-check after the awaits — if close raced ahead, end the client immediately and bail. Test: a regression test spies on `setInterval`/`clearInterval` and asserts that every interval the streamer scheduled at the configured poll cadence is cleared by the time the consumer reads `done: true`, without the consumer needing to call `cancel()`.

changeset-bot · 2026-04-30T12:39:24Z

🦋 Changeset detected

Latest commit: 7f40c82

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@workflow/world-postgres	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-04-30T12:39:31Z

@Pom4H is attempting to deploy a commit to the Vercel Labs Team on Vercel.

A member of the Team first needs to authorize it.

Pom4H added 3 commits April 27, 2026 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make world-postgres stream delivery resilient to missed NOTIFY events#1881

Make world-postgres stream delivery resilient to missed NOTIFY events#1881
Pom4H wants to merge 3 commits intovercel:mainfrom
Pom4H:botify/world-postgres-self-healing-v3

Pom4H commented Apr 30, 2026

Uh oh!

changeset-bot Bot commented Apr 30, 2026

Uh oh!

vercel Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pom4H commented Apr 30, 2026

Summary

Changes

Why

Notes

Uh oh!

changeset-bot Bot commented Apr 30, 2026

🦋 Changeset detected

Uh oh!

vercel Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant