Skip to content

Make world-postgres stream delivery resilient to missed NOTIFY events#1881

Open
Pom4H wants to merge 3 commits intovercel:mainfrom
Pom4H:botify/world-postgres-self-healing-v3
Open

Make world-postgres stream delivery resilient to missed NOTIFY events#1881
Pom4H wants to merge 3 commits intovercel:mainfrom
Pom4H:botify/world-postgres-self-healing-v3

Conversation

@Pom4H
Copy link
Copy Markdown
Contributor

@Pom4H Pom4H commented Apr 30, 2026

Summary

Fixes #1855.

Make @workflow/world-postgres stream delivery resilient when the dedicated PostgreSQL LISTEN connection disconnects or a NOTIFY event is missed.

LISTEN/NOTIFY is treated as a wake-up signal only; the durable streams table remains the source of truth.

Changes

  • Make the dedicated LISTEN workflow_event_chunk client self-healing

    • attach error / end handlers
    • reconnect with bounded exponential backoff
    • re-run LISTEN after reconnect
    • stop reconnect attempts after close()
  • Add a polling fallback in readFromStream

    • track each reader's lastChunkId
    • query streams WHERE chunk_id > lastChunkId
    • use NOTIFY as the fast path
    • periodically poll as the recovery path
    • dedupe and preserve ordering by chunk_id
    • stop polling on EOF / cancel / controller close
  • Add streamPollIntervalMs configuration for PostgresWorldConfig

    • default keeps polling enabled
    • 0 disables the polling fallback
  • Add regression coverage for reconnect, missed notifications, dedupe, and polling behavior

Why

PostgreSQL NOTIFY is not a durable backlog. If the listener is disconnected, chunks may still be inserted successfully into the streams table while active readers stop receiving live updates.

This change makes stream reads durable by always re-querying persisted chunks newer than the reader's last delivered chunk_id.

Notes

This is compatible with higher-level stream reconnect logic. Core-level reconnect can recover client transport interruptions, but it cannot recover a PostgreSQL notification that was never delivered to a disconnected LISTEN client. The Postgres world still needs to treat the table as the durable source of truth.

Pom4H added 3 commits April 27, 2026 18:07
The dedicated `pg.Client` used for `LISTEN/NOTIFY` is long-lived and
will eventually be dropped by the server (idle TCP timeout, pgbouncer
rotation, k8s CNI eviction). Previously a single drop stopped all
stream delivery until process restart.

Two changes make delivery durable:

1. `listenChannel` now reconnects with bounded exponential backoff
   (250ms → 30s cap). The initial connect must succeed; subsequent
   reconnects are best-effort and logged.

2. `streams.get` runs a periodic `SELECT ... WHERE chunk_id > lastChunkId`
   as a safety net for chunks delivered while the LISTEN socket was
   reconnecting. The poll dedupes against in-band notifications via the
   existing `enqueue` ordering check. Configurable via
   `PostgresWorldConfig.streamPollIntervalMs` (default 5000ms; 0 to
   disable).

Tracks vercel#1855.

Tests cover three failure modes via testcontainers:
- polling fallback delivers chunks inserted with NOTIFY suppressed
- reader still receives chunks after pg_terminate_backend kills LISTEN
- listenChannel itself reconnects and delivers post-reconnect notifies
`enqueue` previously decremented `offset` and returned without updating
`lastChunkId`. The new polling fallback re-queries
`chunk_id > lastChunkId` every tick, so chunks intentionally skipped for
`startIndex` would come back on the next poll and be skipped again —
double-decrementing `offset` and eventually mis-delivering them once
`offset` hit zero.

Move the high-water mark update to the top of `enqueue`, before the skip
branch. Adds a regression test that pre-seeds two chunks, opens the
reader with `startIndex=2`, lets several poll ticks fire (none should
deliver), then writes a third chunk and asserts only the third reaches
the reader.
Two reliability issues surfaced on review:

1. After natural EOF, `streams.get` set `closed = true` and closed the
   controller but never cleared the polling `setInterval` or removed the
   EventEmitter listener. The timer kept ticking (no-op via the `closed`
   guard) and the listener stayed attached for the lifetime of the
   process. Extracted an idempotent `stop()` that clears both, called
   from `cancel()` and from the EOF branch in `enqueue`. As a side
   benefit, the polling timer is no longer started at all if the initial
   chunk batch already delivered EOF.

2. `listenChannel.close()` called during an in-flight `connect()` could
   race: `closed = true` was set while `await next.connect()` /
   `LISTEN` was still resolving, after which the just-connected client
   would attach its notification listener and persist past close. Added
   a `closed` re-check after the awaits — if close raced ahead, end the
   client immediately and bail.

Test: a regression test spies on `setInterval`/`clearInterval` and
asserts that every interval the streamer scheduled at the configured
poll cadence is cleared by the time the consumer reads `done: true`,
without the consumer needing to call `cancel()`.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 30, 2026

🦋 Changeset detected

Latest commit: 7f40c82

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@workflow/world-postgres Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Apr 30, 2026

@Pom4H is attempting to deploy a commit to the Vercel Labs Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

world-postgres: stream readers can stall after LISTEN disconnects or missed NOTIFY event

1 participant