Skip to content

[FLINK-37663][connector-base] Fix lost wakeup in FutureCompletingBlockingQueue#28463

Merged
MartijnVisser merged 2 commits into
apache:masterfrom
MartijnVisser:FLINK-37663-fcbq-lost-wakeup
Jun 17, 2026
Merged

[FLINK-37663][connector-base] Fix lost wakeup in FutureCompletingBlockingQueue#28463
MartijnVisser merged 2 commits into
apache:masterfrom
MartijnVisser:FLINK-37663-fcbq-lost-wakeup

Conversation

@MartijnVisser

@MartijnVisser MartijnVisser commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the change

FutureCompletingBlockingQueue (the split-fetcher to source-reader handover) has a lost-wakeup bug.
A putter that blocks in put() registers its Condition in the internal notFull queue, but that
condition is not always removed when the putter stops waiting:

  • when the putter is gracefully woken via wakeUpPuttingThread(int) (and returns false without
    enqueuing), and
  • when the putter's await() returns spuriously (permitted by the Condition contract) and it
    re-blocks, re-adding its condition.

The leftover (stale or duplicated) entry makes a later signalNextPutter(), fired when a consumer
frees a slot, signal a thread that is no longer waiting. A genuinely waiting putter is then never
woken, so a split fetcher can stall indefinitely.

This supersedes the stale #26446 by @herbherbherb, which first diagnosed this issue and proposed
removing the condition inside wakeUpPuttingThread. That covers the graceful-wakeup case but not the
duplicate-condition case (remove() deletes only one occurrence), so the lost wakeup can still
occur; this PR instead removes the condition on every await() return.

Brief change log

  • FutureCompletingBlockingQueue#waitOnPut: remove the putter's condition from notFull once
    await() returns (in a finally), restoring the invariant that notFull only holds conditions of
    currently waiting putters. The removal is O(number of concurrently blocked putters) and is off the
    per-record path (it runs only when a putter unblocks).
  • Added a @VisibleForTesting accessor (getNumberOfQueuedPutters()) so the regression test can
    deterministically sequence the two contending putters without depending on Thread.State. Like the
    existing @VisibleForTesting take() in the same class, this trips the blanket connector "depend
    only on public API" ArchUnit rule (because @VisibleForTesting is @Internal), so it is recorded
    in the flink-architecture-tests-production violation store alongside the existing entry.
  • Added FutureCompletingBlockingQueueTest#testWakeUpDoesNotStrandAnotherPutter, a deterministic
    regression test (first commit) that reproduces the stranded-putter stall and fails on the current
    code. The 10s latch wait is the functional assertion; the method-level @Timeout(60s) is a hang
    backstop for this deadlock-regression test.

The change is split into two commits: the first adds the failing reproduction test together with the
small @VisibleForTesting accessor it needs to sequence the putters deterministically; the second
applies the fix. (The accessor is grouped with the test it enables rather than with the behavioural
change.)

Verifying this change

This change added tests and can be verified as follows:

  • FutureCompletingBlockingQueueTest#testWakeUpDoesNotStrandAnotherPutter blocks two putters, wakes
    the first gracefully, then frees one slot and asserts the second (still-waiting) putter is
    signalled. It is split into a reproduction commit and a fix commit for red-green verification: it
    fails on the code before the fix (the still-waiting putter is stranded, the latch times out) and
    passes after the fix.
  • Existing tests in the module remain green (FutureCompletingBlockingQueueTest, SplitFetcherTest,
    SplitFetcherManagerTest, SourceReaderBaseTest).
  • Additionally validated out-of-tree with the Fray controlled-concurrency tester (not added as a
    dependency): on the unfixed code Fray deterministically reproduces the deadlock; with PR [FLINK-37663] [connector-base] Address potential sync issue in FutureCompletingBlockingQueue wakeup #26446's
    one-line change it still deadlocks (duplicate-condition case); with this fix it ran 5000 explored
    interleavings with no deadlock.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no (FutureCompletingBlockingQueue is @Internal)
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?
  • Yes (Claude Code, Claude Opus 4.8)

Generated-by: Claude Code (Claude Opus 4.8)

@flinkbot

flinkbot commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

…ockingQueue lost wakeup

A freed slot signals a putter that wakeUpPuttingThread() already released instead of the genuinely waiting putter; the test reproduces the stall and fails without the fix. Its @VisibleForTesting accessor is recorded in the architecture violation store, like the existing take().

Generated-by: Claude Code (Claude Opus 4.8)
… wakeup

Remove a blocked putter's condition from the notFull queue when its await() returns, so signalNextPutter() no longer signals a putter that has stopped waiting after a graceful wakeUpPuttingThread() or a spurious wakeup.

Generated-by: Claude Code (Claude Opus 4.8)
@MartijnVisser MartijnVisser force-pushed the FLINK-37663-fcbq-lost-wakeup branch from cca0d9d to 71031be Compare June 16, 2026 22:15

@spuru9 spuru9 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@spuru9

spuru9 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

CI report:

CI has passed. There is some issue with the comment reflecting pass.

@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Jun 17, 2026
@MartijnVisser MartijnVisser merged commit 22c9986 into apache:master Jun 17, 2026
@MartijnVisser MartijnVisser deleted the FLINK-37663-fcbq-lost-wakeup branch June 17, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants