Skip to content

[FLINK-39946][runtime] Unaligned checkpoint restore can mis-map union input channel state after reordering UIDed sources#28459

Open
bowenli86 wants to merge 1 commit into
apache:masterfrom
bowenli86:dev/bowenli/codex/flink-39946-stable-union-input-order
Open

[FLINK-39946][runtime] Unaligned checkpoint restore can mis-map union input channel state after reordering UIDed sources#28459
bowenli86 wants to merge 1 commit into
apache:masterfrom
bowenli86:dev/bowenli/codex/flink-39946-stable-union-input-order

Conversation

@bowenli86

@bowenli86 bowenli86 commented Jun 16, 2026

Copy link
Copy Markdown
Member

What is the purpose of the change

When a Flink job restores from an unaligned checkpoint, input channel state is restored by positional input gate/channel indexes rather than by a stable edge identity.

This can break when a job has multiple source operators feeding a union and the source declarations are reordered between job versions. Even if the source operators have stable UIDs, the generated JobEdge / input gate order can still change because initial source traversal is based on transient stream node ordering. After restore, in-flight channel state from one logical source may be applied to another logical input gate.

The PR makes UID-backed union sources connect in stable UID-derived JobVertexID order, so unaligned checkpoint restore sees the same input-gate positions even if source declaration order changes.

Brief change log

  • Updated job graph generation to order UID-backed union source entry points by stable UID-derived JobVertexID instead of declaration/node-id order.
  • Applied the same stable ordering in the adaptive graph reconnect path so lazily created downstream vertices get consistent input-gate order too.
  • Preserved legacy ordering for non-UID and uidHash-only source heads.

Verifying this change

This change added tests and can be verified as follows:

  • Added regression coverage that reversed UID-backed union source declaration order produces the same sink input order.
  • Added regression coverage for mixed UID-backed union inputs plus an unrelated source.
  • Added coverage that uidHash-only union inputs keep declaration order.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? (no)

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    codex

@bowenli86 bowenli86 marked this pull request as ready for review June 16, 2026 00:09
@flinkbot

flinkbot commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@bowenli86 bowenli86 changed the title [FLINK-39946][runtime] Stabilize UID-backed union input order [FLINK-39946][runtime] Unaligned checkpoint restore can mis-map union input channel state after reordering UIDed sources Jun 16, 2026
@bowenli86 bowenli86 requested a review from pnowojski June 16, 2026 03:16
@bowenli86

Copy link
Copy Markdown
Member Author

this is related to unaligned checkpoint, @pnowojski maybe you can help review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants