Skip to content

Conversation

@wlwilliamx
Copy link
Collaborator

What problem does this PR solve?

Issue Number: close #3411

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?

None

Do you need to update user documentation, design documentation or monitoring documentation?

None

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 18, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 18, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sdojjy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed do-not-merge/needs-triage-completed labels Dec 18, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical consistency issue within the scheduler where the state of active dispatcher operators could be lost or become inconsistent following a maintainer restart. By enhancing the Protocol Buffer definitions to include operator-specific metadata and implementing robust tracking mechanisms within the DispatcherManager, the system can now accurately persist and restore the state of these operators. This ensures that operations on table spans, such as adding, removing, or moving dispatchers, are correctly resumed and maintained, leading to a more resilient and consistent scheduling process.

Highlights

  • Operator State Persistence: Introduced mechanisms to persist and restore the state of in-progress dispatcher operators (add, remove, move, split, merge) across maintainer restarts, ensuring consistency.
  • Protocol Buffer Updates: Added a new OperatorType enum and fields to DispatcherConfig (enabledSplit) and ScheduleDispatcherRequest (operatorType) in the heartbeatpb protobuf definitions to support operator tracking.
  • Dispatcher Manager Enhancements: The DispatcherManager now uses sync.Map to track currentOperatorMap and redoCurrentOperatorMap, preventing duplicate operator scheduling and ensuring proper cleanup.
  • Bootstrap Operator Restoration: During maintainer bootstrap, existing operators are now restored from the MaintainerBootstrapResponse, allowing the system to resume operations on previously active table spans.
  • Operator Type Integration: All relevant operator creation and message sending functions across maintainer/operator and maintainer/replica packages have been updated to utilize the new OperatorType.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@wlwilliamx wlwilliamx changed the title fix(scheduler): ensure span consistency for operators after maintainer restart fix(spanController): ensure span consistency for operators after maintainer restart Dec 18, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces changes to ensure span consistency for operators after a maintainer restart. The main changes involve adding an enabledSplit flag to dispatchers and tracking ongoing operators to restore them during bootstrap. My review focuses on ensuring correctness, consistency, and robustness of these new mechanisms. I've identified a few areas for improvement, including a typo, a misleading comment, a copy-paste error in a log message, and an unhandled error. Overall, the changes are well-structured and address the intended problem.

@wlwilliamx
Copy link
Collaborator Author

/test pull-integration-test

@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 18, 2025

@wlwilliamx: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-build
/test pull-cdc-kafka-integration-heavy
/test pull-cdc-kafka-integration-light
/test pull-cdc-mysql-integration-heavy
/test pull-cdc-mysql-integration-light
/test pull-cdc-storage-integration-heavy
/test pull-cdc-storage-integration-light
/test pull-check
/test pull-error-log-review
/test pull-unit-test

The following commands are available to trigger optional jobs:

/test pull-build-next-gen
/test pull-cdc-kafka-integration-heavy-next-gen
/test pull-cdc-kafka-integration-light-next-gen
/test pull-cdc-mysql-integration-heavy-next-gen
/test pull-cdc-mysql-integration-light-next-gen
/test pull-cdc-pulsar-integration-light
/test pull-cdc-pulsar-integration-light-next-gen
/test pull-cdc-storage-integration-heavy-next-gen
/test pull-cdc-storage-integration-light-next-gen
/test pull-unit-test-next-gen

Use /test all to run the following jobs that were automatically triggered:

pull-build
pull-build-next-gen
pull-check
pull-error-log-review
pull-unit-test
pull-unit-test-next-gen
Details

In response to this:

/test pull-integration-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-heavy

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-light

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-heavy

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-heavy

// or just a part of the table (span). When true, the dispatcher handles the entire table;
// when false, it only handles a portion of the table.
isCompleteTable bool
enabledSplit bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this field?

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-heavy

)
}

response.Operators = append(response.Operators, &heartbeatpb.ScheduleDispatcherRequest{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please not use ScheduleDispatcherRequest here. ScheduleDispatcherRequest is for the add/remove scheduler action from maintainer to dispatchers. Please use a new message type here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I could rename ScheduleDispatcherRequest to ScheduleDispatcherOperator, what do you think? Since the fields needed in the response are the same as those in the request, there's no need to create a separate one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's better to use a sepereate message type. For each usage, we use a seperater message could make code more readable, and message's meaning more clear

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a separate struct, the code would require many unnecessary struct conversions. I think simply renaming the existing ScheduleDispatcherRequest to indicate that it's a struct used to store operator information would be sufficient.

continue
}
// If there is already an operator for the span, skip this request.
_, exists := dispatcherManager.currentOperatorMap.Load(operatorKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this scenario: a dispatcher is performing a merge operation, but before it's finished, a drop table DDL causes the dispatcher to be removed. In this situation, could it happen that the dispatcher holding the operator has now received a new operator?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will these continue happens? It will make the create or remove action skipped. Is there no correctness issue here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge operator will store the data in another map, which will not conflict with this one. You can refer to the PR of another merge operator.

}
},
)
if ok := c.operatorController.AddOperator(op); !ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just use pushOperator instead of AddOperator

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any problems using AddOperator?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushOperator has more complete functions with metrics, log and also start() function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddOperator will do some checks and call the pushOperator

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-heavy

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-light

@wlwilliamx
Copy link
Collaborator Author

/retest

@wlwilliamx
Copy link
Collaborator Author

/test all

@wlwilliamx
Copy link
Collaborator Author

/retest

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 5, 2026

@wlwilliamx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review 7e8b71a link true /test pull-error-log-review
pull-cdc-storage-integration-heavy 7e8b71a link true /test pull-cdc-storage-integration-heavy

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add index can not sync to downstream and data inconsistency between upstream and downstream after ticdc rolling restart

2 participants