fix(healthcheck): use immediate clear() to ensure k8s endpoint change…#13029
fix(healthcheck): use immediate clear() to ensure k8s endpoint change…#13029tyq010101 wants to merge 1 commit intoapache:masterfrom
Conversation
03b48d9 to
3af6251
Compare
|
#12803 |
|
The current method of clearing all data causes a temporary loss of health status with each upstream configuration update. The correct approach is to precisely remove nodes that no longer exist in the new upstream configuration. This clears the old stale IPs while preserving the historical health status of unchanged nodes. |
|
3af6251 to
3ee8d5d
Compare
…s take effect Replace delayed_clear with immediate clear() in healthcheck_manager to fix an issue where k8s endpoint changes would not take effect immediately. The delayed clear could cause healthcheck to continue using stale IP addresses after endpoint updates. Old K8s Endpoint ip has destroyed, but healthCheck manager always check old ip address
3ee8d5d to
7476c07
Compare
|
Thanks for the explanation. I agree that delayed_clear has a problem in this scenario -- you're right that stale IPs can persist. But switching to clear() introduces a different issue: it wipes the health status of all nodes (including the ones that haven't changed), causing a temporary health status reset on every upstream update.
The correct approach is a diff-based removal: compare the old target list with the new node list, and only remove the nodes that no longer exist (C, D), while preserving A and B's health status. This removes only the stale IPs while preserving health history for unchanged nodes. |
moonming
left a comment
There was a problem hiding this comment.
Hi @tyq010101, thank you for looking into the Kubernetes healthcheck issue!
The change: Replacing delayed_clear() with immediate clear() to ensure k8s endpoint changes take effect immediately.
My concern: The original code likely uses delayed_clear() intentionally to avoid race conditions during configuration updates. Switching to immediate clear() could introduce issues:
- Multiple workers might try to clear simultaneously
- In-flight health checks could be disrupted
- The clear might happen while new endpoints are still being processed
Could you help clarify:
- What specific scenario led you to discover this issue?
- Have you observed the race condition I mentioned in testing?
- Why was
delayed_clear()used originally — was there a comment or commit message explaining the choice?
If you can provide more context on the failure scenario, we can work together on a solution that's both immediate and safe. Thank you!
|
Hi @tyq010101, following up on the previous review comments. Please let us know if you have any updates. Thank you. |
…s take effect
Replace delayed_clear() with immediate clear() in healthcheck_manager to fix an issue where k8s endpoint changes would not take effect immediately. The delayed clear could cause healthcheck to continue using stale IP addresses after endpoint updates.
Description
Which issue(s) this PR fixes:
Fixes #
Checklist