fix(healthcheck): use immediate clear() to ensure k8s endpoint change… by tyq010101 · Pull Request #13029 · apache/apisix

tyq010101 · 2026-02-25T03:12:05Z

…s take effect

Replace delayed_clear() with immediate clear() in healthcheck_manager to fix an issue where k8s endpoint changes would not take effect immediately. The delayed clear could cause healthcheck to continue using stale IP addresses after endpoint updates.

Description

Which issue(s) this PR fixes:

Fixes #

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

tyq010101 · 2026-02-25T06:35:31Z

#12803
I think this pr can fix this issue @Baoyuantop

Baoyuantop · 2026-02-27T07:45:27Z

The current method of clearing all data causes a temporary loss of health status with each upstream configuration update. The correct approach is to precisely remove nodes that no longer exist in the new upstream configuration. This clears the old stale IPs while preserving the historical health status of unchanged nodes.

tyq010101 · 2026-03-02T07:40:13Z

@Baoyuantop

You are correct，but clear_delayed also causes a temporary update. The current problem is:

When clear_delayed performs cleanup, the K8s endpoint is no longer in the cache, so even if clear_delayed is executed, the old endpoint IP will not be deleted. The current approach is an optimization based on the previous implementation and does not make the result worse.

…s take effect Replace delayed_clear with immediate clear() in healthcheck_manager to fix an issue where k8s endpoint changes would not take effect immediately. The delayed clear could cause healthcheck to continue using stale IP addresses after endpoint updates. Old K8s Endpoint ip has destroyed, but healthCheck manager always check old ip address

Baoyuantop · 2026-03-12T06:56:58Z

Thanks for the explanation. I agree that delayed_clear has a problem in this scenario -- you're right that stale IPs can persist. But switching to clear() introduces a different issue: it wipes the health status of all nodes (including the ones that haven't changed), causing a temporary health status reset on every upstream update.
Here's what happens with clear():

Upstream has nodes A, B, C, D (all with accumulated health status)
K8s scales down to A, B
clear() removes all target data from shm (A, B, C, D)
New checker re-adds A, B as healthy by default
If A was actually unhealthy, traffic briefly flows to a broken node until the next health check cycle detects the failure again

The correct approach is a diff-based removal: compare the old target list with the new node list, and only remove the nodes that no longer exist (C, D), while preserving A and B's health status.
Something like this in timer_create_checker:

local existing_checker = working_pool[resource_path]
if existing_checker then
    -- diff-based removal: only remove nodes that no longer exist
    local new_nodes_set = {}
    for _, node in ipairs(upstream.nodes) do
        new_nodes_set[node.host .. ":" .. (node.port or "")] = true
    end
    for _, target in ipairs(existing_checker.checker.targets) do
        local key = target.ip .. ":" .. (target.port or "")
        if not new_nodes_set[key] then
            existing_checker.checker:remove_target(target.ip, target.port, target.hostname)
        end
    end
    existing_checker.checker:stop()
end

This removes only the stale IPs while preserving health history for unchanged nodes.

moonming

Hi @tyq010101, thank you for looking into the Kubernetes healthcheck issue!

The change: Replacing delayed_clear() with immediate clear() to ensure k8s endpoint changes take effect immediately.

My concern: The original code likely uses delayed_clear() intentionally to avoid race conditions during configuration updates. Switching to immediate clear() could introduce issues:

Multiple workers might try to clear simultaneously
In-flight health checks could be disrupted
The clear might happen while new endpoints are still being processed

Could you help clarify:

What specific scenario led you to discover this issue?
Have you observed the race condition I mentioned in testing?
Why was delayed_clear() used originally — was there a comment or commit message explaining the choice?

If you can provide more context on the failure scenario, we can work together on a solution that's both immediate and safe. Thank you!

Baoyuantop · 2026-03-18T06:55:53Z

Hi @tyq010101, following up on the previous review comments. Please let us know if you have any updates. Thank you.

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels Feb 25, 2026

tyq010101 force-pushed the fix/k8s_ipaddress_no_update branch from 03b48d9 to 3af6251 Compare February 25, 2026 03:24

tyq010101 force-pushed the fix/k8s_ipaddress_no_update branch from 3af6251 to 3ee8d5d Compare March 2, 2026 07:45

tyq010101 force-pushed the fix/k8s_ipaddress_no_update branch from 3ee8d5d to 7476c07 Compare March 2, 2026 10:07

Baoyuantop added the wait for update wait for the author's response in this issue/PR label Mar 12, 2026

moonming requested changes Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(healthcheck): use immediate clear() to ensure k8s endpoint change…#13029

fix(healthcheck): use immediate clear() to ensure k8s endpoint change…#13029
tyq010101 wants to merge 1 commit intoapache:masterfrom
tyq010101:fix/k8s_ipaddress_no_update

tyq010101 commented Feb 25, 2026 •

edited

Loading

Uh oh!

tyq010101 commented Feb 25, 2026 •

edited

Loading

Uh oh!

Baoyuantop commented Feb 27, 2026

Uh oh!

tyq010101 commented Mar 2, 2026

Uh oh!

Baoyuantop commented Mar 12, 2026

Uh oh!

moonming left a comment

Uh oh!

Baoyuantop commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tyq010101 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Which issue(s) this PR fixes:

Checklist

Uh oh!

tyq010101 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Baoyuantop commented Feb 27, 2026

Uh oh!

tyq010101 commented Mar 2, 2026

Uh oh!

Baoyuantop commented Mar 12, 2026

Uh oh!

moonming left a comment

Choose a reason for hiding this comment

Uh oh!

Baoyuantop commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tyq010101 commented Feb 25, 2026 •

edited

Loading

tyq010101 commented Feb 25, 2026 •

edited

Loading