Fix adaptive metrics decay when provider metrics are not updated by SURYAS1306 · Pull Request #16048 · apache/dubbo

SURYAS1306 · 2026-01-25T13:44:13Z

What is the purpose of the change?

This PR fixes an issue in AdaptiveLoadBalance / AdaptiveMetrics where latency decay behaves incorrectly when provider metrics are not updated for a period of time.

Currently, when no new provider metrics arrive, getLoad() may repeatedly apply the penalty branch or aggressively right-shift lastLatency, which can result in stale or extreme values dominating EWMA. This makes adaptive load balancing unstable, especially in low-QPS or intermittent-update scenarios.

This PR ensures that latency decays safely and progressively instead of collapsing or being stuck at penalty values.

Fixes #15810

What is changed?

1. Improved decay logic in AdaptiveMetrics#getLoad()

Prevents lastLatency from collapsing to zero.
Ensures decay happens smoothly when provider metrics are not refreshed.
Avoids repeatedly applying the penalty path when timestamps are equal.

2. Added unit test

Added testAdaptiveMetricsDecayWithoutProviderUpdate

Verifies that when provider metrics are not updated:

latency decays over time
penalty value is not stuck
EWMA continues to evolve

Why is this needed?

Adaptive load balancing relies on EWMA latency to reflect recent performance trends.

Without this fix:

old latency values can dominate indefinitely
penalty values may be repeatedly re-applied
low-traffic services become unfairly weighted

This change makes adaptive load balancing more stable, realistic, and robust under real-world traffic patterns.

Verifying this change

Added new unit test covering the decay scenario
All tests pass locally:

mvn -pl dubbo-cluster -am test

Checklist

Related GitHub issue referenced (Fixes [BUG] Questions about Dubbo Adaptive Load Balance #15810)
Logic change explained clearly
Unit test added
All tests passing locally

codecov-commenter · 2026-01-25T14:09:43Z

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 60.73%. Comparing base (f5d6436) to head (2113910).
⚠️ Report is 33 commits behind head on 3.3.

Files with missing lines	Patch %	Lines
...ain/java/org/apache/dubbo/rpc/AdaptiveMetrics.java	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##                3.3   #16048      +/-   ##
============================================
- Coverage     60.75%   60.73%   -0.03%     
+ Complexity    11757    11752       -5     
============================================
  Files          1952     1952              
  Lines         89012    89012              
  Branches      13421    13421              
============================================
- Hits          54079    54059      -20     
- Misses        29367    29382      +15     
- Partials       5566     5571       +5

Flag	Coverage Δ
integration-tests-java21	`32.19% <0.00%> (+0.01%)`	⬆️
integration-tests-java8	`32.32% <0.00%> (-0.02%)`	⬇️
samples-tests-java21	`32.06% <0.00%> (-0.06%)`	⬇️
samples-tests-java8	`29.71% <0.00%> (-0.01%)`	⬇️
unit-tests-java11	`59.01% <50.00%> (-0.01%)`	⬇️
unit-tests-java17	`58.52% <50.00%> (+0.01%)`	⬆️
unit-tests-java21	`58.51% <50.00%> (-0.02%)`	⬇️
unit-tests-java25	`58.46% <50.00%> (-0.02%)`	⬇️
unit-tests-java8	`59.01% <50.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

SURYAS1306 · 2026-01-25T17:31:18Z

Hi maintainers,

This PR fixes the adaptive metrics decay issue when provider metrics are not updated and adds a unit test covering the scenario.

All checks are green. I’d really appreciate your review when you have time.

Thanks!

zrlw · 2026-01-27T03:09:25Z

you'd better add a comparison test to ensure applying this PR also does well under high QPS circumstance.

SURYAS1306 · 2026-01-27T13:20:15Z

Hi @zrlw , thanks for the suggestion.
That makes sense. I’ll add a comparison test to cover high QPS scenarios and update the PR soon

SURYAS1306 · 2026-01-27T14:43:20Z

Hi @zrlw , thanks for the suggestion.
I’ve added a high frequency style test to cover the adaptive metrics decay behavior and verified it by running the dubbo-cluster module tests locally.
All tests are passing now. Appreciate it if you could take another look. 

zrlw

We still need to evaluate this PR carefully‌ as we should draw on relatively mature industry solutions‌ to refactor the adaptive algorithm.

SURYAS1306 · 2026-02-01T21:49:10Z

Hi @zrlw , thanks for the feedback.

Understood - this PR focuses on fixing the incorrect decay behavior when provider metrics are not updated. The added high-QPS style test is intended to validate the correctness and stability of the existing adaptive logic under more realistic conditions, without altering the overall strategy.

I agree that the adaptive algorithm itself is an important topic and could benefit from deeper discussion and comparison with more mature industry solutions. I’m happy to participate in that discussion or help explore alternative designs if there is a preferred direction.

For now, this PR intentionally keeps the change minimal and low-risk, addressing the concrete issue of unstable decay behavior without introducing broader algorithmic refactoring. Please let me know how you’d like to proceed.

Thanks.

Wubabalala · 2026-03-19T03:41:40Z

Thanks for working on this fix. I spent some time tracing through the full call chain on main to verify the bug and evaluate this PR. Sharing my findings below — the bug is real and more impactful than it might appear.

Bug Verification — Code-Level Trace

I traced the complete request lifecycle to confirm the root cause:

Step 1. AdaptiveLoadBalanceFilter.onResponse() computes client-side RT and calls setProviderMetrics() asynchronously:

// AdaptiveLoadBalanceFilter.java L117-L121
metricsMap.put("rt", String.valueOf(System.currentTimeMillis() - startTime));
getExecutor().execute(() -> {
    adaptiveMetrics.setProviderMetrics(getServiceKey(invocation), metricsMap);
});

Step 2. setProviderMetrics() sets both timestamps to the same value:

// AdaptiveMetrics.java L93-L94
metrics.currentProviderTime = serviceTime;
metrics.currentTime = serviceTime;          // ← same value as above

Step 3. On the next request, getLoad() evaluates the penalty condition:

// AdaptiveMetrics.java L58-L60
if (metrics.currentProviderTime == metrics.currentTime) {  // ← always true after Step 2
    metrics.lastLatency = timeout * 2L;                     // ← real RT overwritten
}

Since Step 2 guarantees currentProviderTime == currentTime, the penalty branch fires on every normal response cycle. The real RT (set at L98 in setProviderMetrics) is immediately overwritten by timeout * 2L.

Step 4. chooseLowLoadInvoker() uses the corrupted value directly:

// AdaptiveLoadBalance.java L93-L96
long load1 = Double.doubleToLongBits(
        adaptiveMetrics.getLoad(getServiceKey(invoker1, invocation), weight1, timeout1));

No clamping, no normalization, no outlier detection. The polluted EWMA feeds straight into the routing decision.

Simulation Results

I ported AdaptiveMetrics.java faithfully to Python and ran three scenarios. Scripts are attached at the bottom.

Scenario 1: Normal Traffic — Penalty Overwrites Real RT

3 servers (A=10ms, B=50ms, C=200ms), timeout=100ms:

Round  Server       Real RT    lastLatency    EWMA       Branch
1      A (10ms)     10         200            102.5      PENALTY
1      B (50ms)     50         200            112.5      PENALTY
1      C (200ms)    200        200            150.0      PENALTY
...
10     A (10ms)     10         200            136.7      PENALTY
10     B (50ms)     50         200            150.0      PENALTY
10     C (200ms)    200        200            200.0      PENALTY

Every round hits PENALTY. A 10ms server's EWMA (136.7) is only 1.5x less than a 200ms server (200.0) — should be 20x. Adaptive load balancing degrades to near-random distribution.

Scenario 2: Node Degradation — Invisible to Algorithm

Server A degrades from 10ms → 500ms at round 6 (GC pause, downstream timeout):

Round  Event              Server   Real RT    lastLat    EWMA       Branch
5                         A        10         200        136.5      PENALTY
6      A DEGRADES→500ms   A        500        200        259.1      PENALTY
...
12                        A        500        200        300.0      PENALTY
12                        B        50         200        150.0      PENALTY

A's lastLatency remains 200 (penalty value), not 500. The algorithm is blind to the degradation. Traffic continues flowing to the degraded node — cascading failure risk with no circuit breaker to catch it (onError() at AdaptiveLoadBalanceFilter.java L131 only increments a counter).

Scenario 3: Low QPS — Decay to Zero

Server C (200ms) idle for 2 seconds, then getLoad() is called:

multiple = 2000ms / 100ms + 1 = 21
lastLatency = 200 >> 21 = 0

The slowest server's latency decays to zero — it now appears "fastest". Traffic floods the worst-performing node.

Safety Net Audit

I checked whether any compensating mechanism mitigates these bugs:

Layer	Present?	Effect
P2C random selection (`AdaptiveLoadBalance.java L68-L78`)	Yes	Probabilistic buffer only — polluted node isn't picked every time, but still gets ~equal share
pickTime forced selection (`AdaptiveMetrics.java L51-L53`)	Yes	Prevents starvation, but never triggers when all nodes have similar (polluted) EWMA
Load value clamping / normalization	No	—
Circuit breaker / outlier detection	No	`onError()` only increments counter
Fallback to other LB strategy	No	LB strategies are mutually exclusive
EWMA range validation	No	—

For comparison, ShortestResponseLoadBalance in the same codebase uses a sliding window over RpcStatus (real success counts and elapsed time) — a more robust design that doesn't suffer from this class of bug.

Comments on This PR's Approach

1. `==` → `>` effectively disables the penalty path

After getLoad() executes, it sets currentTime = System.currentTimeMillis() (L65). For currentProviderTime > currentTime to hold, the provider's reported time would need to be in the future — essentially impossible under normal clock behavior.

This swings from "penalty always fires" to "penalty never fires". A more targeted fix: track whether new provider metrics have arrived since the last getLoad() call, e.g., a boolean providerUpdated flag that setProviderMetrics() sets to true and getLoad() resets to false.

2. `Math.max(1L, ...)` — right idea, arbitrary floor

Preventing decay-to-zero is correct. But 1L is meaningless for a service with timeout=5000ms — a 1ms floor is effectively zero relative to the scale. Consider deriving the floor from timeout (e.g., timeout / 100) or making it configurable.

3. Test robustness

Reflection (setAccessible) to manipulate private fields couples the test to internal field names. Driving the test through the public API (setProviderMetrics + getLoad) with controlled inputs would be more resilient.
Thread.sleep(150) is a classic source of CI flakiness. A Clock abstraction or time-independent assertions would be more reliable.

Verification Scripts

dubbo_adaptive_bug_sim.py — Basic penalty reproduction (click to expand)

"""
Dubbo AdaptiveMetrics Bug Simulation — Issue #15810
Reproduces: penalty branch fires on every normal response,
corrupting real latency data.
"""
import time

class AdaptiveMetrics:
    """Faithful port of Dubbo's AdaptiveMetrics.java (main branch)"""
    def __init__(self):
        self.current_provider_time = 0
        self.provider_cpu_load = 0.0
        self.last_latency = 0
        self.current_time = 0
        self.pick_time = int(time.time() * 1000)
        self.beta = 0.5
        self.ewma = 0.0
        self.consumer_req = 0
        self.consumer_success = 0

    def _now(self):
        return int(time.time() * 1000)

    def get_load(self, weight, timeout):
        if self._now() - self.pick_time > timeout * 2:
            return 0, "FORCED"
        branch = "NO_DATA"
        if self.current_time > 0:
            multiple = (self._now() - self.current_time) // timeout + 1
            if multiple > 0:
                if self.current_provider_time == self.current_time:
                    self.last_latency = timeout * 2
                    branch = "PENALTY"
                else:
                    self.last_latency = self.last_latency >> multiple
                    branch = "DECAY"
                self.ewma = self.beta * self.ewma + (1 - self.beta) * self.last_latency
                self.current_time = self._now()
        inflight = max(0, self.consumer_req - self.consumer_success)
        load = (self.provider_cpu_load * (self.ewma ** 0.5 + 1)
                * (inflight + 1)
                / (((self.consumer_success / (self.consumer_req + 1)) * weight) + 1))
        return load, branch

    def set_provider_metrics(self, service_time, rt, cpu_load):
        if self.current_provider_time > service_time:
            return
        self.current_provider_time = service_time
        self.current_time = service_time
        self.provider_cpu_load = cpu_load
        self.last_latency = rt
        self.ewma = self.beta * self.ewma + (1 - self.beta) * self.last_latency

    def add_req(self): self.consumer_req += 1
    def add_success(self): self.consumer_success += 1

timeout = 100
servers = {
    "A (10ms)":  {"m": AdaptiveMetrics(), "rt": 10},
    "B (50ms)":  {"m": AdaptiveMetrics(), "rt": 50},
    "C (200ms)": {"m": AdaptiveMetrics(), "rt": 200},
}
for r in range(1, 11):
    for name, s in servers.items():
        m = s["m"]
        now = int(time.time() * 1000)
        m.set_provider_metrics(now, s["rt"], 0.5)
        m.add_req(); m.add_success()
        time.sleep(0.001)
        load, branch = m.get_load(100, timeout)
        print(f"Round {r}: {name} real_rt={s['rt']} lastLat={m.last_latency} ewma={m.ewma:.1f} [{branch}]")

dubbo_adaptive_distributed_sim.py — Node degradation + Low QPS decay (click to expand)

"""
Dubbo AdaptiveMetrics Bug — Distributed Scenarios
Scenario 1: Node degradation (10ms -> 500ms), algorithm blind to change
Scenario 2: Low QPS decay-to-zero, slowest server looks "fastest"
"""
import time

class AdaptiveMetrics:
    def __init__(self):
        self.current_provider_time = 0
        self.provider_cpu_load = 0.0
        self.last_latency = 0
        self.current_time = 0
        self.pick_time = int(time.time() * 1000)
        self.beta = 0.5
        self.ewma = 0.0
        self.consumer_req = 0
        self.consumer_success = 0

    def _now(self):
        return int(time.time() * 1000)

    def get_load(self, weight, timeout):
        if self._now() - self.pick_time > timeout * 2:
            return 0, "FORCED"
        branch = "NO_DATA"
        if self.current_time > 0:
            multiple = (self._now() - self.current_time) // timeout + 1
            if multiple > 0:
                if self.current_provider_time == self.current_time:
                    self.last_latency = timeout * 2
                    branch = "PENALTY"
                else:
                    self.last_latency = self.last_latency >> multiple
                    branch = f"DECAY(>>{multiple})"
                self.ewma = self.beta * self.ewma + (1 - self.beta) * self.last_latency
                self.current_time = self._now()
        return 0, branch

    def set_provider_metrics(self, service_time, rt, cpu_load):
        if self.current_provider_time > service_time:
            return
        self.current_provider_time = service_time
        self.current_time = service_time
        self.provider_cpu_load = cpu_load
        self.last_latency = rt
        self.ewma = self.beta * self.ewma + (1 - self.beta) * self.last_latency

    def add_req(self): self.consumer_req += 1
    def add_success(self): self.consumer_success += 1

# Scenario 1: Node Degradation
print("=== Scenario 1: Node Degradation ===")
timeout = 100
servers = {"A": {"m": AdaptiveMetrics(), "rt": 10},
           "B": {"m": AdaptiveMetrics(), "rt": 50},
           "C": {"m": AdaptiveMetrics(), "rt": 80}}
for r in range(1, 13):
    if r == 6: servers["A"]["rt"] = 500
    for name in ["A", "B", "C"]:
        s = servers[name]; m = s["m"]
        now = int(time.time() * 1000)
        m.set_provider_metrics(now, s["rt"], 0.5)
        m.add_req(); m.add_success()
        time.sleep(0.001)
        _, branch = m.get_load(100, timeout)
        marker = " <- DEGRADED" if r >= 6 and name == "A" else ""
        print(f"R{r} {name}: rt={s['rt']}ms lastLat={m.last_latency} ewma={m.ewma:.1f} [{branch}]{marker}")

# Scenario 2: Low QPS Decay
print("\n=== Scenario 2: Low QPS Decay to Zero ===")
m = AdaptiveMetrics()
now = int(time.time() * 1000)
m.set_provider_metrics(now, 200, 0.5)
m.add_req(); m.add_success()
time.sleep(0.001)
m.get_load(100, timeout)
print(f"Before idle: lastLat={m.last_latency}, ewma={m.ewma:.1f}")
m.current_time = int(time.time() * 1000) - 2000
m.current_provider_time = m.current_time - 1
m.get_load(100, timeout)
print(f"After 2s idle: lastLat={m.last_latency}, ewma={m.ewma:.1f}")
print(f"200 >> 21 = {200 >> 21} -- slowest server now looks 'fastest'")

Fix adaptive metrics decay when provider metrics are not updated

8946d40

SURYAS1306 mentioned this pull request Jan 25, 2026

[BUG] Questions about Dubbo Adaptive Load Balance #15810

Open

4 tasks

Add high-QPS style test for adaptive metrics decay

ddaa5f2

Merge branch '3.3' into fix-adaptive-metrics-decay

2113910

This comment was marked as outdated.

Sign in to view

zrlw added the status/wait for another approve label Jan 28, 2026

zrlw mentioned this pull request Jan 29, 2026

Dubbo支持柔性负载 #10571

Open

2 tasks

zrlw added status:need-discussion and removed status/wait for another approve labels Jan 30, 2026

songxiaosheng added the status/wait for another approve label Jan 30, 2026

zrlw requested changes Jan 30, 2026

View reviewed changes

zrlw removed the status/wait for another approve label Jan 30, 2026

Wubabalala mentioned this pull request Mar 19, 2026

fix(adaptive-lb): replace broken equality-based penalty with explicit freshness tracking #16146

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix adaptive metrics decay when provider metrics are not updated#16048

Fix adaptive metrics decay when provider metrics are not updated#16048
SURYAS1306 wants to merge 3 commits intoapache:3.3from
SURYAS1306:fix-adaptive-metrics-decay

SURYAS1306 commented Jan 25, 2026

Uh oh!

codecov-commenter commented Jan 25, 2026 •

edited

Loading

Uh oh!

SURYAS1306 commented Jan 25, 2026

Uh oh!

zrlw commented Jan 27, 2026

Uh oh!

SURYAS1306 commented Jan 27, 2026

Uh oh!

SURYAS1306 commented Jan 27, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

zrlw left a comment

Uh oh!

SURYAS1306 commented Feb 1, 2026

Uh oh!

Wubabalala commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

SURYAS1306 commented Jan 25, 2026

What is the purpose of the change?

What is changed?

1. Improved decay logic in AdaptiveMetrics#getLoad()

2. Added unit test

Why is this needed?

Verifying this change

Checklist

Uh oh!

codecov-commenter commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

SURYAS1306 commented Jan 25, 2026

Uh oh!

zrlw commented Jan 27, 2026

Uh oh!

SURYAS1306 commented Jan 27, 2026

Uh oh!

SURYAS1306 commented Jan 27, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

zrlw left a comment

Choose a reason for hiding this comment

Uh oh!

SURYAS1306 commented Feb 1, 2026

Uh oh!

Wubabalala commented Mar 19, 2026

Bug Verification — Code-Level Trace

Simulation Results

Scenario 1: Normal Traffic — Penalty Overwrites Real RT

Scenario 2: Node Degradation — Invisible to Algorithm

Scenario 3: Low QPS — Decay to Zero

Safety Net Audit

Comments on This PR's Approach

1. == → > effectively disables the penalty path

2. Math.max(1L, ...) — right idea, arbitrary floor

3. Test robustness

Verification Scripts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jan 25, 2026 •

edited

Loading

1. `==` → `>` effectively disables the penalty path

2. `Math.max(1L, ...)` — right idea, arbitrary floor