Skip to content

Fix adaptive metrics decay when provider metrics are not updated#16048

Open
SURYAS1306 wants to merge 3 commits intoapache:3.3from
SURYAS1306:fix-adaptive-metrics-decay
Open

Fix adaptive metrics decay when provider metrics are not updated#16048
SURYAS1306 wants to merge 3 commits intoapache:3.3from
SURYAS1306:fix-adaptive-metrics-decay

Conversation

@SURYAS1306
Copy link
Contributor

What is the purpose of the change?

This PR fixes an issue in AdaptiveLoadBalance / AdaptiveMetrics where latency decay behaves incorrectly when provider metrics are not updated for a period of time.

Currently, when no new provider metrics arrive, getLoad() may repeatedly apply the penalty branch or aggressively right-shift lastLatency, which can result in stale or extreme values dominating EWMA. This makes adaptive load balancing unstable, especially in low-QPS or intermittent-update scenarios.

This PR ensures that latency decays safely and progressively instead of collapsing or being stuck at penalty values.

Fixes #15810

What is changed?

1. Improved decay logic in AdaptiveMetrics#getLoad()

  • Prevents lastLatency from collapsing to zero.
  • Ensures decay happens smoothly when provider metrics are not refreshed.
  • Avoids repeatedly applying the penalty path when timestamps are equal.

2. Added unit test

Added testAdaptiveMetricsDecayWithoutProviderUpdate

Verifies that when provider metrics are not updated:

  • latency decays over time
  • penalty value is not stuck
  • EWMA continues to evolve

Why is this needed?

Adaptive load balancing relies on EWMA latency to reflect recent performance trends.

Without this fix:

  • old latency values can dominate indefinitely
  • penalty values may be repeatedly re-applied
  • low-traffic services become unfairly weighted

This change makes adaptive load balancing more stable, realistic, and robust under real-world traffic patterns.

Verifying this change

  • Added new unit test covering the decay scenario
  • All tests pass locally:
mvn -pl dubbo-cluster -am test

Checklist

@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 60.73%. Comparing base (f5d6436) to head (2113910).
⚠️ Report is 33 commits behind head on 3.3.

Files with missing lines Patch % Lines
...ain/java/org/apache/dubbo/rpc/AdaptiveMetrics.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                3.3   #16048      +/-   ##
============================================
- Coverage     60.75%   60.73%   -0.03%     
+ Complexity    11757    11752       -5     
============================================
  Files          1952     1952              
  Lines         89012    89012              
  Branches      13421    13421              
============================================
- Hits          54079    54059      -20     
- Misses        29367    29382      +15     
- Partials       5566     5571       +5     
Flag Coverage Δ
integration-tests-java21 32.19% <0.00%> (+0.01%) ⬆️
integration-tests-java8 32.32% <0.00%> (-0.02%) ⬇️
samples-tests-java21 32.06% <0.00%> (-0.06%) ⬇️
samples-tests-java8 29.71% <0.00%> (-0.01%) ⬇️
unit-tests-java11 59.01% <50.00%> (-0.01%) ⬇️
unit-tests-java17 58.52% <50.00%> (+0.01%) ⬆️
unit-tests-java21 58.51% <50.00%> (-0.02%) ⬇️
unit-tests-java25 58.46% <50.00%> (-0.02%) ⬇️
unit-tests-java8 59.01% <50.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@SURYAS1306
Copy link
Contributor Author

Hi maintainers,

This PR fixes the adaptive metrics decay issue when provider metrics are not updated and adds a unit test covering the scenario.

All checks are green. I’d really appreciate your review when you have time.

Thanks!

@zrlw
Copy link
Contributor

zrlw commented Jan 27, 2026

you'd better add a comparison test to ensure applying this PR also does well under high QPS circumstance.

@SURYAS1306
Copy link
Contributor Author

Hi @zrlw , thanks for the suggestion.
That makes sense. I’ll add a comparison test to cover high QPS scenarios and update the PR soon

@SURYAS1306
Copy link
Contributor Author

Hi @zrlw , thanks for the suggestion.
I’ve added a high frequency style test to cover the adaptive metrics decay behavior and verified it by running the dubbo-cluster module tests locally.
All tests are passing now. Appreciate it if you could take another look.


zrlw

This comment was marked as outdated.

Copy link
Contributor

@zrlw zrlw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to evaluate this PR carefully‌ as we should draw on relatively mature industry solutions‌ to refactor the adaptive algorithm.

@SURYAS1306
Copy link
Contributor Author

Hi @zrlw , thanks for the feedback.

Understood - this PR focuses on fixing the incorrect decay behavior when provider metrics are not updated. The added high-QPS style test is intended to validate the correctness and stability of the existing adaptive logic under more realistic conditions, without altering the overall strategy.

I agree that the adaptive algorithm itself is an important topic and could benefit from deeper discussion and comparison with more mature industry solutions. I’m happy to participate in that discussion or help explore alternative designs if there is a preferred direction.

For now, this PR intentionally keeps the change minimal and low-risk, addressing the concrete issue of unstable decay behavior without introducing broader algorithmic refactoring. Please let me know how you’d like to proceed.

Thanks.

@Wubabalala
Copy link

Thanks for working on this fix. I spent some time tracing through the full call chain on main to verify the bug and evaluate this PR. Sharing my findings below — the bug is real and more impactful than it might appear.

Bug Verification — Code-Level Trace

I traced the complete request lifecycle to confirm the root cause:

Step 1. AdaptiveLoadBalanceFilter.onResponse() computes client-side RT and calls setProviderMetrics() asynchronously:

// AdaptiveLoadBalanceFilter.java L117-L121
metricsMap.put("rt", String.valueOf(System.currentTimeMillis() - startTime));
getExecutor().execute(() -> {
    adaptiveMetrics.setProviderMetrics(getServiceKey(invocation), metricsMap);
});

Step 2. setProviderMetrics() sets both timestamps to the same value:

// AdaptiveMetrics.java L93-L94
metrics.currentProviderTime = serviceTime;
metrics.currentTime = serviceTime;          // ← same value as above

Step 3. On the next request, getLoad() evaluates the penalty condition:

// AdaptiveMetrics.java L58-L60
if (metrics.currentProviderTime == metrics.currentTime) {  // ← always true after Step 2
    metrics.lastLatency = timeout * 2L;                     // ← real RT overwritten
}

Since Step 2 guarantees currentProviderTime == currentTime, the penalty branch fires on every normal response cycle. The real RT (set at L98 in setProviderMetrics) is immediately overwritten by timeout * 2L.

Step 4. chooseLowLoadInvoker() uses the corrupted value directly:

// AdaptiveLoadBalance.java L93-L96
long load1 = Double.doubleToLongBits(
        adaptiveMetrics.getLoad(getServiceKey(invoker1, invocation), weight1, timeout1));

No clamping, no normalization, no outlier detection. The polluted EWMA feeds straight into the routing decision.

Simulation Results

I ported AdaptiveMetrics.java faithfully to Python and ran three scenarios. Scripts are attached at the bottom.

Scenario 1: Normal Traffic — Penalty Overwrites Real RT

3 servers (A=10ms, B=50ms, C=200ms), timeout=100ms:

Round  Server       Real RT    lastLatency    EWMA       Branch
1      A (10ms)     10         200            102.5      PENALTY
1      B (50ms)     50         200            112.5      PENALTY
1      C (200ms)    200        200            150.0      PENALTY
...
10     A (10ms)     10         200            136.7      PENALTY
10     B (50ms)     50         200            150.0      PENALTY
10     C (200ms)    200        200            200.0      PENALTY

Every round hits PENALTY. A 10ms server's EWMA (136.7) is only 1.5x less than a 200ms server (200.0) — should be 20x. Adaptive load balancing degrades to near-random distribution.

Scenario 2: Node Degradation — Invisible to Algorithm

Server A degrades from 10ms → 500ms at round 6 (GC pause, downstream timeout):

Round  Event              Server   Real RT    lastLat    EWMA       Branch
5                         A        10         200        136.5      PENALTY
6      A DEGRADES→500ms   A        500        200        259.1      PENALTY
...
12                        A        500        200        300.0      PENALTY
12                        B        50         200        150.0      PENALTY

A's lastLatency remains 200 (penalty value), not 500. The algorithm is blind to the degradation. Traffic continues flowing to the degraded node — cascading failure risk with no circuit breaker to catch it (onError() at AdaptiveLoadBalanceFilter.java L131 only increments a counter).

Scenario 3: Low QPS — Decay to Zero

Server C (200ms) idle for 2 seconds, then getLoad() is called:

multiple = 2000ms / 100ms + 1 = 21
lastLatency = 200 >> 21 = 0

The slowest server's latency decays to zero — it now appears "fastest". Traffic floods the worst-performing node.

Safety Net Audit

I checked whether any compensating mechanism mitigates these bugs:

Layer Present? Effect
P2C random selection (AdaptiveLoadBalance.java L68-L78) Yes Probabilistic buffer only — polluted node isn't picked every time, but still gets ~equal share
pickTime forced selection (AdaptiveMetrics.java L51-L53) Yes Prevents starvation, but never triggers when all nodes have similar (polluted) EWMA
Load value clamping / normalization No
Circuit breaker / outlier detection No onError() only increments counter
Fallback to other LB strategy No LB strategies are mutually exclusive
EWMA range validation No

For comparison, ShortestResponseLoadBalance in the same codebase uses a sliding window over RpcStatus (real success counts and elapsed time) — a more robust design that doesn't suffer from this class of bug.

Comments on This PR's Approach

1. ==> effectively disables the penalty path

After getLoad() executes, it sets currentTime = System.currentTimeMillis() (L65). For currentProviderTime > currentTime to hold, the provider's reported time would need to be in the future — essentially impossible under normal clock behavior.

This swings from "penalty always fires" to "penalty never fires". A more targeted fix: track whether new provider metrics have arrived since the last getLoad() call, e.g., a boolean providerUpdated flag that setProviderMetrics() sets to true and getLoad() resets to false.

2. Math.max(1L, ...) — right idea, arbitrary floor

Preventing decay-to-zero is correct. But 1L is meaningless for a service with timeout=5000ms — a 1ms floor is effectively zero relative to the scale. Consider deriving the floor from timeout (e.g., timeout / 100) or making it configurable.

3. Test robustness

  • Reflection (setAccessible) to manipulate private fields couples the test to internal field names. Driving the test through the public API (setProviderMetrics + getLoad) with controlled inputs would be more resilient.
  • Thread.sleep(150) is a classic source of CI flakiness. A Clock abstraction or time-independent assertions would be more reliable.

Verification Scripts

dubbo_adaptive_bug_sim.py — Basic penalty reproduction (click to expand)
"""
Dubbo AdaptiveMetrics Bug Simulation — Issue #15810
Reproduces: penalty branch fires on every normal response,
corrupting real latency data.
"""
import time

class AdaptiveMetrics:
    """Faithful port of Dubbo's AdaptiveMetrics.java (main branch)"""
    def __init__(self):
        self.current_provider_time = 0
        self.provider_cpu_load = 0.0
        self.last_latency = 0
        self.current_time = 0
        self.pick_time = int(time.time() * 1000)
        self.beta = 0.5
        self.ewma = 0.0
        self.consumer_req = 0
        self.consumer_success = 0

    def _now(self):
        return int(time.time() * 1000)

    def get_load(self, weight, timeout):
        if self._now() - self.pick_time > timeout * 2:
            return 0, "FORCED"
        branch = "NO_DATA"
        if self.current_time > 0:
            multiple = (self._now() - self.current_time) // timeout + 1
            if multiple > 0:
                if self.current_provider_time == self.current_time:
                    self.last_latency = timeout * 2
                    branch = "PENALTY"
                else:
                    self.last_latency = self.last_latency >> multiple
                    branch = "DECAY"
                self.ewma = self.beta * self.ewma + (1 - self.beta) * self.last_latency
                self.current_time = self._now()
        inflight = max(0, self.consumer_req - self.consumer_success)
        load = (self.provider_cpu_load * (self.ewma ** 0.5 + 1)
                * (inflight + 1)
                / (((self.consumer_success / (self.consumer_req + 1)) * weight) + 1))
        return load, branch

    def set_provider_metrics(self, service_time, rt, cpu_load):
        if self.current_provider_time > service_time:
            return
        self.current_provider_time = service_time
        self.current_time = service_time
        self.provider_cpu_load = cpu_load
        self.last_latency = rt
        self.ewma = self.beta * self.ewma + (1 - self.beta) * self.last_latency

    def add_req(self): self.consumer_req += 1
    def add_success(self): self.consumer_success += 1

timeout = 100
servers = {
    "A (10ms)":  {"m": AdaptiveMetrics(), "rt": 10},
    "B (50ms)":  {"m": AdaptiveMetrics(), "rt": 50},
    "C (200ms)": {"m": AdaptiveMetrics(), "rt": 200},
}
for r in range(1, 11):
    for name, s in servers.items():
        m = s["m"]
        now = int(time.time() * 1000)
        m.set_provider_metrics(now, s["rt"], 0.5)
        m.add_req(); m.add_success()
        time.sleep(0.001)
        load, branch = m.get_load(100, timeout)
        print(f"Round {r}: {name} real_rt={s['rt']} lastLat={m.last_latency} ewma={m.ewma:.1f} [{branch}]")
dubbo_adaptive_distributed_sim.py — Node degradation + Low QPS decay (click to expand)
"""
Dubbo AdaptiveMetrics Bug — Distributed Scenarios
Scenario 1: Node degradation (10ms -> 500ms), algorithm blind to change
Scenario 2: Low QPS decay-to-zero, slowest server looks "fastest"
"""
import time

class AdaptiveMetrics:
    def __init__(self):
        self.current_provider_time = 0
        self.provider_cpu_load = 0.0
        self.last_latency = 0
        self.current_time = 0
        self.pick_time = int(time.time() * 1000)
        self.beta = 0.5
        self.ewma = 0.0
        self.consumer_req = 0
        self.consumer_success = 0

    def _now(self):
        return int(time.time() * 1000)

    def get_load(self, weight, timeout):
        if self._now() - self.pick_time > timeout * 2:
            return 0, "FORCED"
        branch = "NO_DATA"
        if self.current_time > 0:
            multiple = (self._now() - self.current_time) // timeout + 1
            if multiple > 0:
                if self.current_provider_time == self.current_time:
                    self.last_latency = timeout * 2
                    branch = "PENALTY"
                else:
                    self.last_latency = self.last_latency >> multiple
                    branch = f"DECAY(>>{multiple})"
                self.ewma = self.beta * self.ewma + (1 - self.beta) * self.last_latency
                self.current_time = self._now()
        return 0, branch

    def set_provider_metrics(self, service_time, rt, cpu_load):
        if self.current_provider_time > service_time:
            return
        self.current_provider_time = service_time
        self.current_time = service_time
        self.provider_cpu_load = cpu_load
        self.last_latency = rt
        self.ewma = self.beta * self.ewma + (1 - self.beta) * self.last_latency

    def add_req(self): self.consumer_req += 1
    def add_success(self): self.consumer_success += 1

# Scenario 1: Node Degradation
print("=== Scenario 1: Node Degradation ===")
timeout = 100
servers = {"A": {"m": AdaptiveMetrics(), "rt": 10},
           "B": {"m": AdaptiveMetrics(), "rt": 50},
           "C": {"m": AdaptiveMetrics(), "rt": 80}}
for r in range(1, 13):
    if r == 6: servers["A"]["rt"] = 500
    for name in ["A", "B", "C"]:
        s = servers[name]; m = s["m"]
        now = int(time.time() * 1000)
        m.set_provider_metrics(now, s["rt"], 0.5)
        m.add_req(); m.add_success()
        time.sleep(0.001)
        _, branch = m.get_load(100, timeout)
        marker = " <- DEGRADED" if r >= 6 and name == "A" else ""
        print(f"R{r} {name}: rt={s['rt']}ms lastLat={m.last_latency} ewma={m.ewma:.1f} [{branch}]{marker}")

# Scenario 2: Low QPS Decay
print("\n=== Scenario 2: Low QPS Decay to Zero ===")
m = AdaptiveMetrics()
now = int(time.time() * 1000)
m.set_provider_metrics(now, 200, 0.5)
m.add_req(); m.add_success()
time.sleep(0.001)
m.get_load(100, timeout)
print(f"Before idle: lastLat={m.last_latency}, ewma={m.ewma:.1f}")
m.current_time = int(time.time() * 1000) - 2000
m.current_provider_time = m.current_time - 1
m.get_load(100, timeout)
print(f"After 2s idle: lastLat={m.last_latency}, ewma={m.ewma:.1f}")
print(f"200 >> 21 = {200 >> 21} -- slowest server now looks 'fastest'")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Questions about Dubbo Adaptive Load Balance

5 participants