Fix adaptive metrics decay when provider metrics are not updated#16048
Fix adaptive metrics decay when provider metrics are not updated#16048SURYAS1306 wants to merge 3 commits intoapache:3.3from
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 3.3 #16048 +/- ##
============================================
- Coverage 60.75% 60.73% -0.03%
+ Complexity 11757 11752 -5
============================================
Files 1952 1952
Lines 89012 89012
Branches 13421 13421
============================================
- Hits 54079 54059 -20
- Misses 29367 29382 +15
- Partials 5566 5571 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi maintainers, This PR fixes the adaptive metrics decay issue when provider metrics are not updated and adds a unit test covering the scenario. All checks are green. I’d really appreciate your review when you have time. Thanks! |
|
you'd better add a comparison test to ensure applying this PR also does well under high QPS circumstance. |
|
Hi @zrlw , thanks for the suggestion. |
|
Hi @zrlw , thanks for the suggestion. |
zrlw
left a comment
There was a problem hiding this comment.
We still need to evaluate this PR carefully as we should draw on relatively mature industry solutions to refactor the adaptive algorithm.
|
Hi @zrlw , thanks for the feedback. Understood - this PR focuses on fixing the incorrect decay behavior when provider metrics are not updated. The added high-QPS style test is intended to validate the correctness and stability of the existing adaptive logic under more realistic conditions, without altering the overall strategy. I agree that the adaptive algorithm itself is an important topic and could benefit from deeper discussion and comparison with more mature industry solutions. I’m happy to participate in that discussion or help explore alternative designs if there is a preferred direction. For now, this PR intentionally keeps the change minimal and low-risk, addressing the concrete issue of unstable decay behavior without introducing broader algorithmic refactoring. Please let me know how you’d like to proceed. Thanks. |
|
Thanks for working on this fix. I spent some time tracing through the full call chain on Bug Verification — Code-Level TraceI traced the complete request lifecycle to confirm the root cause: Step 1. // AdaptiveLoadBalanceFilter.java L117-L121
metricsMap.put("rt", String.valueOf(System.currentTimeMillis() - startTime));
getExecutor().execute(() -> {
adaptiveMetrics.setProviderMetrics(getServiceKey(invocation), metricsMap);
});Step 2. // AdaptiveMetrics.java L93-L94
metrics.currentProviderTime = serviceTime;
metrics.currentTime = serviceTime; // ← same value as aboveStep 3. On the next request, // AdaptiveMetrics.java L58-L60
if (metrics.currentProviderTime == metrics.currentTime) { // ← always true after Step 2
metrics.lastLatency = timeout * 2L; // ← real RT overwritten
}Since Step 2 guarantees Step 4. // AdaptiveLoadBalance.java L93-L96
long load1 = Double.doubleToLongBits(
adaptiveMetrics.getLoad(getServiceKey(invoker1, invocation), weight1, timeout1));No clamping, no normalization, no outlier detection. The polluted EWMA feeds straight into the routing decision. Simulation ResultsI ported Scenario 1: Normal Traffic — Penalty Overwrites Real RT3 servers (A=10ms, B=50ms, C=200ms), timeout=100ms: Every round hits PENALTY. A 10ms server's EWMA (136.7) is only 1.5x less than a 200ms server (200.0) — should be 20x. Adaptive load balancing degrades to near-random distribution. Scenario 2: Node Degradation — Invisible to AlgorithmServer A degrades from 10ms → 500ms at round 6 (GC pause, downstream timeout): A's Scenario 3: Low QPS — Decay to ZeroServer C (200ms) idle for 2 seconds, then The slowest server's latency decays to zero — it now appears "fastest". Traffic floods the worst-performing node. Safety Net AuditI checked whether any compensating mechanism mitigates these bugs:
For comparison, Comments on This PR's Approach1.
|
What is the purpose of the change?
This PR fixes an issue in AdaptiveLoadBalance / AdaptiveMetrics where latency decay behaves incorrectly when provider metrics are not updated for a period of time.
Currently, when no new provider metrics arrive, getLoad() may repeatedly apply the penalty branch or aggressively right-shift lastLatency, which can result in stale or extreme values dominating EWMA. This makes adaptive load balancing unstable, especially in low-QPS or intermittent-update scenarios.
This PR ensures that latency decays safely and progressively instead of collapsing or being stuck at penalty values.
Fixes #15810
What is changed?
1. Improved decay logic in AdaptiveMetrics#getLoad()
2. Added unit test
Added
testAdaptiveMetricsDecayWithoutProviderUpdateVerifies that when provider metrics are not updated:
Why is this needed?
Adaptive load balancing relies on EWMA latency to reflect recent performance trends.
Without this fix:
This change makes adaptive load balancing more stable, realistic, and robust under real-world traffic patterns.
Verifying this change
mvn -pl dubbo-cluster -am testChecklist