StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812
StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812TAOXUY wants to merge 7 commits intoenvoyproxy:mainfrom
Conversation
ggreenway
left a comment
There was a problem hiding this comment.
I don't think your fix is quite right.
I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.
if (scope->evictable_) {
MetricBag metrics(scope->scope_id_);
CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
return [&unused_metrics](std::pair<StatName, T> kv) {
const auto& [name, metric] = kv;
if (metric->used()) {
metric->markUnused();
return false;
} else {
unused_metrics.try_emplace(name, metric);
return true;
}
};
};
The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .
I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.
/wait
When evicting unused stats from the central cache, we need to ensure that gauges actively referenced by components like AccessLogState are not evicted. The use_count() > 1 check prevents this, but a previous bug in evictUnused where the lambda parameter std::pair<StatName, T> kv was captured by value caused artificial inflation of the use_count due to the deep copy. This broke eviction entirely across the codebase. This commit fixes evictUnused by taking const auto& kv by reference, avoiding the deep copy and correctly applying the use_count() > 1 safeguard. Furthermore, AccessLogState now properly holds a GaugeSharedPtr in its State struct so its active references prevent premature eviction by evictUnused. The erroneous std::min safeguard during gauge subtractions is also removed as AccessLogState gauges will no longer be unfairly cleared. Signed-off-by: Xuyang Tao <taoxuy@google.com>
Updated with a interface to not evict per metric. We need to keep gauge not evicted in the scope as that it can be looked-up and then dec/inc on the same gauge. @kyessenov |
|
/retest |
Description: Fixes connection gauge underflow crashes in the Stats Access Logger when decrementing metrics after Scope evictions.
The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow.
Fix: Essentially we need to keep the gauge in the scope so that its value can be referenced to dec/inc. This PR introduce metric-level eviction disablement which allows us to safely decrement the gauge.
Risk Level: Low
Testing: Added StatsAccessLogIntegrationTest.ActiveRequestsGaugeScopeEviction, which synthetically forces an asynchronous scope eviction while a connection is still inflight. Verified that the gauge successfully decrements to 0 in the central store identically to a normal request finish.
Docs: NA
Release: NA
Platform Specific Features: no