Skip to content

StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812

Open
TAOXUY wants to merge 7 commits intoenvoyproxy:mainfrom
TAOXUY:fixStatDestructor
Open

StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812
TAOXUY wants to merge 7 commits intoenvoyproxy:mainfrom
TAOXUY:fixStatDestructor

Conversation

@TAOXUY
Copy link
Contributor

@TAOXUY TAOXUY commented Mar 6, 2026

Description: Fixes connection gauge underflow crashes in the Stats Access Logger when decrementing metrics after Scope evictions.

The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow.

Fix: Essentially we need to keep the gauge in the scope so that its value can be referenced to dec/inc. This PR introduce metric-level eviction disablement which allows us to safely decrement the gauge.

Risk Level: Low

Testing: Added StatsAccessLogIntegrationTest.ActiveRequestsGaugeScopeEviction, which synthetically forces an asynchronous scope eviction while a connection is still inflight. Verified that the gauge successfully decrements to 0 in the central store identically to a normal request finish.

Docs: NA

Release: NA

Platform Specific Features: no

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY TAOXUY changed the title StatsAccessLogger: StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions. Mar 6, 2026
@ggreenway ggreenway self-assigned this Mar 6, 2026
Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.

      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };

The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

TAOXUY added 2 commits March 8, 2026 18:09
When evicting unused stats from the central cache, we need to ensure that
gauges actively referenced by components like AccessLogState are not evicted.
The use_count() > 1 check prevents this, but a previous bug in evictUnused
where the lambda parameter std::pair<StatName, T> kv was captured by value
caused artificial inflation of the use_count due to the deep copy. This broke
eviction entirely across the codebase.

This commit fixes evictUnused by taking const auto& kv by reference, avoiding
the deep copy and correctly applying the use_count() > 1 safeguard.

Furthermore, AccessLogState now properly holds a GaugeSharedPtr in its State
struct so its active references prevent premature eviction by evictUnused. The
erroneous std::min safeguard during gauge subtractions is also removed as
AccessLogState gauges will no longer be unfairly cleared.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Contributor Author

TAOXUY commented Mar 8, 2026

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.

      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };

The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

Updated with a interface to not evict per metric. We need to keep gauge not evicted in the scope as that it can be looked-up and then dec/inc on the same gauge. @kyessenov

TAOXUY added 3 commits March 8, 2026 21:12
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Contributor Author

TAOXUY commented Mar 9, 2026

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants