Tighten tracer.metrics defaults to protect tight-heap JVMs#11500
Draft
dougqh wants to merge 1 commit into
Draft
Conversation
Cut the implicit TRACER_METRICS_MAX_PENDING default from 2048 (logical) to 128 on normal heap and to 64 at Xmx < 128 MB, and the implicit TRACER_METRICS_MAX_AGGREGATES default from 2048 to 256 at tight heap. Why --- The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot, the prior 131072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled. At Xmx <= ~128 MB the G1 survivor region is too small to absorb that footprint -- observed catastrophically at Xmx 64 MB on spring-petclinic where the inbox overflowed young gen and triggered To-space Exhausted → Full GC storms (0 r/s in the worst case). New defaults bound the worst-case in-flight footprint at ~1 MB on normal heap and ~500 KB at tight heap, comfortably below typical survivor sizes and large enough to absorb the sub-second consumer stalls we actually see in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default). Customers who explicitly configure TRACER_METRICS_MAX_PENDING are unaffected; the LEGACY_BATCH_SIZE multiplier still applies to overrides. Only the implicit defaults shrink. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Contributor
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master resultsStartup Time
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What Does This Do
Cut the implicit
tracer.metrics.max.pendingdefault from 2048 (logical) to 128 on normal heap and to 64 at Xmx<128MB, and the implicittracer.metrics.max.aggregatesdefault from 2048 to 256 at tight heap. Customers who explicitly configured either property keep their value.Motivation
The metrics inbox is an
MpscArrayQueue<SpanSnapshot>sized tomaxPending * LEGACY_BATCH_SIZE(64). With one ~120 BSpanSnapshotper slot, the prior 131,072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled.At Xmx ≤ ~128 MB the G1 survivor region is too small to absorb that footprint. Observed catastrophically at Xmx 64m on spring-petclinic —
SpanSnapshots overflow young gen and trigger To-space Exhausted → Full GC storms (0 r/s in the worst case).JFR allocation profile at Xmx 64 m attributes this to
SpanSnapshotbeing the #2 datadog allocator (~280 MB sampled bytes over 90 s) since #11381 introduced the producer/consumer split. The inbox amplifies the per-publish allocation into a heap-pressure problem only at tight heap.New defaults
Both are large enough to absorb the sub-second aggregator stalls we observe in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default).
What this is not
MpscArrayQueue<SpanSnapshot>.SpanSnapshotper metrics-eligible span.onStatsInboxFull.It's purely a bound on the inbox's worst-case footprint, sized for the survivor-region constraint that #11381's per-span allocation pattern made load-bearing.
Test plan
:internal-api:test—ConfigTest*passes:dd-trace-core:test—datadog.trace.common.metrics.*passes (92/92 locally)/tmp/petclinic-bench/run.shreproduces the bomb)🤖 Generated with Claude Code