Skip to content

Tighten tracer.metrics defaults to protect tight-heap JVMs#11500

Draft
dougqh wants to merge 1 commit into
masterfrom
dougqh/css-tight-heap-defaults
Draft

Tighten tracer.metrics defaults to protect tight-heap JVMs#11500
dougqh wants to merge 1 commit into
masterfrom
dougqh/css-tight-heap-defaults

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 29, 2026

What Does This Do

Cut the implicit tracer.metrics.max.pending default from 2048 (logical) to 128 on normal heap and to 64 at Xmx<128MB, and the implicit tracer.metrics.max.aggregates default from 2048 to 256 at tight heap. Customers who explicitly configured either property keep their value.

Motivation

The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot, the prior 131,072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled.

At Xmx ≤ ~128 MB the G1 survivor region is too small to absorb that footprint. Observed catastrophically at Xmx 64m on spring-petclinic — SpanSnapshots overflow young gen and trigger To-space Exhausted → Full GC storms (0 r/s in the worst case).

JFR allocation profile at Xmx 64 m attributes this to SpanSnapshot being the #2 datadog allocator (~280 MB sampled bytes over 90 s) since #11381 introduced the producer/consumer split. The inbox amplifies the per-publish allocation into a heap-pressure problem only at tight heap.

New defaults

Heap maxAggregates maxPending (logical) Inbox slots Worst-case in-flight
Normal (≥ 128 MB) 2048 (unchanged) 128 8,192 ~1 MB
Tight (< 128 MB) 256 64 4,096 ~500 KB

Both are large enough to absorb the sub-second aggregator stalls we observe in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default).

What this is not

  • Not a queue-mechanism change. The inbox stays MpscArrayQueue<SpanSnapshot>.
  • Not an allocation-profile change. Producers still allocate one SpanSnapshot per metrics-eligible span.
  • Not a feature change. Drops on overflow still flow through onStatsInboxFull.

It's purely a bound on the inbox's worst-case footprint, sized for the survivor-region constraint that #11381's per-span allocation pattern made load-bearing.

Test plan

  • :internal-api:testConfigTest* passes
  • :dd-trace-core:testdatadog.trace.common.metrics.* passes (92/92 locally)
  • Petclinic load test sweep at Xmx 64m to confirm the inbox no longer overflows survivors (gated on a fresh-system bench session; the existing harness at /tmp/petclinic-bench/run.sh reproduces the bomb)

🤖 Generated with Claude Code

Cut the implicit TRACER_METRICS_MAX_PENDING default from 2048 (logical) to 128
on normal heap and to 64 at Xmx < 128 MB, and the implicit
TRACER_METRICS_MAX_AGGREGATES default from 2048 to 256 at tight heap.

Why
---
The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to
maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot,
the prior 131072-slot default pinned ~15 MB worst-case in-flight when the
aggregator stalled.

At Xmx <= ~128 MB the G1 survivor region is too small to absorb that
footprint -- observed catastrophically at Xmx 64 MB on spring-petclinic
where the inbox overflowed young gen and triggered To-space Exhausted →
Full GC storms (0 r/s in the worst case).

New defaults bound the worst-case in-flight footprint at ~1 MB on normal
heap and ~500 KB at tight heap, comfortably below typical survivor sizes
and large enough to absorb the sub-second consumer stalls we actually see
in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default).

Customers who explicitly configure TRACER_METRICS_MAX_PENDING are
unaffected; the LEGACY_BATCH_SIZE multiplier still applies to overrides.
Only the implicit defaults shrink.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes tag: ai generated Largely based on code generated by an AI or LLM labels May 29, 2026
@datadog-prod-us1-6

This comment has been minimized.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 29, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Startup Time

Scenario This PR master Change
insecure-bank / iast 13,961 ms 13,955 ms +0.0%
insecure-bank / tracing 12,884 ms 13,130 ms -1.9%
petclinic / appsec 16,646 ms 16,434 ms +1.3%
petclinic / iast 16,567 ms 16,552 ms +0.1%
petclinic / profiling 16,545 ms 16,566 ms -0.1%
petclinic / tracing 15,767 ms 14,915 ms +5.7%

Commit: 391704f7 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: core Tracer core tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes tag: performance Performance related changes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant