Tighten tracer.metrics defaults to protect tight-heap JVMs by dougqh · Pull Request #11500 · DataDog/dd-trace-java

dougqh · 2026-05-29T13:07:31Z

What Does This Do

Cut the implicit tracer.metrics.max.pending default from 2048 (logical) to 128 on normal heap and to 64 at Xmx<128MB, and the implicit tracer.metrics.max.aggregates default from 2048 to 256 at tight heap. Customers who explicitly configured either property keep their value.

Motivation

The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot, the prior 131,072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled.

At Xmx ≤ ~128 MB the G1 survivor region is too small to absorb that footprint. Observed catastrophically at Xmx 64m on spring-petclinic — SpanSnapshots overflow young gen and trigger To-space Exhausted → Full GC storms (0 r/s in the worst case).

JFR allocation profile at Xmx 64 m attributes this to SpanSnapshot being the #2 datadog allocator (~280 MB sampled bytes over 90 s) since #11381 introduced the producer/consumer split. The inbox amplifies the per-publish allocation into a heap-pressure problem only at tight heap.

New defaults

Heap	maxAggregates	maxPending (logical)	Inbox slots	Worst-case in-flight
Normal (≥ 128 MB)	2048 (unchanged)	128	8,192	~1 MB
Tight (< 128 MB)	256	64	4,096	~500 KB

Both are large enough to absorb the sub-second aggregator stalls we observe in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default).

What this is not

Not a queue-mechanism change. The inbox stays MpscArrayQueue<SpanSnapshot>.
Not an allocation-profile change. Producers still allocate one SpanSnapshot per metrics-eligible span.
Not a feature change. Drops on overflow still flow through onStatsInboxFull.

It's purely a bound on the inbox's worst-case footprint, sized for the survivor-region constraint that #11381's per-span allocation pattern made load-bearing.

Test plan

:internal-api:test — ConfigTest* passes
:dd-trace-core:test — datadog.trace.common.metrics.* passes (92/92 locally)
Petclinic load test sweep at Xmx 64m to confirm the inbox no longer overflows survivors (gated on a fresh-system bench session; the existing harness at /tmp/petclinic-bench/run.sh reproduces the bomb)

🤖 Generated with Claude Code

Cut the implicit TRACER_METRICS_MAX_PENDING default from 2048 (logical) to 128 on normal heap and to 64 at Xmx < 128 MB, and the implicit TRACER_METRICS_MAX_AGGREGATES default from 2048 to 256 at tight heap. Why --- The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot, the prior 131072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled. At Xmx <= ~128 MB the G1 survivor region is too small to absorb that footprint -- observed catastrophically at Xmx 64 MB on spring-petclinic where the inbox overflowed young gen and triggered To-space Exhausted → Full GC storms (0 r/s in the worst case). New defaults bound the worst-case in-flight footprint at ~1 MB on normal heap and ~500 KB at tight heap, comfortably below typical survivor sizes and large enough to absorb the sub-second consumer stalls we actually see in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default). Customers who explicitly configure TRACER_METRICS_MAX_PENDING are unaffected; the LEGACY_BATCH_SIZE multiplier still applies to overrides. Only the implicit defaults shrink. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dd-octo-sts · 2026-05-29T13:26:01Z

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite	Status
Startup	🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Startup Time

Scenario	This PR	master	Change
insecure-bank / iast	13,961 ms	13,955 ms	+0.0%
insecure-bank / tracing	12,884 ms	13,130 ms	-1.9%
petclinic / appsec	16,646 ms	16,434 ms	+1.3%
petclinic / iast	16,567 ms	16,552 ms	+0.1%
petclinic / profiling	16,545 ms	16,566 ms	-0.1%
petclinic / tracing	15,767 ms	14,915 ms	+5.7%

Commit: 391704f7 · CI Pipeline · Benchmarking Platform UI

Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes tag: ai generated Largely based on code generated by an AI or LLM labels May 29, 2026

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten tracer.metrics defaults to protect tight-heap JVMs#11500

Tighten tracer.metrics defaults to protect tight-heap JVMs#11500
dougqh wants to merge 1 commit into
masterfrom
dougqh/css-tight-heap-defaults

dougqh commented May 29, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dougqh commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

New defaults

What this is not

Test plan

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented May 29, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dougqh commented May 29, 2026 •

edited

Loading