Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics by mhlidd · Pull Request #11411 · DataDog/dd-trace-java

mhlidd · 2026-05-18T21:43:47Z

What Does This Do

Follow-up to the parent PR for maximo/otlp-runtime-metrics that expands the OTLP JVM runtime metrics surface, splits jvm.thread.count by daemon/state attributes, and gates Development-status metrics behind a new opt-out flag.

New config

dd.metrics.otel.experimental.enabled (default: true) — mirrors OTel's otel.instrumentation.runtime-telemetry.emit-experimental-metrics. When false, only metrics marked Stable in the OTel JVM semantic conventions are emitted; Development-status metrics are suppressed. Settable via either env var:
- DD_METRICS_OTEL_EXPERIMENTAL_ENABLED (Datadog form)
- OTEL_INSTRUMENTATION_RUNTIME_TELEMETRY_EMIT_EXPERIMENTAL_METRICS (OTel-spec form, mapped through OtelEnvironmentConfigSource)
Both env vars are registered in metadata/supported-configurations.json.

Note on the true default (diverges from OTel's false): JMXFetch already emits many metrics that OTel classifies as Development, so defaulting experimental-on keeps the OTLP path at feature parity with the JMX path out of the box. Users who want only the Stable subset can opt out explicitly.

Metrics added, reclassified, or restructured (all under the datadog.jvm.runtime scope, OTel-native names)

Metric	OTel status	When emitted
`jvm.thread.count`	Stable	Always. Now bucketed by `jvm.thread.daemon` (Boolean) and `jvm.thread.state` (String) per OTel JVM semconv, replacing the previous single `Attributes.empty()` data point. Empty buckets are skipped.
`jvm.memory.used_after_last_gc`	Stable	Always (moved into the always-on memory group)
`jvm.gc.duration`	Stable	Always. The `jvm.gc.cause` attribute is gated on the experimental flag (the cause attribute is not in OTel's stable attribute set); `jvm.gc.name` and `jvm.gc.action` are always attached.
`jvm.memory.init`	Development	Only when experimental flag is on. Emits heap/non_heap aggregates and per-pool data points (with `jvm.memory.pool.name`), matching the structure of the always-on `jvm.memory.used` / `jvm.memory.committed` / `jvm.memory.limit`.
`jvm.buffer.memory.used` / `limit` / `count`	Development	Only when experimental flag is on
`jvm.system.cpu.utilization`	Development	Only when experimental flag is on
`jvm.system.cpu.load_1m`	Development	Only when experimental flag is on
`jvm.file_descriptor.count` / `limit`	Development	Only when experimental flag is on, and only on Unix-like JVMs (`UnixOperatingSystemMXBean`)

jvm.thread.count collector — JDK-version- and runtime-aware

The collector is chosen once at class load:

Java 9+ (HotSpot/OpenJ9, non-native-image): uses ThreadMXBean.getThreadInfo(long[]) (the single-arg overload omits stack-trace capture) and reads ThreadInfo.isDaemon() via a MethodHandle resolved reflectively (the surrounding module is compiled against Java 8, so the symbol cannot be referenced directly).
Java 8 or GraalVM native image: falls back to walking the root ThreadGroup. ThreadInfo.isDaemon() was added in Java 9, and ThreadMXBean is unsupported on GraalVM native images.

Both paths avoid Thread.getAllStackTraces(), which forces a safepoint and allocates a StackTraceElement[] per thread on every poll. The 12 (daemon, Thread.State) Attributes objects are precomputed in two Attributes[] lookup tables to avoid per-poll allocation.

Value-guard alignment with OTel reference implementation

jvm.memory.limit and jvm.memory.init now skip recording only when getMax() / getInit() returns the documented -1 sentinel (was > 0, which incorrectly also skipped legitimate 0 values).
All other per-metric guards (>= 0, null checks) match the corresponding callbacks in io.opentelemetry.instrumentation.runtimetelemetry.internal.*.

Review-comment fixes

isGcNotificationInfoAvailable() now catches Exception (was ClassNotFoundException) so any unexpected lookup failure degrades gracefully instead of failing class initialization.
GcNotificationListener.handleNotification now null-checks the GarbageCollectionNotificationInfo payload before recording, in case a notification arrives without a usable payload.
Removed the sunOsBean() helper and inlined the instanceof OperatingSystemMXBean cast at both call sites (registerCpuMetrics, registerSystemCpuMetrics). Makes the type-narrowing explicit at each use site and removes one layer of indirection (reviewer preference for readability over deduplication here).

Test coverage

JvmOtlpRuntimeMetricsTest was extended to assert:
- All newly added metric names are registered (with platform-conditional checks for the Unix-only file descriptor metrics).
- jvm.gc.duration emission via System.gc().
- jvm.thread.count data points carry valid jvm.thread.daemon (boolean) and jvm.thread.state (lower-cased Thread.State name) attributes, empty buckets are skipped, the bucket sum is positive, and both daemon=true and daemon=false buckets are observed in a typical test JVM (jvmThreadCountIsBucketedByDaemonAndState, replacing the old jvmThreadCountIsPositive).
- jvm.memory.init emits both the heap/non_heap aggregates and per-pool data points (jvmMemoryInitHasHeapNonHeapAndPoolAttributes), and the heap aggregate value is positive (jvmMemoryInitHeapAggregateIsPositive).
New JvmOtlpRuntimeMetricsForkedTest runs in an isolated JVM, calls start(false), and verifies that Development-status instruments are absent and that jvm.gc.cause is not attached to jvm.gc.duration data points when experimental metrics are disabled. Forked because JvmOtlpRuntimeMetrics.start(...) is guarded by a process-wide AtomicBoolean and the registry / JMX listeners are JVM-global, so a single JVM cannot exercise both flag values.
Removed a weak startIsIdempotent test that only checked the metric-name Set size — it could not detect duplicate JMX listeners or duplicate observable callbacks under the same instrument name, which are the actual failure modes if the guard were removed.

Misc

Added debug logs when an MXBean isn't available so it's obvious why a metric didn't show up.

Motivation

The parent PR established the OTLP JVM runtime metrics pipeline but only emitted a subset of the OTel JVM semantic conventions, and jvm.thread.count was emitted as a single undifferentiated count. This follow-up brings the surface in line with what opentelemetry-java-instrumentation's runtime-telemetry library emits, splits jvm.thread.count by the spec-required daemon/state attributes, and adds the standard experimental-metrics opt-out so users who want only the Stable subset (smaller cardinality, fewer dashboard surprises) can disable Development metrics without losing the integration entirely.

Aligning the value guards with OTel's reference implementation prevents two real-world divergences:

Without the 0 vs -1 fix, uncapped non-heap pools (where getMax() == 0 on some JVM/version combos) would silently produce no jvm.memory.limit data point — they should publish 0 to indicate "no limit observed."
The experimental gate ensures dashboards built against OTel's stable-only output won't differ between OTel SDK collection and DD-agent collection.

The JDK-version-aware jvm.thread.count collector matters because the module is compiled against Java 8 (so the Java 9+ ThreadInfo.isDaemon() cannot be referenced directly) but ships into JVMs ranging from Java 8 through current LTS releases, plus GraalVM native image where ThreadMXBean is unsupported. The chosen approach avoids Thread.getAllStackTraces() on every poll (which would otherwise dominate overhead) on every runtime it supports.

Additional Notes

No change to JMXFetch behavior beyond passing the new flag through JvmOtlpRuntimeMetrics.start(...). The OTLP_JMX_CONFIG-skip path is unchanged.
The OTel-spec env var otel.instrumentation.runtime-telemetry.emit-experimental-metrics is captured in OtelEnvironmentConfigSource so an unmodified OTel-style config picks up the flag automatically.

Contributor Checklist

Format the title according to the contribution guidelines
Assign the type: and (comp: or inst:) labels in addition to any other useful labels
Avoid using close, fix, or any linking keywords when referencing an issue
Use solves instead, and assign the PR milestone to the issue
Update the CODEOWNERS file on source file addition, migration, or deletion
Update public documentation with any new configuration flags or behaviors

Jira ticket: [PROJ-IDENT]

mhlidd · 2026-05-18T21:59:41Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62d9b50d1d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

dd-octo-sts · 2026-05-20T19:53:34Z

Hi! 👋 Thanks for your pull request! 🎉

To help us review it, please make sure to:

Add at least one type, and one component or instrumentation label to the pull request

If you need help, please check our contributing guidelines.

mhlidd · 2026-05-20T20:13:49Z

@codex review

chatgpt-codex-connector · 2026-05-20T20:20:44Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

ValentinZakharov

Could you clarify whether the following differences from the JVM semantic conventions are intentional?

jvm.thread.count seems to be missing the recommended attributes thread.daemon and thread.state (spec
jvm.memory.init is not split by memory pool and seems to be missing the jvm.memory.pool.name attribute (spec)

mcculls · 2026-05-22T15:35:00Z

  static final int DEFAULT_METRICS_OTEL_TIMEOUT = 7_500; // ms
  static final int DEFAULT_METRICS_OTEL_CARDINALITY_LIMIT = 2_000;

+  public static final boolean DEFAULT_METRICS_OTEL_EXPERIMENTAL_ENABLED = true;


Default for this in OTel is false - do we want to match that?

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/runtime-telemetry/README.md

This was an intentional decision. Idea being that JMX inherently emits many metrics that are considered as "developmental" in OTel, so it would be an easy win to enable these by default.

mcculls

One question about whether the default should really be true since OTel defaults it to false at the moment: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/runtime-telemetry/README.md

Also a few cleanup / robustness comments to be addressed before merging - otherwise looks good.

mhlidd · 2026-05-26T16:19:33Z

Could you clarify whether the following differences from the JVM semantic conventions are intentional?

jvm.thread.count seems to be missing the recommended attributes thread.daemon and thread.state (spec

jvm.memory.init is not split by memory pool and seems to be missing the jvm.memory.pool.name attribute (spec)

@ValentinZakharov Great catch! Adding these attributes now.

mhlidd · 2026-05-26T19:56:27Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f0dd9c339

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-26T20:04:17Z

+    if (n == buffer.length) {
+      return buffer;
+    }


Retry thread enumeration until snapshot is complete

enumerateAllThreads() returns immediately when group.enumerate(buffer) fills the array, but ThreadGroup.enumerate silently drops extra threads when the buffer is too small. In the Java 8 / native-image path this causes jvm.thread.count buckets to under-report whenever activeCount() underestimates by more than the fixed +10 headroom (common under thread churn), so metric totals can be materially wrong. Grow-and-retry until the returned count is strictly less than buffer length to avoid truncation.

Useful? React with 👍 / 👎.

This matches what OTel is doing

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread ...utils/src/main/java/datadog/trace/bootstrap/config/provider/OtelEnvironmentConfigSource.java

Base automatically changed from maximo/otlp-runtime-metrics to master May 19, 2026 18:23

This comment has been minimized.

Sign in to view

mhlidd added 4 commits May 20, 2026 14:18

init

82c7398

update checks to match OTel checks

b2a3ff6

adding jvm.gc.duration

cbdabdd

adding tests for experimental off

de166ab

mhlidd force-pushed the mhlidd/otlp_runtime_metrics_follow_up branch from 90ddfc2 to de166ab Compare May 20, 2026 18:24

removing unnecessary test and adding configs

40ef357

mhlidd changed the title ~~init~~ Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics May 20, 2026

mhlidd marked this pull request as ready for review May 20, 2026 19:53

mhlidd requested review from a team as code owners May 20, 2026 19:53

mhlidd requested review from ValentinZakharov, bric3 and mcculls and removed request for a team May 20, 2026 19:53

mhlidd added type: enhancement Enhancements and improvements inst: opentelemetry OpenTelemetry instrumentation tag: ai generated Largely based on code generated by an AI or LLM labels May 20, 2026

ValentinZakharov reviewed May 21, 2026

View reviewed changes

Comment thread ...a-agent/agent-jmxfetch/src/main/java/datadog/trace/agent/jmxfetch/JvmOtlpRuntimeMetrics.java

mcculls reviewed May 22, 2026

View reviewed changes

Comment thread ...a-agent/agent-jmxfetch/src/main/java/datadog/trace/agent/jmxfetch/JvmOtlpRuntimeMetrics.java Outdated

mcculls reviewed May 22, 2026

View reviewed changes

Comment thread ...a-agent/agent-jmxfetch/src/main/java/datadog/trace/agent/jmxfetch/JvmOtlpRuntimeMetrics.java Outdated

mcculls reviewed May 22, 2026

View reviewed changes

Comment thread ...a-agent/agent-jmxfetch/src/main/java/datadog/trace/agent/jmxfetch/JvmOtlpRuntimeMetrics.java Outdated

mcculls approved these changes May 22, 2026

View reviewed changes

pr comments and split jvm.thread.count by attributes and jdk version

8f0dd9c

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

Conversation

mhlidd commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Additional Notes

Contributor Checklist

Uh oh!

mhlidd commented May 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented May 20, 2026

Uh oh!

mhlidd commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 20, 2026

Uh oh!

ValentinZakharov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcculls May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mhlidd May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mcculls left a comment

Choose a reason for hiding this comment

Uh oh!

mhlidd commented May 26, 2026

Uh oh!

mhlidd commented May 26, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

mhlidd May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mhlidd commented May 18, 2026 •

edited

Loading

ValentinZakharov left a comment •

edited

Loading