Skip to content

Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics#11411

Open
mhlidd wants to merge 6 commits into
masterfrom
mhlidd/otlp_runtime_metrics_follow_up
Open

Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics#11411
mhlidd wants to merge 6 commits into
masterfrom
mhlidd/otlp_runtime_metrics_follow_up

Conversation

@mhlidd
Copy link
Copy Markdown
Contributor

@mhlidd mhlidd commented May 18, 2026

What Does This Do

Follow-up to the parent PR for maximo/otlp-runtime-metrics that expands the OTLP JVM runtime metrics surface, splits jvm.thread.count by daemon/state attributes, and gates Development-status metrics behind a new opt-out flag.

New config

  • dd.metrics.otel.experimental.enabled (default: true) — mirrors OTel's otel.instrumentation.runtime-telemetry.emit-experimental-metrics. When false, only metrics marked Stable in the OTel JVM semantic conventions are emitted; Development-status metrics are suppressed. Settable via either env var:

    • DD_METRICS_OTEL_EXPERIMENTAL_ENABLED (Datadog form)
    • OTEL_INSTRUMENTATION_RUNTIME_TELEMETRY_EMIT_EXPERIMENTAL_METRICS (OTel-spec form, mapped through OtelEnvironmentConfigSource)

    Both env vars are registered in metadata/supported-configurations.json.

    Note on the true default (diverges from OTel's false): JMXFetch already emits many metrics that OTel classifies as Development, so defaulting experimental-on keeps the OTLP path at feature parity with the JMX path out of the box. Users who want only the Stable subset can opt out explicitly.

Metrics added, reclassified, or restructured (all under the datadog.jvm.runtime scope, OTel-native names)

Metric OTel status When emitted
jvm.thread.count Stable Always. Now bucketed by jvm.thread.daemon (Boolean) and jvm.thread.state (String) per OTel JVM semconv, replacing the previous single Attributes.empty() data point. Empty buckets are skipped.
jvm.memory.used_after_last_gc Stable Always (moved into the always-on memory group)
jvm.gc.duration Stable Always. The jvm.gc.cause attribute is gated on the experimental flag (the cause attribute is not in OTel's stable attribute set); jvm.gc.name and jvm.gc.action are always attached.
jvm.memory.init Development Only when experimental flag is on. Emits heap/non_heap aggregates and per-pool data points (with jvm.memory.pool.name), matching the structure of the always-on jvm.memory.used / jvm.memory.committed / jvm.memory.limit.
jvm.buffer.memory.used / limit / count Development Only when experimental flag is on
jvm.system.cpu.utilization Development Only when experimental flag is on
jvm.system.cpu.load_1m Development Only when experimental flag is on
jvm.file_descriptor.count / limit Development Only when experimental flag is on, and only on Unix-like JVMs (UnixOperatingSystemMXBean)

jvm.thread.count collector — JDK-version- and runtime-aware

The collector is chosen once at class load:

  • Java 9+ (HotSpot/OpenJ9, non-native-image): uses ThreadMXBean.getThreadInfo(long[]) (the single-arg overload omits stack-trace capture) and reads ThreadInfo.isDaemon() via a MethodHandle resolved reflectively (the surrounding module is compiled against Java 8, so the symbol cannot be referenced directly).
  • Java 8 or GraalVM native image: falls back to walking the root ThreadGroup. ThreadInfo.isDaemon() was added in Java 9, and ThreadMXBean is unsupported on GraalVM native images.

Both paths avoid Thread.getAllStackTraces(), which forces a safepoint and allocates a StackTraceElement[] per thread on every poll. The 12 (daemon, Thread.State) Attributes objects are precomputed in two Attributes[] lookup tables to avoid per-poll allocation.

Value-guard alignment with OTel reference implementation

  • jvm.memory.limit and jvm.memory.init now skip recording only when getMax() / getInit() returns the documented -1 sentinel (was > 0, which incorrectly also skipped legitimate 0 values).
  • All other per-metric guards (>= 0, null checks) match the corresponding callbacks in io.opentelemetry.instrumentation.runtimetelemetry.internal.*.

Review-comment fixes

  • isGcNotificationInfoAvailable() now catches Exception (was ClassNotFoundException) so any unexpected lookup failure degrades gracefully instead of failing class initialization.
  • GcNotificationListener.handleNotification now null-checks the GarbageCollectionNotificationInfo payload before recording, in case a notification arrives without a usable payload.
  • Removed the sunOsBean() helper and inlined the instanceof OperatingSystemMXBean cast at both call sites (registerCpuMetrics, registerSystemCpuMetrics). Makes the type-narrowing explicit at each use site and removes one layer of indirection (reviewer preference for readability over deduplication here).

Test coverage

  • JvmOtlpRuntimeMetricsTest was extended to assert:
    • All newly added metric names are registered (with platform-conditional checks for the Unix-only file descriptor metrics).
    • jvm.gc.duration emission via System.gc().
    • jvm.thread.count data points carry valid jvm.thread.daemon (boolean) and jvm.thread.state (lower-cased Thread.State name) attributes, empty buckets are skipped, the bucket sum is positive, and both daemon=true and daemon=false buckets are observed in a typical test JVM (jvmThreadCountIsBucketedByDaemonAndState, replacing the old jvmThreadCountIsPositive).
    • jvm.memory.init emits both the heap/non_heap aggregates and per-pool data points (jvmMemoryInitHasHeapNonHeapAndPoolAttributes), and the heap aggregate value is positive (jvmMemoryInitHeapAggregateIsPositive).
  • New JvmOtlpRuntimeMetricsForkedTest runs in an isolated JVM, calls start(false), and verifies that Development-status instruments are absent and that jvm.gc.cause is not attached to jvm.gc.duration data points when experimental metrics are disabled. Forked because JvmOtlpRuntimeMetrics.start(...) is guarded by a process-wide AtomicBoolean and the registry / JMX listeners are JVM-global, so a single JVM cannot exercise both flag values.
  • Removed a weak startIsIdempotent test that only checked the metric-name Set size — it could not detect duplicate JMX listeners or duplicate observable callbacks under the same instrument name, which are the actual failure modes if the guard were removed.

Misc

  • Added debug logs when an MXBean isn't available so it's obvious why a metric didn't show up.

Motivation

The parent PR established the OTLP JVM runtime metrics pipeline but only emitted a subset of the OTel JVM semantic conventions, and jvm.thread.count was emitted as a single undifferentiated count. This follow-up brings the surface in line with what opentelemetry-java-instrumentation's runtime-telemetry library emits, splits jvm.thread.count by the spec-required daemon/state attributes, and adds the standard experimental-metrics opt-out so users who want only the Stable subset (smaller cardinality, fewer dashboard surprises) can disable Development metrics without losing the integration entirely.

Aligning the value guards with OTel's reference implementation prevents two real-world divergences:

  1. Without the 0 vs -1 fix, uncapped non-heap pools (where getMax() == 0 on some JVM/version combos) would silently produce no jvm.memory.limit data point — they should publish 0 to indicate "no limit observed."
  2. The experimental gate ensures dashboards built against OTel's stable-only output won't differ between OTel SDK collection and DD-agent collection.

The JDK-version-aware jvm.thread.count collector matters because the module is compiled against Java 8 (so the Java 9+ ThreadInfo.isDaemon() cannot be referenced directly) but ships into JVMs ranging from Java 8 through current LTS releases, plus GraalVM native image where ThreadMXBean is unsupported. The chosen approach avoids Thread.getAllStackTraces() on every poll (which would otherwise dominate overhead) on every runtime it supports.

Additional Notes

  • No change to JMXFetch behavior beyond passing the new flag through JvmOtlpRuntimeMetrics.start(...). The OTLP_JMX_CONFIG-skip path is unchanged.
  • The OTel-spec env var otel.instrumentation.runtime-telemetry.emit-experimental-metrics is captured in OtelEnvironmentConfigSource so an unmodified OTel-style config picks up the flag automatically.

Contributor Checklist

Jira ticket: [PROJ-IDENT]

@mhlidd
Copy link
Copy Markdown
Contributor Author

mhlidd commented May 18, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62d9b50d1d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Base automatically changed from maximo/otlp-runtime-metrics to master May 19, 2026 18:23
@datadog-prod-us1-5

This comment has been minimized.

@mhlidd mhlidd force-pushed the mhlidd/otlp_runtime_metrics_follow_up branch from 90ddfc2 to de166ab Compare May 20, 2026 18:24
@mhlidd mhlidd changed the title init Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics May 20, 2026
@mhlidd mhlidd marked this pull request as ready for review May 20, 2026 19:53
@mhlidd mhlidd requested review from a team as code owners May 20, 2026 19:53
@mhlidd mhlidd requested review from ValentinZakharov, bric3 and mcculls and removed request for a team May 20, 2026 19:53
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 20, 2026

Hi! 👋 Thanks for your pull request! 🎉

To help us review it, please make sure to:

  • Add at least one type, and one component or instrumentation label to the pull request

If you need help, please check our contributing guidelines.

@mhlidd mhlidd added type: enhancement Enhancements and improvements inst: opentelemetry OpenTelemetry instrumentation tag: ai generated Largely based on code generated by an AI or LLM labels May 20, 2026
@mhlidd
Copy link
Copy Markdown
Contributor Author

mhlidd commented May 20, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@ValentinZakharov ValentinZakharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify whether the following differences from the JVM semantic conventions are intentional?

  • jvm.thread.count seems to be missing the recommended attributes thread.daemon and thread.state (spec
  • jvm.memory.init is not split by memory pool and seems to be missing the jvm.memory.pool.name attribute (spec)

static final int DEFAULT_METRICS_OTEL_TIMEOUT = 7_500; // ms
static final int DEFAULT_METRICS_OTEL_CARDINALITY_LIMIT = 2_000;

public static final boolean DEFAULT_METRICS_OTEL_EXPERIMENTAL_ENABLED = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an intentional decision. Idea being that JMX inherently emits many metrics that are considered as "developmental" in OTel, so it would be an easy win to enable these by default.

Copy link
Copy Markdown
Contributor

@mcculls mcculls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about whether the default should really be true since OTel defaults it to false at the moment: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/runtime-telemetry/README.md

Also a few cleanup / robustness comments to be addressed before merging - otherwise looks good.

@mhlidd
Copy link
Copy Markdown
Contributor Author

mhlidd commented May 26, 2026

Could you clarify whether the following differences from the JVM semantic conventions are intentional?

  • jvm.thread.count seems to be missing the recommended attributes thread.daemon and thread.state (spec
  • jvm.memory.init is not split by memory pool and seems to be missing the jvm.memory.pool.name attribute (spec)

@ValentinZakharov Great catch! Adding these attributes now.

@mhlidd
Copy link
Copy Markdown
Contributor Author

mhlidd commented May 26, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f0dd9c339

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +374 to +376
if (n == buffer.length) {
return buffer;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Retry thread enumeration until snapshot is complete

enumerateAllThreads() returns immediately when group.enumerate(buffer) fills the array, but ThreadGroup.enumerate silently drops extra threads when the buffer is too small. In the Java 8 / native-image path this causes jvm.thread.count buckets to under-report whenever activeCount() underestimates by more than the fixed +10 headroom (common under thread churn), so metric totals can be materially wrong. Grow-and-retry until the returned count is strictly less than buffer length to avoid truncation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches what OTel is doing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inst: opentelemetry OpenTelemetry instrumentation tag: ai generated Largely based on code generated by an AI or LLM type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants