Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics#11411
Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics#11411mhlidd wants to merge 6 commits into
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62d9b50d1d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
This comment has been minimized.
This comment has been minimized.
90ddfc2 to
de166ab
Compare
|
Hi! 👋 Thanks for your pull request! 🎉 To help us review it, please make sure to:
If you need help, please check our contributing guidelines. |
|
@codex review |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
| static final int DEFAULT_METRICS_OTEL_TIMEOUT = 7_500; // ms | ||
| static final int DEFAULT_METRICS_OTEL_CARDINALITY_LIMIT = 2_000; | ||
|
|
||
| public static final boolean DEFAULT_METRICS_OTEL_EXPERIMENTAL_ENABLED = true; |
There was a problem hiding this comment.
Default for this in OTel is false - do we want to match that?
There was a problem hiding this comment.
This was an intentional decision. Idea being that JMX inherently emits many metrics that are considered as "developmental" in OTel, so it would be an easy win to enable these by default.
mcculls
left a comment
There was a problem hiding this comment.
One question about whether the default should really be true since OTel defaults it to false at the moment: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/runtime-telemetry/README.md
Also a few cleanup / robustness comments to be addressed before merging - otherwise looks good.
@ValentinZakharov Great catch! Adding these attributes now. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8f0dd9c339
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (n == buffer.length) { | ||
| return buffer; | ||
| } |
There was a problem hiding this comment.
Retry thread enumeration until snapshot is complete
enumerateAllThreads() returns immediately when group.enumerate(buffer) fills the array, but ThreadGroup.enumerate silently drops extra threads when the buffer is too small. In the Java 8 / native-image path this causes jvm.thread.count buckets to under-report whenever activeCount() underestimates by more than the fixed +10 headroom (common under thread churn), so metric totals can be materially wrong. Grow-and-retry until the returned count is strictly less than buffer length to avoid truncation.
Useful? React with 👍 / 👎.
What Does This Do
Follow-up to the parent PR for
maximo/otlp-runtime-metricsthat expands the OTLP JVM runtime metrics surface, splitsjvm.thread.countby daemon/state attributes, and gates Development-status metrics behind a new opt-out flag.New config
dd.metrics.otel.experimental.enabled(default:true) — mirrors OTel'sotel.instrumentation.runtime-telemetry.emit-experimental-metrics. Whenfalse, only metrics marked Stable in the OTel JVM semantic conventions are emitted; Development-status metrics are suppressed. Settable via either env var:DD_METRICS_OTEL_EXPERIMENTAL_ENABLED(Datadog form)OTEL_INSTRUMENTATION_RUNTIME_TELEMETRY_EMIT_EXPERIMENTAL_METRICS(OTel-spec form, mapped throughOtelEnvironmentConfigSource)Both env vars are registered in
metadata/supported-configurations.json.Note on the
truedefault (diverges from OTel'sfalse): JMXFetch already emits many metrics that OTel classifies as Development, so defaulting experimental-on keeps the OTLP path at feature parity with the JMX path out of the box. Users who want only the Stable subset can opt out explicitly.Metrics added, reclassified, or restructured (all under the
datadog.jvm.runtimescope, OTel-native names)jvm.thread.countjvm.thread.daemon(Boolean) andjvm.thread.state(String) per OTel JVM semconv, replacing the previous singleAttributes.empty()data point. Empty buckets are skipped.jvm.memory.used_after_last_gcjvm.gc.durationjvm.gc.causeattribute is gated on the experimental flag (the cause attribute is not in OTel's stable attribute set);jvm.gc.nameandjvm.gc.actionare always attached.jvm.memory.initjvm.memory.pool.name), matching the structure of the always-onjvm.memory.used/jvm.memory.committed/jvm.memory.limit.jvm.buffer.memory.used/limit/countjvm.system.cpu.utilizationjvm.system.cpu.load_1mjvm.file_descriptor.count/limitUnixOperatingSystemMXBean)jvm.thread.countcollector — JDK-version- and runtime-awareThe collector is chosen once at class load:
ThreadMXBean.getThreadInfo(long[])(the single-arg overload omits stack-trace capture) and readsThreadInfo.isDaemon()via aMethodHandleresolved reflectively (the surrounding module is compiled against Java 8, so the symbol cannot be referenced directly).ThreadGroup.ThreadInfo.isDaemon()was added in Java 9, andThreadMXBeanis unsupported on GraalVM native images.Both paths avoid
Thread.getAllStackTraces(), which forces a safepoint and allocates aStackTraceElement[]per thread on every poll. The 12(daemon, Thread.State)Attributesobjects are precomputed in twoAttributes[]lookup tables to avoid per-poll allocation.Value-guard alignment with OTel reference implementation
jvm.memory.limitandjvm.memory.initnow skip recording only whengetMax()/getInit()returns the documented-1sentinel (was> 0, which incorrectly also skipped legitimate0values).>= 0, null checks) match the corresponding callbacks inio.opentelemetry.instrumentation.runtimetelemetry.internal.*.Review-comment fixes
isGcNotificationInfoAvailable()now catchesException(wasClassNotFoundException) so any unexpected lookup failure degrades gracefully instead of failing class initialization.GcNotificationListener.handleNotificationnow null-checks theGarbageCollectionNotificationInfopayload before recording, in case a notification arrives without a usable payload.sunOsBean()helper and inlined theinstanceof OperatingSystemMXBeancast at both call sites (registerCpuMetrics,registerSystemCpuMetrics). Makes the type-narrowing explicit at each use site and removes one layer of indirection (reviewer preference for readability over deduplication here).Test coverage
JvmOtlpRuntimeMetricsTestwas extended to assert:jvm.gc.durationemission viaSystem.gc().jvm.thread.countdata points carry validjvm.thread.daemon(boolean) andjvm.thread.state(lower-casedThread.Statename) attributes, empty buckets are skipped, the bucket sum is positive, and both daemon=true and daemon=false buckets are observed in a typical test JVM (jvmThreadCountIsBucketedByDaemonAndState, replacing the oldjvmThreadCountIsPositive).jvm.memory.initemits both the heap/non_heap aggregates and per-pool data points (jvmMemoryInitHasHeapNonHeapAndPoolAttributes), and the heap aggregate value is positive (jvmMemoryInitHeapAggregateIsPositive).JvmOtlpRuntimeMetricsForkedTestruns in an isolated JVM, callsstart(false), and verifies that Development-status instruments are absent and thatjvm.gc.causeis not attached tojvm.gc.durationdata points when experimental metrics are disabled. Forked becauseJvmOtlpRuntimeMetrics.start(...)is guarded by a process-wideAtomicBooleanand the registry / JMX listeners are JVM-global, so a single JVM cannot exercise both flag values.startIsIdempotenttest that only checked the metric-nameSetsize — it could not detect duplicate JMX listeners or duplicate observable callbacks under the same instrument name, which are the actual failure modes if the guard were removed.Misc
Motivation
The parent PR established the OTLP JVM runtime metrics pipeline but only emitted a subset of the OTel JVM semantic conventions, and
jvm.thread.countwas emitted as a single undifferentiated count. This follow-up brings the surface in line with whatopentelemetry-java-instrumentation'sruntime-telemetrylibrary emits, splitsjvm.thread.countby the spec-required daemon/state attributes, and adds the standard experimental-metrics opt-out so users who want only the Stable subset (smaller cardinality, fewer dashboard surprises) can disable Development metrics without losing the integration entirely.Aligning the value guards with OTel's reference implementation prevents two real-world divergences:
0vs-1fix, uncapped non-heap pools (wheregetMax() == 0on some JVM/version combos) would silently produce nojvm.memory.limitdata point — they should publish0to indicate "no limit observed."The JDK-version-aware
jvm.thread.countcollector matters because the module is compiled against Java 8 (so the Java 9+ThreadInfo.isDaemon()cannot be referenced directly) but ships into JVMs ranging from Java 8 through current LTS releases, plus GraalVM native image whereThreadMXBeanis unsupported. The chosen approach avoidsThread.getAllStackTraces()on every poll (which would otherwise dominate overhead) on every runtime it supports.Additional Notes
JvmOtlpRuntimeMetrics.start(...). TheOTLP_JMX_CONFIG-skip path is unchanged.otel.instrumentation.runtime-telemetry.emit-experimental-metricsis captured inOtelEnvironmentConfigSourceso an unmodified OTel-style config picks up the flag automatically.Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]