Skip to content

Report Kafka client connection status in DSM payloads#11442

Open
piochelepiotr wants to merge 2 commits into
masterfrom
feat/kafka-config-connection-status
Open

Report Kafka client connection status in DSM payloads#11442
piochelepiotr wants to merge 2 commits into
masterfrom
feat/kafka-config-connection-status

Conversation

@piochelepiotr
Copy link
Copy Markdown
Contributor

Summary

  • Kafka client configs now carry a connectionStatus field (connected / failed) inside the DSM msgpack payload (Configs[*].ConnectionStatus, peer to Type, KafkaClusterId, ConsumerGroup, Config).
  • New advice on org.apache.kafka.clients.Metadata.failedUpdate in both kafka-clients-0.11 and kafka-clients-3.8 so failed clients are reported (previously they were silently dropped — reporting was gated on Metadata.update succeeding).
  • Expanded the value-allowlist with non-secret auth selectors (sasl.mechanism, ssl.protocol, ssl.enabled.protocols, ssl.endpoint.identification.algorithm, ssl.truststore.type, ssl.keystore.type, ssl.cipher.suites, sasl.kerberos.service.name, sasl.login.callback.handler.class). Credentials remain masked.

Why

End goal is a DSM UI flow where a user can diff a failing client's config against working clients on the same cluster — typoed bootstrap.servers, missing sasl.mechanism, SSL truststore drift, etc. Without capturing failed clients at all, the comparison is impossible today.

Downstream

  • Migration: DataDog/k8s-resources#157530 (adds connection_status column + 8th positional proc param with DEFAULT 'connected')
  • Edge + writer + dsm-api: dd-go / dd-source PRs (linked separately)

Test plan

  • Unit tests updated (DataStreamsWritingTest, DefaultDataStreamsMonitoringTest, KafkaConfigHelperTest)
  • End-to-end: local docker compose (kafka + dd-agent) with the built agent jar shows Reporting kafka_producer config with status=connected for a working producer/consumer and status=failed for a producer pointed at a closed port; agent forwards both to /v0.1/pipeline_stats.

Known gap (follow-up)

A client whose bootstrap.servers fails at constructor DNS-resolve time (e.g. bogus.invalid:9092) reports nothing — the producer never builds, so no Metadata ever exists. Adding @Advice.OnMethodExit(onThrowable=...) on the producer constructor would close that gap.

tag: ai generated

🤖 Generated with Claude Code

@piochelepiotr piochelepiotr added type: enhancement Enhancements and improvements inst: kafka Kafka instrumentation tag: ai generated Largely based on code generated by an AI or LLM labels May 21, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 21, 2026

Kafka / producer-benchmark

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master feat/kafka-config-connection-status
git_commit_date 1779794434 1779810745
git_commit_sha 970f5ee 9cd3b0f
See matching parameters
Baseline Candidate
ci_job_date 1779811815 1779811815
ci_job_id 1713477506 1713477506
ci_pipeline_id 115207487 115207487
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion 11.0.25 11.0.25
jmhVersion 1.36 1.36
jvm /usr/lib/jvm/java-11-openjdk-amd64/bin/java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName OpenJDK 64-Bit Server VM OpenJDK 64-Bit Server VM
vmVersion 11.0.25+9-post-Ubuntu-1ubuntu122.04 11.0.25+9-post-Ubuntu-1ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean throughput
scenario:not-instrumented/KafkaProduceBenchmark.benchProduce same
scenario:only-tracing-dsm-disabled-benchmarks/KafkaProduceBenchmark.benchProduce same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaProduceBenchmark.benchProduce same

Previously, the DSM payload only carried Kafka client configs once
`Metadata.update` fired with a valid cluster ID — so clients that never
authenticated or never reached a broker were silently dropped, and we
couldn't compare their configs against working clients.

Now every config is also reported with a `connectionStatus` field
("connected" / "failed") on the per-bucket `Configs` entry, including
on `Metadata.failedUpdate`. Also expands the value-allowlist with
non-secret auth selectors (`sasl.mechanism`, `ssl.protocol`,
`ssl.endpoint.identification.algorithm`, etc.) so the comparison flow
can surface mechanism typos without leaking credentials.

tag: ai generated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@piochelepiotr piochelepiotr force-pushed the feat/kafka-config-connection-status branch from 1c8096d to 9cd3b0f Compare May 26, 2026 15:52
@piochelepiotr piochelepiotr marked this pull request as ready for review May 26, 2026 15:52
@piochelepiotr piochelepiotr requested review from a team as code owners May 26, 2026 15:52
@piochelepiotr piochelepiotr requested review from jordan-wong and removed request for a team May 26, 2026 15:52
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9cd3b0fc5f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +125 to +127
public static void reportPendingConfigAsFailed(MetadataState state) {
// clusterId may be unknown on auth/connect failure — emit with whatever we have (often "")
reportPendingConfig(state, state.clusterId, PendingConfig.STATUS_FAILED);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep pending Kafka config after transient failedUpdate

reportPendingConfigAsFailed routes through reportPendingConfig, which consumes the pending config via takePendingConfig(). In Kafka, failedUpdate can occur on transient/retriable metadata failures before a later successful update; with this change, the first transient failure permanently records the client as failed and prevents the later connected report from ever being emitted for that client instance. This makes connection status inaccurate for recovering clients and can mislead DSM comparisons.

Useful? React with 👍 / 👎.

@datadog-prod-us1-4
Copy link
Copy Markdown

datadog-prod-us1-4 Bot commented May 26, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 5 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | java-startup-parallel-check-slo-breaches   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. Failed to retrieve GitHub token. No such file or directory: artifacts/fail-on-breach.github-token.

DataDog/apm-reliability/dd-trace-java | java-startup-parallel-generate-slos   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. No files to upload. No matching files found in 'artifacts/slos-*.yaml'.

DataDog/apm-reliability/dd-trace-java | java-startup-parallel-upload-to-bp-api   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. File access error: Cannot access '/go/src/github.com/DataDog/apm-reliability/dd-trace-java/artifacts/candidate-*.converted.json': No such file or directory

View all 5 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 054b71e | Docs | Datadog PR Page | Give us feedback!

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 26, 2026

Kafka / consumer-benchmark

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master feat/kafka-config-connection-status
git_commit_date 1779794434 1779810745
git_commit_sha 970f5ee 9cd3b0f
See matching parameters
Baseline Candidate
ci_job_date 1779811846 1779811846
ci_job_id 1713477509 1713477509
ci_pipeline_id 115207487 115207487
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion 11.0.25 11.0.25
jmhVersion 1.36 1.36
jvm /usr/lib/jvm/java-11-openjdk-amd64/bin/java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName OpenJDK 64-Bit Server VM OpenJDK 64-Bit Server VM
vmVersion 11.0.25+9-post-Ubuntu-1ubuntu122.04 11.0.25+9-post-Ubuntu-1ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean throughput
scenario:not-instrumented/KafkaConsumerBenchmark.benchConsume same
scenario:only-tracing-dsm-disabled-benchmarks/KafkaConsumerBenchmark.benchConsume same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaConsumerBenchmark.benchConsume same

reportPendingConfigAsFailed used to call takePendingConfig(), so a
transient failedUpdate would permanently record the client as failed
and prevent a later successful update from emitting "connected".

Switch to a peek + one-shot flag: emit at most one "failed" report per
pending config without consuming it, so recovery still flows through
the normal update path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@piochelepiotr piochelepiotr force-pushed the feat/kafka-config-connection-status branch from 8627f7a to 054b71e Compare May 26, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inst: kafka Kafka instrumentation tag: ai generated Largely based on code generated by an AI or LLM type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant