Report Kafka client connection status in DSM payloads#11442
Report Kafka client connection status in DSM payloads#11442piochelepiotr wants to merge 2 commits into
Conversation
Kafka / producer-benchmarkParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics. See unchanged results
|
Previously, the DSM payload only carried Kafka client configs once
`Metadata.update` fired with a valid cluster ID — so clients that never
authenticated or never reached a broker were silently dropped, and we
couldn't compare their configs against working clients.
Now every config is also reported with a `connectionStatus` field
("connected" / "failed") on the per-bucket `Configs` entry, including
on `Metadata.failedUpdate`. Also expands the value-allowlist with
non-secret auth selectors (`sasl.mechanism`, `ssl.protocol`,
`ssl.endpoint.identification.algorithm`, etc.) so the comparison flow
can surface mechanism typos without leaking credentials.
tag: ai generated
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1c8096d to
9cd3b0f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9cd3b0fc5f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| public static void reportPendingConfigAsFailed(MetadataState state) { | ||
| // clusterId may be unknown on auth/connect failure — emit with whatever we have (often "") | ||
| reportPendingConfig(state, state.clusterId, PendingConfig.STATUS_FAILED); |
There was a problem hiding this comment.
Keep pending Kafka config after transient failedUpdate
reportPendingConfigAsFailed routes through reportPendingConfig, which consumes the pending config via takePendingConfig(). In Kafka, failedUpdate can occur on transient/retriable metadata failures before a later successful update; with this change, the first transient failure permanently records the client as failed and prevents the later connected report from ever being emitted for that client instance. This makes connection status inaccurate for recovering clients and can mislead DSM comparisons.
Useful? React with 👍 / 👎.
|
Kafka / consumer-benchmarkParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics. See unchanged results
|
reportPendingConfigAsFailed used to call takePendingConfig(), so a transient failedUpdate would permanently record the client as failed and prevent a later successful update from emitting "connected". Switch to a peek + one-shot flag: emit at most one "failed" report per pending config without consuming it, so recovery still flows through the normal update path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8627f7a to
054b71e
Compare
Summary
connectionStatusfield (connected/failed) inside the DSM msgpack payload (Configs[*].ConnectionStatus, peer toType,KafkaClusterId,ConsumerGroup,Config).org.apache.kafka.clients.Metadata.failedUpdatein bothkafka-clients-0.11andkafka-clients-3.8so failed clients are reported (previously they were silently dropped — reporting was gated onMetadata.updatesucceeding).sasl.mechanism,ssl.protocol,ssl.enabled.protocols,ssl.endpoint.identification.algorithm,ssl.truststore.type,ssl.keystore.type,ssl.cipher.suites,sasl.kerberos.service.name,sasl.login.callback.handler.class). Credentials remain masked.Why
End goal is a DSM UI flow where a user can diff a failing client's config against working clients on the same cluster — typoed
bootstrap.servers, missingsasl.mechanism, SSL truststore drift, etc. Without capturing failed clients at all, the comparison is impossible today.Downstream
connection_statuscolumn + 8th positional proc param withDEFAULT 'connected')Test plan
DataStreamsWritingTest,DefaultDataStreamsMonitoringTest,KafkaConfigHelperTest)Reporting kafka_producer config with status=connectedfor a working producer/consumer andstatus=failedfor a producer pointed at a closed port; agent forwards both to/v0.1/pipeline_stats.Known gap (follow-up)
A client whose
bootstrap.serversfails at constructor DNS-resolve time (e.g.bogus.invalid:9092) reports nothing — the producer never builds, so noMetadataever exists. Adding@Advice.OnMethodExit(onThrowable=...)on the producer constructor would close that gap.tag: ai generated
🤖 Generated with Claude Code