Skip to content

Add exception handler on HTTP/2 parent channel to suppress WARN logs#48890

Open
jeet1995 wants to merge 7 commits intoAzure:mainfrom
jeet1995:AzCosmos_Http2ParentChannelExceptionHandler
Open

Add exception handler on HTTP/2 parent channel to suppress WARN logs#48890
jeet1995 wants to merge 7 commits intoAzure:mainfrom
jeet1995:AzCosmos_Http2ParentChannelExceptionHandler

Conversation

@jeet1995
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 commented Apr 21, 2026

Problem

Customers see noisy Netty WARN logs in HTTP/2 scenarios:

An exceptionCaught() event was fired, and it reached at the tail of the pipeline.
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed with error(-104): Connection reset by peer

Root Cause

In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection. The parent and child channels have different pipeline structures:

HTTP/1.1 pipeline (single channel — no leak to TailContext):

SslHandler → HttpClientCodec → ChannelOperationsHandler → [TAIL]
                                         ↑
                            Catches exceptions, bridges to
                            Reactor subscriber. Exception
                            never reaches TailContext.

HTTP/2 parent channel pipeline (BEFORE fix — leak to TailContext):

SslHandler → Http2FrameCodec → Http2MultiplexHandler → [TAIL]
                                                          ↑
                                            No handler catches it.
                                            TailContext logs WARN.

HTTP/2 parent channel pipeline (AFTER fix):

SslHandler → Http2FrameCodec → Http2MultiplexHandler → Http2ParentChannelExceptionHandler → [TAIL]
                                                                   ↑
                                                       Consumes ALL exceptions.
                                                       Log level based on connection state.

HTTP/2 child stream channel pipeline (unchanged):

H2ToHttp11Codec → IdleStateHandler → ChannelOperationsHandler → [TAIL]
                                              ↑
                               Same as HTTP/1.1 — catches exceptions,
                               bridges to Reactor subscriber.

Design: Connection-State-Based Log Level

The handler consumes all exceptions on the parent channel (no exception type filtering). The log level is determined by connection state:

  • DEBUG — when activeStreams == 0 OR !channelActive.
  • WARN — when active streams exist on a live channel.
Active streams Channel active Log level Rationale
0 true/false DEBUG Idle connection — no in-flight requests affected
>0 false DEBUG Channel already dead — streams will fail via subscriber
>0 true WARN Live requests may be affected

Active stream count is retrieved via Http2FrameCodec.connection().numActiveStreams() on the same parent channel pipeline. Falls back to -1 if the codec is unavailable, which takes the safe WARN path.

Why no exception type filtering?

By the time any exception reaches our handler, all upstream handlers (Http2FrameCodec, Http2MultiplexHandler) have already handled the protocol actions (GOAWAY, stream reset, child channel error delivery). The exception reaching TailContext is an echo of already-handled work, regardless of type. Connection state (active streams + channel activity) is the only dimension that determines whether the exception has diagnostic value.

Why OR (not AND) for the DEBUG condition?

Either condition alone is sufficient:

  • activeStreams == 0 — no in-flight requests affected, regardless of channel state
  • !channelActive — channel is already dead, any active streams will fail through their Reactor subscribers independently

Testing

5 EmbeddedChannel unit tests with production-matching pipeline (Http2FrameCodecHttp2MultiplexHandler → handler):

Test What it proves
withoutHandler_exceptionReachesTail BEFORE: exception reaches TailContext → WARN
withHandler_zeroActiveStreams_consumedAtDebug 0 active streams → consumed at DEBUG
withHandler_exceptionDoesNotCloseChannel Handler does NOT close channel
withHandler_runtimeException_zeroActiveStreams_consumed RuntimeException also consumed (no type filtering)
withHandler_npe_zeroActiveStreams_consumed NPE also consumed (no type filtering)

Note: The !channelActive branch cannot be unit-tested with EmbeddedChannel because disconnect() tears down the pipeline before fireExceptionCaught can reach handlers. In production, exceptionCaught() fires while the channel is transitioning to inactive.

Impact

  • Handler only overrides exceptionCaught() — Netty @Skip optimization bypasses it for all hot-path events
  • Handler does NOT close the channel
  • Exceptions with active streams on a live channel still log at WARN

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Netty channel handler to suppress noisy “exceptionCaught reached tail of pipeline” WARN logs on HTTP/2 parent (TCP) connections in Cosmos’ Reactor Netty transport, while preserving WARN-level signal when exceptions may impact in-flight HTTP/2 streams.

Changes:

  • Install an HTTP/2 parent-channel exceptionCaught handler from ReactorNettyClient when HTTP/2 is enabled.
  • Add Http2ParentChannelExceptionHandler that consumes parent-channel exceptions and logs at DEBUG vs WARN based on active stream count and channel activity.
  • Add EmbeddedChannel-based unit tests covering exception consumption behavior, and update changelog entry.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java Adds logic to install the new handler onto the HTTP/2 parent channel pipeline.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java New handler that consumes parent-channel exceptions and logs based on connection state.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the fix in the unreleased section.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandlerTest.java New unit tests validating the handler’s exception consumption behavior.
sdk/cosmos/azure-cosmos-tests/pom.xml Enables surefire tests and includes trailing whitespace changes.

Comment thread sdk/cosmos/azure-cosmos/CHANGELOG.md Outdated
@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 5 commits April 21, 2026 17:03
In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection.
The parent channel pipeline has no ChannelOperationsHandler (unlike HTTP/1.1),
so TCP-level exceptions like Connection reset by peer (ECONNRESET) propagate to
Netty's TailContext, which logs them as WARN.

This adds Http2ParentChannelExceptionHandler to the parent channel via
doOnConnected (accessing channel.parent()). The handler consumes exceptions
at DEBUG level WITHOUT closing the channel or altering connection lifecycle,
matching HTTP/1.1 logging behavior.

Changes:
- Handler logs cause.toString() (not getMessage()) for null-safe diagnostics
- Defensive try-catch for duplicate handler name on concurrent stream creation
- Before/after verified with EmbeddedChannel unit tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…toString(), update changelog

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the AzCosmos_Http2ParentChannelExceptionHandler branch from d68fa5c to 2a3b5b2 Compare April 21, 2026 21:05
@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent

}
} else {
// Active streams on a live channel — exception may affect in-flight requests.
logger.warn(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add tests to check the activeStrems > 0 path?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the WARN path (activeStreams > 0 on a live channel) is the most operationally important branch and currently has zero test coverage. All 5 tests exercise only the activeStreams == 0 (DEBUG) path.

A pragmatic way to cover the WARN path without needing real H2 stream negotiation: create an EmbeddedChannel with only the handler (no Http2FrameCodec). When the codec is absent, getActiveStreamCount() returns -1, and since -1 != 0 and EmbeddedChannel.isActive() == true, the condition at line 49 evaluates to false, taking the WARN branch.

`java
@test(groups = "unit")
public void withHandler_codecAbsent_fallsBackToWarnPath() {
// Without Http2FrameCodec, getActiveStreamCount() returns -1.
// -1 != 0 AND channelActive == true takes WARN path (safe default).
EmbeddedChannel channel = new EmbeddedChannel(
new Http2ParentChannelExceptionHandler());

channel.pipeline().fireExceptionCaught(
    new IOException("Connection reset by peer"));

// Exception is still consumed (does not reach TailContext)
channel.checkException();
assertThat(channel.isOpen()).isTrue();

channel.finishAndReleaseAll();

}
`

This single test covers two gaps simultaneously:

  1. The -1 fallback behavior of getActiveStreamCount() when the codec is unavailable
  2. The WARN branch execution (the else at line 57)

It doesn't verify the actual log level (that would require a log capture mechanism), but it proves the handler still consumes exceptions on the WARN path rather than letting them reach TailContext.

$([char]9888)$([char]0xFE0F) AI-generated review may be incorrect. Agree? resolve the conversation. Disagree? reply with your reasoning.

Copy link
Copy Markdown
Member Author

@jeet1995 jeet1995 Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @mbhaskar. Added tests for > 0 and -1 (when Http2FrameCodec isn't installed) activeStreams path.

Address Bhaskar's review: add two tests covering the else branch where
activeStreams > 0 on an active channel, exercising the WARN log path.

- withHandler_activeStreams_consumedAtWarn: creates an active H2 stream
  via codec.connection().local().createStream(), fires an exception, and
  verifies it is consumed (does not reach TailContext).
- withHandler_activeStreams_channelNotClosed: same setup, verifies the
  handler does not close the channel even with active streams.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member

Review complete (32:05)

No new comments — existing review coverage is sufficient.

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

When Http2FrameCodec is absent from the pipeline, getActiveStreamCount()
returns -1. Since -1 != 0 and channelActive == true, the handler takes
the safe WARN path. This test covers that fallback behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.


@Override
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
int activeStreams = getActiveStreamCount(ctx);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering when exception happens, should we proactively calling ctx.close() to force close any broken connections. From reading, seems not all exceptions will cause connection to close. so I am any edge cases we need to take care here.

and another question is should we swallow all exceptions, what if for Error cases, should we still swallow? should we throw again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants