Skip to content

Add exception handler on HTTP/2 parent channel to suppress WARN logs#48687

Closed
jeet1995 wants to merge 1 commit intomainfrom
AzCosmos_Http2ParentChannelExceptionHandler
Closed

Add exception handler on HTTP/2 parent channel to suppress WARN logs#48687
jeet1995 wants to merge 1 commit intomainfrom
AzCosmos_Http2ParentChannelExceptionHandler

Conversation

@jeet1995
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 commented Apr 3, 2026

Problem

Customers see noisy Netty WARN logs in HTTP/2 scenarios:

An exceptionCaught() event was fired, and it reached at the tail of the pipeline.
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed with error(-104): Connection reset by peer

These are cosmetic — zero operational impact — but trigger monitoring alerts.

Root Cause

In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection. The parent and child channels have different pipeline structures:

HTTP/1.1 pipeline (single channel — no leak to TailContext):

SslHandler → HttpClientCodec → ChannelOperationsHandler → [TAIL]
                                         ↑
                            Catches exceptions, bridges to
                            Reactor subscriber. Exception
                            never reaches TailContext.

HTTP/2 parent channel pipeline (BEFORE fix — leak to TailContext):

SslHandler → Http2FrameCodec → Http2MultiplexHandler → [TAIL]
                                                          ↑
                                            No handler catches it.
                                            TailContext logs WARN.

HTTP/2 parent channel pipeline (AFTER fix):

SslHandler → Http2FrameCodec → Http2MultiplexHandler → Http2ParentChannelExceptionHandler → [TAIL]
                                                                   ↑
                                                       Consumes ALL exceptions.
                                                       Log level based on connection state.

HTTP/2 child stream channel pipeline (unchanged):

H2ToHttp11Codec → IdleStateHandler → ChannelOperationsHandler → [TAIL]
                                              ↑
                               Same as HTTP/1.1 — catches exceptions,
                               bridges to Reactor subscriber.

Design: Connection-State-Based Log Level

The handler consumes all exceptions on the parent channel (no exception type filtering). The log level is determined by connection state:

  • DEBUG — when activeStreams == 0 OR !channelActive. No in-flight requests are affected (e.g., TCP RST from LB idle timeout, post-close cleanup).
  • WARN — when active streams exist on a live channel. The exception may affect in-flight requests.
Active streams Channel active Log level Rationale
0 true/false DEBUG Idle connection — exception is noise
>0 false DEBUG Channel already dead — streams will fail via subscriber
>0 true WARN Live requests may be affected — worth alerting

Active stream count is retrieved via Http2FrameCodec.connection().numActiveStreams() on the same parent channel pipeline. Falls back to -1 if the codec is unavailable, which takes the safe WARN path.

Why no exception type filtering?

By the time any exception reaches our handler, all upstream handlers (Http2FrameCodec, Http2MultiplexHandler) have already handled the protocol actions (GOAWAY, stream reset, child channel error delivery). The exception reaching TailContext is an echo of already-handled work, regardless of type. Connection state (active streams + channel activity) is the only dimension that determines whether the exception has diagnostic value.

Why OR (not AND) for the DEBUG condition?

Either condition alone is sufficient to determine the exception is noise:

  • activeStreams == 0 — no in-flight requests affected, regardless of channel state
  • !channelActive — channel is already dead, any active streams will fail through their Reactor subscribers independently

Testing

5 EmbeddedChannel unit tests with production-matching pipeline (Http2FrameCodecHttp2MultiplexHandler → handler):

Test What it proves
withoutHandler_exceptionReachesTail BEFORE: exception reaches TailContext → WARN
withHandler_zeroActiveStreams_consumedAtDebug 0 active streams → consumed at DEBUG
withHandler_exceptionDoesNotCloseChannel Handler does NOT close channel
withHandler_runtimeException_zeroActiveStreams_consumed RuntimeException also consumed (no type filtering)
withHandler_npe_zeroActiveStreams_consumed NPE also consumed (no type filtering)

Note: The !channelActive branch cannot be unit-tested with EmbeddedChannel because disconnect() tears down the pipeline before fireExceptionCaught can reach handlers. In production, exceptionCaught() fires while the channel is transitioning to inactive.

Impact

  • Zero perf impact — handler only overrides exceptionCaught(), Netty @Skip optimization bypasses it for all hot-path events
  • Zero lifecycle change — handler does NOT close the channel
  • Telemetry preserved — exceptions with active streams on a live channel still log at WARN
  • HTTP/1.1 parity — matches ChannelOperationsHandler behavior for connection-level exceptions

@jeet1995 jeet1995 requested review from a team and kirankumarkolli as code owners April 3, 2026 20:36
Copilot AI review requested due to automatic review settings April 3, 2026 20:36
@github-actions github-actions Bot added the Cosmos label Apr 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses noisy Netty WARN logs that occur on HTTP/2 parent (TCP) connections when idle connections are reset, by installing a parent-channel exception handler to consume those exceptions at DEBUG level and close the channel (matching HTTP/1.1 behavior).

Changes:

  • Install an HTTP/2 parent-channel exception handler during doOnConnected by accessing connection.channel().parent().
  • Add Http2ParentChannelExceptionHandler to log parent-channel exceptions at DEBUG and close the channel.
  • Document the fix in azure-cosmos CHANGELOG.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java Adds logic to install a parent (TCP) channel exception handler for HTTP/2 to suppress TailContext WARN logs.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java Introduces a Netty inbound handler that consumes/logs exceptions at DEBUG and closes the parent channel.
sdk/cosmos/azure-cosmos/CHANGELOG.md Adds a “Bugs Fixed” entry describing the WARN suppression for HTTP/2 parent-channel exceptions.

Comment thread sdk/cosmos/azure-cosmos/CHANGELOG.md Outdated

#### Bugs Fixed
Fixing an NPE caused due to boxed Boolean conversion. - See [PR 48656](https://github.com/Azure/azure-sdk-for-java/pull/48656/)
* Fixed Netty WARN log "An exceptionCaught() event was fired, and it reached at the tail of the pipeline" appearing on HTTP/2 connections when the server resets idle TCP connections. Added an exception handler on the HTTP/2 parent channel to consume connection-level exceptions at DEBUG level, matching HTTP/1.1 behavior. - See [PR TBD](https://github.com/Azure/azure-sdk-for-java/pull/TBD)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG entry still contains a placeholder PR reference ("PR TBD" and a /pull/TBD link). This will be a broken link in released notes; please replace with the actual PR number (or remove the link if not available).

Suggested change
* Fixed Netty WARN log "An exceptionCaught() event was fired, and it reached at the tail of the pipeline" appearing on HTTP/2 connections when the server resets idle TCP connections. Added an exception handler on the HTTP/2 parent channel to consume connection-level exceptions at DEBUG level, matching HTTP/1.1 behavior. - See [PR TBD](https://github.com/Azure/azure-sdk-for-java/pull/TBD)
* Fixed Netty WARN log "An exceptionCaught() event was fired, and it reached at the tail of the pipeline" appearing on HTTP/2 connections when the server resets idle TCP connections. Added an exception handler on the HTTP/2 parent channel to consume connection-level exceptions at DEBUG level, matching HTTP/1.1 behavior.

Copilot uses AI. Check for mistakes.
Comment thread sdk/cosmos/azure-cosmos/CHANGELOG.md Outdated
@@ -8,6 +8,7 @@

#### Bugs Fixed
Fixing an NPE caused due to boxed Boolean conversion. - See [PR 48656](https://github.com/Azure/azure-sdk-for-java/pull/48656/)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Unreleased 'Bugs Fixed' section, the first entry is not formatted as a bullet while the newly added entry is. Please make the formatting consistent (typically all entries in these sections are bulleted).

Suggested change
Fixing an NPE caused due to boxed Boolean conversion. - See [PR 48656](https://github.com/Azure/azure-sdk-for-java/pull/48656/)
* Fixing an NPE caused due to boxed Boolean conversion. - See [PR 48656](https://github.com/Azure/azure-sdk-for-java/pull/48656/)

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +180
Channel parent = connection.channel().parent();
if (parent != null
&& parent.pipeline().get(Http2ParentChannelExceptionHandler.HANDLER_NAME) == null) {

parent.pipeline().addLast(
Http2ParentChannelExceptionHandler.HANDLER_NAME,
new Http2ParentChannelExceptionHandler());
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handler installation on the shared HTTP/2 parent channel uses a non-atomic check-then-add (pipeline().get(name) == null then addLast). If multiple stream channels attempt this concurrently, addLast can throw due to a duplicate handler name. Consider making installation robust (e.g., run the add on the parent channel's event loop and/or catch the duplicate-name exception and ignore it) so new stream creation can’t fail because of a race.

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +181
// Install exception handler on the HTTP/2 parent (TCP) channel.
// In H2, doOnConnected fires for stream (child) channels — channel.parent()
// is the TCP connection. The parent pipeline has no ChannelOperationsHandler
// (unlike H1.1), so TCP-level exceptions (RST, broken pipe) propagate to
// Netty's TailContext and get logged as WARN. This handler matches H1.1
// behavior by consuming exceptions at DEBUG level.
Channel parent = connection.channel().parent();
if (parent != null
&& parent.pipeline().get(Http2ParentChannelExceptionHandler.HANDLER_NAME) == null) {

parent.pipeline().addLast(
Http2ParentChannelExceptionHandler.HANDLER_NAME,
new Http2ParentChannelExceptionHandler());
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces new behavior (consuming parent-channel exceptions and closing the parent connection) without accompanying test coverage. There are existing Netty/transport tests in azure-cosmos-tests (e.g., ones that use EmbeddedChannel); please add a unit/integration test that asserts the handler is installed on the H2 parent pipeline and that an exception on the parent is consumed (no TailContext WARN) and results in the parent channel closing.

Copilot generated this review using guidance from repository custom instructions.
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
if (logger.isDebugEnabled()) {
logger.debug("Exception on HTTP/2 parent connection [id:{}]: {}",
ctx.channel().id().asShortText(), cause.getMessage(), cause);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exceptionCaught currently logs only cause.getMessage() in the formatted message. For many Netty exceptions the message can be null/empty, reducing diagnostics. Consider logging the exception class (e.g., cause.toString()) in the formatted portion so the DEBUG log remains useful even when getMessage() is null.

Suggested change
ctx.channel().id().asShortText(), cause.getMessage(), cause);
ctx.channel().id().asShortText(), cause.toString(), cause);

Copilot uses AI. Check for mistakes.
@jeet1995 jeet1995 marked this pull request as draft April 3, 2026 20:42
@jeet1995 jeet1995 force-pushed the AzCosmos_Http2ParentChannelExceptionHandler branch 2 times, most recently from 0ec7d77 to e5f9537 Compare April 3, 2026 21:31
@jeet1995 jeet1995 force-pushed the AzCosmos_Http2ParentChannelExceptionHandler branch 3 times, most recently from 0da600f to acea1fb Compare April 20, 2026 23:26
In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection.
The parent channel pipeline has no ChannelOperationsHandler (unlike HTTP/1.1),
so TCP-level exceptions like Connection reset by peer (ECONNRESET) propagate to
Netty's TailContext, which logs them as WARN.

This adds Http2ParentChannelExceptionHandler to the parent channel via
doOnConnected (accessing channel.parent()). The handler consumes exceptions
at DEBUG level WITHOUT closing the channel or altering connection lifecycle,
matching HTTP/1.1 logging behavior.

Changes:
- Handler logs cause.toString() (not getMessage()) for null-safe diagnostics
- Defensive try-catch for duplicate handler name on concurrent stream creation
- Before/after verified with EmbeddedChannel unit tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the AzCosmos_Http2ParentChannelExceptionHandler branch from acea1fb to 3e2b4f4 Compare April 20, 2026 23:34
@jeet1995 jeet1995 marked this pull request as ready for review April 21, 2026 00:55
@jeet1995
Copy link
Copy Markdown
Member Author

Superseded by #48890 (re-raised from fork for push access).

@jeet1995 jeet1995 closed this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants