Add exception handler on HTTP/2 parent channel to suppress WARN logs#48687
Add exception handler on HTTP/2 parent channel to suppress WARN logs#48687
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses noisy Netty WARN logs that occur on HTTP/2 parent (TCP) connections when idle connections are reset, by installing a parent-channel exception handler to consume those exceptions at DEBUG level and close the channel (matching HTTP/1.1 behavior).
Changes:
- Install an HTTP/2 parent-channel exception handler during
doOnConnectedby accessingconnection.channel().parent(). - Add
Http2ParentChannelExceptionHandlerto log parent-channel exceptions at DEBUG and close the channel. - Document the fix in
azure-cosmosCHANGELOG.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java | Adds logic to install a parent (TCP) channel exception handler for HTTP/2 to suppress TailContext WARN logs. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java | Introduces a Netty inbound handler that consumes/logs exceptions at DEBUG and closes the parent channel. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Adds a “Bugs Fixed” entry describing the WARN suppression for HTTP/2 parent-channel exceptions. |
|
|
||
| #### Bugs Fixed | ||
| Fixing an NPE caused due to boxed Boolean conversion. - See [PR 48656](https://github.com/Azure/azure-sdk-for-java/pull/48656/) | ||
| * Fixed Netty WARN log "An exceptionCaught() event was fired, and it reached at the tail of the pipeline" appearing on HTTP/2 connections when the server resets idle TCP connections. Added an exception handler on the HTTP/2 parent channel to consume connection-level exceptions at DEBUG level, matching HTTP/1.1 behavior. - See [PR TBD](https://github.com/Azure/azure-sdk-for-java/pull/TBD) |
There was a problem hiding this comment.
CHANGELOG entry still contains a placeholder PR reference ("PR TBD" and a /pull/TBD link). This will be a broken link in released notes; please replace with the actual PR number (or remove the link if not available).
| * Fixed Netty WARN log "An exceptionCaught() event was fired, and it reached at the tail of the pipeline" appearing on HTTP/2 connections when the server resets idle TCP connections. Added an exception handler on the HTTP/2 parent channel to consume connection-level exceptions at DEBUG level, matching HTTP/1.1 behavior. - See [PR TBD](https://github.com/Azure/azure-sdk-for-java/pull/TBD) | |
| * Fixed Netty WARN log "An exceptionCaught() event was fired, and it reached at the tail of the pipeline" appearing on HTTP/2 connections when the server resets idle TCP connections. Added an exception handler on the HTTP/2 parent channel to consume connection-level exceptions at DEBUG level, matching HTTP/1.1 behavior. |
| @@ -8,6 +8,7 @@ | |||
|
|
|||
| #### Bugs Fixed | |||
| Fixing an NPE caused due to boxed Boolean conversion. - See [PR 48656](https://github.com/Azure/azure-sdk-for-java/pull/48656/) | |||
There was a problem hiding this comment.
In the Unreleased 'Bugs Fixed' section, the first entry is not formatted as a bullet while the newly added entry is. Please make the formatting consistent (typically all entries in these sections are bulleted).
| Fixing an NPE caused due to boxed Boolean conversion. - See [PR 48656](https://github.com/Azure/azure-sdk-for-java/pull/48656/) | |
| * Fixing an NPE caused due to boxed Boolean conversion. - See [PR 48656](https://github.com/Azure/azure-sdk-for-java/pull/48656/) |
| Channel parent = connection.channel().parent(); | ||
| if (parent != null | ||
| && parent.pipeline().get(Http2ParentChannelExceptionHandler.HANDLER_NAME) == null) { | ||
|
|
||
| parent.pipeline().addLast( | ||
| Http2ParentChannelExceptionHandler.HANDLER_NAME, | ||
| new Http2ParentChannelExceptionHandler()); |
There was a problem hiding this comment.
The handler installation on the shared HTTP/2 parent channel uses a non-atomic check-then-add (pipeline().get(name) == null then addLast). If multiple stream channels attempt this concurrently, addLast can throw due to a duplicate handler name. Consider making installation robust (e.g., run the add on the parent channel's event loop and/or catch the duplicate-name exception and ignore it) so new stream creation can’t fail because of a race.
| // Install exception handler on the HTTP/2 parent (TCP) channel. | ||
| // In H2, doOnConnected fires for stream (child) channels — channel.parent() | ||
| // is the TCP connection. The parent pipeline has no ChannelOperationsHandler | ||
| // (unlike H1.1), so TCP-level exceptions (RST, broken pipe) propagate to | ||
| // Netty's TailContext and get logged as WARN. This handler matches H1.1 | ||
| // behavior by consuming exceptions at DEBUG level. | ||
| Channel parent = connection.channel().parent(); | ||
| if (parent != null | ||
| && parent.pipeline().get(Http2ParentChannelExceptionHandler.HANDLER_NAME) == null) { | ||
|
|
||
| parent.pipeline().addLast( | ||
| Http2ParentChannelExceptionHandler.HANDLER_NAME, | ||
| new Http2ParentChannelExceptionHandler()); | ||
| } |
There was a problem hiding this comment.
This change introduces new behavior (consuming parent-channel exceptions and closing the parent connection) without accompanying test coverage. There are existing Netty/transport tests in azure-cosmos-tests (e.g., ones that use EmbeddedChannel); please add a unit/integration test that asserts the handler is installed on the H2 parent pipeline and that an exception on the parent is consumed (no TailContext WARN) and results in the parent channel closing.
| public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) { | ||
| if (logger.isDebugEnabled()) { | ||
| logger.debug("Exception on HTTP/2 parent connection [id:{}]: {}", | ||
| ctx.channel().id().asShortText(), cause.getMessage(), cause); |
There was a problem hiding this comment.
exceptionCaught currently logs only cause.getMessage() in the formatted message. For many Netty exceptions the message can be null/empty, reducing diagnostics. Consider logging the exception class (e.g., cause.toString()) in the formatted portion so the DEBUG log remains useful even when getMessage() is null.
| ctx.channel().id().asShortText(), cause.getMessage(), cause); | |
| ctx.channel().id().asShortText(), cause.toString(), cause); |
0ec7d77 to
e5f9537
Compare
0da600f to
acea1fb
Compare
In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection. The parent channel pipeline has no ChannelOperationsHandler (unlike HTTP/1.1), so TCP-level exceptions like Connection reset by peer (ECONNRESET) propagate to Netty's TailContext, which logs them as WARN. This adds Http2ParentChannelExceptionHandler to the parent channel via doOnConnected (accessing channel.parent()). The handler consumes exceptions at DEBUG level WITHOUT closing the channel or altering connection lifecycle, matching HTTP/1.1 logging behavior. Changes: - Handler logs cause.toString() (not getMessage()) for null-safe diagnostics - Defensive try-catch for duplicate handler name on concurrent stream creation - Before/after verified with EmbeddedChannel unit tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
acea1fb to
3e2b4f4
Compare
|
Superseded by #48890 (re-raised from fork for push access). |
Problem
Customers see noisy Netty WARN logs in HTTP/2 scenarios:
These are cosmetic — zero operational impact — but trigger monitoring alerts.
Root Cause
In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection. The parent and child channels have different pipeline structures:
HTTP/1.1 pipeline (single channel — no leak to TailContext):
HTTP/2 parent channel pipeline (BEFORE fix — leak to TailContext):
HTTP/2 parent channel pipeline (AFTER fix):
HTTP/2 child stream channel pipeline (unchanged):
Design: Connection-State-Based Log Level
The handler consumes all exceptions on the parent channel (no exception type filtering). The log level is determined by connection state:
activeStreams == 0OR!channelActive. No in-flight requests are affected (e.g., TCP RST from LB idle timeout, post-close cleanup).Active stream count is retrieved via
Http2FrameCodec.connection().numActiveStreams()on the same parent channel pipeline. Falls back to -1 if the codec is unavailable, which takes the safe WARN path.Why no exception type filtering?
By the time any exception reaches our handler, all upstream handlers (
Http2FrameCodec,Http2MultiplexHandler) have already handled the protocol actions (GOAWAY, stream reset, child channel error delivery). The exception reaching TailContext is an echo of already-handled work, regardless of type. Connection state (active streams + channel activity) is the only dimension that determines whether the exception has diagnostic value.Why OR (not AND) for the DEBUG condition?
Either condition alone is sufficient to determine the exception is noise:
activeStreams == 0— no in-flight requests affected, regardless of channel state!channelActive— channel is already dead, any active streams will fail through their Reactor subscribers independentlyTesting
5 EmbeddedChannel unit tests with production-matching pipeline (
Http2FrameCodec→Http2MultiplexHandler→ handler):withoutHandler_exceptionReachesTailwithHandler_zeroActiveStreams_consumedAtDebugwithHandler_exceptionDoesNotCloseChannelwithHandler_runtimeException_zeroActiveStreams_consumedwithHandler_npe_zeroActiveStreams_consumedNote: The
!channelActivebranch cannot be unit-tested with EmbeddedChannel becausedisconnect()tears down the pipeline beforefireExceptionCaughtcan reach handlers. In production,exceptionCaught()fires while the channel is transitioning to inactive.Impact
exceptionCaught(), Netty@Skipoptimization bypasses it for all hot-path eventsChannelOperationsHandlerbehavior for connection-level exceptions