fix: #175 BidiStream SIGTERM causes CPU hot loop and pod stuck in Terminating#176
fix: #175 BidiStream SIGTERM causes CPU hot loop and pod stuck in Terminating#176brightsparc wants to merge 4 commits intorestatedev:mainfrom
Conversation
…uck in Terminating
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
Thanks a lot for creating this fix @brightsparc. cc @igalshilman for review. |
Cool, looking forward to getting this landed to reduce the hot loop when draining pods. |
| # {'type': 'http.disconnect'} | ||
| await self.receive.block_until_http_input_closed() | ||
| try: | ||
| await asyncio.wait_for(self.receive.block_until_http_input_closed(), timeout=30.0) |
There was a problem hiding this comment.
30 seconds is indeed a reasonable amount of time, but I think that we should not do it implicitly like that.
In the invocation protocol we must wait for the restate runtime to explicitly close its side of the input.
This is a signal for us, that the runtime had received all the previously written data (and nothing is queued in between middlewares and proxies).
And only then we can teardown this attempt.
I think that what you are looking for is an explicit handling of SIGTERM, perhaps setting an event.
and making sure that the implementation of block_until_http_input_closed() will respect that.
| assert isinstance(body, bytes) | ||
| # Skip empty body frames to avoid hot loop (see #175) | ||
| body = chunk.get("body", None) | ||
| if body is not None and len(body) > 0: |
fixes: #175
When a pod receives SIGTERM during an active BidiStream invocation, two bugs caused the worker to spin at ~82% CPU and never exit:
ReceiveChannel.__call__()blocked forever on an empty queue after the disconnect event was consumedcreate_poll_or_cancel_coroutine()fed empty body frames (b'') to the VM, creating a tight loop with no useful await pointsChanges
server_types.py: Return synthetichttp.disconnectwhen queue is drained and channel is disconnectedserver_context.py: Skipnotify_input()for empty body frames; add 30s timeout toblock_until_http_input_closed()inleave()tests/disconnect_hotloop.py: Regression tests for both fixes