Skip to content

Comments

fix: #175 BidiStream SIGTERM causes CPU hot loop and pod stuck in Terminating#176

Open
brightsparc wants to merge 4 commits intorestatedev:mainfrom
introspection-org:julian/fix-disconnect-hotloop
Open

fix: #175 BidiStream SIGTERM causes CPU hot loop and pod stuck in Terminating#176
brightsparc wants to merge 4 commits intorestatedev:mainfrom
introspection-org:julian/fix-disconnect-hotloop

Conversation

@brightsparc
Copy link

fixes: #175

When a pod receives SIGTERM during an active BidiStream invocation, two bugs caused the worker to spin at ~82% CPU and never exit:

  1. ReceiveChannel.__call__() blocked forever on an empty queue after the disconnect event was consumed
  2. create_poll_or_cancel_coroutine() fed empty body frames (b'') to the VM, creating a tight loop with no useful await points

Changes

  • server_types.py: Return synthetic http.disconnect when queue is drained and channel is disconnected
  • server_context.py: Skip notify_input() for empty body frames; add 30s timeout to block_until_http_input_closed() in leave()
  • tests/disconnect_hotloop.py: Regression tests for both fixes

@github-actions
Copy link

github-actions bot commented Feb 23, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@brightsparc
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@tillrohrmann
Copy link
Contributor

Thanks a lot for creating this fix @brightsparc. cc @igalshilman for review.

@brightsparc
Copy link
Author

Thanks a lot for creating this fix @brightsparc. cc @igalshilman for review.

Cool, looking forward to getting this landed to reduce the hot loop when draining pods.

# {'type': 'http.disconnect'}
await self.receive.block_until_http_input_closed()
try:
await asyncio.wait_for(self.receive.block_until_http_input_closed(), timeout=30.0)
Copy link
Contributor

@igalshilman igalshilman Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 seconds is indeed a reasonable amount of time, but I think that we should not do it implicitly like that.
In the invocation protocol we must wait for the restate runtime to explicitly close its side of the input.
This is a signal for us, that the runtime had received all the previously written data (and nothing is queued in between middlewares and proxies).
And only then we can teardown this attempt.

I think that what you are looking for is an explicit handling of SIGTERM, perhaps setting an event.
and making sure that the implementation of block_until_http_input_closed() will respect that.

assert isinstance(body, bytes)
# Skip empty body frames to avoid hot loop (see #175)
body = chunk.get("body", None)
if body is not None and len(body) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: BidiStream SIGTERM causes CPU hot loop and pod stuck in Terminating

3 participants