Skip to content

Async topic writer wedges under reconnect: 'cannot schedule new futures after shutdown' in _encode_data_inplace (compressing codec) #852

Description

@vgvoleg

Summary

Under connection loss + reconnect (e.g. a node killed during chaos testing), the async topic writer using a compressing codec (GZIP) can get permanently wedged: write_with_ack never returns because the encode ThreadPoolExecutor was shut down while the writer is still live and re-encoding, so submitting the encode job raises RuntimeError('cannot schedule new futures after shutdown'). The error surfaces only as an unretrieved background future, and the awaiting write_with_ack hangs indefinitely (no timeout of its own).

Traceback (observed)

grpc/aio/_call.py ... raise asyncio.CancelledError()   # stream cancelled by the node kill
asyncio.exceptions.CancelledError

ERROR  Future exception was never retrieved
future: <Future finished exception=RuntimeError('cannot schedule new futures after shutdown')>
Traceback (most recent call last):
  File ".../ydb/_topic_writer/topic_writer_asyncio.py", line 128, in write_with_ack
    results = [f.result() for f in futures]
  File ".../ydb/_topic_writer/topic_writer_asyncio.py", line 578, in _encode_loop
    await self._encode_data_inplace(batch_codec, messages)
  File ".../ydb/_topic_writer/topic_writer_asyncio.py", line 599, in _encode_data_inplace
    encoded_data_futures = eventloop.run_in_executor(self._encode_executor, encoder_function, ...)
  File ".../concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown

Conditions

  • ydb.aio topic writer (topic_client.writer(...)), compressing codec (TopicCodec.GZIP).
  • Connection drop + reconnect (chaos: a CancelledError propagates from the gRPC stream).
  • After that, _encode_data_inplace submits to self._encode_executor, which has already been shut down → RuntimeError.

Impact

write_with_ack never completes → the writer silently hangs. In an application that awaits write_with_ack without its own timeout, the whole writer path stalls. Same class of async-topic-writer reconnect-lifecycle issue as the earlier WriterAsyncIOStream.create() thread leak (PR #845).

Root cause (likely)

self._encode_executor is shut down while the writer is still alive and re-encoding after a reconnect (executor lifecycle not tied correctly to the writer's live/reconnect state).

Workaround

Use TopicCodec.RAW_encode_data_inplace returns early for RAW and never touches the executor.

How it was found

YDB Python SLO chaos testing (PR #851): the async topic workload silently stalled at ~half the run; the baseline container's delivery metrics went N/A mid-run. The SLO harness was hardened (timeouts + writer/reader recreate + RAW) to not hang, but the underlying SDK writer bug remains.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions