Summary
Under connection loss + reconnect (e.g. a node killed during chaos testing), the async topic writer using a compressing codec (GZIP) can get permanently wedged: write_with_ack never returns because the encode ThreadPoolExecutor was shut down while the writer is still live and re-encoding, so submitting the encode job raises RuntimeError('cannot schedule new futures after shutdown'). The error surfaces only as an unretrieved background future, and the awaiting write_with_ack hangs indefinitely (no timeout of its own).
Traceback (observed)
grpc/aio/_call.py ... raise asyncio.CancelledError() # stream cancelled by the node kill
asyncio.exceptions.CancelledError
ERROR Future exception was never retrieved
future: <Future finished exception=RuntimeError('cannot schedule new futures after shutdown')>
Traceback (most recent call last):
File ".../ydb/_topic_writer/topic_writer_asyncio.py", line 128, in write_with_ack
results = [f.result() for f in futures]
File ".../ydb/_topic_writer/topic_writer_asyncio.py", line 578, in _encode_loop
await self._encode_data_inplace(batch_codec, messages)
File ".../ydb/_topic_writer/topic_writer_asyncio.py", line 599, in _encode_data_inplace
encoded_data_futures = eventloop.run_in_executor(self._encode_executor, encoder_function, ...)
File ".../concurrent/futures/thread.py", line 167, in submit
raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
Conditions
ydb.aio topic writer (topic_client.writer(...)), compressing codec (TopicCodec.GZIP).
- Connection drop + reconnect (chaos: a
CancelledError propagates from the gRPC stream).
- After that,
_encode_data_inplace submits to self._encode_executor, which has already been shut down → RuntimeError.
Impact
write_with_ack never completes → the writer silently hangs. In an application that awaits write_with_ack without its own timeout, the whole writer path stalls. Same class of async-topic-writer reconnect-lifecycle issue as the earlier WriterAsyncIOStream.create() thread leak (PR #845).
Root cause (likely)
self._encode_executor is shut down while the writer is still alive and re-encoding after a reconnect (executor lifecycle not tied correctly to the writer's live/reconnect state).
Workaround
Use TopicCodec.RAW — _encode_data_inplace returns early for RAW and never touches the executor.
How it was found
YDB Python SLO chaos testing (PR #851): the async topic workload silently stalled at ~half the run; the baseline container's delivery metrics went N/A mid-run. The SLO harness was hardened (timeouts + writer/reader recreate + RAW) to not hang, but the underlying SDK writer bug remains.
Summary
Under connection loss + reconnect (e.g. a node killed during chaos testing), the async topic writer using a compressing codec (GZIP) can get permanently wedged:
write_with_acknever returns because the encodeThreadPoolExecutorwas shut down while the writer is still live and re-encoding, so submitting the encode job raisesRuntimeError('cannot schedule new futures after shutdown'). The error surfaces only as an unretrieved background future, and the awaitingwrite_with_ackhangs indefinitely (no timeout of its own).Traceback (observed)
Conditions
ydb.aiotopic writer (topic_client.writer(...)), compressing codec (TopicCodec.GZIP).CancelledErrorpropagates from the gRPC stream)._encode_data_inplacesubmits toself._encode_executor, which has already been shut down →RuntimeError.Impact
write_with_acknever completes → the writer silently hangs. In an application that awaitswrite_with_ackwithout its own timeout, the whole writer path stalls. Same class of async-topic-writer reconnect-lifecycle issue as the earlierWriterAsyncIOStream.create()thread leak (PR #845).Root cause (likely)
self._encode_executoris shut down while the writer is still alive and re-encoding after a reconnect (executor lifecycle not tied correctly to the writer's live/reconnect state).Workaround
Use
TopicCodec.RAW—_encode_data_inplacereturns early for RAW and never touches the executor.How it was found
YDB Python SLO chaos testing (PR #851): the async topic workload silently stalled at ~half the run; the baseline container's delivery metrics went N/A mid-run. The SLO harness was hardened (timeouts + writer/reader recreate + RAW) to not hang, but the underlying SDK writer bug remains.