feat(voice): surface unrecoverable error message with a spoken fallback#6152
feat(voice): surface unrecoverable error message with a spoken fallback#6152chenghao-mou wants to merge 3 commits into
Conversation
β¦en fallback An unrecoverable error (e.g. depleted LiveKit Inference credits) used to leave the agent silent: it would join, publish its track, and never speak. AgentSession now surfaces a terminal error on the first occurrence instead of absorbing it under max_unrecoverable_errors, and speaks a configurable `unrecoverable_error_message` (or a generic default) before closing. Inference 429 quota bodies are parsed into a typed InferenceQuotaExceededError (in livekit.agents.inference) carrying the gateway quota_type/category/hint, with retryable/terminal derived from the category. The general `terminal` flag lives on APIError; the inference-specific quota schema stays in the inference plugin. Closes #6009 Co-authored-by: Shawn Feldman <shawn.feldman@livekit.io> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4125ee7 to
4be6763
Compare
β¦cision The session's _on_error was registered as an `error` listener in __init__, so it ran before any user-registered handler and made the close/tolerate decision before a handler could set `ev.error.recoverable = True`. Drop the self-listener and have AgentActivity emit the event (user handlers run) and then call _on_error, restoring the original ordering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4be6763 to
5b029dd
Compare
shawnfeldman
left a comment
There was a problem hiding this comment.
PR Review Summary
Reviewed with the standard agent panel (code-reviewer, silent-failure-hunter, pr-test-analyzer, type-design-analyzer, comment-analyzer, code-simplifier). All findings below were verified against the full post-change files.
Verdict: solid, no blockers. The PR does exactly what it claims β a terminal error now closes on the first occurrence and speaks a fallback before teardown, with the error/close events firing on every path (verified: TTS-failure, no-audio-output, and realtime paths all still emit close). The try/except/finally in _close_on_error is best-effort-before-teardown done right (CancelledError is not swallowed since it's BaseException; the primary error is already logged at error level before the close task is created). Test coverage is strong (33 hermetic tests across both the exception layer and the session layer). No Critical issues.
Important (worth addressing, non-blocking)
-
terminalis invisible in error reprs.APIError.__repr__(livekit-agents/livekit/agents/_exceptions.py:57) andAPIStatusError.__str__/__repr__(:95,:107) printretryablebut omit the newterminalfield β which now decides whether a session dies on turn 1. When someone debugs "why did my session close immediately,"repr(err)won't showterminal=True. Fix: addterminal={self.terminal!r}to both reprs and the__str__. -
Test gap: the 16s playout timeout is never exercised (
agent_session.py:1615).wait_for_playout()isawait asyncio.shield(self._done_fut)(voice/speech_handle.py:182), so theasyncio.wait_for(..., timeout=_ERROR_MESSAGE_PLAYOUT_TIMEOUT)cancels only the outer wait; teardown'sinterrupt(force=True)+drain()in_aclose_implis what actually reclaims a stuck playout. This is load-bearing safety (a wedged playout must not strand teardown) and is untested. Suggest a test where the audio sink never signals playback-finished, asserting close still completes within a bound. Worth a one-line comment at:1615noting the timeout is a soft bound, not a hard cap on playout. -
Test gap: no "second unrecoverable error arrives while
_closing_taskis in flight." Theif self._closing_task or error.recoverable: returnguard (agent_session.py:1550) is the only thing preventing a double-close / doublesay()during the speak-then-close window (realistic for a depleted gateway firing back-to-back errors). Suggest firing_on_errortwice before awaiting and asserting exactly one close event.
Suggestions (nits / polish)
-
AudioSource'sstrarm is interpreted differently by its two consumers.AgentSessionresolves astras file-path-if-os.path.isfile-else-TTS-text (agent_session.py:1606), butBackgroundAudioPlayer._normalize_builtin_audio(voice/background_audio.py:207) treatsstras a file path only (no TTS). Same type alias (voice/audio_source.py:30), opposite outcomes for"goodbye". code-simplifier confirmed a shared resolver isn't worth it (the contracts genuinely differ), so the cheap fix is a one-line note on theAudioSourcealias documenting the divergence. -
terminal β not retryableis enforced only by docstring.APIError.__init__(_exceptions.py:39) accepts both flags independently, soAPIStatusError("x", status_code=429, terminal=True)yieldsretryable=True, terminal=True(429 is transient).InferenceQuotaExceededErrorderives both from one boolean and can't self-contradict, but the base explicit-override path can. Consider derivingself.retryable = retryable and not terminal, or asserting in__init__. -
Empty-string error message is a silent no-op.
unrecoverable_error_message=""is "given" (notNOT_GIVEN/None), so_resolve_error_message(agent_session.py:1581) returns"", which isn't a file βsay("", audio=NOT_GIVEN)speaks nothing and suppresses the default. Unlikely, but a one-line doc note (or treating""likeNone) would avoid the surprise. -
inference/tts.py:739(the_run/streaming-response path) has the same bodylesscreate_api_error_from_httpas:496but lacks the explanatory "aiohttp discards the body / typing is future work" comment that:496carries. Add a back-reference so a future maintainer knows the omission is deliberate. -
Minor doc redundancy: the
unrecoverable_error_messagedocstring (agent_session.py:~481) says "str... or anAudioSource," butAudioSourcealready includesstr. The split is pedagogically useful (str has two sub-behaviors) β optionally clarify with "(stris itself anAudioSource; called out because its behavior depends on whether it's a file path)." -
Optional refactor (code-simplifier): the 429 re-parse in
inference/llm.py:466reassignsbody = e.bodyon theexceptpath (the value it already held). Extracting a tiny_response_body(e)helper removes that ambiguity. Behavior-identical; take it or leave it.
Not introduced by this PR (context, no action)
- STT/realtime errors are not counted against
max_unrecoverable_errorsβ there's no_stt_error_counts, and_on_erroronly branches on"llm_error"/"tts_error"(agent_session.py:1556). A non-terminal unrecoverable STT/realtime error closes on first occurrence. This matches the base code; it's orthogonal to this PR. One downstream effect worth being aware of: STT/TTS quota 429s stay untyped (the handshake body is discarded by aiohttp, per the comments atinference/stt.py:884/tts.py:496), so the new "out of credits" spoken-fallback UX is effectively LLM/TTS-only today.
Strengths
_str_or_nonecoercion of untrusted gateway JSON (inference/_exceptions.py:13) is exemplary β it coerces non-strvalues toNonebefore thecategory in _TERMINAL_QUOTA_CATEGORIESfrozenset check, so an unhashablecategory: [...]can't crash membership, and malformed fields can't leak. Pinned bytest_quota_error_non_str_body_fields_are_dropped.- The gateway
hintis never spoken β terminal errors always fall back to the genericDEFAULT_UNRECOVERABLE_ERROR_MESSAGE. Good privacy hygiene; tested over both a real hint andNone. - Terminal-vs-transient split is tested at both layers (exception construction and session behavior), and
await_countassertions prove credit-exhaustion hits the endpoint exactly once while rate-limits retry. - Clean type extraction: moving
BuiltinAudioClip/AudioSourceintovoice/audio_source.pybreaks the import cycle with no duplicateatexit/ExitStackregistration, and the removedbackground_audio.pyimports are genuinely unused. InferenceQuotaExceededErrorderives terminal/retryable from a single boolean (can't self-contradict via the body path), with an appropriateOptional-returningfrom_responsefactory that keepscreate_api_error_from_httpprovider-agnostic.- Comments are accurate β the comment-analyzer found zero factually-incorrect comments; the
ev.error.errorattribute path, the terminal-category list, and the emit-before-decide ordering all cross-check against the code. - Backward compatible:
terminaldefaults toFalse(existing errors unaffected) and the new params are additive keyword-only.
Reviewed by Claude Code.
| async def _close_on_error( | ||
| self, error: llm.LLMError | stt.STTError | tts.TTSError | llm.RealtimeModelError | ||
| ) -> None: | ||
| # best-effort: speak a fallback message before teardown so the agent isn't silent. | ||
| # A failing or unavailable source just falls through to close β the error/close | ||
| # events fire either way. | ||
| try: | ||
| message = self._resolve_error_message(error.error) | ||
| if message is None or self._activity is None or self.output.audio is None: | ||
| return | ||
|
|
||
| # pre-recorded audio is played as-is, so it survives a dead/exhausted TTS; a | ||
| # non-file str falls back to TTS synthesis. | ||
| text = "" | ||
| audio: NotGivenOr[AsyncIterable[rtc.AudioFrame]] = NOT_GIVEN | ||
| if isinstance(message, BuiltinAudioClip): | ||
| audio = audio_frames_from_file(message.path()) | ||
| elif isinstance(message, str): | ||
| if os.path.isfile(message): | ||
| audio = audio_frames_from_file(message) | ||
| else: | ||
| text = message | ||
| else: | ||
| audio = message | ||
|
|
||
| handle = self.say(text, audio=audio, allow_interruptions=False, add_to_chat_ctx=False) | ||
| await asyncio.wait_for( | ||
| handle.wait_for_playout(), timeout=_ERROR_MESSAGE_PLAYOUT_TIMEOUT | ||
| ) | ||
| except Exception: | ||
| logger.warning("failed to play the unrecoverable-error message", exc_info=True) | ||
| finally: | ||
| await self._aclose_impl(error=error, reason=CloseReason.ERROR) |
There was a problem hiding this comment.
π© Extended window where _closing is False during error-message playout
In the old code, _on_error directly created _aclose_impl as a task, which set self._closing = True almost immediately (first thing after acquiring the lock at agent_session.py:991). In the new code, _close_on_error first speaks the error message (up to 16s timeout at agent_session.py:1615-1616) before calling _aclose_impl in the finally block. During this window, self._closing remains False. Code that gates on self._closing (e.g. _update_activity at line 1480: if self._closing and new_activity == "start") won't know a close is pending. The _closing_task field IS set, which guards re-entrant _on_error calls, but doesn't guard activity updates. In practice this is unlikely to cause issues because (a) activity updates require async scheduling and the error occurs in a failing component, (b) _aclose_impl still runs in the finally and cleans everything up, and (c) the _started flag prevents double-close. But it's worth being aware of the extended window.
Was this helpful? React with π or π to provide feedback.
An unrecoverable error (e.g. out of LiveKit Inference credits) used to leave the agent silently connected but mute. Now
AgentSession:max_unrecoverable_errorsidentical failures.unrecoverable_error_messagebefore closing, so the agent isn't silent β astr(synthesized via TTS, or played as-is if it's a file path), anAudioSource/BuiltinAudioClip(played directly, so it survives a dead TTS), orNoneto disable. Defaults to a generic message on terminal errors.Inference
429quota errors are now a typedInferenceQuotaExceededError(livekit.agents.inference) you can catch from theerrorevent, carrying the gateway'squota_type/category/hint. Quota details are never spoken to end users.Supersedes #6012 (consolidated; original author credited as co-author). Closes #6009.
π€ Generated with Claude Code