feat(realtime): add_to_chat_ctx on generate_reply() by cphoward · Pull Request #5605 · livekit/agents

cphoward · 2026-04-30T00:58:08Z

Adds an add_to_chat_ctx: bool = True parameter to RealtimeSession.generate_reply and AgentSession.generate_reply, plus a new RealtimeCapabilities.ephemeral_response capability flag that plugins use to declare whether they honor the parameter. When the OpenAI plugin (on the public endpoint) sees add_to_chat_ctx=False, it sets conversation: "none" on the outbound response.create, force-overrides tools=[] / tool_choice="none" (with a warning if the caller passed any), and suppresses LiveKit-internal openai_client_event_queued / openai_server_event_received emits and LK_OPENAI_DEBUG log output for that response's lifecycle. AgentActivity suppresses the assistant turn's writes to agent._chat_ctx at all three _upsert_item sites and at _on_remote_item_added. The OpenAI plugin also intercepts at the source in _handle_conversion_item_added so ephemeral items never enter _remote_chat_ctx (the shadow state behind the public chat_ctx property and reconnection-replay state).

Replaces #5569 (closed). The previous attempt rendered content via response.create(input=[assistant_message]), which the GA model gpt-realtime ignored — the model produced an unrelated generic response instead of rendering the input text. Switching to response.create(instructions=X, conversation: "none") renders X audibly AND keeps it out of substrate state.

This enables use cases where the agent must speak content to the user without that content entering its own reasoning context on subsequent turns — for example, confirmation flows where the agent relays sensitive information from a trusted external source without the model gaining persistent access to that information.

The OpenAI plugin enforces a single-isolated-call serialization contract: generate_reply(add_to_chat_ctx=False) raises RuntimeError (with diagnostic context: client_event_id, response_id, elapsed-since-issue, docstring §Concurrency reference) if another isolated call is already in flight on the same session. Default add_to_chat_ctx=True calls retain their existing concurrency semantics. The contract is documented in the API docstring §Concurrency.

Independently of that contract, the PR re-includes the orphan filter at _handle_response_created (verbatim port from the previous closed PR commit 92418578, including the isinstance(metadata, dict) server-VAD bypass), restructured to run BEFORE _current_generation is assigned. Nine bare-assert handlers in the OpenAI plugin convert from assert self._current_generation is not None to if self._current_generation is None: return so the substrate's parallel out-of-band path (still reachable via orphans, server-VAD overlap, reconnection-mid-response, or timeout races) cannot crash the session.

interrupt() is wired with the active server-assigned response_id so cancel actually stops in-flight isolated responses (cancel-without-id is silently no-op for out-of-band responses on the OpenAI substrate). The default cancel-without-id is preserved as the race-window fallback for the small window between response.create send and response.created arrival when the server-assigned id is not yet known.

Default add_to_chat_ctx=True and default ephemeral_response=False preserve all existing behavior; reverting this PR is a no-op for any current caller.

Behavior matrix

`add_to_chat_ctx`	Plugin declares `ephemeral_response`	Behavior
`True` (default)	*	Existing behavior unchanged.
`False`	`True` (OpenAI public endpoint)	`conversation: "none"` on the wire; tools forced off; local context, remote context shadow, OTel span, and public events all suppressed.
`False`	`False` (Phonic, Google, Ultravox, AWS, Azure-OpenAI)	`UserWarning` from the dispatcher; falls back to `True` (legacy add-to-context path).
`False`	* (non-realtime LLM)	`AgentSession.generate_reply` raises `NotImplementedError`.

Known limitations

Single isolated call per session: generate_reply(add_to_chat_ctx=False) is serialized — concurrent issuance raises RuntimeError. This is the documented API contract (see docstring §Concurrency), not a temporary limitation. A follow-up PR could lift it via a _current_generation dict refactor keyed by response.id, but that work is not required for this PR's correctness.
interrupt() race window: between response.create send and response.created arrival the server-assigned response.id is not yet known. If interrupt() fires inside this window the cancel falls back to the existing no-id behavior (no-op for out-of-band on the substrate). The audio output's clear_buffer() still fires locally so the user stops hearing already-buffered audio.
Azure OpenAI Realtime endpoint: RealtimeCapabilities.ephemeral_response is set to False for Azure-backed sessions because conversation: "none" semantics are not verified there. Azure-backed generate_reply(add_to_chat_ctx=False) calls go through the dispatcher's UserWarning fallback to the legacy add-to-context path. A follow-up issue can be filed when Azure parity is verified.

Empirical foundation

Substrate behavior verified with content-asserted three-arm contrast against gpt-realtime (ISOLATED with nonce content / BASELINE no isolation / SAFETY-CONTROL with PII-shaped content). Standalone reproduction probes are available on request — happy to share the scripts that exercise the substrate primitive directly and through the RealtimeSession wrapper. Audio recordings of model output are preserved separately for verification.

Test coverage

31 new tests in tests/test_realtime/test_generate_reply_isolation.py (30 active, 1 skipped when Azure env vars not set), mix of mock-based (always run) and live-substrate (require OPENAI_API_KEY):

Capability flag and abstract signature: 6 tests (defaults, signature shape, OpenAI plugin declaration on public + Azure endpoints).
Dispatcher capability gate and pipeline-LLM guard: covered via AgentSession.generate_reply signature/error tests.
Substrate isolation: conversation: "none" on serialized JSON bytes, calibration that default does NOT set the field, tool override at the plugin layer.
Single-isolated-call serialization contract: rejection in pre-creation and post-creation arms, diagnostic context (client_event_id, response_id, elapsed-since-issue, §Concurrency reference), default-during-isolated proceeds, succeeds-after-completes, ephemeral state cleanup on response.done.
Orphan filter: filtered orphan with defensive cancel, server-VAD bypass on metadata=None.
Handler guards: 9 reachable handlers early-return on _current_generation is None.
Shadow-state leak fix: ephemeral items skipped from _remote_chat_ctx and remote_item_added.
Local context gate: _on_remote_item_added skips ephemeral, calibration that non-ephemeral pass through, plugin-without-attribute behaves normally.
interrupt(): sends cancel with response_id for in-flight isolated responses, race-window fallback issues default no-id cancel without raising, no-active-generation is noop.
Reconnect cleanup: regression test that _reconnect() drains all ephemeral tracking state so subsequent isolated calls are not blocked by stale entries.
Live smoke test: end-to-end audibility + behavioral isolation against gpt-realtime via the LiveKit RealtimeSession wrapper, plus end-to-end check that the local chat_ctx is not polluted.

…lity Adds: - add_to_chat_ctx: bool = True keyword-only parameter to abstract RealtimeSession.generate_reply with a Concurrency docstring section. - AgentSession.generate_reply: same parameter; raises NotImplementedError when combined with a non-realtime LLM. - RealtimeCapabilities.ephemeral_response: bool = False field. - AgentActivity dispatcher capability gate: emits DeprecationWarning and falls back to add_to_chat_ctx=True for plugins that do not declare the capability; emits a separate DeprecationWarning when the caller combines add_to_chat_ctx=False with non-empty tools/tool_choice. - OpenAI plugin sets ephemeral_response=True for non-Azure endpoints (Azure path stays on the legacy fallback until conversation:"none" semantics are verified there). - OpenAI plugin generate_reply accepts the new kwarg; the substrate-level behavior lands in a follow-up commit. Default add_to_chat_ctx=True preserves all existing behavior; default ephemeral_response=False preserves all existing plugin behavior.

…ency hardening for ephemeral responses When generate_reply is called with add_to_chat_ctx=False, the OpenAI plugin now sets conversation: "none" on the outbound response.create event so the substrate does not enter the response into its persistent conversation state, force-overrides params.tools=[] / params.tool_choice="none" (with a logger.warning if the caller passed any), and suppresses openai_client_event_queued / openai_server_event_received emits and LK_OPENAI_DEBUG log output for the response lifecycle. Adds a single-isolated-call serialization contract: a second generate_reply(add_to_chat_ctx=False) issued while the first is in flight raises RuntimeError with diagnostic context (in-flight client_event_id, response_id, elapsed-since-issue, docstring section reference). Default add_to_chat_ctx=True calls retain their existing concurrency semantics (the substrate enforces serialization of default-conversation responses). The contract is the API behavior, not a temporary limitation. Closes a shadow-state leak path: items belonging to an in-flight ephemeral response now skip _remote_chat_ctx.insert() and the remote_item_added emit in _handle_conversion_item_added, so the rendered text cannot leak via session.chat_ctx or the reconnection-replay state. Concurrency hardening (re-included from the prior closed PR commit 9241857): orphan filter at _handle_response_created (verbatim port, including the isinstance(metadata, dict) server-VAD bypass), restructured to run BEFORE _current_generation is assigned. Nine bare-assert handler sites converted to early-return on _current_generation is None (output_item_added, content_part_added, text_delta, text_done, audio_transcript_delta, audio_delta, audio_done, output_item_done, _handle_function_call) so the substrate parallel out-of-band path (reachable via orphans, server-VAD overlap, reconnection-mid-response, or timeout races) cannot crash the session. Tests cover: serialization-contract rejection (pre-creation + post-creation arms), conversation-none on the wire (asserted on serialized JSON bytes), tool override at the plugin layer, default-during-isolated proceeds, ephemeral state cleanup on response.done, orphan filter (with metadata-None bypass), all 9 handler guards on None generation, and a live-substrate smoke test against gpt-realtime that verifies audibility and behavioral isolation.

…meral responses Threads the post-capability-gate `effective_add_to_chat_ctx` value through AgentActivity._realtime_reply_task -> AgentActivity._realtime_generation_task -> AgentActivity._realtime_generation_task_impl, then gates all three _chat_ctx._upsert_item sites in the realtime-generate-reply path: - function-call upsert (defense-in-depth: tools are forced off at the plugin layer for isolated turns, so the callback should not fire, but the gate covers any future plugin that does not honor the override); - assistant-message upsert plus dependent emits (_conversation_item_added, speech_handle._item_added) and the OTel ATTR_RESPONSE_TEXT span attribute that tracing backends would otherwise log; - function-call-output upsert plus the corresponding _tool_items_added emit (also defense-in-depth). Adds a defensive gate at AgentActivity._on_remote_item_added that early- returns when the inbound item id is in self._rt_session._ephemeral_remote_item_ids. Uses a duck-typed getattr(..., set()) lookup so plugins that have not opted into ephemeral support continue to behave normally. The OpenAI plugin already drops these items at the source in _handle_conversion_item_added (Phase 2); this gate is a defensive second layer for future plugin implementations. Tests cover: ephemeral item skipped at remote-item-added, calibration that non-ephemeral items still pass through, behavior preserved for plugins without the ephemeral attribute, and a live-substrate end-to-end check that an isolated generate_reply against gpt-realtime does not pollute session.chat_ctx with the rendered text.

…rks for isolated responses For each in-flight isolated response, interrupt() now sends ResponseCancelEvent with the server-assigned response_id and drains the ephemeral tracking dicts (_active_ephemeral_response_ids, _ephemeral_event_ids, _ephemeral_started_at). Cancel-without-id is silently no-op for out-of-band responses on the OpenAI substrate, so isolated turns could not be stopped before this change. The default cancel-without-id is preserved as the race-window fallback: between response.create send and response.created arrival the server- assigned response_id is not yet known, and cancel-without-id is no-op for OOB but still serves as a no-op signal. The cleanup-on-response.done that the serialization contract relies on already landed in the previous fused commit; this commit confirms the test that all four ephemeral tracking structures drain when a response completes naturally. Tests cover: cancel carries response_id when an isolated response is in-flight (verified on serialized event), race-window fallback issues the default no-id cancel without raising, no-active-generation interrupt is a noop.

…onse capability

…rm contrast and capability-gate coverage Reconnect cleanup (the meaningful behavioural fix): A websocket reconnect during an in-flight isolated generate_reply previously left stale entries in _active_ephemeral_response_ids, _ephemeral_event_ids, and _ephemeral_started_at. The serialization-contract check would then treat the next legitimate isolated call as already-in-flight and raise RuntimeError for the lifetime of the session. _reconnect() now drains all four ephemeral tracking structures alongside _response_created_futures. _handle_response_done now also discards the per-response remote-item ids registered during the response lifecycle so the gate set does not grow unbounded across many ephemeral calls in a long-lived session. Test coverage additions: - test_substrate_isolation_isolated_arm_live: hard end-to-end isolation assertion against gpt-realtime through the LiveKit RealtimeSession wrapper. - test_substrate_isolation_calibration_arms_live: BASELINE + SAFETY-CONTROL arms with combined-arms calibration check (at least one must recall) so the ISOLATED arms pass cannot be misattributed to model unwillingness or a safety filter. Per-arm flake from the live model is tolerated; full calibration failure is not. - test_reconnect_drains_ephemeral_tracking_state: regression test for the reconnect bug above. - test_capability_gate_warns_and_falls_back_for_unsupporting_plugin: exercises the dispatcher gate against a legacy 3-kwarg plugin signature (would raise TypeError without the gate). - test_isolated_response_does_not_emit_public_events_for_response: real emit-guard test (predicate + emit branch), replacing the prior predicate- only check. - Strengthened test_concurrent_isolated_generate_reply_rejects_second_pre_creation to assert the first future stays pending and no second response.create reaches the wire. Cleaned up planning-stage comments in test docstrings for clarity. No production-code changes related to the cleanup.

…ult-during-isolated test scope Two refinements: - The dispatcher capability gate previously emitted DeprecationWarning when a plugin without RealtimeCapabilities.ephemeral_response received an add_to_chat_ctx=False call. DeprecationWarning is filtered out by default and the situation is not actually a deprecation (the API is not going away) — it is user misuse: the kwarg cannot be honored by the plugin. Switch to UserWarning so callers see the warning loud and clear, and so they understand the call did not isolate. The same category change applies to the secondary "tools-with-isolated" warning. - test_default_generate_reply_during_isolated_does_not_raise had a misleading name and an under-specified assertion. Default-during- isolated does not raise, but under the single-slot _current_generation the second response.created clobbers the first, detaching its stream from the slot-resident handlers. Test docstring now records this as a documented limitation; the assertion is unchanged because correctness of the overlap is not in scope until _current_generation is refactored to a dict keyed by response.id (separate work).

cphoward added 7 commits April 29, 2026 17:03

docs(realtime): document add_to_chat_ctx parameter and ephemeral_resp…

81e543f

…onse capability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(realtime): add_to_chat_ctx on generate_reply()#5605

feat(realtime): add_to_chat_ctx on generate_reply()#5605
cphoward wants to merge 7 commits intolivekit:mainfrom
cphoward:feat/realtime-ephemeral-generate-reply

cphoward commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cphoward commented Apr 30, 2026

Behavior matrix

Known limitations

Empirical foundation

Test coverage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant