Skip to content

feat(realtime): add_to_chat_ctx on generate_reply()#5605

Draft
cphoward wants to merge 7 commits intolivekit:mainfrom
cphoward:feat/realtime-ephemeral-generate-reply
Draft

feat(realtime): add_to_chat_ctx on generate_reply()#5605
cphoward wants to merge 7 commits intolivekit:mainfrom
cphoward:feat/realtime-ephemeral-generate-reply

Conversation

@cphoward
Copy link
Copy Markdown

Adds an add_to_chat_ctx: bool = True parameter to RealtimeSession.generate_reply and AgentSession.generate_reply, plus a new RealtimeCapabilities.ephemeral_response capability flag that plugins use to declare whether they honor the parameter. When the OpenAI plugin (on the public endpoint) sees add_to_chat_ctx=False, it sets conversation: "none" on the outbound response.create, force-overrides tools=[] / tool_choice="none" (with a warning if the caller passed any), and suppresses LiveKit-internal openai_client_event_queued / openai_server_event_received emits and LK_OPENAI_DEBUG log output for that response's lifecycle. AgentActivity suppresses the assistant turn's writes to agent._chat_ctx at all three _upsert_item sites and at _on_remote_item_added. The OpenAI plugin also intercepts at the source in _handle_conversion_item_added so ephemeral items never enter _remote_chat_ctx (the shadow state behind the public chat_ctx property and reconnection-replay state).

Replaces #5569 (closed). The previous attempt rendered content via response.create(input=[assistant_message]), which the GA model gpt-realtime ignored — the model produced an unrelated generic response instead of rendering the input text. Switching to response.create(instructions=X, conversation: "none") renders X audibly AND keeps it out of substrate state.

This enables use cases where the agent must speak content to the user without that content entering its own reasoning context on subsequent turns — for example, confirmation flows where the agent relays sensitive information from a trusted external source without the model gaining persistent access to that information.

The OpenAI plugin enforces a single-isolated-call serialization contract: generate_reply(add_to_chat_ctx=False) raises RuntimeError (with diagnostic context: client_event_id, response_id, elapsed-since-issue, docstring §Concurrency reference) if another isolated call is already in flight on the same session. Default add_to_chat_ctx=True calls retain their existing concurrency semantics. The contract is documented in the API docstring §Concurrency.

Independently of that contract, the PR re-includes the orphan filter at _handle_response_created (verbatim port from the previous closed PR commit 92418578, including the isinstance(metadata, dict) server-VAD bypass), restructured to run BEFORE _current_generation is assigned. Nine bare-assert handlers in the OpenAI plugin convert from assert self._current_generation is not None to if self._current_generation is None: return so the substrate's parallel out-of-band path (still reachable via orphans, server-VAD overlap, reconnection-mid-response, or timeout races) cannot crash the session.

interrupt() is wired with the active server-assigned response_id so cancel actually stops in-flight isolated responses (cancel-without-id is silently no-op for out-of-band responses on the OpenAI substrate). The default cancel-without-id is preserved as the race-window fallback for the small window between response.create send and response.created arrival when the server-assigned id is not yet known.

Default add_to_chat_ctx=True and default ephemeral_response=False preserve all existing behavior; reverting this PR is a no-op for any current caller.

Behavior matrix

add_to_chat_ctx Plugin declares ephemeral_response Behavior
True (default) * Existing behavior unchanged.
False True (OpenAI public endpoint) conversation: "none" on the wire; tools forced off; local context, remote context shadow, OTel span, and public events all suppressed.
False False (Phonic, Google, Ultravox, AWS, Azure-OpenAI) UserWarning from the dispatcher; falls back to True (legacy add-to-context path).
False * (non-realtime LLM) AgentSession.generate_reply raises NotImplementedError.

Known limitations

  • Single isolated call per session: generate_reply(add_to_chat_ctx=False) is serialized — concurrent issuance raises RuntimeError. This is the documented API contract (see docstring §Concurrency), not a temporary limitation. A follow-up PR could lift it via a _current_generation dict refactor keyed by response.id, but that work is not required for this PR's correctness.
  • interrupt() race window: between response.create send and response.created arrival the server-assigned response.id is not yet known. If interrupt() fires inside this window the cancel falls back to the existing no-id behavior (no-op for out-of-band on the substrate). The audio output's clear_buffer() still fires locally so the user stops hearing already-buffered audio.
  • Azure OpenAI Realtime endpoint: RealtimeCapabilities.ephemeral_response is set to False for Azure-backed sessions because conversation: "none" semantics are not verified there. Azure-backed generate_reply(add_to_chat_ctx=False) calls go through the dispatcher's UserWarning fallback to the legacy add-to-context path. A follow-up issue can be filed when Azure parity is verified.

Empirical foundation

Substrate behavior verified with content-asserted three-arm contrast against gpt-realtime (ISOLATED with nonce content / BASELINE no isolation / SAFETY-CONTROL with PII-shaped content). Standalone reproduction probes are available on request — happy to share the scripts that exercise the substrate primitive directly and through the RealtimeSession wrapper. Audio recordings of model output are preserved separately for verification.

Test coverage

31 new tests in tests/test_realtime/test_generate_reply_isolation.py (30 active, 1 skipped when Azure env vars not set), mix of mock-based (always run) and live-substrate (require OPENAI_API_KEY):

  • Capability flag and abstract signature: 6 tests (defaults, signature shape, OpenAI plugin declaration on public + Azure endpoints).
  • Dispatcher capability gate and pipeline-LLM guard: covered via AgentSession.generate_reply signature/error tests.
  • Substrate isolation: conversation: "none" on serialized JSON bytes, calibration that default does NOT set the field, tool override at the plugin layer.
  • Single-isolated-call serialization contract: rejection in pre-creation and post-creation arms, diagnostic context (client_event_id, response_id, elapsed-since-issue, §Concurrency reference), default-during-isolated proceeds, succeeds-after-completes, ephemeral state cleanup on response.done.
  • Orphan filter: filtered orphan with defensive cancel, server-VAD bypass on metadata=None.
  • Handler guards: 9 reachable handlers early-return on _current_generation is None.
  • Shadow-state leak fix: ephemeral items skipped from _remote_chat_ctx and remote_item_added.
  • Local context gate: _on_remote_item_added skips ephemeral, calibration that non-ephemeral pass through, plugin-without-attribute behaves normally.
  • interrupt(): sends cancel with response_id for in-flight isolated responses, race-window fallback issues default no-id cancel without raising, no-active-generation is noop.
  • Reconnect cleanup: regression test that _reconnect() drains all ephemeral tracking state so subsequent isolated calls are not blocked by stale entries.
  • Live smoke test: end-to-end audibility + behavioral isolation against gpt-realtime via the LiveKit RealtimeSession wrapper, plus end-to-end check that the local chat_ctx is not polluted.

…lity

Adds:
- add_to_chat_ctx: bool = True keyword-only parameter to abstract
  RealtimeSession.generate_reply with a Concurrency docstring section.
- AgentSession.generate_reply: same parameter; raises NotImplementedError
  when combined with a non-realtime LLM.
- RealtimeCapabilities.ephemeral_response: bool = False field.
- AgentActivity dispatcher capability gate: emits DeprecationWarning and
  falls back to add_to_chat_ctx=True for plugins that do not declare the
  capability; emits a separate DeprecationWarning when the caller combines
  add_to_chat_ctx=False with non-empty tools/tool_choice.
- OpenAI plugin sets ephemeral_response=True for non-Azure endpoints
  (Azure path stays on the legacy fallback until conversation:"none"
  semantics are verified there).
- OpenAI plugin generate_reply accepts the new kwarg; the substrate-level
  behavior lands in a follow-up commit.

Default add_to_chat_ctx=True preserves all existing behavior; default
ephemeral_response=False preserves all existing plugin behavior.
…ency hardening for ephemeral responses

When generate_reply is called with add_to_chat_ctx=False, the OpenAI plugin
now sets conversation: "none" on the outbound response.create event so the
substrate does not enter the response into its persistent conversation state,
force-overrides params.tools=[] / params.tool_choice="none" (with a
logger.warning if the caller passed any), and suppresses
openai_client_event_queued / openai_server_event_received emits and
LK_OPENAI_DEBUG log output for the response lifecycle.

Adds a single-isolated-call serialization contract: a second
generate_reply(add_to_chat_ctx=False) issued while the first is in flight
raises RuntimeError with diagnostic context (in-flight client_event_id,
response_id, elapsed-since-issue, docstring section reference). Default
add_to_chat_ctx=True calls retain their existing concurrency semantics
(the substrate enforces serialization of default-conversation responses).
The contract is the API behavior, not a temporary limitation.

Closes a shadow-state leak path: items belonging to an in-flight ephemeral
response now skip _remote_chat_ctx.insert() and the remote_item_added emit
in _handle_conversion_item_added, so the rendered text cannot leak via
session.chat_ctx or the reconnection-replay state.

Concurrency hardening (re-included from the prior closed PR commit 9241857):
orphan filter at _handle_response_created (verbatim port, including the
isinstance(metadata, dict) server-VAD bypass), restructured to run BEFORE
_current_generation is assigned. Nine bare-assert handler sites converted
to early-return on _current_generation is None (output_item_added,
content_part_added, text_delta, text_done, audio_transcript_delta,
audio_delta, audio_done, output_item_done, _handle_function_call) so the
substrate parallel out-of-band path (reachable via orphans, server-VAD
overlap, reconnection-mid-response, or timeout races) cannot crash the
session.

Tests cover: serialization-contract rejection (pre-creation + post-creation
arms), conversation-none on the wire (asserted on serialized JSON bytes),
tool override at the plugin layer, default-during-isolated proceeds,
ephemeral state cleanup on response.done, orphan filter (with metadata-None
bypass), all 9 handler guards on None generation, and a live-substrate
smoke test against gpt-realtime that verifies audibility and behavioral
isolation.
…meral responses

Threads the post-capability-gate `effective_add_to_chat_ctx` value through
AgentActivity._realtime_reply_task ->
AgentActivity._realtime_generation_task ->
AgentActivity._realtime_generation_task_impl, then gates all three
_chat_ctx._upsert_item sites in the realtime-generate-reply path:

- function-call upsert (defense-in-depth: tools are forced off at the
  plugin layer for isolated turns, so the callback should not fire, but
  the gate covers any future plugin that does not honor the override);
- assistant-message upsert plus dependent emits (_conversation_item_added,
  speech_handle._item_added) and the OTel ATTR_RESPONSE_TEXT span attribute
  that tracing backends would otherwise log;
- function-call-output upsert plus the corresponding _tool_items_added
  emit (also defense-in-depth).

Adds a defensive gate at AgentActivity._on_remote_item_added that early-
returns when the inbound item id is in
self._rt_session._ephemeral_remote_item_ids. Uses a duck-typed
getattr(..., set()) lookup so plugins that have not opted into ephemeral
support continue to behave normally. The OpenAI plugin already drops these
items at the source in _handle_conversion_item_added (Phase 2); this gate
is a defensive second layer for future plugin implementations.

Tests cover: ephemeral item skipped at remote-item-added, calibration that
non-ephemeral items still pass through, behavior preserved for plugins
without the ephemeral attribute, and a live-substrate end-to-end check
that an isolated generate_reply against gpt-realtime does not pollute
session.chat_ctx with the rendered text.
…rks for isolated responses

For each in-flight isolated response, interrupt() now sends
ResponseCancelEvent with the server-assigned response_id and drains the
ephemeral tracking dicts (_active_ephemeral_response_ids,
_ephemeral_event_ids, _ephemeral_started_at). Cancel-without-id is
silently no-op for out-of-band responses on the OpenAI substrate, so
isolated turns could not be stopped before this change.

The default cancel-without-id is preserved as the race-window fallback:
between response.create send and response.created arrival the server-
assigned response_id is not yet known, and cancel-without-id is no-op
for OOB but still serves as a no-op signal.

The cleanup-on-response.done that the serialization contract relies on
already landed in the previous fused commit; this commit confirms the
test that all four ephemeral tracking structures drain when a response
completes naturally.

Tests cover: cancel carries response_id when an isolated response is
in-flight (verified on serialized event), race-window fallback issues
the default no-id cancel without raising, no-active-generation interrupt
is a noop.
…rm contrast and capability-gate coverage

Reconnect cleanup (the meaningful behavioural fix):
A websocket reconnect during an in-flight isolated generate_reply previously
left stale entries in _active_ephemeral_response_ids, _ephemeral_event_ids,
and _ephemeral_started_at.  The serialization-contract check would then
treat the next legitimate isolated call as already-in-flight and raise
RuntimeError for the lifetime of the session.  _reconnect() now drains all
four ephemeral tracking structures alongside _response_created_futures.

_handle_response_done now also discards the per-response remote-item ids
registered during the response lifecycle so the gate set does not grow
unbounded across many ephemeral calls in a long-lived session.

Test coverage additions:
- test_substrate_isolation_isolated_arm_live: hard end-to-end isolation
  assertion against gpt-realtime through the LiveKit RealtimeSession wrapper.
- test_substrate_isolation_calibration_arms_live: BASELINE + SAFETY-CONTROL
  arms with combined-arms calibration check (at least one must recall) so
  the ISOLATED arms pass cannot be misattributed to model unwillingness or
  a safety filter.  Per-arm flake from the live model is tolerated; full
  calibration failure is not.
- test_reconnect_drains_ephemeral_tracking_state: regression test for the
  reconnect bug above.
- test_capability_gate_warns_and_falls_back_for_unsupporting_plugin:
  exercises the dispatcher gate against a legacy 3-kwarg plugin signature
  (would raise TypeError without the gate).
- test_isolated_response_does_not_emit_public_events_for_response: real
  emit-guard test (predicate + emit branch), replacing the prior predicate-
  only check.
- Strengthened test_concurrent_isolated_generate_reply_rejects_second_pre_creation
  to assert the first future stays pending and no second response.create
  reaches the wire.

Cleaned up planning-stage comments in test docstrings for clarity.  No
production-code changes related to the cleanup.
…ult-during-isolated test scope

Two refinements:

- The dispatcher capability gate previously emitted DeprecationWarning when
  a plugin without RealtimeCapabilities.ephemeral_response received an
  add_to_chat_ctx=False call.  DeprecationWarning is filtered out by
  default and the situation is not actually a deprecation (the API is not
  going away) — it is user misuse: the kwarg cannot be honored by the
  plugin.  Switch to UserWarning so callers see the warning loud and
  clear, and so they understand the call did not isolate.  The same
  category change applies to the secondary "tools-with-isolated" warning.

- test_default_generate_reply_during_isolated_does_not_raise had a
  misleading name and an under-specified assertion.  Default-during-
  isolated does not raise, but under the single-slot _current_generation
  the second response.created clobbers the first, detaching its stream
  from the slot-resident handlers.  Test docstring now records this as a
  documented limitation; the assertion is unchanged because correctness
  of the overlap is not in scope until _current_generation is refactored
  to a dict keyed by response.id (separate work).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant