From 77b94c06a9c684b89805089b998d5fbffcf80fc5 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 5 Jun 2026 18:25:03 +0000 Subject: [PATCH 1/2] feat: stable ActionTrace export contract (#94) + property-based invariant tests (#99) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit #94 — Action trace export: - Add export_action_trace / export_action_traces: a stable, versioned, JSON-serialisable shape for ActionTrace records so downstream tools (e.g. LessonWeaver-style lesson extraction) can consume the audit trail without depending on internals. Derived only from already-redaction-safe trace fields (redacted args, post-firewall result_summary), so it cannot widen the I-01 boundary. Optional human-correction metadata attaches at export time; denied requests never produce a trace (policy gates before invoke). - Record the invoked capability's sensitivity on ActionTrace. - New docs/trace_export.md (incl. how it differs from the OTel export) and examples/trace_export_demo.py (one succeeded + one failed action), wired into make ci. #99 — Property-based tests (tests/test_policy_properties.py, Hypothesis): - Stable reason code on every allow/deny; max_rows never exceeds the policy cap; handle expansion never exceeds the original grant (indirect-use scenario); tokens never verify outside scope and tampered/expired tokens are rejected; policy traces never leak raw scope values; trace export is always JSON-serialisable. Adds hypothesis as a dev dependency. Validated: ruff check, ruff format --check, mypy (41 files), pytest (581 passed, 1 skipped; test_mcp_driver skipped — mcp not installable locally), and the example list all green. --- .github/workflows/ci.yml | 1 + AGENTS.md | 1 + CHANGELOG.md | 21 ++ Makefile | 1 + docs/architecture.md | 4 +- docs/trace_export.md | 142 ++++++++++ examples/trace_export_demo.py | 133 ++++++++++ pyproject.toml | 1 + src/weaver_kernel/__init__.py | 14 +- src/weaver_kernel/kernel/__init__.py | 1 + src/weaver_kernel/kernel/_invoke.py | 12 + src/weaver_kernel/kernel/_stream.py | 2 +- src/weaver_kernel/models.py | 8 + src/weaver_kernel/trace.py | 105 +++++++- tests/test_policy_properties.py | 370 +++++++++++++++++++++++++++ tests/test_trace.py | 130 +++++++++- 16 files changed, 939 insertions(+), 7 deletions(-) create mode 100644 docs/trace_export.md create mode 100644 examples/trace_export_demo.py create mode 100644 tests/test_policy_properties.py diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index cdeb98b..7f726cc 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -47,6 +47,7 @@ jobs: python examples/http_driver_demo.py python examples/tutorial.py python examples/readme_quickstart.py + python examples/trace_export_demo.py conformance_stub: name: "Weaver Spec Conformance Stub (v0.1.0)" diff --git a/AGENTS.md b/AGENTS.md index 3948e15..de88aed 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -138,6 +138,7 @@ See [docs/agent-context/review-checklist.md](docs/agent-context/review-checklist | Driver integration patterns | [docs/integrations.md](docs/integrations.md) | | Capability design conventions | [docs/capabilities.md](docs/capabilities.md) | | Context firewall details | [docs/context_firewall.md](docs/context_firewall.md) | +| Action trace export contract | [docs/trace_export.md](docs/trace_export.md) | ## Update policy diff --git a/CHANGELOG.md b/CHANGELOG.md index a2862ce..e8fe8ff 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -20,6 +20,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 a settings rename to `weaver-kernel` is the optional final step. ### Added +- **Action trace export contract (#94).** New `export_action_trace` / + `export_action_traces` produce a stable, versioned, JSON-serialisable shape + for `ActionTrace` records so downstream tools (e.g. LessonWeaver-style lesson + extraction) can consume the audit trail without depending on internals. The + export is derived only from already-redaction-safe trace fields — `args` + (memory payloads stripped) and `result_summary` (post-firewall counts/flags) + — so it cannot widen the I-01 boundary. `ActionTrace` now carries the invoked + capability's `sensitivity`, and downstream human-correction metadata can be + attached at export time. New [`docs/trace_export.md`](docs/trace_export.md) + (including how it differs from the OpenTelemetry export) and + [`examples/trace_export_demo.py`](examples/trace_export_demo.py), wired into + `make ci`. +- **Property-based invariant tests (#99).** New `tests/test_policy_properties.py` + uses Hypothesis to assert authorization invariants across generated + principals, capabilities, scopes, constraints, handles, and tokens: every + decision carries a stable reason code, `max_rows` never exceeds the policy + cap, handle expansion never exceeds the original grant (indirect-use + scenario), tokens never verify outside their scope and tampered/expired + tokens are always rejected, policy traces never leak raw scope values, and + the trace export is always JSON-serialisable. Adds `hypothesis` as a dev + dependency. - README repositioned to lead with the unique **capability-token + tamper-evident audit** value, with explicit boundary framing for the policy engine (vs `AgentFence`, #111) and the context firewall (vs `contextweaver`, #110) so a diff --git a/Makefile b/Makefile index 4550d82..0f9d441 100644 --- a/Makefile +++ b/Makefile @@ -25,5 +25,6 @@ example: python examples/repository_safety_check.py python examples/chainweaver_flow.py python examples/evaluation_artifact_policy.py + python examples/trace_export_demo.py ci: fmt-check lint type test example diff --git a/docs/architecture.md b/docs/architecture.md index 9e2fc9c..d4fe3fb 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -154,7 +154,9 @@ Transforms `RawResult → Frame`. Never exposes raw output to the LLM. Stores full results by opaque handle ID with TTL. `expand()` supports pagination, field selection, and basic equality filtering. ### TraceStore -Records every `ActionTrace`. `explain(action_id)` returns the full audit record. On a successful invocation the trace also carries a `result_summary` — a redaction-safe dict of counts/flags (`fact_count`, `row_count`, `warning_count`, `has_handle`) derived from the firewalled `Frame`, never from raw driver data — so an invocation's outcome is auditable directly (e.g. a repository safety check passed iff `result_summary["row_count"] == 0`). Failed runs have `result_summary == None`. +Records every `ActionTrace`. `explain(action_id)` returns the full audit record. On a successful invocation the trace also carries a `result_summary` — a redaction-safe dict of counts/flags (`fact_count`, `row_count`, `warning_count`, `has_handle`) derived from the firewalled `Frame`, never from raw driver data — so an invocation's outcome is auditable directly (e.g. a repository safety check passed iff `result_summary["row_count"] == 0`). Failed runs have `result_summary == None`. Each trace also records the invoked capability's `sensitivity` (`NONE`/`PII`/`PCI`/`SECRETS`/`MEMORY`). + +`export_action_trace` / `export_action_traces` serialise traces into a stable, versioned, JSON-serialisable shape for downstream analysis tools (distinct from the OpenTelemetry observability export). See [trace_export.md](trace_export.md). ### Adapters (`weaver_kernel.adapters`) Vendor-specific tool-format adapters that translate between `Capability` objects diff --git a/docs/trace_export.md b/docs/trace_export.md new file mode 100644 index 0000000..3b80d1e --- /dev/null +++ b/docs/trace_export.md @@ -0,0 +1,142 @@ +# Action Trace Export + +agent-kernel records an [`ActionTrace`](architecture.md) for every invocation. +The **trace export contract** turns those records into a stable, +JSON-serialisable shape that an external tool can consume — for example a +[LessonWeaver](https://github.com/dgenio/weaver-spec)-style lesson-extraction +layer that learns from past actions, policies, denials, corrections, and +outcomes. + +```python +from weaver_kernel import export_action_traces + +envelope = export_action_traces(kernel._traces.list_all()) +``` + +Runnable companion: [`examples/trace_export_demo.py`](../examples/trace_export_demo.py). + +## How this differs from OpenTelemetry export + +agent-kernel also ships an OpenTelemetry integration +([`weaver_kernel.otel`](architecture.md), `pip install weaver-kernel[otel]`). +The two serve different consumers and do **not** compete: + +| | OpenTelemetry (`instrument_kernel`) | Trace export (`export_action_traces`) | +|---|---|---| +| Consumer | Live observability backends (traces/metrics) | Offline analysis / learning tools | +| Shape | OTel spans + metrics, vendor-defined | Stable JSON envelope defined here | +| Timing | Emitted during execution | Pulled after the fact from the `TraceStore` | +| Stability | Tracks OTel semantic conventions | Versioned by `TRACE_EXPORT_VERSION` | + +Use OTel for dashboards and alerting; use the export contract when another +program needs a durable, replayable record of what the agent did. + +## Privacy + +The export is derived **only** from fields the `ActionTrace` already holds, +all of which are redaction-safe by construction: + +- `args` has memory payloads stripped at record time (keys like `payload`, + `content`, `value`, `memory`, `text`, `body` for `memory.*` capabilities + become `"[REDACTED]"`). +- `result_summary` carries counts and flags taken from the **post-firewall** + `Frame` — never raw driver data. + +The contract adds no field the trace did not already carry, so exporting can +never widen the I-01 firewall boundary or leak sensitive payloads. A *denied* +request never produces an `ActionTrace` — policy gates before invocation +(I-02) — so the export only ever describes authorised invocations; denials are +surfaced separately via `PolicyDenied` / `Kernel.explain_denial`. + +## Envelope shape + +`export_action_traces(...)` returns a versioned envelope: + +```json +{ + "schema": "weaver_kernel.action_trace_export", + "version": "1", + "traces": [ /* one object per ActionTrace */ ] +} +``` + +Each trace object: + +| Field | Type | Notes | +|-------|------|-------| +| `action_id` | string | Unique id; matches `Kernel.explain(action_id)`. | +| `capability_id` | string | The capability (tool) that was invoked. | +| `principal_id` | string | Who invoked it. | +| `token_id` | string | The capability token used. | +| `invoked_at` | string | ISO 8601 timestamp. | +| `response_mode` | string | `summary` / `table` / `handle_only` / `raw`. | +| `driver_id` | string | Driver that served the call (`""` on failure). | +| `handle_id` | string \| null | Handle for the full dataset, if one was minted. | +| `sensitivity` | string | `NONE` / `PII` / `PCI` / `SECRETS` / `MEMORY`. | +| `status` | string | `succeeded` or `failed` (derived from `error`). | +| `error` | string \| null | Failure reason; `null` on success. | +| `args` | object | Redacted invocation arguments. | +| `result_summary` | object \| null | Post-firewall counts/flags; `null` on failure. | +| `correction` | object \| null | Optional human-correction metadata (see below). | + +### Human corrections + +agent-kernel does not record human corrections itself. A downstream tool can +attach them at export time by passing a mapping of `action_id` → metadata: + +```python +envelope = export_action_traces( + traces, + corrections={"act-123": {"corrected_by": "reviewer", "note": "wrong customer"}}, +) +``` + +## Example output + +```json +{ + "schema": "weaver_kernel.action_trace_export", + "version": "1", + "traces": [ + { + "action_id": "0a1b...", + "capability_id": "billing.list_invoices", + "principal_id": "agent-007", + "token_id": "f3c2...", + "invoked_at": "2026-06-05T12:00:00+00:00", + "response_mode": "summary", + "driver_id": "billing", + "handle_id": "9d7e...", + "sensitivity": "PII", + "status": "succeeded", + "error": null, + "args": {"operation": "list_invoices", "status": "paid"}, + "result_summary": {"fact_count": 4, "row_count": 0, "warning_count": 1, "has_handle": true}, + "correction": null + }, + { + "action_id": "5e6f...", + "capability_id": "billing.flaky_report", + "principal_id": "agent-007", + "token_id": "11aa...", + "invoked_at": "2026-06-05T12:00:01+00:00", + "response_mode": "summary", + "driver_id": "", + "handle_id": null, + "sensitivity": "NONE", + "status": "failed", + "error": "Handler for operation='flaky_report' raised: reporting backend is unavailable", + "args": {"operation": "flaky_report"}, + "result_summary": null, + "correction": {"corrected_by": "on-call", "note": "known outage; retried later"} + } + ] +} +``` + +## Stability + +`TRACE_EXPORT_VERSION` is bumped only on a **breaking** change to the field +shape. New optional fields may be added without a bump, so consumers should +ignore unknown keys. Assert on `status`, `sensitivity`, and `reason`/`error` +rather than on human-readable strings, which may evolve. diff --git a/examples/trace_export_demo.py b/examples/trace_export_demo.py new file mode 100644 index 0000000..68ff3c5 --- /dev/null +++ b/examples/trace_export_demo.py @@ -0,0 +1,133 @@ +"""trace_export_demo.py — export action traces for downstream analysis (#94). + +The written contract lives in ``docs/trace_export.md``. This script is the +runnable companion. It shows how to turn the kernel's audit trail into a +stable, redaction-safe JSON shape that an external tool (for example a +LessonWeaver-style lesson-extraction layer) can consume without depending on +agent-kernel internals. + +The demo records two invocations so the export covers both outcomes the +contract distinguishes: + + 1. ``billing.list_invoices`` — a normal READ that **succeeds**. + 2. ``billing.flaky_report`` — a READ whose driver **fails**, producing a + ``status: "failed"`` trace (a *denied* request never reaches invoke, so + it never produces a trace; denials surface via ``explain_denial``). + +It then prints the versioned export envelope, attaching optional human +correction metadata to one trace. Everything is offline and deterministic. + +Run with: ``python examples/trace_export_demo.py`` +""" + +from __future__ import annotations + +import asyncio +import json + +from weaver_kernel import ( + Capability, + CapabilityRegistry, + DriverError, + HMACTokenProvider, + Kernel, + Principal, + SafetyClass, + SensitivityTag, + StaticRouter, + export_action_traces, + make_billing_driver, +) +from weaver_kernel.drivers.base import ExecutionContext +from weaver_kernel.models import CapabilityRequest, ImplementationRef + +_SECRET = "example-secret-do-not-use-in-prod" + + +def _build_kernel() -> Kernel: + capabilities = [ + Capability( + capability_id="billing.list_invoices", + name="List Invoices", + description="List invoices for a customer", + safety_class=SafetyClass.READ, + sensitivity=SensitivityTag.PII, + allowed_fields=["id", "amount", "currency", "status", "date"], + impl=ImplementationRef(driver_id="billing", operation="list_invoices"), + ), + Capability( + capability_id="billing.flaky_report", + name="Flaky Report", + description="A report whose backing service is currently failing", + safety_class=SafetyClass.READ, + impl=ImplementationRef(driver_id="billing", operation="flaky_report"), + ), + ] + registry = CapabilityRegistry() + registry.register_many(capabilities) + + driver = make_billing_driver() + + def flaky_report(ctx: ExecutionContext) -> object: + raise DriverError("reporting backend is unavailable") + + driver.register_handler("flaky_report", flaky_report) + + router = StaticRouter( + routes={ + "billing.list_invoices": ["billing"], + "billing.flaky_report": ["billing"], + } + ) + kernel = Kernel( + registry=registry, + token_provider=HMACTokenProvider(secret=_SECRET), + router=router, + ) + kernel.register_driver(driver) + return kernel + + +async def main() -> None: + kernel = _build_kernel() + principal = Principal( + principal_id="agent-007", + roles=["reader"], + attributes={"tenant": "acme"}, + ) + + # 1. A successful READ — produces a status="succeeded" trace. + list_req = CapabilityRequest(capability_id="billing.list_invoices", goal="list invoices") + list_token = kernel.get_token(list_req, principal, justification="") + ok_frame = await kernel.invoke( + list_token, + principal=principal, + args={"operation": "list_invoices", "status": "paid"}, + ) + print(f"succeeded: action_id={ok_frame.action_id} facts={len(ok_frame.facts)}") + + # 2. A failing READ — produces a status="failed" trace. + flaky_req = CapabilityRequest(capability_id="billing.flaky_report", goal="run report") + flaky_token = kernel.get_token(flaky_req, principal, justification="") + failed_action_id = "" + try: + await kernel.invoke(flaky_token, principal=principal, args={"operation": "flaky_report"}) + except DriverError as exc: + print(f"failed: {exc}") + # The failure was still recorded; grab the most recent trace's id. + failed_action_id = kernel._traces.list_all()[-1].action_id + + # Export everything. Attach an optional human correction to the failed run. + corrections = ( + {failed_action_id: {"corrected_by": "on-call", "note": "known outage; retried later"}} + if failed_action_id + else None + ) + envelope = export_action_traces(kernel._traces.list_all(), corrections=corrections) + + print("\nExported trace envelope:") + print(json.dumps(envelope, indent=2)) + + +if __name__ == "__main__": + asyncio.run(main()) diff --git a/pyproject.toml b/pyproject.toml index 5b0cbaa..404d7bf 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -51,6 +51,7 @@ dev = [ "pytest>=8.0", "pytest-cov>=5.0", "pytest-asyncio>=0.23", + "hypothesis>=6.100", "ruff>=0.4", "mypy>=1.10", "httpx>=0.27", diff --git a/src/weaver_kernel/__init__.py b/src/weaver_kernel/__init__.py index e324f3f..9f358f2 100644 --- a/src/weaver_kernel/__init__.py +++ b/src/weaver_kernel/__init__.py @@ -26,6 +26,7 @@ Handles & traces:: from weaver_kernel import HandleStore, TraceStore + from weaver_kernel import export_action_trace, export_action_traces LLM tool-format adapters:: @@ -140,7 +141,13 @@ from .registry import CapabilityRegistry from .router import StaticRouter from .tokens import CapabilityToken, HMACTokenProvider -from .trace import TraceStore +from .trace import ( + TRACE_EXPORT_SCHEMA, + TRACE_EXPORT_VERSION, + TraceStore, + export_action_trace, + export_action_traces, +) # Single source of truth: read the version from the installed distribution # metadata (the PyPI dist name is ``weaver-kernel``, distinct from the import @@ -250,6 +257,11 @@ # stores "HandleStore", "TraceStore", + # trace export (issue #94) + "TRACE_EXPORT_SCHEMA", + "TRACE_EXPORT_VERSION", + "export_action_trace", + "export_action_traces", # adapters "AnthropicMiddleware", "OpenAIMiddleware", diff --git a/src/weaver_kernel/kernel/__init__.py b/src/weaver_kernel/kernel/__init__.py index 07bf004..0fb7353 100644 --- a/src/weaver_kernel/kernel/__init__.py +++ b/src/weaver_kernel/kernel/__init__.py @@ -234,6 +234,7 @@ async def invoke( args=args, response_mode=response_mode, plan=plan, + capability=capability, ) async def invoke_stream( diff --git a/src/weaver_kernel/kernel/_invoke.py b/src/weaver_kernel/kernel/_invoke.py index b8d22ca..dd546e8 100644 --- a/src/weaver_kernel/kernel/_invoke.py +++ b/src/weaver_kernel/kernel/_invoke.py @@ -22,10 +22,12 @@ from typing import TYPE_CHECKING, Any from ..drivers.base import Driver, ExecutionContext +from ..enums import SensitivityTag from ..errors import DriverError from ..firewall.budget_manager import BudgetManager from ..models import ( ActionTrace, + Capability, Frame, Handle, Principal, @@ -150,6 +152,7 @@ def record_failure_trace( response_mode: ResponseMode, error_message: str, trace_store: TraceStore, + sensitivity: SensitivityTag = SensitivityTag.NONE, ) -> None: """Persist an :class:`ActionTrace` for a run where no driver succeeded.""" trace_store.record( @@ -162,6 +165,7 @@ def record_failure_trace( args=_redact_args_for_trace(capability_id, args), response_mode=response_mode, driver_id="", + sensitivity=sensitivity, error=error_message, ) ) @@ -179,6 +183,7 @@ def record_success_trace( handle_id: str | None, result_summary: dict[str, Any] | None, trace_store: TraceStore, + sensitivity: SensitivityTag = SensitivityTag.NONE, ) -> None: """Persist an :class:`ActionTrace` for a successful invocation.""" trace_store.record( @@ -191,6 +196,7 @@ def record_success_trace( args=_redact_args_for_trace(capability_id, args), response_mode=response_mode, driver_id=driver_id, + sensitivity=sensitivity, handle_id=handle_id, result_summary=result_summary, ) @@ -205,6 +211,7 @@ async def perform_invoke( args: dict[str, Any], response_mode: ResponseMode, plan: RoutePlan, + capability: Capability, ) -> Frame: """Run the non-dry-run invocation pipeline end-to-end. @@ -221,6 +228,9 @@ async def perform_invoke( args: Driver arguments. response_mode: The caller-requested response mode. plan: The router-resolved :class:`RoutePlan` for *token*. + capability: The resolved :class:`Capability`; its + :attr:`~weaver_kernel.models.Capability.sensitivity` is copied onto + the recorded :class:`ActionTrace`. """ action_id = str(uuid.uuid4()) effective_mode = resolve_effective_mode( @@ -272,6 +282,7 @@ async def perform_invoke( response_mode=response_mode, error_message=err_msg, trace_store=kernel._traces, + sensitivity=capability.sensitivity, ) raise DriverError( f"All drivers failed for capability '{token.capability_id}'. Last error: {err_msg}" @@ -316,6 +327,7 @@ async def perform_invoke( handle_id=handle.handle_id if handle else None, result_summary=_frame_result_summary(frame), trace_store=kernel._traces, + sensitivity=capability.sensitivity, ) logger.info( "invoke_success", diff --git a/src/weaver_kernel/kernel/_stream.py b/src/weaver_kernel/kernel/_stream.py index 9691b00..94c3242 100644 --- a/src/weaver_kernel/kernel/_stream.py +++ b/src/weaver_kernel/kernel/_stream.py @@ -53,7 +53,6 @@ async def invoke_stream_impl( response_mode: ResponseMode, ) -> AsyncIterator[Frame]: """Stream Frames for one capability invocation.""" - del capability # currently unused; kept in signature for future hooks. action_id = str(uuid.uuid4()) initial_mode = resolve_effective_mode( response_mode=response_mode, @@ -150,6 +149,7 @@ async def invoke_stream_impl( args=_redact_args_for_trace(token.capability_id, args), response_mode=(last_frame.response_mode if last_frame else initial_mode), driver_id=fallback_driver_id, + sensitivity=capability.sensitivity, handle_id=handle.handle_id if handle else None, result_summary=(_frame_result_summary(last_frame) if last_frame else None), error=None if yielded_any else "stream produced no chunks", diff --git a/src/weaver_kernel/models.py b/src/weaver_kernel/models.py index d511eed..3a4dcca 100644 --- a/src/weaver_kernel/models.py +++ b/src/weaver_kernel/models.py @@ -413,6 +413,14 @@ class ActionTrace: args: dict[str, Any] response_mode: ResponseMode driver_id: str + sensitivity: SensitivityTag = SensitivityTag.NONE + """Sensitivity tag of the invoked capability, copied at record time. + + Lets the audit trail (and the :mod:`~weaver_kernel.trace` export contract) + flag which invocations touched PII/PCI/SECRETS/MEMORY data without a + second registry lookup. Defaults to :attr:`SensitivityTag.NONE` for traces + constructed directly (e.g. in tests) or for non-sensitive capabilities. + """ handle_id: str | None = None error: str | None = None result_summary: dict[str, Any] | None = None diff --git a/src/weaver_kernel/trace.py b/src/weaver_kernel/trace.py index cd94d96..1650c40 100644 --- a/src/weaver_kernel/trace.py +++ b/src/weaver_kernel/trace.py @@ -1,10 +1,113 @@ -"""TraceStore: in-memory audit trail for kernel invocations.""" +"""TraceStore: in-memory audit trail for kernel invocations. + +This module also defines the **stable export contract** for +:class:`~weaver_kernel.models.ActionTrace` records +(:func:`export_action_trace` / :func:`export_action_traces`) so downstream +analysis tools — for example a LessonWeaver-style lesson-extraction layer — +can consume traces without depending on agent-kernel internals. + +The export is intentionally distinct from the OpenTelemetry observability +export in :mod:`weaver_kernel.otel`: OTel emits live spans and metrics for +monitoring, whereas this contract produces a stable, JSON-serialisable audit +record per invocation for offline analysis. See ``docs/trace_export.md``. + +Privacy: the export is derived **only** from already-redaction-safe +:class:`ActionTrace` fields. ``args`` has memory payloads stripped at record +time and ``result_summary`` carries counts/flags only — never raw driver data +— so exporting cannot widen the I-01 firewall boundary or leak sensitive +payloads. The contract adds no field that the trace did not already hold. +""" from __future__ import annotations +from collections.abc import Iterable +from typing import Any + from .errors import AgentKernelError from .models import ActionTrace +# ── Export contract ───────────────────────────────────────────────────────── + +TRACE_EXPORT_SCHEMA = "weaver_kernel.action_trace_export" +"""Stable schema identifier embedded in every exported envelope.""" + +TRACE_EXPORT_VERSION = "1" +"""Schema version of the export envelope. Bumped only on a breaking change to +the field shape; new optional fields may be added without a bump.""" + + +def export_action_trace( + trace: ActionTrace, + *, + correction: dict[str, Any] | None = None, +) -> dict[str, Any]: + """Serialise a single :class:`ActionTrace` to the stable export shape. + + The returned dict is JSON-serialisable as long as ``trace.args`` and + ``trace.result_summary`` hold JSON-compatible values (they do for traces + the kernel records). No raw driver output is added: every field is copied + verbatim from the already-redaction-safe trace. + + Args: + trace: The recorded action trace to export. + correction: Optional human-correction metadata to attach (e.g. + ``{"corrected_by": "reviewer", "note": "..."}``). agent-kernel does + not record corrections itself; a downstream tool supplies them at + export time. ``None`` when no correction is available. + + Returns: + A dict with the stable export fields. ``status`` is ``"failed"`` when + the invocation recorded an ``error`` and ``"succeeded"`` otherwise. + (A *denied* request never produces an :class:`ActionTrace` — policy + gates before invocation, per I-02 — so the export only ever describes + authorised invocations; denials surface via + :class:`~weaver_kernel.PolicyDenied` / ``explain_denial``.) + """ + return { + "action_id": trace.action_id, + "capability_id": trace.capability_id, + "principal_id": trace.principal_id, + "token_id": trace.token_id, + "invoked_at": trace.invoked_at.isoformat(), + "response_mode": trace.response_mode, + "driver_id": trace.driver_id, + "handle_id": trace.handle_id, + "sensitivity": trace.sensitivity.value, + "status": "failed" if trace.error is not None else "succeeded", + "error": trace.error, + "args": trace.args, + "result_summary": trace.result_summary, + "correction": correction, + } + + +def export_action_traces( + traces: Iterable[ActionTrace], + *, + corrections: dict[str, dict[str, Any]] | None = None, +) -> dict[str, Any]: + """Export an iterable of traces as a versioned, JSON-serialisable envelope. + + Args: + traces: The action traces to export (e.g. ``TraceStore.list_all()``). + corrections: Optional mapping of ``action_id`` → human-correction + metadata, applied per trace. Entries with no matching trace are + ignored; traces with no entry get ``correction=None``. + + Returns: + A dict ``{"schema", "version", "traces": [...]}`` where each entry is + the result of :func:`export_action_trace`. + """ + corrections = corrections or {} + return { + "schema": TRACE_EXPORT_SCHEMA, + "version": TRACE_EXPORT_VERSION, + "traces": [ + export_action_trace(trace, correction=corrections.get(trace.action_id)) + for trace in traces + ], + } + class TraceStore: """Stores :class:`ActionTrace` records indexed by ``action_id``. diff --git a/tests/test_policy_properties.py b/tests/test_policy_properties.py new file mode 100644 index 0000000..3a26a57 --- /dev/null +++ b/tests/test_policy_properties.py @@ -0,0 +1,370 @@ +"""Property-based invariant tests for the authorization surface (issue #99). + +These tests use Hypothesis to generate varied principals, capabilities, +requests, scopes, constraints, handles, and tokens, then assert that the +core security/audit invariants always hold — the failure modes that +example-based unit tests miss (token-scope confusion, handle expansion +outside the original grant, policy traces leaking raw argument values, etc.). + +Invariants under test (see ``docs/agent-context/invariants.md`` and AGENTS.md): + +* **I-02 — every decision is stable and auditable** + - :func:`test_decision_always_carries_a_stable_reason_code` — an allow + returns ``allowed=True`` with a stable :class:`AllowReason`; a deny + *raises* :class:`PolicyDenied` with a stable :class:`DenialReason` + (so a denied capability never silently yields a grant/token/frame). +* **Constraint integrity** + - :func:`test_max_rows_never_exceeds_policy_cap` + - :func:`test_handle_expand_never_exceeds_grant` — the indirect / + handle-expansion scenario: an expand query never returns more rows or + wider fields than the original grant authorised. +* **I-06 — tokens bind principal + capability + expiry** + - :func:`test_token_never_verifies_outside_its_scope` + - :func:`test_tampered_token_is_always_rejected` +* **Redaction safety (feeds the #94 export contract)** + - :func:`test_policy_trace_never_leaks_raw_scope_values` + - :func:`test_trace_export_is_always_json_serialisable` + +Reproducing failures: on failure Hypothesis prints a minimal falsifying +example plus a ``@reproduce_failure(...)`` decorator, and persists the case in +its example database so the next run replays it first. The rate limiter is +disabled in :func:`_engine` so repeated generated examples do not spuriously +deny on the sliding window. +""" + +from __future__ import annotations + +import datetime +import json +import string +from dataclasses import asdict + +import pytest +from hypothesis import given, settings +from hypothesis import strategies as st + +from weaver_kernel import ( + ActionTrace, + AllowReason, + Capability, + CapabilityRequest, + DefaultPolicyEngine, + DenialReason, + HandleConstraintViolation, + HandleStore, + HMACTokenProvider, + PolicyDenied, + Principal, + SafetyClass, + SensitivityTag, + TokenExpired, + TokenInvalid, + TokenScopeError, + export_action_traces, +) + +# ── Shared strategies & helpers ───────────────────────────────────────────── + +_ROLE_POOL = [ + "reader", + "writer", + "admin", + "service", + "pii_reader", + "secrets_reader", + "memory_writer", + "memory_reader_sensitive", +] +_ID_ALPHABET = string.ascii_letters + string.digits + "-_" +_DENIAL_CODES = {code.value for code in DenialReason} +_ALLOW_CODES = {code.value for code in AllowReason} + +# Effectively unlimited rate limits: Hypothesis runs many examples against a +# fresh engine, and the sliding-window limiter would otherwise deny later +# examples for reasons unrelated to the property under test. +_NO_RATE_LIMIT = {sc: (1_000_000, 3600.0) for sc in SafetyClass} + +_ids = st.text(alphabet=_ID_ALPHABET, min_size=1, max_size=16) +_roles = st.lists(st.sampled_from(_ROLE_POOL), unique=True, max_size=5) +_attributes = st.dictionaries( + st.sampled_from(["tenant", "region", "team"]), + st.text(alphabet=string.ascii_letters, min_size=1, max_size=8), + max_size=3, +) +_justifications = st.text(max_size=40) +_ROW_FIELDS = ["id", "email", "amount", "status"] + + +def _engine() -> DefaultPolicyEngine: + """A fresh engine with rate limiting disabled (see module docstring).""" + return DefaultPolicyEngine(rate_limits=_NO_RATE_LIMIT) + + +def _read_capability() -> Capability: + """A READ capability with no sensitivity — always reaches the allow path.""" + return Capability( + capability_id="cap.read", + name="read", + description="generated read capability", + safety_class=SafetyClass.READ, + sensitivity=SensitivityTag.NONE, + ) + + +@st.composite +def _principals(draw: st.DrawFn) -> Principal: + return Principal( + principal_id=draw(_ids), + roles=draw(_roles), + attributes=draw(_attributes), + ) + + +@st.composite +def _capabilities(draw: st.DrawFn) -> Capability: + cap_id = draw(_ids) + return Capability( + capability_id=cap_id, + name=cap_id, + description="generated capability", + safety_class=draw(st.sampled_from(list(SafetyClass))), + sensitivity=draw(st.sampled_from(list(SensitivityTag))), + ) + + +@st.composite +def _rows(draw: st.DrawFn) -> list[dict[str, object]]: + count = draw(st.integers(min_value=0, max_value=12)) + return [ + { + "id": f"R-{i}", + "email": draw(st.text(alphabet=string.ascii_lowercase, min_size=1, max_size=6)), + "amount": draw(st.integers(min_value=0, max_value=1000)), + "status": draw(st.sampled_from(["paid", "unpaid", "overdue"])), + } + for i in range(count) + ] + + +@st.composite +def _action_traces(draw: st.DrawFn) -> ActionTrace: + has_error = draw(st.booleans()) + return ActionTrace( + action_id=draw(_ids), + capability_id=draw(_ids), + principal_id=draw(_ids), + token_id=draw(_ids), + invoked_at=datetime.datetime.now(tz=datetime.timezone.utc), + args=draw(st.dictionaries(st.text(min_size=1, max_size=8), st.integers(), max_size=4)), + response_mode=draw(st.sampled_from(["summary", "table", "handle_only", "raw"])), + driver_id=draw(_ids), + sensitivity=draw(st.sampled_from(list(SensitivityTag))), + error=draw(st.text(max_size=20)) if has_error else None, + result_summary=( + None if has_error else {"row_count": draw(st.integers(min_value=0, max_value=100))} + ), + ) + + +# ── I-02: every decision is stable and auditable ──────────────────────────── + + +@settings(deadline=None, max_examples=200) +@given(principal=_principals(), capability=_capabilities(), justification=_justifications) +def test_decision_always_carries_a_stable_reason_code( + principal: Principal, capability: Capability, justification: str +) -> None: + engine = _engine() + request = CapabilityRequest(capability_id=capability.capability_id, goal="generated goal") + try: + decision = engine.evaluate(request, capability, principal, justification=justification) + except PolicyDenied as exc: + # A denial raises before any token is issued, so a denied capability + # can never produce a usable grant or frame. The code must be stable. + assert str(exc.reason_code) in _DENIAL_CODES + return + assert decision.allowed is True + assert decision.reason_code is not None + assert str(decision.reason_code) in _ALLOW_CODES + + +# ── Constraint integrity ──────────────────────────────────────────────────── + + +@settings(deadline=None, max_examples=200) +@given( + principal=_principals(), + requested_max_rows=st.one_of(st.none(), st.integers(min_value=-10, max_value=10_000)), +) +def test_max_rows_never_exceeds_policy_cap( + principal: Principal, requested_max_rows: int | None +) -> None: + engine = _engine() + capability = _read_capability() + constraints = {} if requested_max_rows is None else {"max_rows": requested_max_rows} + request = CapabilityRequest( + capability_id=capability.capability_id, goal="g", constraints=constraints + ) + decision = engine.evaluate(request, capability, principal, justification="") + cap_limit = 500 if "service" in principal.roles else 50 + capped = decision.constraints["max_rows"] + assert 0 <= capped <= cap_limit + if requested_max_rows is not None and requested_max_rows >= 0: + assert capped <= requested_max_rows + + +@settings(deadline=None, max_examples=200) +@given( + rows=_rows(), + granted_max_rows=st.one_of(st.none(), st.integers(min_value=0, max_value=20)), + granted_fields=st.lists(st.sampled_from(_ROW_FIELDS), unique=True, max_size=4), + query_limit=st.one_of(st.none(), st.integers(min_value=-5, max_value=30)), + query_offset=st.integers(min_value=0, max_value=10), + query_fields=st.lists(st.sampled_from(_ROW_FIELDS), unique=True, max_size=4), +) +def test_handle_expand_never_exceeds_grant( + rows: list[dict[str, object]], + granted_max_rows: int | None, + granted_fields: list[str], + query_limit: int | None, + query_offset: int, + query_fields: list[str], +) -> None: + store = HandleStore() + constraints: dict[str, object] = {} + if granted_max_rows is not None: + constraints["max_rows"] = granted_max_rows + if granted_fields: + constraints["allowed_fields"] = granted_fields + handle = store.store("cap.read", rows, principal_id="p1", constraints=constraints) + + query: dict[str, object] = {"offset": query_offset} + if query_limit is not None: + query["limit"] = query_limit + if query_fields: + query["fields"] = query_fields + + try: + frame = store.expand(handle, query=query, principal_id="p1") + except HandleConstraintViolation: + return # rejecting the over-broad request is the safe outcome + + preview = frame.table_preview + if granted_max_rows is not None: + assert len(preview) <= granted_max_rows + if granted_fields: + for row in preview: + assert set(row).issubset(set(granted_fields)) + + +# ── I-06: tokens bind principal + capability + expiry ─────────────────────── + + +@settings(deadline=None, max_examples=200) +@given( + capability_id=_ids, + principal_id=_ids, + other_principal_id=_ids, + other_capability_id=_ids, + ttl=st.one_of( + st.integers(min_value=-86_400, max_value=-1), + st.integers(min_value=120, max_value=86_400), + ), +) +def test_token_never_verifies_outside_its_scope( + capability_id: str, + principal_id: str, + other_principal_id: str, + other_capability_id: str, + ttl: int, +) -> None: + provider = HMACTokenProvider(secret="prop-test-secret") + token = provider.issue(capability_id, principal_id, ttl_seconds=ttl) + + if ttl <= 0: + with pytest.raises(TokenExpired): + provider.verify( + token, + expected_principal_id=principal_id, + expected_capability_id=capability_id, + ) + return + + # In-scope verification of a live token succeeds. + provider.verify( + token, expected_principal_id=principal_id, expected_capability_id=capability_id + ) + if other_principal_id != principal_id: + with pytest.raises(TokenScopeError): + provider.verify( + token, + expected_principal_id=other_principal_id, + expected_capability_id=capability_id, + ) + if other_capability_id != capability_id: + with pytest.raises(TokenScopeError): + provider.verify( + token, + expected_principal_id=principal_id, + expected_capability_id=other_capability_id, + ) + + +@settings(deadline=None, max_examples=100) +@given( + capability_id=_ids, + principal_id=_ids, + flip_index=st.integers(min_value=0, max_value=63), +) +def test_tampered_token_is_always_rejected( + capability_id: str, principal_id: str, flip_index: int +) -> None: + provider = HMACTokenProvider(secret="prop-test-secret") + token = provider.issue(capability_id, principal_id, ttl_seconds=3600) + sig = token.signature + idx = flip_index % len(sig) + replacement = "0" if sig[idx] != "0" else "1" + token.signature = sig[:idx] + replacement + sig[idx + 1 :] + with pytest.raises(TokenInvalid): + provider.verify( + token, expected_principal_id=principal_id, expected_capability_id=capability_id + ) + + +# ── Redaction safety (feeds the #94 export contract) ──────────────────────── + + +@settings(deadline=None, max_examples=200) +@given( + principal=_principals(), + scope_value=st.text(alphabet=string.ascii_letters + string.digits, min_size=4, max_size=16), +) +def test_policy_trace_never_leaks_raw_scope_values(principal: Principal, scope_value: str) -> None: + engine = _engine() + capability = _read_capability() + sentinel = f"SCOPEVAL{scope_value}SCOPEVAL" + request = CapabilityRequest( + capability_id=capability.capability_id, + goal="g", + scope={"customer_id": sentinel}, + ) + decision = engine.evaluate(request, capability, principal, justification="") + trace = decision.trace + assert trace is not None + # The scope *key* is recorded for audit, but its raw *value* must not be. + assert "customer_id" in trace.scope_keys + serialised = json.dumps(asdict(trace)) + assert sentinel not in serialised + + +@settings(deadline=None, max_examples=200) +@given(traces=st.lists(_action_traces(), max_size=6)) +def test_trace_export_is_always_json_serialisable(traces: list[ActionTrace]) -> None: + envelope = export_action_traces(traces) + assert envelope["version"] == "1" + assert len(envelope["traces"]) == len(traces) + blob = json.dumps(envelope) # must not raise + assert isinstance(blob, str) + for exported, trace in zip(envelope["traces"], traces, strict=True): + assert exported["status"] == ("failed" if trace.error is not None else "succeeded") + assert exported["sensitivity"] == trace.sensitivity.value diff --git a/tests/test_trace.py b/tests/test_trace.py index 2c8a469..803f2ac 100644 --- a/tests/test_trace.py +++ b/tests/test_trace.py @@ -1,14 +1,28 @@ -"""Tests for TraceStore.""" +"""Tests for TraceStore and the ActionTrace export contract (issue #94).""" from __future__ import annotations import datetime +import json import pytest -from weaver_kernel import TraceStore +from weaver_kernel import ( + Capability, + CapabilityRegistry, + HMACTokenProvider, + InMemoryDriver, + Kernel, + Principal, + SafetyClass, + SensitivityTag, + StaticRouter, + TraceStore, + export_action_trace, + export_action_traces, +) from weaver_kernel.errors import AgentKernelError -from weaver_kernel.models import ActionTrace +from weaver_kernel.models import ActionTrace, CapabilityRequest def _trace(action_id: str = "act-1") -> ActionTrace: @@ -63,3 +77,113 @@ def test_result_summary_defaults_none() -> None: # (e.g. failure traces, or callers constructing ActionTrace directly) keep # it unset rather than fabricating a summary. assert _trace("act-default").result_summary is None + + +def test_sensitivity_defaults_none() -> None: + assert _trace("act-default").sensitivity is SensitivityTag.NONE + + +# ── Export contract (issue #94) ───────────────────────────────────────────── + + +def test_export_action_trace_success_shape() -> None: + trace = ActionTrace( + action_id="act-ok", + capability_id="billing.list_invoices", + principal_id="u1", + token_id="tok-1", + invoked_at=datetime.datetime(2026, 1, 2, 3, 4, 5, tzinfo=datetime.timezone.utc), + args={"operation": "billing.list_invoices"}, + response_mode="summary", + driver_id="billing", + sensitivity=SensitivityTag.PII, + handle_id="h-1", + result_summary={"row_count": 3, "fact_count": 1, "warning_count": 0, "has_handle": True}, + ) + exported = export_action_trace(trace) + assert exported["action_id"] == "act-ok" + assert exported["capability_id"] == "billing.list_invoices" + assert exported["invoked_at"] == "2026-01-02T03:04:05+00:00" + assert exported["sensitivity"] == "PII" + assert exported["status"] == "succeeded" + assert exported["error"] is None + assert exported["result_summary"]["row_count"] == 3 + assert exported["correction"] is None + + +def test_export_action_trace_failure_status() -> None: + trace = ActionTrace( + action_id="act-fail", + capability_id="cap.x", + principal_id="u1", + token_id="tok-1", + invoked_at=datetime.datetime.now(tz=datetime.timezone.utc), + args={}, + response_mode="summary", + driver_id="", + error="All drivers failed", + ) + exported = export_action_trace(trace) + assert exported["status"] == "failed" + assert exported["error"] == "All drivers failed" + assert exported["result_summary"] is None + + +def test_export_action_trace_attaches_correction() -> None: + correction = {"corrected_by": "reviewer", "note": "wrong customer"} + exported = export_action_trace(_trace("act-corr"), correction=correction) + assert exported["correction"] == correction + + +def test_export_envelope_version_and_corrections() -> None: + traces = [_trace("act-0"), _trace("act-1")] + envelope = export_action_traces(traces, corrections={"act-1": {"note": "flagged"}}) + assert envelope["schema"] == "weaver_kernel.action_trace_export" + assert envelope["version"] == "1" + assert [t["action_id"] for t in envelope["traces"]] == ["act-0", "act-1"] + assert envelope["traces"][0]["correction"] is None + assert envelope["traces"][1]["correction"] == {"note": "flagged"} + # A correction for an unknown action_id is simply ignored. + json.dumps(envelope) # must be JSON-serialisable + + +@pytest.mark.asyncio +async def test_export_redacts_memory_payload_end_to_end() -> None: + """A memory payload redacted at record time stays redacted in the export.""" + cap = Capability( + capability_id="memory.read_notes", + name="read notes", + description="read durable notes", + safety_class=SafetyClass.READ, + sensitivity=SensitivityTag.MEMORY, + ) + registry = CapabilityRegistry() + registry.register(cap) + driver = InMemoryDriver(driver_id="mem") + driver.register_handler("memory.read_notes", lambda ctx: [{"note": "n1"}]) + kernel = Kernel( + registry=registry, + token_provider=HMACTokenProvider(secret="test-secret-do-not-use-in-prod"), + router=StaticRouter(routes={"memory.read_notes": ["mem"]}), + ) + kernel.register_driver(driver) + + principal = Principal(principal_id="u1", roles=["reader"]) + req = CapabilityRequest(capability_id="memory.read_notes", goal="read notes") + token = kernel.get_token(req, principal, justification="") + secret = "topsecret-PAYLOAD-123" + frame = await kernel.invoke( + token, + principal=principal, + args={"operation": "memory.read_notes", "payload": secret}, + ) + + trace = kernel.explain(frame.action_id) + assert trace.sensitivity is SensitivityTag.MEMORY + assert trace.args["payload"] == "[REDACTED]" + + envelope = export_action_traces(kernel._traces.list_all()) + exported = envelope["traces"][0] + assert exported["sensitivity"] == "MEMORY" + assert exported["status"] == "succeeded" + assert secret not in json.dumps(envelope) From b2be0e0ce6fe9ec0efb43fc83403f7fb730839b0 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 5 Jun 2026 18:31:08 +0000 Subject: [PATCH 2/2] fix: address review feedback on PR #115 - docs/trace_export.md: drop the stale `reason` field reference in the Stability section (the export shape only has `error`). - models.py: declare ActionTrace.sensitivity last so adding it does not shift the positional __init__ order of pre-existing public fields. --- docs/trace_export.md | 5 +++-- src/weaver_kernel/models.py | 20 ++++++++++++-------- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/docs/trace_export.md b/docs/trace_export.md index 3b80d1e..6dc8a47 100644 --- a/docs/trace_export.md +++ b/docs/trace_export.md @@ -138,5 +138,6 @@ envelope = export_action_traces( `TRACE_EXPORT_VERSION` is bumped only on a **breaking** change to the field shape. New optional fields may be added without a bump, so consumers should -ignore unknown keys. Assert on `status`, `sensitivity`, and `reason`/`error` -rather than on human-readable strings, which may evolve. +ignore unknown keys. Assert on `status`, `sensitivity`, and the presence of +`error` rather than on human-readable strings (the `error` text itself may +evolve). diff --git a/src/weaver_kernel/models.py b/src/weaver_kernel/models.py index 3a4dcca..453ba7d 100644 --- a/src/weaver_kernel/models.py +++ b/src/weaver_kernel/models.py @@ -413,14 +413,6 @@ class ActionTrace: args: dict[str, Any] response_mode: ResponseMode driver_id: str - sensitivity: SensitivityTag = SensitivityTag.NONE - """Sensitivity tag of the invoked capability, copied at record time. - - Lets the audit trail (and the :mod:`~weaver_kernel.trace` export contract) - flag which invocations touched PII/PCI/SECRETS/MEMORY data without a - second registry lookup. Defaults to :attr:`SensitivityTag.NONE` for traces - constructed directly (e.g. in tests) or for non-sensitive capabilities. - """ handle_id: str | None = None error: str | None = None result_summary: dict[str, Any] | None = None @@ -435,6 +427,18 @@ class ActionTrace: == 0``. """ + sensitivity: SensitivityTag = SensitivityTag.NONE + """Sensitivity tag of the invoked capability, copied at record time. + + Lets the audit trail (and the :mod:`~weaver_kernel.trace` export contract) + flag which invocations touched PII/PCI/SECRETS/MEMORY data without a second + registry lookup. Defaults to :attr:`SensitivityTag.NONE` for traces + constructed directly (e.g. in tests) or for non-sensitive capabilities. + + Declared last so adding it does not shift the positional ``__init__`` order + of the pre-existing fields (``ActionTrace`` is part of the public API). + """ + # ── Policy explanation ────────────────────────────────────────────────────────