dgenio · dgenio · Jun 3, 2026 · May 30, 2026 · Jun 3, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- Two more ecosystem integration cookbooks under `docs/integrations/`, each with
+  a runnable, offline companion wired into `make ci`:
+  - **ChainWeaver compiled flows as capabilities** (#95): a `ChainWeaverDriver`
+    wraps a compiled flow behind the `Driver` protocol so the flow runs through
+    the normal policy/audit pipeline and produces a kernel-visible `ActionTrace`.
+    A flow-step failure is translated into a `DriverError` that preserves the
+    flow id and failing step. ChainWeaver stays an optional dependency. New
+    [`docs/integrations/chainweaver.md`](docs/integrations/chainweaver.md) and
+    [`examples/chainweaver_flow.py`](examples/chainweaver_flow.py).
+  - **Policy guardrails for statistical evaluation artifacts** (#96): a generic,
+    producer-agnostic `assess_artifact()` layer lets an agent summarize an
+    evaluation artifact while gating deployment/rollout recommendations on its
+    support diagnostics (multi-signal: `support_health`, `decision_stable`,
+    `warnings`, `recommendation.intent`). Denied actions are downgraded to a
+    manual-review recommendation whose reason is recorded in `ActionTrace.args`.
+    No statistical estimation is added and no producer dependency is taken. New
+    [`docs/integrations/evaluation_artifacts.md`](docs/integrations/evaluation_artifacts.md)
+    and [`examples/evaluation_artifact_policy.py`](examples/evaluation_artifact_policy.py).
 - `ActionTrace.result_summary` (#93): successful invocations now record a
   redaction-safe summary of the firewalled `Frame` (`fact_count`, `row_count`,
   `warning_count`, `has_handle` — counts/flags only, never raw driver data), so

diff --git a/Makefile b/Makefile
@@ -23,5 +23,7 @@ example:
 	python examples/readme_quickstart.py
 	python examples/contextweaver_policy_flow.py
 	python examples/repository_safety_check.py
+	python examples/chainweaver_flow.py
+	python examples/evaluation_artifact_policy.py
 
 ci: fmt-check lint type test example
diff --git a/README.md b/README.md
@@ -189,6 +189,8 @@ See [docs/agent-context/invariants.md](docs/agent-context/invariants.md) for the
 - [Integrations (MCP, HTTPDriver)](docs/integrations.md)
   - [contextweaver: policy before action](docs/integrations/contextweaver.md)
   - [Repository safety checks as a capability](docs/integrations/repository_safety_check.md)
+  - [ChainWeaver compiled flows as capabilities](docs/integrations/chainweaver.md)
+  - [Policy guardrails for evaluation artifacts](docs/integrations/evaluation_artifacts.md)
 - [Designing capabilities](docs/capabilities.md)
 - [Context Firewall](docs/context_firewall.md)
 

diff --git a/docs/integrations.md b/docs/integrations.md
@@ -377,3 +377,14 @@ projects and external checkers. Each has a runnable, offline companion under
   — gate a high-impact action behind a deterministic check that shells out to a
   local command (e.g. VibeGuard), with the result recorded in the audit trace.
   Companion: [`examples/repository_safety_check.py`](../examples/repository_safety_check.py).
+- [ChainWeaver compiled flows as policy-controlled capabilities](integrations/chainweaver.md)
+  — wrap a ChainWeaver compiled flow behind the `Driver` protocol so it runs
+  through the normal policy/audit pipeline; step failures surface as a
+  `DriverError` that preserves the flow id and failing step. ChainWeaver stays
+  an optional dependency.
+  Companion: [`examples/chainweaver_flow.py`](../examples/chainweaver_flow.py).
+- [Policy guardrails for statistical evaluation artifacts](integrations/evaluation_artifacts.md)
+  — let an agent summarize an evaluation artifact while gating deployment/rollout
+  recommendations on its support diagnostics; the downgrade reason is recorded in
+  the audit trace. Producer-agnostic; no statistical estimation in agent-kernel.
+  Companion: [`examples/evaluation_artifact_policy.py`](../examples/evaluation_artifact_policy.py).
diff --git a/docs/integrations/chainweaver.md b/docs/integrations/chainweaver.md
@@ -0,0 +1,104 @@
+# ChainWeaver compiled flows as policy-controlled capabilities
+
+[ChainWeaver](https://github.com/dgenio/ChainWeaver) is the Weaver-ecosystem
+orchestration layer: it *compiles* a multi-step flow that an agent can run.
+agent-kernel owns authorization, execution, and audit. Wrapping a compiled flow
+as a capability means the flow runs through the normal pipeline
+(policy → token → invoke → firewall → trace) instead of as an out-of-band side
+channel — so a flow invocation is policy-checked and auditable like any other
+capability (weaver-spec **I-02**).
+
+This page describes the pattern. The runnable companion is
+[`examples/chainweaver_flow.py`](../../examples/chainweaver_flow.py), which is
+deterministic, offline, and depends on no ChainWeaver package.
+
+> ChainWeaver is **not** a required dependency. The adapter only relies on a
+> compiled flow exposing a `run(inputs)` method and a `flow_id` attribute. The
+> example ships tiny `CompiledFlow` / `FlowExecutionError` stand-ins so it runs
+> in CI; in production you pass a real compiled flow to `ChainWeaverDriver`.
+
+## The pattern
+
+```
+agent invokes flows.summarize_release
+        │
+        ▼
+ChainWeaverDriver.execute()  →  compiled_flow.run(inputs)
+        │                              │
+        │                              ├─ all steps ok → RawResult → Frame + ActionTrace
+        │                              └─ step raises  → FlowExecutionError
+        ▼                                                     │
+   DriverError (flow id + failing step preserved) ◄───────────┘
+```
+
+| Component | Role |
+|---|---|
+| `CompiledFlow` | A ChainWeaver compiled flow: ordered, named steps run over a shared context. |
+| `ChainWeaverDriver` | Implements the `Driver` protocol; maps a capability operation to a flow and runs it. |
+| `flows.summarize_release` | A `READ` capability whose implementation is the compiled flow. |
+
+## The adapter
+
+`ChainWeaverDriver` implements the `Driver` protocol and runs the flow bound to
+the capability's operation:
+
+```python
+class ChainWeaverDriver:
+    def __init__(self, flows: dict[str, CompiledFlow], *, driver_id: str = "chainweaver") -> None:
+        self._flows = dict(flows)
+        self._driver_id = driver_id
+
+    @property
+    def driver_id(self) -> str:
+        return self._driver_id
+
+    async def execute(self, ctx: ExecutionContext) -> RawResult:
+        operation = str(ctx.args.get("operation", ctx.capability_id))
+        flow = self._flows.get(operation)
+        if flow is None:
+            raise DriverError(f"... no flow for operation='{operation}'.")
+        inputs = {k: v for k, v in ctx.args.items() if k != "operation"}
+        try:
+            output = flow.run(inputs)
+        except FlowExecutionError as exc:
+            raise DriverError(
+                f"ChainWeaver flow '{exc.flow_id}' failed at step '{exc.step}': {exc.cause}"
+            ) from exc
+        return RawResult(capability_id=ctx.capability_id, data=output,
+                         metadata={"flow_id": flow.flow_id, "steps": [s.name for s in flow.steps]})
+```
+
+## Errors preserve ChainWeaver context
+
+A real ChainWeaver flow raises its own exception when a step fails. The adapter
+**translates** that native error into a kernel `DriverError`, carrying the flow
+id and the failing step into the message rather than leaking a raw traceback.
+`Kernel.invoke()` then wraps it as
+`All drivers failed for capability '...'. Last error: ChainWeaver flow '<flow_id>'
+failed at step '<step>': <cause>`, so the orchestration context survives for the
+caller — and a failed run still records an `ActionTrace` (with `error` set), so
+I-02 holds even on failure.
+
+## Audit trail
+
+A successful invocation is recorded like any capability:
+
+```python
+action_id = await run_flow(kernel, principal, {"release": "v1.4.0", "changes": [...]})
+trace = kernel.explain(action_id)
+# trace.driver_id == "chainweaver"
+# trace.result_summary == {"fact_count": ..., "row_count": ..., "warning_count": ..., "has_handle": ...}
+```
+
+## Non-goals
+
+- agent-kernel does not compile flows — that is ChainWeaver's job.
+- ChainWeaver is never a required dependency.
+- Wrapping a flow does not bypass policy: the flow capability is granted and
+  invoked through the same pipeline as every other capability.
+
+## Related
+
+- `examples/chainweaver_flow.py` — runnable, offline.
+- [ChainWeaver](https://github.com/dgenio/ChainWeaver)
+- [weaver-spec](https://github.com/dgenio/weaver-spec)
diff --git a/docs/integrations/evaluation_artifacts.md b/docs/integrations/evaluation_artifacts.md
@@ -0,0 +1,105 @@
+# Policy guardrails for statistical evaluation artifacts
+
+Agents increasingly consume *evaluation artifacts* — structured reports such as
+an offline policy-evaluation result (e.g. from
+[`skdr-eval`](https://github.com/dgenio/skdr-eval)). These artifacts are easy to
+misuse: an agent that reads a favorable headline estimate and recommends
+deployment, while ignoring the support diagnostics, uncertainty, and warnings,
+turns a *caveated* result into an *unconditional* action.
+
+agent-kernel already separates **reading** from **acting** through safety
+classes. This page adds a small, generic policy layer on top: an agent may
+always *summarize* an artifact (with caveats), but recommending **deployment**
+or **automatic rollout** is gated on the artifact's diagnostics.
+
+The runnable companion is
+[`examples/evaluation_artifact_policy.py`](../../examples/evaluation_artifact_policy.py),
+which is deterministic, offline, and uses fixture artifacts.
+
+> agent-kernel does **not** implement offline policy evaluation or any
+> statistical estimation, and takes no dependency on a specific producer. The
+> policy reads documented fields off a plain dict artifact, so it works for any
+> producer — not just `skdr-eval`.
+
+## Summarizing evidence vs. acting on evidence
+
+This is the distinction the guardrail enforces:
+
+| Action | Capability | Safety class | Gated? |
+|---|---|---|---|
+| Summarize the artifact and its caveats | `eval.summarize_artifact` | `READ` | No — always allowed. |
+| Recommend deployment / rollout | `eval.recommend_deployment` | `WRITE` | Yes — only when diagnostics are healthy. |
+| Recommend manual review / better logs | `eval.recommend_manual_review` | `WRITE` | The downgrade target when deployment is denied. |
+
+Summarizing a high-risk result is fine — it informs the human. Recommending
+deployment *as if the result were reliable* is what the policy blocks.
+
+## The generic assessment
+
+`assess_artifact(artifact)` is producer-agnostic. It inspects documented fields
+and returns stable decision codes:
+
+```python
+decision = assess_artifact(artifact)
+# decision.allowed_actions  → e.g. ("allow_summary", "allow_manual_review_recommendation", ...)
+# decision.denied_actions   → e.g. ("deny_deployment_recommendation", "deny_automatic_rollout")
+# decision.reasons          → e.g. ("support_health=high_risk", "decision is not stable")
+# decision.allows_deployment → bool gate the host branches on
+```
+
+Fields inspected (all optional; missing fields default to the safest reading):
+
+| Field | Meaning |
+|---|---|
+| `support_health` | `"ok"` / `"caution"` / `"high_risk"`. |
+| `decision_stable` | Whether the comparison is robust to reasonable perturbation. |
+| `warnings` | Producer warnings (e.g. low ESS, poor overlap). |
+| `recommendation.intent` | The artifact's own steer (`"deploy"`, `"hold"`, …). |
+| `uncertainty` / `limitations` | Surfaced as caveats in the summary. |
+
+Deployment is permitted **only** when several signals agree: `support_health`
+is `ok`, the decision is stable, there are no warnings, and the artifact does
+not itself recommend holding. This is deliberately *not* a single-metric gate —
+a good point estimate with poor support is still blocked.
+
+| `support_health` | Deployment | Outcome |
+|---|---|---|
+| `ok` (stable, no warnings) | allowed | `allow_summary` + deployment recommendation |
+| `caution` | denied | downgraded to manual-review recommendation |
+| `high_risk` | denied | downgraded + `require_human_review` |
+
+## Audit trail records why
+
+When deployment is denied, the host does not grant `eval.recommend_deployment`;
+instead it invokes `eval.recommend_manual_review` with the reasons in the call
+args. Because the kernel only redacts args for `memory.`-prefixed capability ids,
+these `eval.*` capabilities keep their args in `ActionTrace.args`, so the audit
+trace records *why* the action was downgraded:
+
+```python
+capability_id, action_id = await act_on_artifact(kernel, principal, artifact)
+trace = kernel.explain(action_id)
+# capability_id == "eval.recommend_manual_review"
+# trace.args["reason"] == "support_health=high_risk; decision is not stable; 2 warning(s): ..."
+# trace.args["downgraded_from"] == "recommend_deployment"
+```
+
+## Non-goals
+
+- No OPE / statistical estimation in agent-kernel.
+- No hard dependency on `skdr-eval` or any producer.
+- No decision based on a single numeric metric.
+
+## Aligning with `weaver-spec`
+
+If `weaver-spec` publishes a formal `EvaluationArtifact` contract, the field
+names read by `assess_artifact` should be aligned to it; the decision codes here
+(`allow_summary`, `allow_manual_review_recommendation`, `require_human_review`,
+`deny_deployment_recommendation`, `deny_automatic_rollout`) are intended to be a
+stable, producer-neutral vocabulary in the meantime.
+
+## Related
+
+- `examples/evaluation_artifact_policy.py` — runnable, offline.
+- [skdr-eval](https://github.com/dgenio/skdr-eval)
+- [weaver-spec](https://github.com/dgenio/weaver-spec)