Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]

### Added
- Two more ecosystem integration cookbooks under `docs/integrations/`, each with
a runnable, offline companion wired into `make ci`:
- **ChainWeaver compiled flows as capabilities** (#95): a `ChainWeaverDriver`
wraps a compiled flow behind the `Driver` protocol so the flow runs through
the normal policy/audit pipeline and produces a kernel-visible `ActionTrace`.
A flow-step failure is translated into a `DriverError` that preserves the
flow id and failing step. ChainWeaver stays an optional dependency. New
[`docs/integrations/chainweaver.md`](docs/integrations/chainweaver.md) and
[`examples/chainweaver_flow.py`](examples/chainweaver_flow.py).
- **Policy guardrails for statistical evaluation artifacts** (#96): a generic,
producer-agnostic `assess_artifact()` layer lets an agent summarize an
evaluation artifact while gating deployment/rollout recommendations on its
support diagnostics (multi-signal: `support_health`, `decision_stable`,
`warnings`, `recommendation.intent`). Denied actions are downgraded to a
manual-review recommendation whose reason is recorded in `ActionTrace.args`.
No statistical estimation is added and no producer dependency is taken. New
[`docs/integrations/evaluation_artifacts.md`](docs/integrations/evaluation_artifacts.md)
and [`examples/evaluation_artifact_policy.py`](examples/evaluation_artifact_policy.py).
- `ActionTrace.result_summary` (#93): successful invocations now record a
redaction-safe summary of the firewalled `Frame` (`fact_count`, `row_count`,
`warning_count`, `has_handle` — counts/flags only, never raw driver data), so
Expand Down
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,7 @@ example:
python examples/readme_quickstart.py
python examples/contextweaver_policy_flow.py
python examples/repository_safety_check.py
python examples/chainweaver_flow.py
python examples/evaluation_artifact_policy.py

ci: fmt-check lint type test example
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,8 @@ See [docs/agent-context/invariants.md](docs/agent-context/invariants.md) for the
- [Integrations (MCP, HTTPDriver)](docs/integrations.md)
- [contextweaver: policy before action](docs/integrations/contextweaver.md)
- [Repository safety checks as a capability](docs/integrations/repository_safety_check.md)
- [ChainWeaver compiled flows as capabilities](docs/integrations/chainweaver.md)
- [Policy guardrails for evaluation artifacts](docs/integrations/evaluation_artifacts.md)
- [Designing capabilities](docs/capabilities.md)
- [Context Firewall](docs/context_firewall.md)

Expand Down
11 changes: 11 additions & 0 deletions docs/integrations.md
Original file line number Diff line number Diff line change
Expand Up @@ -377,3 +377,14 @@ projects and external checkers. Each has a runnable, offline companion under
— gate a high-impact action behind a deterministic check that shells out to a
local command (e.g. VibeGuard), with the result recorded in the audit trace.
Companion: [`examples/repository_safety_check.py`](../examples/repository_safety_check.py).
- [ChainWeaver compiled flows as policy-controlled capabilities](integrations/chainweaver.md)
— wrap a ChainWeaver compiled flow behind the `Driver` protocol so it runs
through the normal policy/audit pipeline; step failures surface as a
`DriverError` that preserves the flow id and failing step. ChainWeaver stays
an optional dependency.
Companion: [`examples/chainweaver_flow.py`](../examples/chainweaver_flow.py).
- [Policy guardrails for statistical evaluation artifacts](integrations/evaluation_artifacts.md)
— let an agent summarize an evaluation artifact while gating deployment/rollout
recommendations on its support diagnostics; the downgrade reason is recorded in
the audit trace. Producer-agnostic; no statistical estimation in agent-kernel.
Companion: [`examples/evaluation_artifact_policy.py`](../examples/evaluation_artifact_policy.py).
104 changes: 104 additions & 0 deletions docs/integrations/chainweaver.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# ChainWeaver compiled flows as policy-controlled capabilities

[ChainWeaver](https://github.com/dgenio/ChainWeaver) is the Weaver-ecosystem
orchestration layer: it *compiles* a multi-step flow that an agent can run.
agent-kernel owns authorization, execution, and audit. Wrapping a compiled flow
as a capability means the flow runs through the normal pipeline
(policy → token → invoke → firewall → trace) instead of as an out-of-band side
channel — so a flow invocation is policy-checked and auditable like any other
capability (weaver-spec **I-02**).

This page describes the pattern. The runnable companion is
[`examples/chainweaver_flow.py`](../../examples/chainweaver_flow.py), which is
deterministic, offline, and depends on no ChainWeaver package.

> ChainWeaver is **not** a required dependency. The adapter only relies on a
> compiled flow exposing a `run(inputs)` method and a `flow_id` attribute. The
> example ships tiny `CompiledFlow` / `FlowExecutionError` stand-ins so it runs
> in CI; in production you pass a real compiled flow to `ChainWeaverDriver`.

## The pattern

```
agent invokes flows.summarize_release
ChainWeaverDriver.execute() → compiled_flow.run(inputs)
│ │
│ ├─ all steps ok → RawResult → Frame + ActionTrace
│ └─ step raises → FlowExecutionError
▼ │
DriverError (flow id + failing step preserved) ◄───────────┘
```

| Component | Role |
|---|---|
| `CompiledFlow` | A ChainWeaver compiled flow: ordered, named steps run over a shared context. |
| `ChainWeaverDriver` | Implements the `Driver` protocol; maps a capability operation to a flow and runs it. |
| `flows.summarize_release` | A `READ` capability whose implementation is the compiled flow. |

## The adapter

`ChainWeaverDriver` implements the `Driver` protocol and runs the flow bound to
the capability's operation:

```python
class ChainWeaverDriver:
def __init__(self, flows: dict[str, CompiledFlow], *, driver_id: str = "chainweaver") -> None:
self._flows = dict(flows)
self._driver_id = driver_id

@property
def driver_id(self) -> str:
return self._driver_id

async def execute(self, ctx: ExecutionContext) -> RawResult:
operation = str(ctx.args.get("operation", ctx.capability_id))
flow = self._flows.get(operation)
if flow is None:
raise DriverError(f"... no flow for operation='{operation}'.")
inputs = {k: v for k, v in ctx.args.items() if k != "operation"}
try:
output = flow.run(inputs)
except FlowExecutionError as exc:
raise DriverError(
f"ChainWeaver flow '{exc.flow_id}' failed at step '{exc.step}': {exc.cause}"
) from exc
return RawResult(capability_id=ctx.capability_id, data=output,
metadata={"flow_id": flow.flow_id, "steps": [s.name for s in flow.steps]})
```

## Errors preserve ChainWeaver context

A real ChainWeaver flow raises its own exception when a step fails. The adapter
**translates** that native error into a kernel `DriverError`, carrying the flow
id and the failing step into the message rather than leaking a raw traceback.
`Kernel.invoke()` then wraps it as
`All drivers failed for capability '...'. Last error: ChainWeaver flow '<flow_id>'
failed at step '<step>': <cause>`, so the orchestration context survives for the
caller — and a failed run still records an `ActionTrace` (with `error` set), so
I-02 holds even on failure.

## Audit trail

A successful invocation is recorded like any capability:

```python
action_id = await run_flow(kernel, principal, {"release": "v1.4.0", "changes": [...]})
trace = kernel.explain(action_id)
# trace.driver_id == "chainweaver"
# trace.result_summary == {"fact_count": ..., "row_count": ..., "warning_count": ..., "has_handle": ...}
```

## Non-goals

- agent-kernel does not compile flows — that is ChainWeaver's job.
- ChainWeaver is never a required dependency.
- Wrapping a flow does not bypass policy: the flow capability is granted and
invoked through the same pipeline as every other capability.

## Related

- `examples/chainweaver_flow.py` — runnable, offline.
- [ChainWeaver](https://github.com/dgenio/ChainWeaver)
- [weaver-spec](https://github.com/dgenio/weaver-spec)
105 changes: 105 additions & 0 deletions docs/integrations/evaluation_artifacts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Policy guardrails for statistical evaluation artifacts

Agents increasingly consume *evaluation artifacts* — structured reports such as
an offline policy-evaluation result (e.g. from
[`skdr-eval`](https://github.com/dgenio/skdr-eval)). These artifacts are easy to
misuse: an agent that reads a favorable headline estimate and recommends
deployment, while ignoring the support diagnostics, uncertainty, and warnings,
turns a *caveated* result into an *unconditional* action.

agent-kernel already separates **reading** from **acting** through safety
classes. This page adds a small, generic policy layer on top: an agent may
always *summarize* an artifact (with caveats), but recommending **deployment**
or **automatic rollout** is gated on the artifact's diagnostics.

The runnable companion is
[`examples/evaluation_artifact_policy.py`](../../examples/evaluation_artifact_policy.py),
which is deterministic, offline, and uses fixture artifacts.

> agent-kernel does **not** implement offline policy evaluation or any
> statistical estimation, and takes no dependency on a specific producer. The
> policy reads documented fields off a plain dict artifact, so it works for any
> producer — not just `skdr-eval`.

## Summarizing evidence vs. acting on evidence

This is the distinction the guardrail enforces:

| Action | Capability | Safety class | Gated? |
|---|---|---|---|
| Summarize the artifact and its caveats | `eval.summarize_artifact` | `READ` | No — always allowed. |
| Recommend deployment / rollout | `eval.recommend_deployment` | `WRITE` | Yes — only when diagnostics are healthy. |
| Recommend manual review / better logs | `eval.recommend_manual_review` | `WRITE` | The downgrade target when deployment is denied. |

Summarizing a high-risk result is fine — it informs the human. Recommending
deployment *as if the result were reliable* is what the policy blocks.

## The generic assessment

`assess_artifact(artifact)` is producer-agnostic. It inspects documented fields
and returns stable decision codes:

```python
decision = assess_artifact(artifact)
# decision.allowed_actions → e.g. ("allow_summary", "allow_manual_review_recommendation", ...)
# decision.denied_actions → e.g. ("deny_deployment_recommendation", "deny_automatic_rollout")
# decision.reasons → e.g. ("support_health=high_risk", "decision is not stable")
# decision.allows_deployment → bool gate the host branches on
```

Fields inspected (all optional; missing fields default to the safest reading):

| Field | Meaning |
|---|---|
| `support_health` | `"ok"` / `"caution"` / `"high_risk"`. |
| `decision_stable` | Whether the comparison is robust to reasonable perturbation. |
| `warnings` | Producer warnings (e.g. low ESS, poor overlap). |
| `recommendation.intent` | The artifact's own steer (`"deploy"`, `"hold"`, …). |
| `uncertainty` / `limitations` | Surfaced as caveats in the summary. |

Deployment is permitted **only** when several signals agree: `support_health`
is `ok`, the decision is stable, there are no warnings, and the artifact does
not itself recommend holding. This is deliberately *not* a single-metric gate —
a good point estimate with poor support is still blocked.

| `support_health` | Deployment | Outcome |
|---|---|---|
| `ok` (stable, no warnings) | allowed | `allow_summary` + deployment recommendation |
| `caution` | denied | downgraded to manual-review recommendation |
| `high_risk` | denied | downgraded + `require_human_review` |

## Audit trail records why

When deployment is denied, the host does not grant `eval.recommend_deployment`;
instead it invokes `eval.recommend_manual_review` with the reasons in the call
args. Because the kernel only redacts args for `memory.`-prefixed capability ids,
these `eval.*` capabilities keep their args in `ActionTrace.args`, so the audit
trace records *why* the action was downgraded:

```python
capability_id, action_id = await act_on_artifact(kernel, principal, artifact)
trace = kernel.explain(action_id)
# capability_id == "eval.recommend_manual_review"
# trace.args["reason"] == "support_health=high_risk; decision is not stable; 2 warning(s): ..."
# trace.args["downgraded_from"] == "recommend_deployment"
```

## Non-goals

- No OPE / statistical estimation in agent-kernel.
- No hard dependency on `skdr-eval` or any producer.
- No decision based on a single numeric metric.

## Aligning with `weaver-spec`

If `weaver-spec` publishes a formal `EvaluationArtifact` contract, the field
names read by `assess_artifact` should be aligned to it; the decision codes here
(`allow_summary`, `allow_manual_review_recommendation`, `require_human_review`,
`deny_deployment_recommendation`, `deny_automatic_rollout`) are intended to be a
stable, producer-neutral vocabulary in the meantime.

## Related

- `examples/evaluation_artifact_policy.py` — runnable, offline.
- [skdr-eval](https://github.com/dgenio/skdr-eval)
- [weaver-spec](https://github.com/dgenio/weaver-spec)
Loading
Loading