Skip to content

[DX] Policy testing framework: assert allow/deny/ask decisions as fixtures #138

@dgenio

Description

@dgenio

Summary

Ship a small policy-testing toolkit so teams can unit-test their own policies the way they test code: declare scenario fixtures (principal, capability, request context) and expected decisions (allow/deny/ask + reason code), run them with pytest or a one-line runner, and get a readable diff when a policy change alters any decision.

Why this matters

A policy nobody can test is a policy nobody will trust in production. OPA's opa test made policy testing a category norm; bringing the same workflow to agent tool-call policy turns the DSL from "config file" into "engineered artifact", and gives teams a regression net for exactly the change AGENTS.md warns about (rule-ordering silently bypassing sensitivity checks). It also generates the shared fixture format the AgentFence policy-contract work (#111, #120) will need.

Proposed scope

  • weaver_kernel.policy_testing module: PolicyScenario dataclass (principal, capability id, safety class, sensitivity, intent/scope, justification) + expect (decision and reason code); run_scenarios(engine, scenarios) -> ScenarioReport.
  • YAML scenario-file format mirroring the DSL's vocabulary, loadable with the existing [policy] extra.
  • Pytest integration: a documented pattern (parametrized fixture helper) so scenarios run as individual test cases with clear ids.
  • Decision-diff output: given two engines (or one engine and two policy files), report every scenario whose decision changed — the building block for safe policy rollouts.
  • Docs page (docs/policy_testing.md) + scenarios for the cookbook recipes if the cookbook lands.

Implementation notes

  • Evaluation path: DefaultPolicyEngine.evaluate() / DeclarativePolicyEngine — scenarios should construct real CapabilityRequest/Principal objects, not mocks, so tests exercise the true code path.
  • Assert stable codes from policy_reasons.py, never message strings (repo convention).
  • dry_run() (kernel/_dry_run.py) covers full-kernel scenarios; the toolkit targets engine-level testing without needing drivers — document when to use which.
  • New module must stay ≤300 lines; split loader/report if needed.

Acceptance criteria

  • Scenario dataclass + YAML format + runner shipped and exported.
  • Pytest pattern documented and used in this repo's own tests for the default policy.
  • Decision-diff between two policy files works and is covered by tests.
  • Docs page with a worked example; CHANGELOG updated.

Out of scope

References

  • OPA opa test and Cedar policy-test patterns as neutral ecosystem precedents.
  • In-repo: policy_reasons.py, policy_dsl.py, kernel/_dry_run.py, AGENTS.md "Adding a policy rule".

Priority: P1 · Effort: M · Impact: High

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions