Skip to content

feat(middleware): Model routing, PII filtering, Cloud model proxies#9802

Open
richiejp wants to merge 38 commits into
mudler:masterfrom
richiejp:feat/routing-stats-backend
Open

feat(middleware): Model routing, PII filtering, Cloud model proxies#9802
richiejp wants to merge 38 commits into
mudler:masterfrom
richiejp:feat/routing-stats-backend

Conversation

@richiejp
Copy link
Copy Markdown
Collaborator

@richiejp richiejp commented May 13, 2026

Allows analyzing requests then routing, filtering and transforming them.

Chat requests can be classified and labelled as requiring particular capabilities.
Then routed to the model which satisfies all of the capabilities. Naturally requests that require fewer capabilities can be handled by smaller specialized models. In addition the classifier chooses more capabilities the more uncertain it is, routing difficult requests to larger general purpose models.

Classification is very fast, but once requests have been classified their embeddings can be used to avoid classifying similar requests. This works by labelling the embeddings of past requests and then doing a cosine similarity search on the embeddings of new requests.

image

Private information can be detected, when it is found in the request, the request can be modified to redact it,
routed differently or it can be blocked.

image image

Cloud models and a MITM proxy can be configured and take part in filtering and routing.
This allows sending easy requests to smaller local models and hard ones to cloud models.
The MITM proxy allows you to use Claude Code or Codex subscriptions (OAuth) with the PII
filter and potentially even with routing (although this is limited by the cloud providers ToS).

image

Routing classifies requests using a model such as ArchRouter which labels a request.
We score each request on the possible capabilities it may require and pick a model which
has all of the capabilities with scores towards the top of the distribution.

image

The ability to score multiple choices is an interesting feature in its own right.
It allows you to very quickly check with what probability an LLM would produce a particular
answer.

  • feat(routing): add billing recorder and stats backend foundation
  • feat(routing): expose usage stats in REST, UI, and MCP
  • feat(routing): add regex PII filter with REST and MCP surfaces
  • feat(routing): record usage end-to-end in no-auth mode
  • feat(routing): per-model PII gating + middleware admin page
  • feat(routing): rule-based intelligent router (subsystem 2 MVP)
  • feat(routing): streaming PII filter with buffered-emit invariant
  • feat(routing): PII pattern editor in model config UI
  • feat(routing): streaming PII filter on Anthropic /v1/messages and /v1/completions
  • feat(routing): cloud passthrough proxy (subsystem 4 MVP)
  • docs(routing): cloud passthrough proxy feature page
  • feat(routing): MITM proxy for subscription-auth Claude Code / Codex
  • feat(mitm): negotiate HTTP/2 with h1.1 fallback
  • refactor(cloudproxy): extract shared SSE wire helpers, trim dead state and comments
  • feat(import-model): add cloud-proxy templates to YAML editor
  • Revert "feat(import-model): add cloud-proxy templates to YAML editor"
  • feat(model-editor): add cloud-proxy templates to Add Model picker
  • feat(mitm): runtime control of listener and intercept allowlist
  • feat(middleware-ui): MITM proxy admin tab
  • refactor(mitm): simplify-pass cleanup
  • feat(mitm): emit proxy_connect + proxy_traffic audit events
  • test(mitm): cover tunneled-host event + Events tab kind filter
  • fix(mitm): restore listener from runtime_settings.json on restart
  • fix(routing): address code-review findings across pii/mitm/router
  • feat(middleware): per-pattern PII toggle, model-config-owned MITM hosts
  • refactor(store/local): extract in-process vector store library
  • feat(routing): KNN + LLM classifiers and per-model admission control
  • refactor(store): keep the vector store out of the main process
  • feat(backend): TokenClassify RPC + transformers NER pipeline
  • fix(openai): add missing auth import to chat.go
  • feat(pii): NER tier in the redactor
  • feat(middleware-ui): router template + Create routing model link
  • fix(model-editor): code-editor crash on structured template values
  • feat(model-editor): structured router-candidates editor + proxy chat usecase
  • fix(router-candidates): one textarea per exemplar, multi-line-safe
  • feat(router): KNN consumes a benchmarker-produced routing dataset
  • docs(router): recommend nomic-embed-text-v1.5 over Longformer
  • feat(routing): Score gRPC primitive, score classifier, L2 embedding cache

@richiejp richiejp force-pushed the feat/routing-stats-backend branch 4 times, most recently from aff5af4 to 8389d96 Compare May 13, 2026 14:54
richiejp added 24 commits May 14, 2026 14:45
Introduces core/services/routing/{contract,billing} as the foundation
for the routing module. The billing recorder is wired through the
existing UsageMiddleware and runs unconditionally — a no-auth single-
user box now records token usage under a synthetic "local" user, where
previously the middleware short-circuited on a nil auth DB and zero
stats were captured.

- StatsBackend interface with three impls (gorm, in-memory ring,
  disabled) selected at startup; Recorder fans out to backend + Prom
  counters from a single increment site so DB and metrics cannot
  diverge.
- UsageRecord schema extended with RequestedModel/ServedModel,
  Pre/PostFilterPromptTokens, pricing version, cost, and correlation/
  router/PII foreign keys (all nullable; AutoMigrate handles existing
  deployments).
- Synthetic LocalUser persisted to ${DataPath}/.local_user_id so usage
  history aggregates across restarts in single-user mode.
- contract.Invariant emits localai_invariant_violation_total and panics
  under -tags=routing_strict for nightly E2E surfacing.
- --disable-stats opt-out for ephemeral CI runs.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Wires the billing recorder from the previous commit into user-facing
surfaces. Before this, the Recorder collected data but no endpoint
queried it without auth, the UI hid the Usage page in single-user
mode, and there was no MCP tool to read stats. After:

- New REST endpoints GET /api/usage and /api/usage/all that go through
  application.StatsRecorder() and fall back to the synthetic local
  user when auth is off. Old /api/auth/usage stays as the auth-only
  alias. Both new endpoints carry swagger annotations under the
  "usage" tag.
- Sidebar drops authOnly:true on the Usage entry; Usage.jsx picks the
  endpoint based on authEnabled and skips the empty-state-bail when
  auth is off.
- /api/instructions registry gains a "usage-and-billing" entry so
  agents discover the surface; the existing reachability test bumps to
  13 instructions and asserts the new name is present.
- New MCP tool get_usage_stats with read-only semantics, registered
  under the existing localaitools server. coverage_test.go
  ::TestToolHTTPRouteMappingComplete documents the route pairing;
  expectedFullCatalog and expectedReadOnlyCatalog include the tool.
  Both inproc and httpapi clients implement GetUsageStats; the inproc
  client picks up the StatsRecorder + FallbackUser at construction in
  application.go.
- Playwright e2e spec usage-dashboard.spec.js asserts (a) the Usage
  link is visible without auth, (b) the page renders /api/usage data
  without bailing, and (c) auth-on still routes to /api/auth/usage.

Verified end-to-end against tests/e2e-ui/ui-test-server: /api/auth/status
reports authEnabled:false, /api/usage returns the local user with a
stable UUID, /api/usage/all admits the local user as admin.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Subsystem 3 of the routing module. The regex tier is the cheap,
deterministic layer; the encoder NER tier (TokenClassify gRPC) is
follow-up work.

Pattern set: email, phone, SSN, credit card with Luhn verification,
IPv4 (with octet bounds-check), and common API key prefixes (sk-,
pk-, xoxb-, ghp_, github_pat_). Each pattern has one of three
actions:

  - mask: replace the matched span with [REDACTED:<id>] before the
    request reaches the backend. Default for everything except
    api_key_prefix.
  - block: short-circuit the request with HTTP 400 and a pii_blocked
    error type. The matched value is never echoed back to the client.
    Default for api_key_prefix — leaked credentials are higher harm
    than other PII.
  - route_local: leave the text intact but flag the echo context so a
    future content router refuses cloud-proxy candidates. Useful for
    deployments that trust local models with sensitive data but not
    external providers.

Wiring:

  - core/services/routing/pii: types, regex compile, redactor, in-
    memory event ring buffer, YAML config loader, request middleware.
  - core/services/routing/piiadapter: per-API-shape adapter (OpenAI
    today; Anthropic when needed) so the schema package never imports
    pii.
  - core/http/routes/openai.go: wires pii.RequestMiddleware as the
    innermost middleware in the chat slice — runs after the request
    is parsed, mutates the request body in place when masking, returns
    400 when blocking.
  - core/http/routes/pii.go: GET /api/pii/patterns, GET /api/pii/events,
    POST /api/pii/test (admin-or-local-user; events filterable by
    correlation_id, user_id, pattern_id).
  - pkg/mcp/localaitools: list_pii_patterns, get_pii_events,
    test_pii_redaction tools with full route map coverage in
    coverage_test.go.
  - core/http/endpoints/localai/api_instructions.go: pii-filtering
    instructions entry; reachability test bumps to 14.
  - --pii-config / --disable-pii flags; pii.yaml format overrides
    per-id action with unknown-id rejection at startup.

PIIEvent records never carry the matched value — only the byte
offset, length, and an 8-char sha256 prefix so admins can dedupe
recurring leaks during audit. The contract.Invariant
"pii.event_per_span" asserts every redacted span produces an event
record.

Verified end-to-end against ui-test-server: GET /api/pii/patterns
returns the 6 defaults with correct actions; POST /api/pii/test with
"contact alice@example.com" returns
'redacted="contact [REDACTED:email] about it"' and a span with
hash_prefix=ff8d9819; same with "sk-..." returns blocked=true.

Streaming response filter (the buffered-emit invariant) is in the
plan as a separate slice and not in this commit.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Streaming chat completions weren't producing UsageRecords because the
middleware only parsed token counts from the response body — and OpenAI
clients rarely set stream_options.include_usage, while Anthropic uses a
different shape entirely. Handlers now stamp the canonical token counts
on the echo context via middleware.StampUsage; UsageMiddleware reads the
stamp first and only falls back to body-parse for proxy/foreign
endpoints. The body-parse fallback gains an Anthropic shape so
passthrough proxies for /v1/messages still work.

Billing's Prometheus counters were never reaching /metrics because the
monitoring service that calls otel.SetMeterProvider was created later
than billing.NewRecorder, leaving the counters bound to the no-op global
provider. The metrics service now initialises in application.start()
before any counter is registered, exposes its meter via Application
.MetricsService(), and hands it directly to billing via SetMeter() so
the order-of-operations dependency is explicit rather than racy.

The synthetic local user is now wired unconditionally when stats are
enabled (not just when authDB is nil), so internal/system callers under
auth-on still attribute correctly. The /app/users React route is
guarded by a new RequireAuthEnabled component that redirects to /app
when auth is off, defending against direct URL access of an admin-only
page that has nothing to manage in single-user mode.

A new localai_usage_unrecorded_total{endpoint,reason} counter ticks
whenever a request finishes without producing a record, so silent
billing misses are observable rather than invisible.

Verified end-to-end: chat (streaming + non-streaming), embeddings, and
Anthropic messages (streaming + non-streaming) each produce one
UsageRecord and one Prom counter increment in no-auth mode.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Move PII filtering from a global opt-out to a per-model opt-in: local
models bypass redaction by default, while backends matching `proxy-*`
default to on (forward-compatible with the cloud-passthrough subsystem).
A new ModelConfig.PII block lets a model opt in (`enabled: true`) and
upgrade or downgrade individual pattern actions without touching global
config. The middleware reads the resolved config from the echo context
and short-circuits when disabled, so a chat to a local model pays no
regex-scan cost.

The Anthropic /v1/messages route gains the same redaction path via a
new piiadapter.Anthropic() that walks AnthropicRequest.Messages —
identical shape to the OpenAI adapter, so a future passthrough proxy
gets PII for free.

A new admin page at /app/middleware (System section, admin-only)
surfaces the live state. Three tabs: Filtering shows the pattern
catalogue with action editors plus every model's resolved enabled state
and overrides; Routing is a placeholder until subsystem 2 lands; Events
renders recent PIIEvents (correlation id, pattern id, action, hash
prefix — the redacted content is never stored or displayed). The page
reads /api/middleware/status (a single-round-trip aggregator) and
mutates pattern actions via PUT /api/pii/patterns/:id (transient,
restored from --pii-config on restart). MCP exposes the same surface as
get_middleware_status and set_pii_pattern_action so an agent can
introspect or tune the filter without code access. The drift detector
in pkg/mcp/localaitools/coverage_test.go still passes — both new tools
ship with their HTTP route mappings.

Behaviour change for existing deployments: local models no longer
receive global PII redaction without an explicit `pii: { enabled: true }`
in their YAML. Documented in the new middleware-admin instructions
registry entry.

End-to-end verified against tests/e2e-ui/ui-test-server (which gains a
--pii-yaml flag for injecting per-model PII config into the auto-
generated mock-model.yaml): default-off produces no events; explicit
opt-in produces a mask event; per-model action override produces an
HTTP 400 pii_blocked response.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Add the routing subsystem's content-router tier: a Router config block
on ModelConfig turns a model into a smart-router that classifies each
request and rewrites input.Model to one of its candidates. The standard
model-resolution path then runs ACL, disabled-state, and per-model PII
against the chosen target — the router only does *model* selection,
not node selection (SmartRouter still owns the latter in distributed
mode).

The classifier interface lives in core/services/routing/router with one
shipped implementation: a feature classifier that picks a candidate by
prompt length and code-fence presence. The router.Probe shape is
schema-agnostic; per-API-shape extractors (OpenAIProbe, AnthropicProbe)
in core/http/middleware translate parsed requests into probes without
dragging the schema package into the router. The interface deliberately
doesn't depend on core/config — callers translate RouterCandidate
slices into FeatureCandidate slices at construction time.

The new RouteModel middleware runs after SetModelAndConfig + body
parse but before the PII filter. When the resolved config has a
Router block, the middleware invokes the classifier, looks up the
matched label in the candidate table, reloads the target model's
config, asserts depth-1 (the candidate must NOT itself be a router —
chained routers turn dispatch into a graph), and swaps MODEL_CONFIG +
input.Model in place. RequestedModel/ServedModel get stamped on the
context so the usage log records the routing. Classifier failures and
unknown labels fall through to Router.Fallback; fallback-empty errors
return 503 rather than silently bypassing.

The decision log is a ring-buffer in core/services/routing/router that
mirrors the PII event log: in-memory by default, capped at 5k records,
filterable by correlation_id / user_id / router_model. New REST
endpoints surface it: GET /api/router/decisions (admin-only) and an
updated GET /api/router/status that lists configured router models +
their classifier configs. The /api/middleware/status aggregator pulls
the same data so the React Middleware page renders the Routing tab
with active routers and recent decisions side-by-side.

MCP gains a get_router_decisions tool. The coverage drift detector
catches the new tool — its HTTP route is documented in the same map.

The new instructions registry entry "intelligent-routing" explains the
Router block, the depth-1 rule, and points at the decisions endpoint.
Total instructions count → 16.

End-to-end verified: configured mock-model as a smart-router with a
small (max_prompt_length=30) and a large candidate; a 5-char prompt
routes to small-model and a 100-char prompt routes to large-model;
both decisions appear in /api/router/decisions and /api/middleware/
status reflects the active config.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Closes the output-side gap in the PII subsystem: until now, redaction
only ran on incoming chat requests. A model could generate "your key
is sk-..." and stream it straight to the client. The new StreamFilter
intercepts the OpenAI chat completion stream's content deltas, applies
the same regex tier the request-side middleware uses, and masks
matches that span chunk boundaries.

The buffered-emit invariant: for any active pattern with bounded
max-length L, the filter holds back the trailing L-1 characters of
the cumulative input. New text disambiguates the boundary; the stream
close (Drain) flushes whatever is left. This is what guarantees the
mask survives an arbitrarily-split chunk sequence — alice@example.com
arriving as "alice@" + "example.com" still becomes [REDACTED:email].

Action handling differs from the request side: earlier chunks are
already on the wire by the time later chunks scan, so a "block" can't
actually reject. The filter remaps block to mask for redaction while
recording PIIEvent rows with action=block so audits surface the
original intent ("the model would have leaked X here, suppressed in
flight"). route_local on output is a no-op (the routing decision was
made at request time).

A property test feeds the redactor every corpus input across 10
random chunkings and asserts (a) no secret value ever appears in the
emitted output and (b) the streamed output equals what a single-shot
redaction would produce on the unsplit text.

Wiring: the OpenAI chat endpoint constructs a per-stream filter when
the resolved ModelConfig has PIIIsEnabled — the same gate the
request-side middleware reads, so a model with PII off pays no
streaming cost either. ChatEndpoint signature gains *pii.Redactor and
pii.EventStore parameters; the legacy /v1/mcp/chat/completions wires
nil values (kept for backward compatibility, request-side filter on
the main route still applies).

The mock-backend gains a MOCK_LEAK_EMAIL prompt sentinel that emits a
response containing alice@example.com — used by the end-to-end test:
streaming chat against a mock-model with pii.enabled=true produces a
data chunk containing [REDACTED:email] and an /api/pii/events row
with direction=out and action=mask.

Anthropic /v1/messages and the bare /v1/completions path are NOT yet
wired; their streaming surfaces will get the same filter in a follow-
up. The StreamFilter type is schema-agnostic so wiring is a small
patch per route.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The per-model pii.patterns field was being rendered as a generic
JSON-editor textarea, leaving users to discover the schema by trial and
error. Replace it with a dedicated component that fetches the live
pattern catalog from /api/pii/patterns and presents pattern + action as
two select dropdowns per row, with a separate "add" picker that hides
patterns already overridden.

The pattern catalog is loaded at render time, so new built-in patterns
(when added to DefaultPatterns) surface in the UI automatically without
schema duplication. Unknown IDs already in the YAML still render so
hand-edited configs aren't lost on first load.

Also gives pii.enabled a proper label and description in the config
metadata registry so the toggle isn't an opaque "Enabled" entry under
"Other".

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
…/completions

Closes the streaming-coverage gap flagged in 8d421453. The StreamFilter
type is wire-format-agnostic, so wiring it into the remaining streaming
surfaces is a per-route patch:

- Anthropic /v1/messages: text_delta is the only content surface that
  carries model output; wrap each emit (token-callback path, ChatDeltas
  path, autoparse fallback) so a pattern split across SSE chunks still
  gets masked. Drain the buffered tail before any content_block_stop on
  the text block (normal close, tool-call transitions, autoparse), so
  trailing residue isn't silently truncated when the model pivots into
  a tool_use block. Block→mask remap and per-model action overrides
  follow the same gating as the OpenAI chat path.

- /v1/completions: response-side only — the endpoint has no chat
  message structure for request-side scanning, but a model trained on
  PII can still emit it. Filter Choices[0].Text per chunk and drain the
  residue into one final text-bearing chunk just before the stop
  chunk + [DONE].

Same per-model gate as elsewhere: PII off for non-proxy backends by
default, on for proxy-* / explicit pii.enabled = true. Filter is nil
when disabled — flow is untouched.

Subsystem 3 (PII) is now feature-complete for the MVP scope across
both directions on chat/completions/messages. Encoder NER tier
(TokenClassify gRPC) remains as a follow-up.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds wire-format-faithful HTTP+SSE forwarding for models whose Backend
starts with `proxy-` and whose `proxy.upstream_url` is set. The chat
and messages handlers fork to the proxy before any local templating
or gRPC dispatch, so the upstream sees the request body the client
sent (with only the top-level `model` field optionally rewritten).

The streaming PII filter rides on top: per-token text is extracted
from each SSE chunk, pushed through pii.StreamFilter, and spliced
back into the original envelope so the upstream's event names and
metadata pass through untouched. PII residue flushes before the
provider's terminal marker ([DONE] / message_stop) so clients that
stop reading on the marker don't lose the tail.

Auth is provider-aware (OpenAI Bearer, Anthropic x-api-key +
anthropic-version header). API keys read from env vars named in
config so secrets stay out of YAML and the admin UI.

No request-shape translation in the MVP — a client posting
OpenAI-shaped requests to a proxy-anthropic model gets a confused
upstream. Cross-shape forwarding is deliberately deferred; tool-call
argument round-tripping and reasoning-content passthrough deserve
their own review.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds a copy-paste-ready model config template for both proxy-openai and
proxy-anthropic, covering API key handling via env vars, model name
rewriting, request timeout, and the per-model PII gate. Includes a
section on combining proxy models with the intelligent router so a
single LocalAI instance can mix local and cloud candidates behind one
classifier.

Documents the MVP limitations explicitly (no request-shape translation,
no output-side PII for buffered responses, no retry) so users don't
hit them as surprises.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds an HTTPS forward proxy that selectively MITMs traffic for
allowlisted LLM API hosts so LocalAI can apply per-request PII
redaction to clients authenticating via OAuth / subscription rather
than via API keys held by LocalAI. Hosts outside the allowlist get a
plain CONNECT tunnel — OAuth flows, telemetry, and unrelated HTTPS
keep working without depending on the CA being trusted.

Components:
- mitm.CA: ECDSA-P256 CA, generated once and persisted (key 0600)
- mitm leaf cache: per-SNI leaf certs minted on demand, cached in-mem
- mitm.Server: CONNECT-aware HTTP server, hijacks the conn, mints
  leaf, terminates TLS, parses HTTP/1.1 requests, dispatches
- mitm PII handler: re-uses the existing piiadapter for request
  redaction and pii.StreamFilter for SSE response redaction; runs
  only on /v1/messages and /v1/chat/completions paths (others pass
  through verbatim, preserving Anthropic-OAuth and OpenAI-Codex
  auth flows untouched)
- Application wiring: --mitm-listen / --mitm-ca-dir /
  --mitm-intercept-hosts CLI flags. Off by default. CA cert exposed
  unauthenticated at GET /api/middleware/proxy-ca.crt for client
  trust-store install.

Primary use case: redact PII from Claude Code sessions running
against a Claude Pro/Max subscription, where LocalAI doesn't hold
(and can't use) an API key. Codex CLI works the same way.

HTTP/1.1 only; HTTP/2 deferred (most CLIs negotiate down without
issue).

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Previously the MITM proxy terminated TLS as HTTP/1.1 only. Modern
LLM-API clients (Claude Code, Codex CLI) and the Anthropic / OpenAI
APIs themselves all speak HTTP/2 — h2 multiplexing is what makes
streaming responses cheap. Forcing h1.1 in the middle of the path
worked but cost a measurable per-request overhead and would have
broken any future client that drops h1 support.

Changes:
- proxy.go: TLS NextProtos = ["h2", "http/1.1"]; after handshake
  branch on NegotiatedProtocol. h2 path uses http2.Server.ServeConn
  with the InterceptHandler wrapped as an http.Handler. h1.1 path
  retains the manual request-loop with connResponseWriter as a
  fallback for legacy clients.
- handler.go: outbound http.Transport explicitly configured with
  http2.ConfigureTransport so the upstream leg also negotiates h2.
- go.mod: promote golang.org/x/net to a direct dependency (was
  indirect via websocket).
- New tests: TestProxy_NegotiatesHTTP2 verifies resp.Proto ==
  "HTTP/2.0", TestProxy_HTTP2Streaming covers SSE-over-h2 with per-
  frame flush, TestProxy_HTTP1Fallback locks the legacy path.

The InterceptHandler signature is unchanged — h2 streams map 1:1 to
http.Request, just like h1, so handlers don't have to know which
protocol is on the wire.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
…e and comments

- New core/services/cloudproxy/ssewire package owns the SSE scanner
  and the per-provider rewrite/terminal/residual helpers; cloudproxy
  and mitm both import it. Removes ~150 lines of literal duplication
  between mitm/sse.go and cloudproxy/{sse,proxy}.go.
- handler.go: replace dispatchPIIIntercept (8 positional params) with
  a piiDispatcher struct built once at NewPIIHandler time. Hoists the
  pattern→action map out of the per-request hot path, fixes a PII
  event-ID collision when one request triggered multiple spans of
  the same pattern (now uses an atomic seq), and stops silently
  dropping store.Record errors.
- proxy.go: cache streaming(body) result instead of re-parsing JSON.
- ca.go: drop the redundant certDER field; use cert.Raw, the byte-
  identical buffer x509.ParseCertificate already populates.
- Trim package docs and over-narrating per-declaration comments to
  match the project style guide (only WHY when non-obvious).

No behaviour change. All existing tests pass.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds two starter YAMLs to the Import Model page's Power → YAML view:
"OpenAI proxy" and "Anthropic proxy". Clicking either fills the
editor with a working proxy-* skeleton — backend, upstream URL,
api_key_env (so the secret stays out of YAML), upstream_model,
request_timeout_seconds, and a sensible per-model PII gate.

Templates appear next to the Copy button so they're discoverable
without leaving the editor. The user fills in their own model
name, upstream URL, and env-var name and submits.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
This reverts commit f11c533ceb9b7c164023ca27e21259d29196bd95.

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds two template cards to the Add Model page (/app/model-editor in
create mode): "OpenAI Proxy" and "Anthropic Proxy". Picking either
pre-fills the form with backend, upstream URL, api_key_env,
upstream_model placeholder, request timeout, and pii.enabled — the
user fills in the model name, the env-var name, and the upstream
model and saves.

This is the right home for the proxy starter; the Import Model page
is reserved for fetching artefacts from HF / Ollama / OCI and the
proxy doesn't fit that pattern.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds MITMListen and MITMInterceptHosts to RuntimeSettings so an
admin can flip the cloudproxy MITM listener on/off and edit the
intercept allowlist via /api/settings (already admin-gated; locked
down by --disable-runtime-settings when the operator wants no
runtime mutation at all).

The CA dir stays startup-only — the persisted CA is the trust
anchor for every already-installed client, and rotating it from a
REST endpoint would orphan them. Editing the listen address or
allowlist reuses the same CA via Application.RestartMITM, which
stops the old listener (if any), reads the current config, and
starts a new one.

Also adds a "mitm" section to GET /api/middleware/status so the
admin page can render running state, configured vs bound listen
address, allowlist, and the CA download URL.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds a "MITM Proxy" tab to /app/middleware. Shows running state +
bound listen address; renders Apply/Discard form for the listen
address and intercept-host allowlist (which writes through to
/api/settings, already admin-gated and watchable by
--disable-runtime-settings); offers a one-click CA cert download
plus a brief client-setup recipe (NODE_EXTRA_CA_CERTS + HTTPS_PROXY)
so an admin can stand up Claude Code / Codex without leaving the
page.

Backend bits already shipped in 76e3b5fe — this turns the data into
a working control surface.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
- ProxyTab: gate the server→local sync useEffect on \!dirty so
  Refresh / post-save refetch can't clobber mid-typed input. The
  intercept_hosts array reference changes per fetchAll(), so the
  previous deps[] silently re-fired every poll.
- Switch ProxyTab.save to settingsApi.save — same path Settings.jsx
  uses. Drops the raw fetch + handcrafted JSON.
- Move mitmMutex from a package-level var onto Application, matching
  p2pMutex / watchdogMutex. Add stopMITMLocked for symmetry with
  startMITMLocked; RestartMITM now reads as
  stopLocked → bail-on-empty → startLocked.
- Add BackendProxyOpenAI / BackendProxyAnthropic constants in
  cloudproxy and use them in providerName. Test-data sites stay as
  literals so a typo'd constant rename still fails the tests.
- Trim a buildMITMStatus comment that just narrated the field names.

No behaviour change.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Repurpose the PII event store as a shared middleware audit log: add an
EventKind discriminator (pii | proxy_connect | proxy_traffic) and
proxy-specific fields (Host, Intercepted, BytesSent, BytesReceived,
StatusCode, DurationMS) to the existing PIIEvent record. Keep request
contents out of the store — bodies live in API/backend traces only.

The MITM Server records a proxy_connect row for every CONNECT (with
Host + Intercepted=true|false) so admins can see which hostnames a
client tried to reach and whether the proxy terminated TLS or
tunneled through. The PIIHandler wraps its ResponseWriter to count
bytes downstream and records a proxy_traffic row at request end with
sent/received byte counts, status code, and duration.

The /api/pii/events endpoint accepts a kind= filter. The Middleware
admin page Events tab gains a Kind column, a kind filter row, and
per-kind detail formatting (host + intercept decision for connects;
HTTP status, byte counts, and duration for traffic). The MCP
get_pii_events tool stays scoped to kind=pii so the LLM-facing audit
isn't polluted by proxy rows with empty PII fields.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Add a Go test for the tunneled CONNECT path: a non-allowlisted host
must record a proxy_connect with Intercepted=false and zero
proxy_traffic events (since tunneled bytes never reach the dispatcher).

Extend the Playwright spec for the Middleware page Events tab. The
mock event feed now includes a pii row, two proxy_connect rows
(intercept and tunnel decisions), and one proxy_traffic row.
New test cases:
  - proxy_connect rows show "intercepted" / "tunneled" labels
  - proxy_traffic row shows HTTP status, byte counts, and duration
  - the kind filter buttons narrow the table to a single kind
  - the Kind column header and per-kind badges render

Note: Playwright runs failed in the local sandbox (the bundled
chrome-headless-shell can't load libglib on this NixOS host); the
specs are authored against the rendered DOM and will run in CI.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
loadRuntimeSettingsFromFile applied every other persisted runtime
setting (branding, watchdog, P2P, agent pool, ...) back into options
on startup but skipped the MITM fields. So when an admin configured
the listener via /api/settings, runtime_settings.json on the mounted
volume held mitm_listen + mitm_intercept_hosts, but on restart options
came up empty and the start-MITM gate at startup never fired.

Two changes:

  - loadRuntimeSettingsFromFile now copies MITMListen and
    MITMInterceptHosts from the file when no CLI flag set them. Like
    branding, the file is the only source — there are no env vars for
    these — so an explicit --mitm-listen still wins, but a /api/settings
    save round-trips correctly.

  - The startMITMProxy call moves to after loadRuntimeSettingsFromFile.
    Previously it ran before the file load, so even with the loader
    fix in place options.MITMListen would be empty when the gate
    fired. The watchdog and other restartable subsystems already
    initialize after the load — MITM now matches.

Tests pin the contract:
  - core/config: WritePersistedSettings + ReadPersistedSettings round-trip
    preserves both MITM fields.
  - core/application: loadRuntimeSettingsFromFile populates MITMListen
    and MITMInterceptHosts from a fixture file, and an explicit CLI
    flag wins over the file value.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Self-review pass on the routing-stats slice. Each finding paired with
test coverage; one refactor (atomic.Pointer for MITM accessors)
matches the existing agentPoolService precedent in the same struct.

Logic fixes:

- pii/stream.go: snap emitBoundary to a rune start so the held tail
  never contains a split UTF-8 codepoint. Multibyte corpus added to
  the buffered-emit invariant test.
- pii/redactor.go: SetAction publishes a fresh patterns slice
  (slices.Clone) instead of mutating r.patterns[i].Action in place —
  Go strings are not atomic two-word values, so concurrent Redact
  callers iterating an older snapshot would race on the field even
  under RWMutex. Race-stress test added.
- pii/openai adapter: new bit-24 sentinel + 24-bit block field
  (idxWholeStringFlag/idxBlockMask) replaces the 0xFFFF sentinel
  that collided with a real block index of 65535.
- mitm/proxy.go: fail closed if SetDeadline errors before the TLS
  handshake — proceeding into the protocol switch on an
  unhandshaken conn is worse than dropping the connection.
- mitm/response.go: Connection: close compared with EqualFold so
  any casing triggers the post-response disconnect (RFC 9110 §7.6.1).
- application: MITMServer/MITMCA accessors now atomic.Pointer-backed
  (matches agentPoolService); readers no longer race RestartMITM
  on pointer swap. mitmMutex retained only to serialize Stop+Start.
- router/feature.go: prompt length predicates use rune count, not
  byte count — operators reason in characters, not UTF-8 bytes.
  Cached once per Classify call rather than recomputed per candidate.
- mcp/localaitools/inproc: GetUsageStats(All=true, UserID=…) honours
  the UserID filter, matching the REST endpoint's ?user_id param —
  same MCP call now returns the same data over either transport.
- react-ui middleware spec: bytes_received mock changed from 1280 to
  1228 so formatBytes returns the asserted 1.2KB string.

Test coverage added:

- pii: race-detector test for SetAction, multibyte UTF-8 corpus.
- ssewire: direct unit tests for Scanner edge cases (CRLF, leading
  blanks, mid-event EOF) and IsTerminalMarker per provider.
- mitm: Stop idempotency, restart cycle with allowlist swap.
- middleware/route_model: classifier-success, fallback,
  depth-1-invariant, no-fallback-503, unsupported-classifier paths
  + OpenAIProbe/AnthropicProbe extractors.
- anthropic/messages: drainStreamPIIToText covers nil-filter no-op,
  empty-drain no-op, residual emit shape, idempotence, and
  end-of-stream redaction.
- application: symmetric MITMInterceptHosts CLI-wins loader test.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
richiejp added 14 commits May 14, 2026 14:57
Four UX changes that together move per-host MITM control out of the
global runtime_settings.json and into model YAML, where PII overrides
already lived. The MITM model template + the Add Model picker entry
mirror how the Talk page surfaces pipeline models.

A. Per-pattern PII enable + persist

- pii.Pattern gains a Disabled flag; Redactor.RedactWithOverrides
  skips disabled patterns. SetDisabled mutates via slices.Clone for
  the same race-free publish SetAction uses.
- PUT /api/pii/patterns/:id accepts {action?, disabled?} (one or
  both). New POST /api/pii/patterns/persist snapshots the live
  redactor's deltas vs --pii-config defaults into a new
  pii_pattern_overrides map in runtime_settings.json; the boot
  loader applies it after redactor construction.
- React: per-row Enabled checkbox + a "Save to disk" button on the
  Filtering tab. PUT toasts note the change is transient until
  persist is clicked.
- MCP: PIIPatternActionUpdate.Disabled is optional; new
  persist_pii_patterns tool. Coverage map + full-catalog test
  updated.

B. Model-config link buttons

- Per-model row in the Filtering tab gets an Edit button linking to
  /app/model-editor/<name>. Mirrors the same pattern used elsewhere
  for navigating to a config from a status surface.

D2. Model configs own MITM hosts

- New mitm: { hosts: [...] } block on ModelConfig. Loader gains
  MITMHostOwners() returning {Owners, Conflicts}; ANY duplicate host
  across model configs is a critical error that disables the MITM
  listener until resolved (strict 1-to-1 invariant the dispatcher
  relies on).
- startMITMLocked validates ownership before binding; conflicts are
  published on Application.mitmHostConflicts and surfaced via
  /api/middleware/status with a clear error message and links to
  the colliding configs in the React banner.
- Allowlist is now exactly the set of hosts claimed by model configs
  — the global MITMInterceptHosts list and MITMHostsWithPIIDisabled
  list are removed from RuntimeSettings, ApplicationConfig, the CLI
  flag, and runtime_settings.json. Per-host PII gate inherits from
  each owner config's pii.enabled.
- New "MITM Intercept" template in modelTemplates.js (default name
  mitm-anthropic, default host api.anthropic.com, pii.enabled: true,
  empty pii.patterns: [] for an immediately-visible override editor).
  Registered in core/config/meta/registry.go as a string-list field
  so the model editor renders it.
- /api/middleware/status MITM payload gains models: a list of every
  config that owns at least one MITM host (name, hosts, pii_enabled,
  backend), plus host_owners, host_conflicts. The MITM Proxy tab
  renders this as a top-level "MITM Models" table with an Add MITM
  model button.

Test: ModelConfigLoader.MITMHostOwners cross-config conflict
detection, host-normalisation, and intra-config duplicate handling.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Pull the local-store gRPC backend's KV+KNN logic into a reusable
pkg/store/local library so other in-process callers (notably the
routing module's KNN classifier) share one implementation. The
backend/go/local-store binary becomes a thin pb<->[]float32/[]byte
translation wrapper. Shared WrapKeys/WrapValues/UnwrapKeys helpers
move to pkg/store/proto.go.

Regression test suite covers normalization invariants, sort/merge
correctness, delete, KNN top-k ordering, and the 0xFFFF block-index
boundary that previously aliased.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Lands two routing subsystems behind the existing router config:

KNN classifier — embeds candidate exemplars on first Classify
(atomic.Pointer[local.Store] + loadMu for safe lazy load), picks
the candidate whose nearest exemplar is closest to the probe.
Threshold via router.min_score. Reuses the extracted local-store
library so the same KNN search runs in both the gRPC backend and
in-process router.

LLM classifier — asks a small instruct model to pick a label
from natural-language descriptions. Longest-first label match,
RWMutex-guarded prompt memo cache (size from
router.classifier_cache_size, default 1024), TrimSpace+ToLower
cache key.

EmbedderFactory / LLMCallerFactory adapter pattern on Application
keeps the router package free of HTTP/backend imports. Per-router
sync.Map cache in the middleware avoids re-embedding exemplars on
every request.

Admission control (subsystem 5) — per-model semaphore limiter
(sync.Map[modelName]chan struct{}) gates concurrent in-flight
requests by ModelConfig.Limits.MaxConcurrent. On rejection:
HTTP 503 + Retry-After + audit row via new pii.KindAdmission
event kind + JSON body { error.type: admission_rejected }.
Cap is fixed at first Acquire per model — admin restarts to
resize, matching the rest of the model config lifecycle.

Middleware runs after RouteModel so a router fanout that lands
on a saturated downstream is rejected even when the router-model
itself has slack. /api/middleware/status gains an admission
section listing each gated model's max_concurrent / in_flight /
retry_after_seconds. The Events tab in the Middleware page knows
about admission rows.

Single-source-of-truth constants (ClassifierFeature / KNN / LLM,
LabelFallback) and an errDecision helper de-duplicate the
classifier surface.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The earlier extraction to pkg/store/local was wrong shape: it pulled
the in-memory KV+KNN into the main process so the KNN router
classifier could call it directly. That undermines the point of
having a vector-store backend — admins should be able to swap in
qdrant, pinecone, or any other pluggable store backend without
changes to the routing code.

Reverts pkg/store/local and inlines the implementation back into
backend/go/local-store as package main. KNN now consumes a
router.VectorStore interface (Set / Find), with the production
adapter at core/application/embedder.go wrapping pkg/store's gRPC
client (SetCols / Find) over a backend resolved from
core/backend.StoreBackend — exactly how face/voice recognition
consume the same surface.

RouterConfig gains a store_model field naming the chosen backend
(empty = default local-store). Each router model uses its own
namespace ("router-knn-<routerModelName>") so two routers sharing
a backend can't see each other's exemplars; ModelLoader's
per-(backend, namespace) process isolation does the rest.

The router package gains no core/backend dependency — the
VectorStore interface lives alongside Embedder and LLMCaller and
is wired from the application layer the same way.

Algorithm coverage (sort/merge, normalised fast path, KNN top-K,
dimension enforcement) stays where it belongs — in
backend/go/local-store/store_test.go — exercised through the
gRPC service surface that downstream consumers actually use.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds a TokenClassify gRPC method for token-classification (NER) models
and implements it in the Python transformers backend. The PII redactor
will consume this in a follow-up to add an ML-based detection tier on
top of the regex tier.

Proto surface:
- TokenClassifyRequest { text, threshold }
- TokenClassifyEntity { entity_group, start, end, score, text }
- TokenClassifyResponse { repeated entities }

Byte offsets are into the original UTF-8 text so the consumer can slice
without re-tokenising. entity_group follows HuggingFace's aggregated-tag
convention (PER, LOC, ORG, ... or PII-specific labels depending on the
model).

Go wiring: Client / embedBackend / ConnectionEvictingClient gain
TokenClassify; Backend interface includes it. Generated stubs are
gitignored and regenerated at build time via `make protogen-go`.

Python backend: a new `Type=TokenClassification` model-load branch
loads via `transformers.pipeline("token-classification",
aggregation_strategy="simple")`. The aggregated-strategy pipeline gives
us span-merged entities with byte offsets out of the box. TokenClassify
RPC runs the pipeline, filters by threshold, and returns the entities.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The streaming PII filter wiring referenced auth.GetUser to attribute
events to the request's user, but the import line was dropped during
rebase. Result was the build failure: "undefined: auth" at chat.go:709
and :1453.

Assisted-by: claude-code:claude-opus-4-7 [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds an optional encoder-based detection tier on top of the existing
regex tier. NER catches the long tail (unformatted names, locations,
mixed-language PII) that regex can't express, while regex keeps the
cheap path for formatted hits (emails, SSNs, credit cards).

The redactor exposes a new RedactWithNER(ctx, text, overrides, NERConfig)
that runs both tiers and merges hits through the same
overlap-resolution as before — when an entity span overlaps a regex
hit, the stronger action wins (block > route_local > mask). NER pattern
IDs are namespaced "ner:<entity_group>" so audit rows and event-tab
filters distinguish them from regex hits, and admins can disable a
single entity type with the same Disabled-pattern machinery.

NERConfig is per-request: each call site supplies the loaded
detector + per-group action map + minimum confidence, so the same
Redactor instance can serve different models with different NER
preferences without per-model redactor instances.

Fail-open semantics: a detector error returns the regex-only Result
alongside the error. Caller decides whether to surface the failure
(fail-closed: refuse the request) or log and proceed
(fail-open: ship regex-tier protection only). The regex tier itself
never errors.

regex hit-collection / overlap-merge / output emission are now
factored into collectRegexHits + mergeAndEmit so the regex-only
RedactWithOverrides and the new RedactWithNER share one
implementation.

Out of scope (follow-up commits):
- core/application adapter from gRPC TokenClassify to NERDetector
- per-model PIIConfig.NER block + middleware wiring
- React middleware page surface for NER entity types
- gallery model entry for a recommended NER model

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The Routing tab now has an explicit affordance for creating a new
routing model — matches the pattern already used on the MITM Proxy
tab. Empty state shows a primary "Create routing model" button;
populated state adds an "Add routing model" button next to the
Active routers header.

Both link to /app/model-editor?template=router. A new template in
modelTemplates.js seeds the editor with the feature classifier and
two empty candidate rows (one for 'code' with requires_code, one for
'chat') so admins fill in candidate models + a fallback and save.

The model editor wouldn't render the router fields until they were
registered, so registry.go gains entries for:
- router.classifier (select: feature / knn / llm)
- router.fallback (model-select chat)
- router.embedding_model (model-select models — for KNN)
- router.store_model (model-select models — for KNN's vector store)
- router.min_score (number)
- router.classifier_model (model-select chat — for LLM)
- router.classifier_cache_size (number)
- router.candidates (code-editor — array of {label, model, rules})

All under the "other" section alongside mitm.hosts, ordered after
the MITM entry.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The router template seeds router.candidates with an array of
{label, model, rules} objects. CodeMirror's EditorState.create({ doc })
requires a string — passing the array crashed inside CM's Text class
with "(intermediate value).split is not a function", surfaced as an
"Unexpected Application Error" overlay the moment the template
loaded.

Adds a StructuredCodeEditor wrapper that:
- YAML-stringifies the structured value for display so CodeMirror
  always sees a string,
- parses the user's text back to a structured value on every edit
  (using YAML.parse) so the editor form state holds the canonical
  shape, ready for unflattenConfig + YAML.stringify on save,
- holds the last-published structured value steady while the YAML
  buffer is mid-edit and temporarily invalid (the CM YAML linter
  surfaces the syntax error inline).

ConfigFieldRenderer routes code-editor fields through the wrapper
when the form value is non-string; plain text blobs (Go templates
etc.) still use the original CodeEditor with no behaviour change.

Playwright regression test pins:
- The Routing tab's "Create routing model" button navigates to
  /app/model-editor?template=router.
- Loading that URL doesn't render the "Unexpected Application Error"
  overlay, and the Router Candidates / Classifier fields are visible.

A page.on('pageerror') hook surfaces any uncaught render error so a
future regression fails with a useful message rather than silently
passing.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
…usecase

Two follow-ups for the routing template:

1. router.candidates moves from raw YAML to a dedicated structured
   editor. Each candidate is a card with:
   - label and model (model picker — no more typing model names from
     memory),
   - optional description (LLM classifier hint),
   - collapsible Rules section: max/min prompt length, requires_code
     toggle, and an Examples textarea for KNN exemplars (one per
     line).
   Empty rule values are stripped from the output so the saved YAML
   doesn't carry zero-valued junk. New "router-candidates" component
   in the field registry routes to RouterCandidatesEditor; everything
   else (regex tier, KNN factory, classifier dispatch) was already
   wired against the same structured shape, so the YAML this editor
   produces round-trips cleanly.

2. Proxy templates (proxy-openai, proxy-anthropic) ship with
   known_usecases: ['chat']. Without it the proxy model wasn't
   surfacing in router fallback / candidate pickers (or any chat
   capability selector) because pickers filter by FLAG_CHAT and
   backends with no explicit usecase list don't pass.

Updated regression test to assert the structured editor's "Add
candidate" button is present — if the field gets reverted to raw
YAML, the test fails loudly instead of silently passing on the
"didn't crash" check alone.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The KNN exemplars field shipped as a single textarea with "one per
line" semantics. That broke for any prompt that itself contained a
newline — a realistic case for the multi-line prompts admins want to
paste in verbatim from real traffic — and gave them a tiny 3-row
box for what's often the most consequential field on the form.

New ExamplesEditor renders one resizable textarea per exemplar with
add / remove buttons. Each exemplar can hold arbitrary text
including line breaks; the array on the wire stays a plain []string
that the KNN classifier already consumes unchanged.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Wires the consuming side of the LLMRouter-style data pipeline: the
KNN classifier can now load exemplars from a JSONL file alongside
(or instead of) hand-written candidate.examples. The benchmarker
itself isn't shipped yet — this lets the consumer be ready when it
lands, and lets admins drop in third-party datasets (LLMRouter's
own outputs etc.) directly.

JSONL shape (one row per query):

  {"_meta": {"embedding_model": "longformer-base-4096",
             "embedding_dim": 768,
             "judge": "claude-opus",
             "judge_method": "pairwise_winrate"}}
  {"query": "fix the bug in this function",
   "best_model": "qwen-coder",
   "scores": {"qwen-coder": 0.92, "qwen-chat": 0.45},
   "embedding": [0.12, ...]}
  {"query": "hello", "best_model": "qwen-chat"}

The _meta header is optional. embedding/scores per row are optional.
Blank lines and "#" comments are skipped.

Loader (pkg pii/services/routing/router/routing_data.go):
- LoadRoutingDataset(path) parses JSONL, validates {query, best_model}
  on each row, returns RoutingDataset{Meta, Rows}.
- 8MB per-line buffer so 4096-D Longformer rows fit.
- FilterByCandidates(modelNames) drops rows whose best_model isn't
  configured — admins can share one benchmark across deployments
  with different lineups.
- EmbeddingsMatch(name, dim) reports whether stored embeddings can
  be used verbatim (saving 10-100x cold-start cost on large
  datasets).

KNN integration:
- KNNCandidate gains a Model field; the loader maps row.best_model →
  candidate.Label.
- NewKNNClassifier signature gains a trailing KNNOptions{Dataset,
  EmbeddingModelName}; existing call sites pass KNNOptions{}.
- Seeding has two passes — hand-written Examples first, then dataset
  rows. Empty-Examples candidates are no longer a constructor panic;
  with no dataset and no examples, the seed step fails on first
  Classify and the middleware falls back (the right failure mode).
- Pre-computed embeddings are honoured iff the dataset's
  _meta.embedding_model matches the configured embedder; otherwise
  re-embed (different embedders → different vector spaces).
- Rows referencing models the router doesn't know about are silently
  dropped.

Config:
- RouterConfig.ExemplarsFile (yaml: exemplars_file) names the JSONL.
  Relative paths resolve against models dir.
- Field registered in core/config/meta/registry.go so the model
  editor renders it as a path input next to the candidates editor.

Tests cover: meta header parsing, optional header, blank/comment
lines, missing-field validation, malformed JSON, missing file,
candidate filter, embeddings-match check; KNN seeds from dataset,
drops unknown models, uses precomputed embeddings when aligned,
re-embeds when mismatched, combines hand-written + dataset
exemplars.

Out of scope: the benchmarking CLI that produces these files.
Discussed as a separate slice — for general use the recommended
shape is pairwise-LLM-judge over a sampled traffic subset with
LocalAI's PII redactor in front of the judge call.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
llama.cpp doesn't support Longformer's sliding-window + global
attention pattern — confirmed by grepping convert_hf_to_gguf.py for
LongformerModel (not present; supported encoder archs are Bert,
DistilBert, Roberta, XLMRoberta, NomicBert, JinaBert, ModernBert,
NeoBERT, EuroBert).

For routing the dataset schema is encoder-agnostic; we just need
SOME long-context sentence encoder. nomic-embed-text-v1.5
(NomicBert arch, 8192 native context, GGUF available, already
in gallery/index.yaml) fits the bill and runs on the existing
llama-cpp embedding path.

Updates the model-editor description for router.embedding_model
to surface nomic-embed-text-v1.5 as the default suggestion, with
modernbert-embed-base / jina-embeddings-v3 as alternatives.

Also corrects an inaccurate comment in routing_data.go that
conflated Longformer's context length (4096 tokens) with
embedding dimensionality (768) when justifying the 8MB scanner
buffer.

Assisted-by: claude-code:claude-opus-4-7 [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
…ache

Replace the prior feature/knn/llm router classifiers with a single
score-based classifier that asks an Arch-Router-style model to rank
every policy label as a continuation of the routing prompt and reads
off the softmax distribution. Multi-label routing falls out of this
naturally: the middleware activates every label whose probability
crosses a softmax threshold and picks the first candidate whose
labels are a superset of the active set.

Wiring summary:

  - backend.proto adds Score(ScoreRequest) → ScoreResponse. The
    llama-cpp C++ backend implements Score on top of force-decoded
    candidates against a freshly-cleared KV cache (prompt-KV sharing
    optimisation is on the perf TODO list); vLLM uses prompt_logprobs.
    Other backends return UNIMPLEMENTED.
  - core/services/routing/router/score.go is the classifier. It builds
    the ChatML routing prompt once at construction, scores every
    policy label as a continuation, and applies an activation
    threshold (default 0.15; 0.40 is a better empirical default on
    Arch-Router-1.5B per the eval in features/middleware.md).
  - RouterConfig grows Policies, ActivationThreshold, and an optional
    EmbeddingCache nested struct. RouterCandidate collapses to
    {Model, Labels[]} — labels are the matching contract, descriptions
    live on the policy.
  - The dead feature/knn/llm/routing_data files are removed.

L2 embedding cache:

  - core/services/routing/router/embedding_cache.go wraps a Classifier
    decorator that embeds each probe, KNN-searches the per-router
    local-store collection, returns a cached decision if the cosine
    similarity passes a threshold (default 0.80, lowered from 0.92
    after the eval against nomic-embed-text-v1.5 paraphrases). Low-
    confidence decisions are deliberately not cached so they can't
    poison future paraphrases.
  - Stats include hits, misses, near_misses, low_confidence, and a
    10-bin similarity histogram so admins can see where the cosine
    distribution sits relative to the configured threshold.
  - Registry tracks built classifiers by fingerprint of the
    RouterConfig YAML, so config edits invalidate the cache wrapper
    automatically while the on-disk vectors persist.

UI:

  - The model-editor schema is rewritten: dead KNN/LLM fields gone,
    policies/activation_threshold/embedding_cache.* added with proper
    descriptions, sliders, and component bindings.
  - RouterCandidatesEditor is rewritten for {model, labels[]} with
    multi-select label chips populated from router.policies via a new
    FormContext.
  - RouterPoliciesEditor is the structured editor for the label
    vocabulary, with duplicate-label detection via a memoised set.
  - The Routing tab on /app/middleware renders the embedding-cache
    histogram inline with a threshold marker.

Verification:

  - Unit tests cover the score classifier (multi-label activation,
    fallback, depth-1) and the embedding cache (hit, near-miss,
    low-confidence skip, embedder/store error fallthrough, histogram
    population).
  - Refreshed e2e specs (router-template.spec.js, middleware-page.spec.js)
    pass under make test-ui-e2e-docker: 133/135 passing with the two
    failures unrelated to this slice.
  - End-to-end eval against the LocalAGI stack with a 30-prompt corpus
    + 3 paraphrases each produced 35% steady-state hit rate at 0.80
    threshold (53% of caching-eligible decisions), 15ms p50 cache-hit
    latency vs 246ms classifier round-trip — a ~16× speedup on hits.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
@richiejp richiejp force-pushed the feat/routing-stats-backend branch from 8389d96 to 99f79f4 Compare May 14, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants