feat(middleware): Model routing, PII filtering, Cloud model proxies by richiejp · Pull Request #9802 · mudler/LocalAI

richiejp · 2026-05-13T11:59:36Z

Allows analyzing requests then routing, filtering and transforming them.

Chat requests can be classified and labelled as requiring particular capabilities.
Then routed to the model which satisfies all of the capabilities. Naturally requests that require fewer capabilities can be handled by smaller specialized models. In addition the classifier chooses more capabilities the more uncertain it is, routing difficult requests to larger general purpose models.

Classification is very fast, but once requests have been classified their embeddings can be used to avoid classifying similar requests. This works by labelling the embeddings of past requests and then doing a cosine similarity search on the embeddings of new requests.

Private information can be detected, when it is found in the request, the request can be modified to redact it,
routed differently or it can be blocked.

Cloud models and a MITM proxy can be configured and take part in filtering and routing.
This allows sending easy requests to smaller local models and hard ones to cloud models.
The MITM proxy allows you to use Claude Code or Codex subscriptions (OAuth) with the PII
filter and potentially even with routing (although this is limited by the cloud providers ToS).

Routing classifies requests using a model such as ArchRouter which labels a request.
We score each request on the possible capabilities it may require and pick a model which
has all of the capabilities with scores towards the top of the distribution.

The ability to score multiple choices is an interesting feature in its own right.
It allows you to very quickly check with what probability an LLM would produce a particular
answer.

feat(routing): add billing recorder and stats backend foundation
feat(routing): expose usage stats in REST, UI, and MCP
feat(routing): add regex PII filter with REST and MCP surfaces
feat(routing): record usage end-to-end in no-auth mode
feat(routing): per-model PII gating + middleware admin page
feat(routing): rule-based intelligent router (subsystem 2 MVP)
feat(routing): streaming PII filter with buffered-emit invariant
feat(routing): PII pattern editor in model config UI
feat(routing): streaming PII filter on Anthropic /v1/messages and /v1/completions
feat(routing): cloud passthrough proxy (subsystem 4 MVP)
docs(routing): cloud passthrough proxy feature page
feat(routing): MITM proxy for subscription-auth Claude Code / Codex
feat(mitm): negotiate HTTP/2 with h1.1 fallback
refactor(cloudproxy): extract shared SSE wire helpers, trim dead state and comments
feat(import-model): add cloud-proxy templates to YAML editor
Revert "feat(import-model): add cloud-proxy templates to YAML editor"
feat(model-editor): add cloud-proxy templates to Add Model picker
feat(mitm): runtime control of listener and intercept allowlist
feat(middleware-ui): MITM proxy admin tab
refactor(mitm): simplify-pass cleanup
feat(mitm): emit proxy_connect + proxy_traffic audit events
test(mitm): cover tunneled-host event + Events tab kind filter
fix(mitm): restore listener from runtime_settings.json on restart
fix(routing): address code-review findings across pii/mitm/router
feat(middleware): per-pattern PII toggle, model-config-owned MITM hosts
refactor(store/local): extract in-process vector store library
feat(routing): KNN + LLM classifiers and per-model admission control
refactor(store): keep the vector store out of the main process
feat(backend): TokenClassify RPC + transformers NER pipeline
fix(openai): add missing auth import to chat.go
feat(pii): NER tier in the redactor
feat(middleware-ui): router template + Create routing model link
fix(model-editor): code-editor crash on structured template values
feat(model-editor): structured router-candidates editor + proxy chat usecase
fix(router-candidates): one textarea per exemplar, multi-line-safe
feat(router): KNN consumes a benchmarker-produced routing dataset
docs(router): recommend nomic-embed-text-v1.5 over Longformer
feat(routing): Score gRPC primitive, score classifier, L2 embedding cache

Introduces core/services/routing/{contract,billing} as the foundation for the routing module. The billing recorder is wired through the existing UsageMiddleware and runs unconditionally — a no-auth single- user box now records token usage under a synthetic "local" user, where previously the middleware short-circuited on a nil auth DB and zero stats were captured. - StatsBackend interface with three impls (gorm, in-memory ring, disabled) selected at startup; Recorder fans out to backend + Prom counters from a single increment site so DB and metrics cannot diverge. - UsageRecord schema extended with RequestedModel/ServedModel, Pre/PostFilterPromptTokens, pricing version, cost, and correlation/ router/PII foreign keys (all nullable; AutoMigrate handles existing deployments). - Synthetic LocalUser persisted to ${DataPath}/.local_user_id so usage history aggregates across restarts in single-user mode. - contract.Invariant emits localai_invariant_violation_total and panics under -tags=routing_strict for nightly E2E surfacing. - --disable-stats opt-out for ephemeral CI runs. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Wires the billing recorder from the previous commit into user-facing surfaces. Before this, the Recorder collected data but no endpoint queried it without auth, the UI hid the Usage page in single-user mode, and there was no MCP tool to read stats. After: - New REST endpoints GET /api/usage and /api/usage/all that go through application.StatsRecorder() and fall back to the synthetic local user when auth is off. Old /api/auth/usage stays as the auth-only alias. Both new endpoints carry swagger annotations under the "usage" tag. - Sidebar drops authOnly:true on the Usage entry; Usage.jsx picks the endpoint based on authEnabled and skips the empty-state-bail when auth is off. - /api/instructions registry gains a "usage-and-billing" entry so agents discover the surface; the existing reachability test bumps to 13 instructions and asserts the new name is present. - New MCP tool get_usage_stats with read-only semantics, registered under the existing localaitools server. coverage_test.go ::TestToolHTTPRouteMappingComplete documents the route pairing; expectedFullCatalog and expectedReadOnlyCatalog include the tool. Both inproc and httpapi clients implement GetUsageStats; the inproc client picks up the StatsRecorder + FallbackUser at construction in application.go. - Playwright e2e spec usage-dashboard.spec.js asserts (a) the Usage link is visible without auth, (b) the page renders /api/usage data without bailing, and (c) auth-on still routes to /api/auth/usage. Verified end-to-end against tests/e2e-ui/ui-test-server: /api/auth/status reports authEnabled:false, /api/usage returns the local user with a stable UUID, /api/usage/all admits the local user as admin. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Subsystem 3 of the routing module. The regex tier is the cheap, deterministic layer; the encoder NER tier (TokenClassify gRPC) is follow-up work. Pattern set: email, phone, SSN, credit card with Luhn verification, IPv4 (with octet bounds-check), and common API key prefixes (sk-, pk-, xoxb-, ghp_, github_pat_). Each pattern has one of three actions: - mask: replace the matched span with [REDACTED:<id>] before the request reaches the backend. Default for everything except api_key_prefix. - block: short-circuit the request with HTTP 400 and a pii_blocked error type. The matched value is never echoed back to the client. Default for api_key_prefix — leaked credentials are higher harm than other PII. - route_local: leave the text intact but flag the echo context so a future content router refuses cloud-proxy candidates. Useful for deployments that trust local models with sensitive data but not external providers. Wiring: - core/services/routing/pii: types, regex compile, redactor, in- memory event ring buffer, YAML config loader, request middleware. - core/services/routing/piiadapter: per-API-shape adapter (OpenAI today; Anthropic when needed) so the schema package never imports pii. - core/http/routes/openai.go: wires pii.RequestMiddleware as the innermost middleware in the chat slice — runs after the request is parsed, mutates the request body in place when masking, returns 400 when blocking. - core/http/routes/pii.go: GET /api/pii/patterns, GET /api/pii/events, POST /api/pii/test (admin-or-local-user; events filterable by correlation_id, user_id, pattern_id). - pkg/mcp/localaitools: list_pii_patterns, get_pii_events, test_pii_redaction tools with full route map coverage in coverage_test.go. - core/http/endpoints/localai/api_instructions.go: pii-filtering instructions entry; reachability test bumps to 14. - --pii-config / --disable-pii flags; pii.yaml format overrides per-id action with unknown-id rejection at startup. PIIEvent records never carry the matched value — only the byte offset, length, and an 8-char sha256 prefix so admins can dedupe recurring leaks during audit. The contract.Invariant "pii.event_per_span" asserts every redacted span produces an event record. Verified end-to-end against ui-test-server: GET /api/pii/patterns returns the 6 defaults with correct actions; POST /api/pii/test with "contact alice@example.com" returns 'redacted="contact [REDACTED:email] about it"' and a span with hash_prefix=ff8d9819; same with "sk-..." returns blocked=true. Streaming response filter (the buffered-emit invariant) is in the plan as a separate slice and not in this commit. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Streaming chat completions weren't producing UsageRecords because the middleware only parsed token counts from the response body — and OpenAI clients rarely set stream_options.include_usage, while Anthropic uses a different shape entirely. Handlers now stamp the canonical token counts on the echo context via middleware.StampUsage; UsageMiddleware reads the stamp first and only falls back to body-parse for proxy/foreign endpoints. The body-parse fallback gains an Anthropic shape so passthrough proxies for /v1/messages still work. Billing's Prometheus counters were never reaching /metrics because the monitoring service that calls otel.SetMeterProvider was created later than billing.NewRecorder, leaving the counters bound to the no-op global provider. The metrics service now initialises in application.start() before any counter is registered, exposes its meter via Application .MetricsService(), and hands it directly to billing via SetMeter() so the order-of-operations dependency is explicit rather than racy. The synthetic local user is now wired unconditionally when stats are enabled (not just when authDB is nil), so internal/system callers under auth-on still attribute correctly. The /app/users React route is guarded by a new RequireAuthEnabled component that redirects to /app when auth is off, defending against direct URL access of an admin-only page that has nothing to manage in single-user mode. A new localai_usage_unrecorded_total{endpoint,reason} counter ticks whenever a request finishes without producing a record, so silent billing misses are observable rather than invisible. Verified end-to-end: chat (streaming + non-streaming), embeddings, and Anthropic messages (streaming + non-streaming) each produce one UsageRecord and one Prom counter increment in no-auth mode. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

Move PII filtering from a global opt-out to a per-model opt-in: local models bypass redaction by default, while backends matching `proxy-*` default to on (forward-compatible with the cloud-passthrough subsystem). A new ModelConfig.PII block lets a model opt in (`enabled: true`) and upgrade or downgrade individual pattern actions without touching global config. The middleware reads the resolved config from the echo context and short-circuits when disabled, so a chat to a local model pays no regex-scan cost. The Anthropic /v1/messages route gains the same redaction path via a new piiadapter.Anthropic() that walks AnthropicRequest.Messages — identical shape to the OpenAI adapter, so a future passthrough proxy gets PII for free. A new admin page at /app/middleware (System section, admin-only) surfaces the live state. Three tabs: Filtering shows the pattern catalogue with action editors plus every model's resolved enabled state and overrides; Routing is a placeholder until subsystem 2 lands; Events renders recent PIIEvents (correlation id, pattern id, action, hash prefix — the redacted content is never stored or displayed). The page reads /api/middleware/status (a single-round-trip aggregator) and mutates pattern actions via PUT /api/pii/patterns/:id (transient, restored from --pii-config on restart). MCP exposes the same surface as get_middleware_status and set_pii_pattern_action so an agent can introspect or tune the filter without code access. The drift detector in pkg/mcp/localaitools/coverage_test.go still passes — both new tools ship with their HTTP route mappings. Behaviour change for existing deployments: local models no longer receive global PII redaction without an explicit `pii: { enabled: true }` in their YAML. Documented in the new middleware-admin instructions registry entry. End-to-end verified against tests/e2e-ui/ui-test-server (which gains a --pii-yaml flag for injecting per-model PII config into the auto- generated mock-model.yaml): default-off produces no events; explicit opt-in produces a mask event; per-model action override produces an HTTP 400 pii_blocked response. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

Add the routing subsystem's content-router tier: a Router config block on ModelConfig turns a model into a smart-router that classifies each request and rewrites input.Model to one of its candidates. The standard model-resolution path then runs ACL, disabled-state, and per-model PII against the chosen target — the router only does *model* selection, not node selection (SmartRouter still owns the latter in distributed mode). The classifier interface lives in core/services/routing/router with one shipped implementation: a feature classifier that picks a candidate by prompt length and code-fence presence. The router.Probe shape is schema-agnostic; per-API-shape extractors (OpenAIProbe, AnthropicProbe) in core/http/middleware translate parsed requests into probes without dragging the schema package into the router. The interface deliberately doesn't depend on core/config — callers translate RouterCandidate slices into FeatureCandidate slices at construction time. The new RouteModel middleware runs after SetModelAndConfig + body parse but before the PII filter. When the resolved config has a Router block, the middleware invokes the classifier, looks up the matched label in the candidate table, reloads the target model's config, asserts depth-1 (the candidate must NOT itself be a router — chained routers turn dispatch into a graph), and swaps MODEL_CONFIG + input.Model in place. RequestedModel/ServedModel get stamped on the context so the usage log records the routing. Classifier failures and unknown labels fall through to Router.Fallback; fallback-empty errors return 503 rather than silently bypassing. The decision log is a ring-buffer in core/services/routing/router that mirrors the PII event log: in-memory by default, capped at 5k records, filterable by correlation_id / user_id / router_model. New REST endpoints surface it: GET /api/router/decisions (admin-only) and an updated GET /api/router/status that lists configured router models + their classifier configs. The /api/middleware/status aggregator pulls the same data so the React Middleware page renders the Routing tab with active routers and recent decisions side-by-side. MCP gains a get_router_decisions tool. The coverage drift detector catches the new tool — its HTTP route is documented in the same map. The new instructions registry entry "intelligent-routing" explains the Router block, the depth-1 rule, and points at the decisions endpoint. Total instructions count → 16. End-to-end verified: configured mock-model as a smart-router with a small (max_prompt_length=30) and a large candidate; a 5-char prompt routes to small-model and a 100-char prompt routes to large-model; both decisions appear in /api/router/decisions and /api/middleware/ status reflects the active config. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

Closes the output-side gap in the PII subsystem: until now, redaction only ran on incoming chat requests. A model could generate "your key is sk-..." and stream it straight to the client. The new StreamFilter intercepts the OpenAI chat completion stream's content deltas, applies the same regex tier the request-side middleware uses, and masks matches that span chunk boundaries. The buffered-emit invariant: for any active pattern with bounded max-length L, the filter holds back the trailing L-1 characters of the cumulative input. New text disambiguates the boundary; the stream close (Drain) flushes whatever is left. This is what guarantees the mask survives an arbitrarily-split chunk sequence — alice@example.com arriving as "alice@" + "example.com" still becomes [REDACTED:email]. Action handling differs from the request side: earlier chunks are already on the wire by the time later chunks scan, so a "block" can't actually reject. The filter remaps block to mask for redaction while recording PIIEvent rows with action=block so audits surface the original intent ("the model would have leaked X here, suppressed in flight"). route_local on output is a no-op (the routing decision was made at request time). A property test feeds the redactor every corpus input across 10 random chunkings and asserts (a) no secret value ever appears in the emitted output and (b) the streamed output equals what a single-shot redaction would produce on the unsplit text. Wiring: the OpenAI chat endpoint constructs a per-stream filter when the resolved ModelConfig has PIIIsEnabled — the same gate the request-side middleware reads, so a model with PII off pays no streaming cost either. ChatEndpoint signature gains *pii.Redactor and pii.EventStore parameters; the legacy /v1/mcp/chat/completions wires nil values (kept for backward compatibility, request-side filter on the main route still applies). The mock-backend gains a MOCK_LEAK_EMAIL prompt sentinel that emits a response containing alice@example.com — used by the end-to-end test: streaming chat against a mock-model with pii.enabled=true produces a data chunk containing [REDACTED:email] and an /api/pii/events row with direction=out and action=mask. Anthropic /v1/messages and the bare /v1/completions path are NOT yet wired; their streaming surfaces will get the same filter in a follow- up. The StreamFilter type is schema-agnostic so wiring is a small patch per route. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

The per-model pii.patterns field was being rendered as a generic JSON-editor textarea, leaving users to discover the schema by trial and error. Replace it with a dedicated component that fetches the live pattern catalog from /api/pii/patterns and presents pattern + action as two select dropdowns per row, with a separate "add" picker that hides patterns already overridden. The pattern catalog is loaded at render time, so new built-in patterns (when added to DefaultPatterns) surface in the UI automatically without schema duplication. Unknown IDs already in the YAML still render so hand-edited configs aren't lost on first load. Also gives pii.enabled a proper label and description in the config metadata registry so the toggle isn't an opaque "Enabled" entry under "Other". Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

…/completions Closes the streaming-coverage gap flagged in 8d421453. The StreamFilter type is wire-format-agnostic, so wiring it into the remaining streaming surfaces is a per-route patch: - Anthropic /v1/messages: text_delta is the only content surface that carries model output; wrap each emit (token-callback path, ChatDeltas path, autoparse fallback) so a pattern split across SSE chunks still gets masked. Drain the buffered tail before any content_block_stop on the text block (normal close, tool-call transitions, autoparse), so trailing residue isn't silently truncated when the model pivots into a tool_use block. Block→mask remap and per-model action overrides follow the same gating as the OpenAI chat path. - /v1/completions: response-side only — the endpoint has no chat message structure for request-side scanning, but a model trained on PII can still emit it. Filter Choices[0].Text per chunk and drain the residue into one final text-bearing chunk just before the stop chunk + [DONE]. Same per-model gate as elsewhere: PII off for non-proxy backends by default, on for proxy-* / explicit pii.enabled = true. Filter is nil when disabled — flow is untouched. Subsystem 3 (PII) is now feature-complete for the MVP scope across both directions on chat/completions/messages. Encoder NER tier (TokenClassify gRPC) remains as a follow-up. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Adds wire-format-faithful HTTP+SSE forwarding for models whose Backend starts with `proxy-` and whose `proxy.upstream_url` is set. The chat and messages handlers fork to the proxy before any local templating or gRPC dispatch, so the upstream sees the request body the client sent (with only the top-level `model` field optionally rewritten). The streaming PII filter rides on top: per-token text is extracted from each SSE chunk, pushed through pii.StreamFilter, and spliced back into the original envelope so the upstream's event names and metadata pass through untouched. PII residue flushes before the provider's terminal marker ([DONE] / message_stop) so clients that stop reading on the marker don't lose the tail. Auth is provider-aware (OpenAI Bearer, Anthropic x-api-key + anthropic-version header). API keys read from env vars named in config so secrets stay out of YAML and the admin UI. No request-shape translation in the MVP — a client posting OpenAI-shaped requests to a proxy-anthropic model gets a confused upstream. Cross-shape forwarding is deliberately deferred; tool-call argument round-tripping and reasoning-content passthrough deserve their own review. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Adds a copy-paste-ready model config template for both proxy-openai and proxy-anthropic, covering API key handling via env vars, model name rewriting, request timeout, and the per-model PII gate. Includes a section on combining proxy models with the intelligent router so a single LocalAI instance can mix local and cloud candidates behind one classifier. Documents the MVP limitations explicitly (no request-shape translation, no output-side PII for buffered responses, no retry) so users don't hit them as surprises. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Adds an HTTPS forward proxy that selectively MITMs traffic for allowlisted LLM API hosts so LocalAI can apply per-request PII redaction to clients authenticating via OAuth / subscription rather than via API keys held by LocalAI. Hosts outside the allowlist get a plain CONNECT tunnel — OAuth flows, telemetry, and unrelated HTTPS keep working without depending on the CA being trusted. Components: - mitm.CA: ECDSA-P256 CA, generated once and persisted (key 0600) - mitm leaf cache: per-SNI leaf certs minted on demand, cached in-mem - mitm.Server: CONNECT-aware HTTP server, hijacks the conn, mints leaf, terminates TLS, parses HTTP/1.1 requests, dispatches - mitm PII handler: re-uses the existing piiadapter for request redaction and pii.StreamFilter for SSE response redaction; runs only on /v1/messages and /v1/chat/completions paths (others pass through verbatim, preserving Anthropic-OAuth and OpenAI-Codex auth flows untouched) - Application wiring: --mitm-listen / --mitm-ca-dir / --mitm-intercept-hosts CLI flags. Off by default. CA cert exposed unauthenticated at GET /api/middleware/proxy-ca.crt for client trust-store install. Primary use case: redact PII from Claude Code sessions running against a Claude Pro/Max subscription, where LocalAI doesn't hold (and can't use) an API key. Codex CLI works the same way. HTTP/1.1 only; HTTP/2 deferred (most CLIs negotiate down without issue). Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Previously the MITM proxy terminated TLS as HTTP/1.1 only. Modern LLM-API clients (Claude Code, Codex CLI) and the Anthropic / OpenAI APIs themselves all speak HTTP/2 — h2 multiplexing is what makes streaming responses cheap. Forcing h1.1 in the middle of the path worked but cost a measurable per-request overhead and would have broken any future client that drops h1 support. Changes: - proxy.go: TLS NextProtos = ["h2", "http/1.1"]; after handshake branch on NegotiatedProtocol. h2 path uses http2.Server.ServeConn with the InterceptHandler wrapped as an http.Handler. h1.1 path retains the manual request-loop with connResponseWriter as a fallback for legacy clients. - handler.go: outbound http.Transport explicitly configured with http2.ConfigureTransport so the upstream leg also negotiates h2. - go.mod: promote golang.org/x/net to a direct dependency (was indirect via websocket). - New tests: TestProxy_NegotiatesHTTP2 verifies resp.Proto == "HTTP/2.0", TestProxy_HTTP2Streaming covers SSE-over-h2 with per- frame flush, TestProxy_HTTP1Fallback locks the legacy path. The InterceptHandler signature is unchanged — h2 streams map 1:1 to http.Request, just like h1, so handlers don't have to know which protocol is on the wire. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

…e and comments - New core/services/cloudproxy/ssewire package owns the SSE scanner and the per-provider rewrite/terminal/residual helpers; cloudproxy and mitm both import it. Removes ~150 lines of literal duplication between mitm/sse.go and cloudproxy/{sse,proxy}.go. - handler.go: replace dispatchPIIIntercept (8 positional params) with a piiDispatcher struct built once at NewPIIHandler time. Hoists the pattern→action map out of the per-request hot path, fixes a PII event-ID collision when one request triggered multiple spans of the same pattern (now uses an atomic seq), and stops silently dropping store.Record errors. - proxy.go: cache streaming(body) result instead of re-parsing JSON. - ca.go: drop the redundant certDER field; use cert.Raw, the byte- identical buffer x509.ParseCertificate already populates. - Trim package docs and over-narrating per-declaration comments to match the project style guide (only WHY when non-obvious). No behaviour change. All existing tests pass. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Adds two starter YAMLs to the Import Model page's Power → YAML view: "OpenAI proxy" and "Anthropic proxy". Clicking either fills the editor with a working proxy-* skeleton — backend, upstream URL, api_key_env (so the secret stays out of YAML), upstream_model, request_timeout_seconds, and a sensible per-model PII gate. Templates appear next to the Copy button so they're discoverable without leaving the editor. The user fills in their own model name, upstream URL, and env-var name and submits. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

This reverts commit f11c533ceb9b7c164023ca27e21259d29196bd95. Signed-off-by: Richard Palethorpe <io@richiejp.com>

Adds two template cards to the Add Model page (/app/model-editor in create mode): "OpenAI Proxy" and "Anthropic Proxy". Picking either pre-fills the form with backend, upstream URL, api_key_env, upstream_model placeholder, request timeout, and pii.enabled — the user fills in the model name, the env-var name, and the upstream model and saves. This is the right home for the proxy starter; the Import Model page is reserved for fetching artefacts from HF / Ollama / OCI and the proxy doesn't fit that pattern. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Adds MITMListen and MITMInterceptHosts to RuntimeSettings so an admin can flip the cloudproxy MITM listener on/off and edit the intercept allowlist via /api/settings (already admin-gated; locked down by --disable-runtime-settings when the operator wants no runtime mutation at all). The CA dir stays startup-only — the persisted CA is the trust anchor for every already-installed client, and rotating it from a REST endpoint would orphan them. Editing the listen address or allowlist reuses the same CA via Application.RestartMITM, which stops the old listener (if any), reads the current config, and starts a new one. Also adds a "mitm" section to GET /api/middleware/status so the admin page can render running state, configured vs bound listen address, allowlist, and the CA download URL. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Adds a "MITM Proxy" tab to /app/middleware. Shows running state + bound listen address; renders Apply/Discard form for the listen address and intercept-host allowlist (which writes through to /api/settings, already admin-gated and watchable by --disable-runtime-settings); offers a one-click CA cert download plus a brief client-setup recipe (NODE_EXTRA_CA_CERTS + HTTPS_PROXY) so an admin can stand up Claude Code / Codex without leaving the page. Backend bits already shipped in 76e3b5fe — this turns the data into a working control surface. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

- ProxyTab: gate the server→local sync useEffect on \!dirty so Refresh / post-save refetch can't clobber mid-typed input. The intercept_hosts array reference changes per fetchAll(), so the previous deps[] silently re-fired every poll. - Switch ProxyTab.save to settingsApi.save — same path Settings.jsx uses. Drops the raw fetch + handcrafted JSON. - Move mitmMutex from a package-level var onto Application, matching p2pMutex / watchdogMutex. Add stopMITMLocked for symmetry with startMITMLocked; RestartMITM now reads as stopLocked → bail-on-empty → startLocked. - Add BackendProxyOpenAI / BackendProxyAnthropic constants in cloudproxy and use them in providerName. Test-data sites stay as literals so a typo'd constant rename still fails the tests. - Trim a buildMITMStatus comment that just narrated the field names. No behaviour change. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Repurpose the PII event store as a shared middleware audit log: add an EventKind discriminator (pii | proxy_connect | proxy_traffic) and proxy-specific fields (Host, Intercepted, BytesSent, BytesReceived, StatusCode, DurationMS) to the existing PIIEvent record. Keep request contents out of the store — bodies live in API/backend traces only. The MITM Server records a proxy_connect row for every CONNECT (with Host + Intercepted=true|false) so admins can see which hostnames a client tried to reach and whether the proxy terminated TLS or tunneled through. The PIIHandler wraps its ResponseWriter to count bytes downstream and records a proxy_traffic row at request end with sent/received byte counts, status code, and duration. The /api/pii/events endpoint accepts a kind= filter. The Middleware admin page Events tab gains a Kind column, a kind filter row, and per-kind detail formatting (host + intercept decision for connects; HTTP status, byte counts, and duration for traffic). The MCP get_pii_events tool stays scoped to kind=pii so the LLM-facing audit isn't polluted by proxy rows with empty PII fields. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Add a Go test for the tunneled CONNECT path: a non-allowlisted host must record a proxy_connect with Intercepted=false and zero proxy_traffic events (since tunneled bytes never reach the dispatcher). Extend the Playwright spec for the Middleware page Events tab. The mock event feed now includes a pii row, two proxy_connect rows (intercept and tunnel decisions), and one proxy_traffic row. New test cases: - proxy_connect rows show "intercepted" / "tunneled" labels - proxy_traffic row shows HTTP status, byte counts, and duration - the kind filter buttons narrow the table to a single kind - the Kind column header and per-kind badges render Note: Playwright runs failed in the local sandbox (the bundled chrome-headless-shell can't load libglib on this NixOS host); the specs are authored against the rendered DOM and will run in CI. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

loadRuntimeSettingsFromFile applied every other persisted runtime setting (branding, watchdog, P2P, agent pool, ...) back into options on startup but skipped the MITM fields. So when an admin configured the listener via /api/settings, runtime_settings.json on the mounted volume held mitm_listen + mitm_intercept_hosts, but on restart options came up empty and the start-MITM gate at startup never fired. Two changes: - loadRuntimeSettingsFromFile now copies MITMListen and MITMInterceptHosts from the file when no CLI flag set them. Like branding, the file is the only source — there are no env vars for these — so an explicit --mitm-listen still wins, but a /api/settings save round-trips correctly. - The startMITMProxy call moves to after loadRuntimeSettingsFromFile. Previously it ran before the file load, so even with the loader fix in place options.MITMListen would be empty when the gate fired. The watchdog and other restartable subsystems already initialize after the load — MITM now matches. Tests pin the contract: - core/config: WritePersistedSettings + ReadPersistedSettings round-trip preserves both MITM fields. - core/application: loadRuntimeSettingsFromFile populates MITMListen and MITMInterceptHosts from a fixture file, and an explicit CLI flag wins over the file value. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Self-review pass on the routing-stats slice. Each finding paired with test coverage; one refactor (atomic.Pointer for MITM accessors) matches the existing agentPoolService precedent in the same struct. Logic fixes: - pii/stream.go: snap emitBoundary to a rune start so the held tail never contains a split UTF-8 codepoint. Multibyte corpus added to the buffered-emit invariant test. - pii/redactor.go: SetAction publishes a fresh patterns slice (slices.Clone) instead of mutating r.patterns[i].Action in place — Go strings are not atomic two-word values, so concurrent Redact callers iterating an older snapshot would race on the field even under RWMutex. Race-stress test added. - pii/openai adapter: new bit-24 sentinel + 24-bit block field (idxWholeStringFlag/idxBlockMask) replaces the 0xFFFF sentinel that collided with a real block index of 65535. - mitm/proxy.go: fail closed if SetDeadline errors before the TLS handshake — proceeding into the protocol switch on an unhandshaken conn is worse than dropping the connection. - mitm/response.go: Connection: close compared with EqualFold so any casing triggers the post-response disconnect (RFC 9110 §7.6.1). - application: MITMServer/MITMCA accessors now atomic.Pointer-backed (matches agentPoolService); readers no longer race RestartMITM on pointer swap. mitmMutex retained only to serialize Stop+Start. - router/feature.go: prompt length predicates use rune count, not byte count — operators reason in characters, not UTF-8 bytes. Cached once per Classify call rather than recomputed per candidate. - mcp/localaitools/inproc: GetUsageStats(All=true, UserID=…) honours the UserID filter, matching the REST endpoint's ?user_id param — same MCP call now returns the same data over either transport. - react-ui middleware spec: bytes_received mock changed from 1280 to 1228 so formatBytes returns the asserted 1.2KB string. Test coverage added: - pii: race-detector test for SetAction, multibyte UTF-8 corpus. - ssewire: direct unit tests for Scanner edge cases (CRLF, leading blanks, mid-event EOF) and IsTerminalMarker per provider. - mitm: Stop idempotency, restart cycle with allowlist swap. - middleware/route_model: classifier-success, fallback, depth-1-invariant, no-fallback-503, unsupported-classifier paths + OpenAIProbe/AnthropicProbe extractors. - anthropic/messages: drainStreamPIIToText covers nil-filter no-op, empty-drain no-op, residual emit shape, idempotence, and end-of-stream redaction. - application: symmetric MITMInterceptHosts CLI-wins loader test. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Four UX changes that together move per-host MITM control out of the global runtime_settings.json and into model YAML, where PII overrides already lived. The MITM model template + the Add Model picker entry mirror how the Talk page surfaces pipeline models. A. Per-pattern PII enable + persist - pii.Pattern gains a Disabled flag; Redactor.RedactWithOverrides skips disabled patterns. SetDisabled mutates via slices.Clone for the same race-free publish SetAction uses. - PUT /api/pii/patterns/:id accepts {action?, disabled?} (one or both). New POST /api/pii/patterns/persist snapshots the live redactor's deltas vs --pii-config defaults into a new pii_pattern_overrides map in runtime_settings.json; the boot loader applies it after redactor construction. - React: per-row Enabled checkbox + a "Save to disk" button on the Filtering tab. PUT toasts note the change is transient until persist is clicked. - MCP: PIIPatternActionUpdate.Disabled is optional; new persist_pii_patterns tool. Coverage map + full-catalog test updated. B. Model-config link buttons - Per-model row in the Filtering tab gets an Edit button linking to /app/model-editor/<name>. Mirrors the same pattern used elsewhere for navigating to a config from a status surface. D2. Model configs own MITM hosts - New mitm: { hosts: [...] } block on ModelConfig. Loader gains MITMHostOwners() returning {Owners, Conflicts}; ANY duplicate host across model configs is a critical error that disables the MITM listener until resolved (strict 1-to-1 invariant the dispatcher relies on). - startMITMLocked validates ownership before binding; conflicts are published on Application.mitmHostConflicts and surfaced via /api/middleware/status with a clear error message and links to the colliding configs in the React banner. - Allowlist is now exactly the set of hosts claimed by model configs — the global MITMInterceptHosts list and MITMHostsWithPIIDisabled list are removed from RuntimeSettings, ApplicationConfig, the CLI flag, and runtime_settings.json. Per-host PII gate inherits from each owner config's pii.enabled. - New "MITM Intercept" template in modelTemplates.js (default name mitm-anthropic, default host api.anthropic.com, pii.enabled: true, empty pii.patterns: [] for an immediately-visible override editor). Registered in core/config/meta/registry.go as a string-list field so the model editor renders it. - /api/middleware/status MITM payload gains models: a list of every config that owns at least one MITM host (name, hosts, pii_enabled, backend), plus host_owners, host_conflicts. The MITM Proxy tab renders this as a top-level "MITM Models" table with an Add MITM model button. Test: ModelConfigLoader.MITMHostOwners cross-config conflict detection, host-normalisation, and intra-config duplicate handling. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

Pull the local-store gRPC backend's KV+KNN logic into a reusable pkg/store/local library so other in-process callers (notably the routing module's KNN classifier) share one implementation. The backend/go/local-store binary becomes a thin pb<->[]float32/[]byte translation wrapper. Shared WrapKeys/WrapValues/UnwrapKeys helpers move to pkg/store/proto.go. Regression test suite covers normalization invariants, sort/merge correctness, delete, KNN top-k ordering, and the 0xFFFF block-index boundary that previously aliased. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

Lands two routing subsystems behind the existing router config: KNN classifier — embeds candidate exemplars on first Classify (atomic.Pointer[local.Store] + loadMu for safe lazy load), picks the candidate whose nearest exemplar is closest to the probe. Threshold via router.min_score. Reuses the extracted local-store library so the same KNN search runs in both the gRPC backend and in-process router. LLM classifier — asks a small instruct model to pick a label from natural-language descriptions. Longest-first label match, RWMutex-guarded prompt memo cache (size from router.classifier_cache_size, default 1024), TrimSpace+ToLower cache key. EmbedderFactory / LLMCallerFactory adapter pattern on Application keeps the router package free of HTTP/backend imports. Per-router sync.Map cache in the middleware avoids re-embedding exemplars on every request. Admission control (subsystem 5) — per-model semaphore limiter (sync.Map[modelName]chan struct{}) gates concurrent in-flight requests by ModelConfig.Limits.MaxConcurrent. On rejection: HTTP 503 + Retry-After + audit row via new pii.KindAdmission event kind + JSON body { error.type: admission_rejected }. Cap is fixed at first Acquire per model — admin restarts to resize, matching the rest of the model config lifecycle. Middleware runs after RouteModel so a router fanout that lands on a saturated downstream is rejected even when the router-model itself has slack. /api/middleware/status gains an admission section listing each gated model's max_concurrent / in_flight / retry_after_seconds. The Events tab in the Middleware page knows about admission rows. Single-source-of-truth constants (ClassifierFeature / KNN / LLM, LabelFallback) and an errDecision helper de-duplicate the classifier surface. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

The earlier extraction to pkg/store/local was wrong shape: it pulled the in-memory KV+KNN into the main process so the KNN router classifier could call it directly. That undermines the point of having a vector-store backend — admins should be able to swap in qdrant, pinecone, or any other pluggable store backend without changes to the routing code. Reverts pkg/store/local and inlines the implementation back into backend/go/local-store as package main. KNN now consumes a router.VectorStore interface (Set / Find), with the production adapter at core/application/embedder.go wrapping pkg/store's gRPC client (SetCols / Find) over a backend resolved from core/backend.StoreBackend — exactly how face/voice recognition consume the same surface. RouterConfig gains a store_model field naming the chosen backend (empty = default local-store). Each router model uses its own namespace ("router-knn-<routerModelName>") so two routers sharing a backend can't see each other's exemplars; ModelLoader's per-(backend, namespace) process isolation does the rest. The router package gains no core/backend dependency — the VectorStore interface lives alongside Embedder and LLMCaller and is wired from the application layer the same way. Algorithm coverage (sort/merge, normalised fast path, KNN top-K, dimension enforcement) stays where it belongs — in backend/go/local-store/store_test.go — exercised through the gRPC service surface that downstream consumers actually use. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

Adds a TokenClassify gRPC method for token-classification (NER) models and implements it in the Python transformers backend. The PII redactor will consume this in a follow-up to add an ML-based detection tier on top of the regex tier. Proto surface: - TokenClassifyRequest { text, threshold } - TokenClassifyEntity { entity_group, start, end, score, text } - TokenClassifyResponse { repeated entities } Byte offsets are into the original UTF-8 text so the consumer can slice without re-tokenising. entity_group follows HuggingFace's aggregated-tag convention (PER, LOC, ORG, ... or PII-specific labels depending on the model). Go wiring: Client / embedBackend / ConnectionEvictingClient gain TokenClassify; Backend interface includes it. Generated stubs are gitignored and regenerated at build time via `make protogen-go`. Python backend: a new `Type=TokenClassification` model-load branch loads via `transformers.pipeline("token-classification", aggregation_strategy="simple")`. The aggregated-strategy pipeline gives us span-merged entities with byte offsets out of the box. TokenClassify RPC runs the pipeline, filters by threshold, and returns the entities. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

The streaming PII filter wiring referenced auth.GetUser to attribute events to the request's user, but the import line was dropped during rebase. Result was the build failure: "undefined: auth" at chat.go:709 and :1453. Assisted-by: claude-code:claude-opus-4-7 [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

Adds an optional encoder-based detection tier on top of the existing regex tier. NER catches the long tail (unformatted names, locations, mixed-language PII) that regex can't express, while regex keeps the cheap path for formatted hits (emails, SSNs, credit cards). The redactor exposes a new RedactWithNER(ctx, text, overrides, NERConfig) that runs both tiers and merges hits through the same overlap-resolution as before — when an entity span overlaps a regex hit, the stronger action wins (block > route_local > mask). NER pattern IDs are namespaced "ner:<entity_group>" so audit rows and event-tab filters distinguish them from regex hits, and admins can disable a single entity type with the same Disabled-pattern machinery. NERConfig is per-request: each call site supplies the loaded detector + per-group action map + minimum confidence, so the same Redactor instance can serve different models with different NER preferences without per-model redactor instances. Fail-open semantics: a detector error returns the regex-only Result alongside the error. Caller decides whether to surface the failure (fail-closed: refuse the request) or log and proceed (fail-open: ship regex-tier protection only). The regex tier itself never errors. regex hit-collection / overlap-merge / output emission are now factored into collectRegexHits + mergeAndEmit so the regex-only RedactWithOverrides and the new RedactWithNER share one implementation. Out of scope (follow-up commits): - core/application adapter from gRPC TokenClassify to NERDetector - per-model PIIConfig.NER block + middleware wiring - React middleware page surface for NER entity types - gallery model entry for a recommended NER model Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

The Routing tab now has an explicit affordance for creating a new routing model — matches the pattern already used on the MITM Proxy tab. Empty state shows a primary "Create routing model" button; populated state adds an "Add routing model" button next to the Active routers header. Both link to /app/model-editor?template=router. A new template in modelTemplates.js seeds the editor with the feature classifier and two empty candidate rows (one for 'code' with requires_code, one for 'chat') so admins fill in candidate models + a fallback and save. The model editor wouldn't render the router fields until they were registered, so registry.go gains entries for: - router.classifier (select: feature / knn / llm) - router.fallback (model-select chat) - router.embedding_model (model-select models — for KNN) - router.store_model (model-select models — for KNN's vector store) - router.min_score (number) - router.classifier_model (model-select chat — for LLM) - router.classifier_cache_size (number) - router.candidates (code-editor — array of {label, model, rules}) All under the "other" section alongside mitm.hosts, ordered after the MITM entry. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

The router template seeds router.candidates with an array of {label, model, rules} objects. CodeMirror's EditorState.create({ doc }) requires a string — passing the array crashed inside CM's Text class with "(intermediate value).split is not a function", surfaced as an "Unexpected Application Error" overlay the moment the template loaded. Adds a StructuredCodeEditor wrapper that: - YAML-stringifies the structured value for display so CodeMirror always sees a string, - parses the user's text back to a structured value on every edit (using YAML.parse) so the editor form state holds the canonical shape, ready for unflattenConfig + YAML.stringify on save, - holds the last-published structured value steady while the YAML buffer is mid-edit and temporarily invalid (the CM YAML linter surfaces the syntax error inline). ConfigFieldRenderer routes code-editor fields through the wrapper when the form value is non-string; plain text blobs (Go templates etc.) still use the original CodeEditor with no behaviour change. Playwright regression test pins: - The Routing tab's "Create routing model" button navigates to /app/model-editor?template=router. - Loading that URL doesn't render the "Unexpected Application Error" overlay, and the Router Candidates / Classifier fields are visible. A page.on('pageerror') hook surfaces any uncaught render error so a future regression fails with a useful message rather than silently passing. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

…usecase Two follow-ups for the routing template: 1. router.candidates moves from raw YAML to a dedicated structured editor. Each candidate is a card with: - label and model (model picker — no more typing model names from memory), - optional description (LLM classifier hint), - collapsible Rules section: max/min prompt length, requires_code toggle, and an Examples textarea for KNN exemplars (one per line). Empty rule values are stripped from the output so the saved YAML doesn't carry zero-valued junk. New "router-candidates" component in the field registry routes to RouterCandidatesEditor; everything else (regex tier, KNN factory, classifier dispatch) was already wired against the same structured shape, so the YAML this editor produces round-trips cleanly. 2. Proxy templates (proxy-openai, proxy-anthropic) ship with known_usecases: ['chat']. Without it the proxy model wasn't surfacing in router fallback / candidate pickers (or any chat capability selector) because pickers filter by FLAG_CHAT and backends with no explicit usecase list don't pass. Updated regression test to assert the structured editor's "Add candidate" button is present — if the field gets reverted to raw YAML, the test fails loudly instead of silently passing on the "didn't crash" check alone. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

The KNN exemplars field shipped as a single textarea with "one per line" semantics. That broke for any prompt that itself contained a newline — a realistic case for the multi-line prompts admins want to paste in verbatim from real traffic — and gave them a tiny 3-row box for what's often the most consequential field on the form. New ExamplesEditor renders one resizable textarea per exemplar with add / remove buttons. Each exemplar can hold arbitrary text including line breaks; the array on the wire stays a plain []string that the KNN classifier already consumes unchanged. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

Wires the consuming side of the LLMRouter-style data pipeline: the KNN classifier can now load exemplars from a JSONL file alongside (or instead of) hand-written candidate.examples. The benchmarker itself isn't shipped yet — this lets the consumer be ready when it lands, and lets admins drop in third-party datasets (LLMRouter's own outputs etc.) directly. JSONL shape (one row per query): {"_meta": {"embedding_model": "longformer-base-4096", "embedding_dim": 768, "judge": "claude-opus", "judge_method": "pairwise_winrate"}} {"query": "fix the bug in this function", "best_model": "qwen-coder", "scores": {"qwen-coder": 0.92, "qwen-chat": 0.45}, "embedding": [0.12, ...]} {"query": "hello", "best_model": "qwen-chat"} The _meta header is optional. embedding/scores per row are optional. Blank lines and "#" comments are skipped. Loader (pkg pii/services/routing/router/routing_data.go): - LoadRoutingDataset(path) parses JSONL, validates {query, best_model} on each row, returns RoutingDataset{Meta, Rows}. - 8MB per-line buffer so 4096-D Longformer rows fit. - FilterByCandidates(modelNames) drops rows whose best_model isn't configured — admins can share one benchmark across deployments with different lineups. - EmbeddingsMatch(name, dim) reports whether stored embeddings can be used verbatim (saving 10-100x cold-start cost on large datasets). KNN integration: - KNNCandidate gains a Model field; the loader maps row.best_model → candidate.Label. - NewKNNClassifier signature gains a trailing KNNOptions{Dataset, EmbeddingModelName}; existing call sites pass KNNOptions{}. - Seeding has two passes — hand-written Examples first, then dataset rows. Empty-Examples candidates are no longer a constructor panic; with no dataset and no examples, the seed step fails on first Classify and the middleware falls back (the right failure mode). - Pre-computed embeddings are honoured iff the dataset's _meta.embedding_model matches the configured embedder; otherwise re-embed (different embedders → different vector spaces). - Rows referencing models the router doesn't know about are silently dropped. Config: - RouterConfig.ExemplarsFile (yaml: exemplars_file) names the JSONL. Relative paths resolve against models dir. - Field registered in core/config/meta/registry.go so the model editor renders it as a path input next to the candidates editor. Tests cover: meta header parsing, optional header, blank/comment lines, missing-field validation, malformed JSON, missing file, candidate filter, embeddings-match check; KNN seeds from dataset, drops unknown models, uses precomputed embeddings when aligned, re-embeds when mismatched, combines hand-written + dataset exemplars. Out of scope: the benchmarking CLI that produces these files. Discussed as a separate slice — for general use the recommended shape is pairwise-LLM-judge over a sampled traffic subset with LocalAI's PII redactor in front of the judge call. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

llama.cpp doesn't support Longformer's sliding-window + global attention pattern — confirmed by grepping convert_hf_to_gguf.py for LongformerModel (not present; supported encoder archs are Bert, DistilBert, Roberta, XLMRoberta, NomicBert, JinaBert, ModernBert, NeoBERT, EuroBert). For routing the dataset schema is encoder-agnostic; we just need SOME long-context sentence encoder. nomic-embed-text-v1.5 (NomicBert arch, 8192 native context, GGUF available, already in gallery/index.yaml) fits the bill and runs on the existing llama-cpp embedding path. Updates the model-editor description for router.embedding_model to surface nomic-embed-text-v1.5 as the default suggestion, with modernbert-embed-base / jina-embeddings-v3 as alternatives. Also corrects an inaccurate comment in routing_data.go that conflated Longformer's context length (4096 tokens) with embedding dimensionality (768) when justifying the 8MB scanner buffer. Assisted-by: claude-code:claude-opus-4-7 [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>

…ache Replace the prior feature/knn/llm router classifiers with a single score-based classifier that asks an Arch-Router-style model to rank every policy label as a continuation of the routing prompt and reads off the softmax distribution. Multi-label routing falls out of this naturally: the middleware activates every label whose probability crosses a softmax threshold and picks the first candidate whose labels are a superset of the active set. Wiring summary: - backend.proto adds Score(ScoreRequest) → ScoreResponse. The llama-cpp C++ backend implements Score on top of force-decoded candidates against a freshly-cleared KV cache (prompt-KV sharing optimisation is on the perf TODO list); vLLM uses prompt_logprobs. Other backends return UNIMPLEMENTED. - core/services/routing/router/score.go is the classifier. It builds the ChatML routing prompt once at construction, scores every policy label as a continuation, and applies an activation threshold (default 0.15; 0.40 is a better empirical default on Arch-Router-1.5B per the eval in features/middleware.md). - RouterConfig grows Policies, ActivationThreshold, and an optional EmbeddingCache nested struct. RouterCandidate collapses to {Model, Labels[]} — labels are the matching contract, descriptions live on the policy. - The dead feature/knn/llm/routing_data files are removed. L2 embedding cache: - core/services/routing/router/embedding_cache.go wraps a Classifier decorator that embeds each probe, KNN-searches the per-router local-store collection, returns a cached decision if the cosine similarity passes a threshold (default 0.80, lowered from 0.92 after the eval against nomic-embed-text-v1.5 paraphrases). Low- confidence decisions are deliberately not cached so they can't poison future paraphrases. - Stats include hits, misses, near_misses, low_confidence, and a 10-bin similarity histogram so admins can see where the cosine distribution sits relative to the configured threshold. - Registry tracks built classifiers by fingerprint of the RouterConfig YAML, so config edits invalidate the cache wrapper automatically while the on-disk vectors persist. UI: - The model-editor schema is rewritten: dead KNN/LLM fields gone, policies/activation_threshold/embedding_cache.* added with proper descriptions, sliders, and component bindings. - RouterCandidatesEditor is rewritten for {model, labels[]} with multi-select label chips populated from router.policies via a new FormContext. - RouterPoliciesEditor is the structured editor for the label vocabulary, with duplicate-label detection via a memoised set. - The Routing tab on /app/middleware renders the embedding-cache histogram inline with a threshold marker. Verification: - Unit tests cover the score classifier (multi-label activation, fallback, depth-1) and the embedding cache (hit, near-miss, low-confidence skip, embedder/store error fallthrough, histogram population). - Refreshed e2e specs (router-template.spec.js, middleware-page.spec.js) pass under make test-ui-e2e-docker: 133/135 passing with the two failures unrelated to this slice. - End-to-end eval against the LocalAGI stack with a 30-prompt corpus + 3 paraphrases each produced 35% steady-state hit rate at 0.80 threshold (53% of caching-eligible decisions), 15ms p50 cache-hit latency vs 246ms classifier round-trip — a ~16× speedup on hits. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>

richiejp force-pushed the feat/routing-stats-backend branch 4 times, most recently from aff5af4 to 8389d96 Compare May 13, 2026 14:54

mudler added needs-review labels May 13, 2026

richiejp added 24 commits May 14, 2026 14:45

Revert "feat(import-model): add cloud-proxy templates to YAML editor"

1455e99

This reverts commit f11c533ceb9b7c164023ca27e21259d29196bd95. Signed-off-by: Richard Palethorpe <io@richiejp.com>

richiejp added 14 commits May 14, 2026 14:57

richiejp force-pushed the feat/routing-stats-backend branch from 8389d96 to 99f79f4 Compare May 14, 2026 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(middleware): Model routing, PII filtering, Cloud model proxies#9802

feat(middleware): Model routing, PII filtering, Cloud model proxies#9802
richiejp wants to merge 38 commits into
mudler:masterfrom
richiejp:feat/routing-stats-backend

richiejp commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

richiejp commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

richiejp commented May 13, 2026 •

edited

Loading