Skip to content

Test/hub failover auth fast fail#98

Merged
josealekhine merged 5 commits into
mainfrom
test/hub-failover-auth-fast-fail
May 23, 2026
Merged

Test/hub failover auth fast fail#98
josealekhine merged 5 commits into
mainfrom
test/hub-failover-auth-fast-fail

Conversation

@josealekhine
Copy link
Copy Markdown
Member

No description provided.

The behavior was already implemented in 8bcb620 (the
original master-follower replication feature): failover.go
checks authErr(callErr) on every peer attempt and returns
immediately when the gRPC status code is Unauthenticated
or PermissionDenied. The reasoning is sound — the same
bearer token would fail on every peer in the list, so
cycling is pointless and just wastes time.

But the test suite never asserted the invariant. The
three existing failover tests cover:

- TestFailoverClient_FirstPeerWorks: happy path
- TestFailoverClient_SkipsBadPeer: connection-error skip
- TestFailoverClient_AllBad: all-unreachable error

None of them exercise the "auth fails → walk stops"
contract. A future refactor could silently delete the
authErr(callErr) branch and all three would still pass.

New TestFailoverClient_FailsFastOnAuthError:
- Real server as the first peer; bogus bearer token will
  trigger Unauthenticated.
- Unrouted 127.0.0.1:1 as the second peer — would surface
  Unavailable if the walk continued past auth.
- Asserts returned gRPC code is Unauthenticated or
  PermissionDenied; if it's Unavailable, the walk cycled
  past auth into the unrouted peer, which is the exact
  regression to catch.

Closes line 375 of TASKS.md (open since 2026-04-08).

Spec: specs/hub-security-audit.md
Signed-off-by: Jose Alekhinne <jose@ctx.ist>
Capture the triage finding from 22cffc2 (hub failover
auth-fast-fail test). The task at TASKS.md line 375 had
been open since 2026-04-08; on triage, the code already
existed in failover.go:61-63 and err_check.go:22-30, but
no test asserted the invariant. The task closed by
writing the regression test, not by writing the feature.

Application beyond this one task: behavior-named TASKS.md
items older than a few weeks should get an
implementation-status sweep before being scoped as new
work. The pattern fits hub/connect/replication-adjacent
tasks particularly well — several still-open items
(lines 380 file locking, 385 fanout entry loss) may
warrant the same triage.

Spec: specs/hub-security-audit.md
Signed-off-by: Jose Alekhinne <jose@ctx.ist>
Triaged the file-locking task: a "lock" exists in
state.go:42-50 but it's a check-then-write TOCTOU race,
not real locking. Filed
#93 with the
analysis + two proposed fixes (atomic O_CREATE|O_EXCL or
syscall.Flock) + the three regression tests the fix
should add.

Task stays open in TASKS.md (the bug is real and
unfixed); the cross-reference points readers at the
filed issue so the analysis isn't duplicated.

Spec: specs/hub-security-audit.md
Signed-off-by: Jose Alekhinne <jose@ctx.ist>
The triage of TASKS.md line 393 (use crypto/subtle.ConstantTimeCompare
+ replace O(n) scan with map) showed the implementation
was already correct: store.go:174-189 uses an O(1) map
lookup on s.tokenIdx plus a defensive CTC against the
stored token. The task description ("currently uses ==")
was stale.

But the test surface didn't pin the contract. The
existing TestStoreRegisterAndValidate covers valid vs.
"invalid" (a completely different string), which a
naive prefix matcher or non-constant comparison would
also pass. A future "simplification" PR could collapse
the careful chain to a single == or strings.HasPrefix
and the test suite would stay green.

New TestStoreValidateToken_RejectsNearMissTokens
exercises 9 near-miss cases that probe the boundary the
CTC defends:

- empty token
- last byte changed (would pass a HasPrefix(stored, X))
- first byte changed
- middle byte changed
- prefix only (would pass HasPrefix(X, stored))
- extra suffix appended (would pass HasPrefix(X, stored))
- case-shifted (would pass a strings.EqualFold)
- whitespace-padded
- all-X same length (random-token false-positive sanity)

All must return nil. If a future refactor weakens the
comparison, these start failing immediately.

The CTC itself is technically redundant once the map
lookup hits — Go map keys match exact-byte by definition
— but the CTC is the explicit signal of intent. The
test pins the *behavior* the CTC enables, not the
nanosecond-level timing (which can't be tested reliably
in Go anyway).

Spec: specs/hub-security-audit.md
Signed-off-by: Jose Alekhinne <jose@ctx.ist>
… record gateway-coupling decision

Three coordinated context updates:

1. **New task: ctx-architecture-next** (TASKS.md, architecture
   pipeline section). Fourth step in the pipeline:
   map → enrich → hunt → prescribe. Consumes the three
   prior artifacts (ARCHITECTURE.md, enriched verifications,
   DANGER-ZONES.md) and produces NEXT-ACTIONS.md — a ranked,
   sequenced fix plan that maps each danger zone to a
   concrete next move (refactor, test, doc, escalate,
   accept) with effort estimates. Distinct from the
   previously-skipped ctx-architecture-extend: extend was
   about "where features grow" (overlapped with
   DETAILED_DESIGN and enrich's registration sites); next
   is pure synthesis, no overlap. Pipeline grows from 3 to
   4 because each step now has a distinct output document.
   The synthesis layer is gateway-free by design: reads
   markdown, writes markdown, no MCP dependency at the
   skill's own layer.

2. **Skip: pluggable graph tool interface** (TASKS.md
   line 1350-ish, `#added:2026-03-25-120000`). Marked `[-]`
   with rationale: pluggable abstraction implies multiple
   candidate graph tools, which implies ctx vouches for
   the interface contract across implementations — exactly
   the ownership coupling the new decision rules out.
   Spec sketch in ideas/spec-mcp-warm-up-ceremony.md stays
   available as the starting point if the trade-off ever
   shifts.

3. **New decision: MCP gateway not worth the coupling
   cost** (DECISIONS.md). Builds on the existing 2026-03-12
   "Recommend companion RAGs as peer MCP servers not bridge
   through ctx" with a stronger reason: ownership coupling.
   A gateway would make ctx vouch for the gatewayed tool's
   install/uninstall/upgrade/error surface — coupling we
   don't want for tools we don't ship. Pluggable abstraction
   is the same trap; both are skipped as a direct
   consequence. Doctor-style "is the peer available?"
   checks remain valid; that's not proxying.

Spec: ideas/spec-companion-intelligence.md
Signed-off-by: Jose Alekhinne <jose@ctx.ist>
@josealekhine josealekhine self-assigned this May 23, 2026
@josealekhine josealekhine merged commit 79c5a01 into main May 23, 2026
16 checks passed
@josealekhine josealekhine deleted the test/hub-failover-auth-fast-fail branch May 23, 2026 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant