Skip to content

Feat/hot path optimization with external rpc validation#508

Open
oten91 wants to merge 59 commits intofeat/hot-path-optimizationfrom
feat/hot-path-optimization-with-external-rpc-validation
Open

Feat/hot path optimization with external rpc validation#508
oten91 wants to merge 59 commits intofeat/hot-path-optimizationfrom
feat/hot-path-optimization-with-external-rpc-validation

Conversation

@oten91
Copy link
Contributor

@oten91 oten91 commented Feb 17, 2026

updating to go 1.25
adding optional external rpc validation

When all suppliers in a session are behind the real chain tip, internal
consensus considers everything "in sync" because it only compares
endpoints against each other. Add optional per-service external RPC
endpoints that periodically fetch the real block height as a floor for
perceivedBlockNumber, correctly filtering stale suppliers.

Supports multiple sources per service for redundancy (max wins). Handles
EVM hex, Cosmos decimal, and Solana numeric response formats. Failure-safe:
unreachable sources log a warning and are skipped. External heights are
written to Redis for cross-replica benefit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@oten91 oten91 requested a review from jorgecuesta February 17, 2026 22:49
@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch 2 times, most recently from 113320b to ff9606e Compare February 17, 2026 23:06
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from ff9606e to e932bd2 Compare February 17, 2026 23:19
Copy link
Contributor

@jorgecuesta jorgecuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor question about it assuming all the endpoints are JSON rpc compatible and a possible optimization.

@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from d1bf40c to 1cb8631 Compare February 18, 2026 13:50
… mode to external block fetcher

- Replace docker/cli exclude with direct require of v29.2.1 which uses
  moby/moby packages with proper Go modules, fixing the range-over-function
  compilation error in E2E tests.
- Pin docker/docker to v28.5.1 and dockertest to v3.11.0 to fix broken pipe
  error when streaming Docker build context via the API in E2E tests.
- Add type field ("jsonrpc" default or "rest") to ExternalBlockSource config
  so Cosmos REST endpoints (GET /status) work alongside JSON-RPC POST.
- Expand .dockerignore to exclude non-essential directories (docusaurus, docs,
  research, etc.) reducing build context size.
- Add integration tests against real public RPC endpoints (ETH, Solana,
  Osmosis JSON-RPC, Osmosis REST) to validate all chain type parsing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from 1cb8631 to 25be89b Compare February 18, 2026 14:10
oten91 and others added 20 commits February 18, 2026 15:20
Prevents cold-start filtering where the external block source reports the
real chain tip before suppliers have reported their heights, causing all
endpoints to appear "behind" and get filtered out.

Default grace period: 60s. Configurable per-service via `grace_period`
in `external_block_sources` config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds ConsumeExternalBlockHeight to Cosmos and Solana QoS instances so
external_block_sources works for all chain types, not just EVM.

Also fixes EVM: skip Redis writes during grace period to prevent other
replicas from picking up the external height before suppliers report.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docker image build + container startup can take 5+ minutes in CI,
which was consuming the entire 5-minute context timeout before tests
even started. This caused all Vegeta test methods to immediately
report "canceled" with 0 requests.

Move context creation to after getGatewayURLForTestMode() returns,
so the 5-minute timeout only governs actual test execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CI workflow already builds the Docker image in Phase 1 and passes
it as an artifact to Phase 2. But force_rebuild_image was still true
in the default config, causing a redundant ~5 minute rebuild from
scratch (with NoCache) before every E2E test run.

Set force_rebuild_image=false in the CI config script so the test
reuses the pre-built image from Phase 1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All 5 matrix jobs (3 HTTP + 2 WebSocket) were downloading the Docker
image artifact simultaneously, triggering GitHub's secondary rate
limit (403 Forbidden). Stagger downloads at 0/5/10/15/20s intervals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o QoS instances

QoS instances are created at startup before external health check rules are
loaded, so syncAllowance starts at 0 (disabled). This means stale endpoints
were never filtered even when external rules specified a sync_allowance.

Fix by using atomic.Uint64 for syncAllowance in both EVM and Cosmos QoS
configs, and propagating the value from the health check executor when
external rules are loaded/refreshed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eneric QoS

Services using generic/NoOp QoS (near, sui, tron) previously had no block
height tracking, meaning stale endpoints were never filtered even when
external block sources were configured. This converts NoOpQoS from a
stateless value type to a stateful pointer type with per-endpoint block
height tracking, sync-allowance-based filtering, and external block height
consumption — matching the pattern used by Cosmos and Solana QoS.

Default behavior is unchanged: when syncAllowance=0 (default), endpoint
selection remains purely random.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In multi-replica deployments, each replica updates perceivedBlockHeight
locally only. This means Replica A may see block 1000 while Replica B
stays at 950, causing B to select stale endpoints. EVM already had full
Redis sync; this extends the same pattern to the remaining QoS types.

Adds SetReputationService, StartBackgroundSync (periodic Redis reads),
and async Redis writes in UpdateFromExtractedData and
ConsumeExternalBlockHeight for Solana, Cosmos, and NoOp QoS. Uses simple
max semantics (if Redis > local, update) rather than EVM's consensus
mechanism. No changes to cmd/main.go needed — existing duck-typed
interface checks auto-detect the new methods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onfigured

NewSimpleQoSInstance (used for all EVM services) sets getEVMChainID()
to "". However, isChainIDValid() still ran unconditionally, rejecting
every observed endpoint — either because it had no chain ID observation
yet (errNoChainIDObs) or because the observed chain ID (e.g. "0x38")
didn't match "". This caused a flood of warn logs and forced fallback
to random endpoint selection.

Skip chain ID validation and synthetic eth_chainId checks entirely when
no expected chain ID is configured, consistent with the constructor's
intent to delegate validation to active health checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…a stale supplier filtering

Health checks only run on the leader pod, so non-leader replicas had empty
per-endpoint block heights and could not filter stale suppliers. This adds
Redis HSET/HGETALL sync for per-endpoint block heights across all 4 QoS types
(EVM, Solana, Cosmos, NoOp), enabling all replicas to filter stale endpoints.

Write side: async HSET from UpdateFromExtractedData after each health check.
Read side: HGETALL on the existing 5s background sync ticker, updating only
existing local endpoint entries where Redis has a higher block height.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests cover both write side (UpdateFromExtractedData writes per-endpoint
block heights to Redis) and read side (StartBackgroundSync reads from Redis
and updates local endpoint store) for all 4 QoS types: EVM, Solana, Cosmos,
and NoOp. Verifies that:
- Higher Redis values update local endpoints
- Lower Redis values do NOT downgrade local endpoints
- Non-existent local endpoints are NOT created from Redis data
- Periodic sync picks up new Redis data after startup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Non-leader replicas had empty endpoint stores at startup because health
checks only run on the leader pod. The syncEndpointBlocksFromRedis
closure would fetch per-endpoint block heights from Redis but silently
skip all entries since none existed locally. This meant stale suppliers
were never filtered on non-leader pods.

Now, EVM/Cosmos/NoOp QoS types create new endpoint entries from Redis
data when the endpoint doesn't exist in the local store, enabling
block height filtering from the first request. Solana is excluded
because its validateBasic() requires health+epoch data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tractBlockHeight

Extends extractBlockHeight to handle non-EVM response formats for external
block sources on generic chains (near, sui, tron):
- Decimal strings without 0x prefix parsed as base-10 (Sui)
- Numeric latest_block_height in sync_info objects (NEAR)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e period

Add mdbx_panic to the heuristic error indicator patterns so that Erigon
nodes with corrupted/full MDBX databases trigger retries on a different
supplier instead of returning the error directly to users. Also reduce
the external block grace period from 60s to 30s to shorten the cold
start window where stale suppliers can leak through.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… comet_bft unsupported

CometBFT methods (status, block, validators, etc.) sent via JSON-RPC POST
were rejected with "unsupported RPC type" when suppliers hadn't staked
comet_bft endpoints. Since CometBFT nodes handle both REST and JSON-RPC
interfaces on the same endpoint, these requests can safely route to json_rpc
endpoints instead. This fallback is a no-op when comet_bft is properly staked.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ng UNKNOWN_RPC

NoOp QoS was discarding the RPC type detected by the gateway and always
setting UNKNOWN_RPC on service payloads. This caused all generic services
(tron, sui, near) to report unknown_rpc in metrics/observations. Now the
detected type (json_rpc, rest, etc.) flows through to the payload.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Endpoint block height entries accumulate indefinitely as sessions rotate,
causing non-leader replicas to recreate stale entries from Redis on every
sync cycle. This adds a lastSeen timestamp to each endpoint entry, updated
during filtering when endpoints appear in active sessions. A periodic sweep
(every 5s tick) removes entries older than 2h TTL from both local stores
and Redis via HDEL.

Changes across all QoS types (EVM, Cosmos, Solana, NoOp):
- Add lastSeen field to endpoint structs
- Add touchEndpoints() to update lastSeen during filtering (separate WLock)
- Add sweepStaleEndpoints() to remove entries past TTL
- Call sweep in StartBackgroundSync tick loop
- Add RemoveEndpointBlockHeights to ReputationService/Storage interfaces
- Implement HDEL-based cleanup in Redis storage, delete in memory storage
- Update all 6 test mocks with the new interface method

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ove them

Entries created by syncEndpointBlocksFromRedis had lastSeen=zero, which
the sweep intentionally skips. This meant Redis-synced entries were never
cleaned up. Set lastSeen=time.Now() on new entries so they get swept after
the 2h TTL if they never appear in an active session.

Also fix NoOp UpdateFromExtractedData to preserve existing lastSeen instead
of overwriting the entire endpointState struct.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Broken/lazy suppliers return empty responses like "result":[],
"result":{}, or bare {}. This adds two layers of defense:

Layer 1 — Heuristic detection (retry on different supplier):
- REST bare {} → rest_empty_object (confidence 0.80, retry)
- JSON-RPC "result":{} → confidence 0.95, MajorErrorSignal (never valid
  for any JSON-RPC method)
- JSON-RPC "result":[] → confidence 0.75, MinorErrorSignal (valid for
  some methods like eth_getLogs, retry is benign)

Layer 2 — Block height filter (prevent future requests):
- Add ErrInvalidBlockHeightResult sentinel error for extractors to
  signal "response had a result but it was unparseable"
- EVM extractor: returns sentinel when eth_blockNumber gets non-string
  result ([], {}, number, boolean) or unparseable hex
- Cosmos extractor: returns sentinel when CometBFT JSON-RPC response
  has empty container result ({} or [])
- UpdateFromExtractedData (EVM/Cosmos/Solana): stores block height 0
  when InvalidBlockHeight flag is set, causing the endpoint to be
  filtered (0 < perceived - sync_allowance)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…infrastructure

When a relay fails, retries now exclude ALL endpoints behind the same
domain (host), not just the exact endpoint address. This prevents wasting
retry budget on different supplier endpoints that share the same broken
backend (e.g., multiple suppliers behind rel.spacebelt.xyz all returning
empty responses).

Also stops retrying entirely when all available endpoints are from tried
domains, instead of resetting and cycling through the same broken
infrastructure with backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
oten91 and others added 30 commits February 24, 2026 12:53
…alid responses

Solana's GetServicePayloads() was not setting JSONRPCMethod on the
protocol payload, so the heuristic analyzer received an empty method
and could not perform method-aware empty array detection. This caused
"result":[] from broken suppliers to be classified as jsonrpc_success
instead of jsonrpc_invalid_empty_array for scalar methods like getSlot.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…reset

A single success was resetting CriticalStrikes to 0, allowing deceptive
suppliers to avoid the cooldown system entirely by interleaving fake
successes with critical errors. Now strikes decay by 1 per success,
requiring sustained good behavior to fully recover.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WebSocket handler was missing RPC type validation, allowing connections
for services that don't have "websocket" in their configured rpc_types.
This let suppliers pump unlimited success signals via WebSocket for
services like Solana (json_rpc only), bypassing reputation penalties.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RPCTypeValidator was defined on the Gateway struct and checked in both
HTTP and WebSocket handlers, but never instantiated — always nil.
This meant RPC type validation was silently skipped for all requests,
allowing WebSocket connections to services like Solana that only
support json_rpc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
At 50+ critical strikes, `1 << exponent` overflows int64, producing
zero or negative cooldown durations. This caused suppliers with 100+
strikes to have instantly-expiring cooldowns instead of the intended
1-hour max. Cap exponent at 7 (5min * 2^7 = 640min > 60min max).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to all protocols

Some suppliers return a static {"jsonrpc":"2.0","id":1,"result":[]} for
ALL requests regardless of protocol type, including REST and CometBFT
paths. This evades current detection because:
- REST analyzeREST() only looks for error indicators
- CometBFT "result":[] check requires jsonrpcMethod which is empty for GET
- Cosmos health checks lack expected_response_contains validation

Changes:
- Add REST protocol mismatch detection (JSON-RPC response to REST request)
- Add CometBFT empty array detection (no CometBFT method returns arrays)
- Register both as deceptive patterns for CriticalErrorSignal + strikes
- Set JSONRPCMethod on Cosmos JSON-RPC payloads (aligns with EVM/Solana)
- Add expected_response_contains to all 15 Cosmos chain health checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sei is a dual chain (Cosmos + EVM) but most suppliers only run EVM
nodes. The CometBFT health checks (health, status, syncing) were
penalizing EVM-only suppliers that correctly reject unknown methods
like "status" with "method does not exist". Switch to eth_blockNumber
and eth_chainId which all sei suppliers support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spacebelt returns {"result":[]} to eth_blockNumber which passes basic
HTTP 200 check. Adding expected_response_contains: '"0x' catches this
since real EVM responses always contain hex strings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Some suppliers return HTTP 200 with
{"error":{"code":-32000,"message":"Failed to call fallback API"}} when
they can't serve archival requests. This was leaking through to clients
as a 200 while the same class of error ("missing trie node") from other
suppliers was correctly caught and retried.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Suppliers returning fast canned responses like {"result":[]} pass
transport-level checks (no error, has bytes), win the hedge race, and
cause the correct response from another supplier to be discarded. The
heuristic later catches the bad response but the good one is gone.

Now the hedge layer runs the same heuristic analysis before declaring
a winner. If the first response fails heuristic validation, it waits
for the second response instead of immediately returning the bad one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… major

The syncing REST health check had reputation_signal: critical_error,
causing CriticalErrorSignal (strikes + exponential cooldown) on any
health check failure. This was unfairly penalizing legitimate suppliers
whose REST endpoint occasionally failed.

Changed to major_error for all 14 Cosmos chains. The status JSON-RPC
check retains critical_error since its expected_response_contains
targets canned responses specifically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…height

The session cache key previously included the current block height, which
changes every block (~60s). Since sessions span ~50 blocks, this caused a
cache miss on every new block for the same session, making the cache
ineffective and overloading full nodes with redundant GetSession gRPC calls.

The fix computes sessionStartHeight deterministically from the current
height and NumBlocksPerSession (already cached via GetSharedParams), so
the cache key remains stable for the entire session window.

Falls back to current height if SharedParams is unavailable, preserving
previous behavior rather than failing requests.

Closes #509
The original formula `h - (h % N)` is 0-based, but Shannon sessions
start at block 1: [1,60], [61,120], etc. This caused the last block
of each session to get a different cache key, resulting in 2 fetches
per session instead of 1.

Use Shannon's formula `h - ((h-1) % N)` to align cache keys with
actual session boundaries. Updated tests to use NumBlocksPerSession=60
matching production config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…kout

Archival requests (eth_call at old blocks, etc.) hitting non-archival
suppliers return "historical state is not available" or similar pruning
errors. The heuristic correctly retries these on different suppliers,
but the circuit breaker was also marking the entire domain as broken.

This caused a death spiral: archival request volume continuously fed
the circuit breaker, escalating hit counts into the thousands and
locking ALL non-qspider domains out for 30 minutes at a time. On
mainnet, arb-one had hit counts of 151,549 on a single domain.

Changes:
- Add shouldCircuitBreak() to gate circuit breaking on heuristic type
- Archival/pruning errors (historical state, state pruned, missing trie
  node, etc.) skip circuit breaking but still retry on other suppliers
- Add IsArchivalRelatedError() to heuristic package for pattern matching
- Store reason in Redis (format: "expiry:hitCount:reason") for debugging
  without needing log access. Backward compatible with old formats.
- Add MatchedPattern field to AnalysisResult for callers to inspect
- Add diagnostic logging: circuit break events (error level) and skipped
  circuit breaks (warn level) with reason, domain, and service context
- Add logger parameter to DomainCircuitBreaker for structured logging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The hedge race failure branch was not setting lastCircuitBreakReason or
lastResponseSnippet, causing all circuit breaker logs and Redis entries
to show empty reason/response fields. This made it impossible to debug
why domains were being locked out via the hedge path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CometBFT requests sent as JSON-RPC POST (e.g., {"jsonrpc":"2.0","method":"status"})
were being routed to the REST validator, which re-detected the RPC type from the URL
path as REST. This caused rest_protocol_mismatch false positives at 0.95 confidence,
triggering circuit breaker domain lockouts across Cosmos chains.

Route these requests to the JSONRPC validator instead, which correctly preserves
RPCType=COMET_BFT and avoids the heuristic false positive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onses

Tron's HTTP API returns {} for query endpoints when the entity doesn't
exist (e.g., /wallet/getaccount for unactivated accounts). The heuristic
was flagging these as rest_empty_object errors at 0.80 confidence,
triggering circuit breaker domain lockouts and causing ~88% 5xx on tron.

Add path-based whitelist (restEmptyObjectValidPathPrefixes) for /wallet/
and /walletsolidity/ prefixes, matching the existing emptyArrayValidMethods
pattern for JSON-RPC. Pass request path through to REST analyzer via the
existing jsonrpcMethod parameter (empty for REST, now falls back to
payload.Path).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… JSON-RPC

processOfElimination auto-assigned JSON_RPC to any HTTP request when it
was the only HTTP-based RPC type configured. GET requests to json_rpc-only
services (e.g., GET /status) were forwarded as JSON-RPC, causing backend
400 rejections and circuit breaker lockouts.

Now validates that requests must be POST to be typed as JSON_RPC. Non-POST
requests fall through to inspectPayload, which returns a proper error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add POST /admin/circuit-breaker/clear/{serviceId} endpoint that clears
both in-memory and Redis circuit breaker state. Redis DEL alone is
insufficient because refreshFromRedis merges local entries back —
this endpoint clears local cache first, then Redis.

Also updates pnf_path_rules.yaml sync_allowance values to match
external configuration (eth: 2, poly/arb-one/base: 20).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…are registered

InitExternalConfig runs before SetQoSInstances, so the initial
SetSyncAllowance calls in refreshExternalConfig find an empty
qosInstances map and silently fail. This leaves sync_allowance=0
(disabled) for all services until the first periodic refresh (5min),
creating a window where block height filtering is completely bypassed.

Fix: SetQoSInstances now re-applies sync_allowance from any
already-loaded external configs when QoS instances are registered.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
HTTP 4xx responses indicate the client sent a bad request, not that
the domain is broken. Don't punish domains for correctly rejecting
malformed requests (e.g., NEAR backends returning PARSE_ERROR for
non-JSON-RPC payloads). Only 5xx, transport failures, and heuristic-
detected supplier errors should trigger circuit breaks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log HTTP method, URL path, content-type, user-agent, and first 512
bytes of request body at error level when RPC type detection fails.
Helps diagnose sources of bad traffic (e.g., malformed NEAR POSTs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Shannon protocol layer's heuristic check only used payload.JSONRPCMethod,
which is empty for REST requests. This meant path-aware validation (e.g., the
Tron /wallet/ whitelist for valid {} responses) never received the request path,
causing false positive rest_empty_object detections and circuit breaker lockouts.

Add the same payload.Path fallback that gateway-level checks already have.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… breaker reason

- Promote protocol-layer heuristic detection log from Debug to Error for
  production visibility (needed to identify which JSON-RPC method triggers
  false positive empty array detection on sei)
- Include jsonrpc_method and request_path in the heuristic error log
- Append (method=...) to the error string propagated to circuit breaker
- Increase Redis circuit breaker reason truncation from 200 to 500 chars
  so method names aren't cut off in diagnostic queries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ethod

Two false positive sources found via improved diagnostics:

1. Cosmos REST /cosmos/* paths return {} for non-existent entities (e.g.,
   /cosmos/slashing/v1beta1/signing_infos/{addr} for validators with no
   slashing record). Add /cosmos/ to restEmptyObjectValidPathPrefixes.

2. Sei exposes EVM methods with sei_ prefix (sei_getLogs, sei_getFilterLogs,
   etc.) that behave identically to eth_* equivalents. Add to
   emptyArrayValidMethods whitelist so "result":[] is not flagged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…kout via hedge path

When the protocol layer detects an archival error (e.g., "historical state
not available", "missing trie node") and returns it as an error, the hedge_failed
code path loses the structured heuristicResult. shouldCircuitBreak then sees
heuristicResult=nil and treats it as a transport failure → circuit breaks the
domain → locks out ALL traffic including non-archival requests.

Fix: add error string parameter to shouldCircuitBreak. When heuristicResult is
nil but an error is available, fall back to substring matching against archival
patterns. This prevents domains from being locked out for serving pruned-state
errors while remaining healthy for current-block requests.

Affected chains: arb-one (118K hits easy2stake), base (45K hits nodefleet),
bsc, and other EVM chains with archival traffic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… lockout

Lite fullnodes (e.g., Tron) return plain text like "this API is closed
because this node is a lite fullnode" for unsupported endpoints. Previously
this was classified as generic non_json_response, triggering domain-level
circuit breaking even though the domain works fine for other requests.

Now the structural analyzer detects capability limitation patterns in
non-JSON responses and returns a specific reason (non_json_capability_limitation)
that shouldCircuitBreak recognizes as non-circuit-breakable — same treatment
as archival errors. The request still retries on a different supplier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When shouldCircuitBreak returns false in the batch processing path,
the else branch accessed checkResult.HeuristicResult.Reason without
a nil check. Moved heuristicReason extraction before the if/else
so it's available in both branches with proper nil safety.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CometBFT endpoints always return JSON-RPC formatted responses (e.g.,
{"jsonrpc":"2.0","id":-1,"result":{}}) even for GET requests. This was
causing two types of false positives:

1. analyzeREST flagged CometBFT GET responses as rest_protocol_mismatch
   because it saw "jsonrpc" in a REST response — but CometBFT paths like
   /health, /status, /block always return JSON-RPC format by design.

2. analyzeJSONRPC flagged {"result":{}} as jsonrpc_empty_object_result
   when rpcType arrived as JSON_RPC instead of COMET_BFT (due to
   rpc_type_fallbacks routing CometBFT through JSON_RPC endpoints).

Fix: Add method-aware and path-aware CometBFT detection so the heuristic
recognizes valid CometBFT responses regardless of how they were routed.
These checks become redundant safety nets once suppliers can stake
comet_bft endpoints directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants