Skip to content

fix: Make OpenTelemetry startup idempotent#2684

Open
mike-inkeep wants to merge 2 commits intomainfrom
fix/hot_reload_metricreader_error
Open

fix: Make OpenTelemetry startup idempotent#2684
mike-inkeep wants to merge 2 commits intomainfrom
fix/hot_reload_metricreader_error

Conversation

@mike-inkeep
Copy link
Contributor

Vite hot reload frequently tries to re-run the otel SDK startup, which is not idempotent and leads to a 500 error on all subsequent requests. This requires restarting pnpm dev.
The error shows up as: 10:04:53 AM [vite] Internal server error: MetricReader can not be bound to a MeterProvider again.

This PR makes that operation idempotent by using a global var to keep track of whether otel was started up already.

@vercel
Copy link

vercel bot commented Mar 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 16, 2026 1:32pm
agents-manage-ui Ready Ready Preview, Comment Mar 16, 2026 1:32pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
agents-docs Skipped Skipped Mar 16, 2026 1:32pm

Request Review

@vercel vercel bot temporarily deployed to Preview – agents-docs March 13, 2026 14:11 Inactive
@changeset-bot
Copy link

changeset-bot bot commented Mar 13, 2026

⚠️ No Changeset found

Latest commit: d5f8ac1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idempotency approach is sound for the Vite HMR use case, but there's a breaking change that needs addressing before merge.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow runpullfrog.com𝕏

});

export const defaultSDK = new NodeSDK({
const defaultSDK = new NodeSDK({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the export from defaultSDK is a breaking change. create-agents-template/apps/agents-api/src/instrumentation.ts:19 does:

const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();

@inkeep/agents-api/instrumentation is a declared subpath export in agents-api/package.json. The template (and any downstream consumer using that subpath) will break at runtime.

Either:

  1. Update the template in this PR to use startOpenTelemetrySDK(), or
  2. Continue exporting defaultSDK (possibly deprecated) alongside the new function.

Comment on lines +93 to +110
const OTEL_STARTED_KEY = Symbol.for('@inkeep/agents-api.otelStarted');

export function startOpenTelemetrySDK(): void {
if ((globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY]) {
return;
}
try {
defaultSDK.start();
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
} catch (error) {
const msg = error instanceof Error ? error.message : String(error);
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
return;
}
throw error;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new NodeSDK(...) constructor at line 85 still runs at module scope on every HMR re-evaluation — allocating fresh span processors, context managers, and instrumentations each time. The guard only prevents start() from firing twice, but the stale SDK objects from prior evaluations are leaked.

Consider moving the new NodeSDK(...) construction inside startOpenTelemetrySDK() behind the same guard, or caching the instance on globalThis so repeated module evaluations reuse it.

Comment on lines +103 to +106
const msg = error instanceof Error ? error.message : String(error);
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this catch fires, OTEL recovered via the fallback path rather than the primary guard. A console.warn (or logger call) here would help operators distinguish between clean idempotency and error-recovery idempotency during debugging.

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(4) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

🟠 1) create-agents-template/apps/agents-api/src/instrumentation.ts:19-20 Breaking change: defaultSDK export removed but template still imports it

Issue: This PR removes the export keyword from defaultSDK (making it private), but create-agents-template still imports this export directly:

// create-agents-template/apps/agents-api/src/instrumentation.ts:19-20
const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();

Why: When users upgrade @inkeep/agents-api or scaffold a new project using the template, the code will fail at runtime with an import error. This is a breaking change that affects the template's compatibility with the package.

Fix: Update the template to use the new idempotent function:

// create-agents-template/apps/agents-api/src/instrumentation.ts
const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();

Refs:


🟠 2) agents-api/src/instrumentation.ts:95-110 Missing test coverage for idempotency logic

Issue: The new startOpenTelemetrySDK() function has zero test coverage. It has three distinct code paths: (1) early return when already started, (2) successful SDK start with flag set, and (3) error handling for MetricReader binding errors — none of which are tested.

Why: Without tests, future refactoring could break the HMR safety mechanism, causing the original 500 error to resurface. The codebase already has a test pattern for this exact scenario in socket-mode.test.ts:

// packages/agents-work-apps/src/__tests__/slack/socket-mode.test.ts:121
it('should skip if client already exists (HMR safety)', async () => {
  (globalThis as Record<string, unknown>).__inkeep_slack_socket_mode_client__ = {};
  // ...
  expect(mockSocketModeStart).not.toHaveBeenCalled();
});

Fix: Add a test file at agents-api/src/__tests__/instrumentation.test.ts with cases for:

  1. should skip SDK start if already started (HMR safety) — verify early return
  2. should set global flag after successful start — verify flag is set
  3. should catch MetricReader binding error and set flag — verify error suppression
  4. should rethrow non-MetricReader errors — verify other errors propagate

Refs:

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: instrumentation.ts:104-107 Silent error suppression without logging

💭 Consider (2) 💭

💭 1) agents-api/src/instrumentation.ts:104 Brittle string matching for error detection

Issue: The error check msg.includes('MetricReader') && msg.includes('can not be bound') relies on specific error message text from @opentelemetry/sdk-node. If OpenTelemetry changes the error message wording in a future version, this check would fail.

Why: A dependency update could silently break the hot-reload fix. Since this is dev-only, it might not be caught until it disrupts developer workflow.

Fix: Add a version comment documenting the expected error format:

// Known error from @opentelemetry/sdk-node when SDK.start() is called multiple times.
// Verified against @opentelemetry/sdk-node@0.212.0. Update if error message changes.
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {

Refs:

💭 2) agents-api/src/instrumentation.ts:102-108 Catch block sets flag even when SDK may be in partial state

Issue: If the MetricReader error is thrown, it means defaultSDK.start() partially succeeded before failing. Setting the flag to true prevents future retries, but the SDK may be in an inconsistent state.

Why: In the HMR context this is fine because the previous SDK instance's tracers are still working. But it's worth noting that subsequent calls to shutdown or flush on this new instance would operate on a broken SDK.

Fix: This is acceptable as-is for the HMR use case. Consider adding a code comment documenting this edge case for future maintainers.


🚫 REQUEST CHANGES

Summary: The approach using Symbol.for() + globalThis for idempotency is correct and follows established patterns in this codebase (see Slack Socket Mode). However, two issues need addressing before merge: (1) the template breaking change — create-agents-template imports the now-private defaultSDK export, and (2) the fix lacks test coverage for a behavior-critical change. Both are straightforward to fix.

Discarded (1)
Location Issue Reason Discarded
agents-docs/content/guides/observability/langfuse-usage.mdx:50 Documentation shows old OTEL pattern Pre-existing issue, docs show user-maintained code example, LOW confidence
Reviewers (5)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-sre 3 1 0 0 1 0 1
pr-review-tests 2 1 1 0 0 0 0
pr-review-errors 2 0 1 0 1 0 0
pr-review-precision 2 1 1 0 0 0 0
pr-review-standards 0 0 0 0 0 0 0
Total 9 3 3 0 2 0 1

Note: Findings were deduplicated across reviewers — template breaking change was flagged by both pr-review-sre and pr-review-precision; silent error suppression was flagged by both pr-review-sre and pr-review-errors.

Comment on lines +104 to +107
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Silent error suppression without logging

Issue: When the MetricReader binding error is caught and suppressed, there's no logging to indicate this happened.

Why: This follows a different pattern than createSafeBatchProcessor() (line 37) which logs warnings for initialization failures. Without logging, debugging OTEL issues becomes harder — there's no trace that startup was skipped due to a binding conflict.

Fix: Add a debug/info log to maintain observability:

Suggested change
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
return;
}
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
logger.debug({}, 'OpenTelemetry SDK already started (MetricReader already bound), skipping re-initialization');
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
return;
}

Refs:

@github-actions github-actions bot deleted a comment from claude bot Mar 13, 2026
@itoqa
Copy link

itoqa bot commented Mar 13, 2026

Ito Test Report ✅

19 test cases ran. 19 passed.

The verification run completed with stable API and UI behavior across baseline checks, stress scenarios, auth/session handling, navigation edge cases, and post-stress resilience validation. 🔍 Code review of the PR-scoped startup-idempotency change and observed runtime outcomes did not surface a code-supported defect under the exercised paths.

✅ Passed (19)
Test Case Summary Timestamp Screenshot
ROUTE-1 Six consecutive /health probes returned 204 with no 5xx responses. 0:00 ROUTE-1_0-00.png
ROUTE-2 The /ready endpoint returned 200 with expected healthy schema fields and no server error. 0:00 ROUTE-2_0-00.png
ROUTE-3 Root navigation bootstrapped into tenant projects successfully with no observed auth/bootstrap 500 responses. 0:58 ROUTE-3_0-58.png
ROUTE-4 Tenant route navigation remained functional across settings and projects views with no server error page during transitions. 1:59 ROUTE-4_1-59.png
LOGIC-1 Completed 20 hard reload cycles on authenticated tenant route and app stayed usable without persistent server-error state. 16:05 LOGIC-1_16-05.png
LOGIC-2 Ran 10-tab concurrent load and refresh sequence; all tabs remained on valid tenant projects page without 5xx behavior. 16:05 LOGIC-2_16-05.png
LOGIC-3 No 5xx occurred across repeated auth-session probes and a valid dev-session bootstrap returned 200 with session cookie issuance. 18:03 LOGIC-3_18-03.png
EDGE-1 Direct deep-link to tenant projects from a cleared state resolved successfully with controlled auth bootstrap and no 500 response. 2:13 EDGE-1_2-13.png
EDGE-2 Completed three back/back/forward/forward cycles across tenant routes; no stale 500/internal-error state appeared during traversal. 20:15 EDGE-2_20-15.png
EDGE-3 Mobile viewport bootstrap completed into tenant projects with visible primary UI and no observed startup-related 500 responses. 2:34 EDGE-3_2-34.png
EDGE-4 Repeated refresh interruptions during route activity recovered to stable tenant page each time with no persistent server-error loop. 16:05 EDGE-4_16-05.png
EDGE-5 Completed 30 open/close churn iterations and periodic health probes; API health remained 204 throughout and UI stayed functional. 16:05 EDGE-5_16-05.png
ADV-1 POST /api/auth/dev-session without Authorization returned controlled 401 response. 0:00 ADV-1_0-00.png
ADV-2 Ten rapid requests with incorrect bearer token consistently returned 401 without any 5xx. 0:00 ADV-2_0-00.png
ADV-3 Malformed and oversized Authorization header variants were rejected with controlled non-5xx statuses. 0:00 ADV-3_0-00.png
ADV-4 After deleting the auth cookie, protected-route fan-out completed with controlled route handling and no raw 500 error pages. 18:05 ADV-4_18-05.png
ADV-5 Encoded script/traversal path variants were handled in a controlled way (error/not-found style behavior) without 500 responses or raw script payload reflection. 20:31 ADV-5_20-31.png
ADV-6 High-concurrency mixed probes completed with only expected 200/204/401 statuses and no 5xx responses. 18:08 ADV-6_18-08.png
RESILIENCE-1 After a 2-minute post-stress interval, tenant UI routes remained functional, /health returned 204, /ready returned 200 with valid contract, and no delayed 500 regression was observed. 22:29 RESILIENCE-1_22-29.png
📋 View Recording

Screen Recording

@mike-inkeep
Copy link
Contributor Author

@claude add export back to defaultSDK

move new NodeSDK() into startOpenTelemetrySDK behind the same guard, or cache the instance on globalThis so repeated module evaluations reuse it

@inkeep-internal-ci
Copy link
Contributor

inkeep-internal-ci bot commented Mar 16, 2026

Claude finished @mike-inkeep's task in 1m 43s —— View job


Changes implemented

  • Read current instrumentation.ts and template file
  • Add export back to defaultSDK (fixes breaking change for create-agents-template)
  • Move new NodeSDK() behind a globalThis guard via getOrCreateSDK() — repeated HMR module evaluations now reuse the cached instance instead of leaking fresh SDK objects
  • Commit and push

What changed in agents-api/src/instrumentation.ts:

  • Added OTEL_SDK_KEY symbol on globalThis to cache the NodeSDK instance
  • getOrCreateSDK() checks the cache first, only constructs a new NodeSDK if none exists
  • export const defaultSDK = getOrCreateSDK() — export is restored, template import is unbroken
  • startOpenTelemetrySDK() uses a shared OtelGlobal type alias (cleaner casts)
    ·

Restores the export on defaultSDK to avoid breaking the
create-agents-template subpath import. Moves the new NodeSDK()
construction behind a globalThis guard (getOrCreateSDK) so
repeated Vite HMR module evaluations reuse the same instance instead
of leaking fresh SDK objects.

Co-authored-by: mike-inkeep <mike-inkeep@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@itoqa
Copy link

itoqa bot commented Mar 16, 2026

Ito Test Report ✅

19 test cases ran. 19 passed.

✅ The local dev stability run stayed healthy across API route checks, auth bootstrap flows, multi-tab/mobile stress, and adversarial auth requests. Code-change logs only show non-product test-run configuration tweaks (dev server host/overlay indicators), and no test output provided a code-supported defect signal, so all executed cases are reported as passed.

✅ Passed (19)
Test Case Summary Timestamp Screenshot
ROUTE-1 Twenty sequential /health checks returned HTTP 204 with no HTTP 500 responses. 0:00 ROUTE-1_0-00.png
ROUTE-2 Readiness probes stayed healthy after route churn and returned HTTP 200 without HTTP 500. 0:00 ROUTE-2_0-00.png
ROUTE-3 Clean-context navigation to /default/projects rendered successfully with protected UI available. 1:08 ROUTE-3_1-08.png
ROUTE-4 OpenAPI remained parseable across repeated checks and docs route stayed reachable with no HTTP 500 status responses. 0:00 ROUTE-4_0-00.png
ROUTE-5 Ten rapid refreshes completed on /default/projects with stable page availability and no persistent failure pattern. 1:21 ROUTE-5_1-21.png
LOGIC-1 Repeated first-hit style root requests with periodic readiness/openapi probes stayed stable and did not show HTTP 500 responses. 0:00 LOGIC-1_0-00.png
LOGIC-2 Parallel bursts across core routes stayed responsive and did not produce HTTP 500 status responses. 0:00 LOGIC-2_0-00.png
LOGIC-3 Forced recovery probe completed with five healthy checks and successful return to protected UI; no poisoned state persisted. 7:25 LOGIC-3_7-25.png
EDGE-1 Thirty reloads with periodic readiness probes remained stable; all sampled health/readiness checks stayed valid. 1:38 EDGE-1_1-38.png
EDGE-2 Five-tab concurrent navigation and reload sequence completed with responsive API endpoints and no persistent crash state. 2:31 EDGE-2_2-31.png
EDGE-3 Five clean-cookie deep-link interruption cycles converged to rendered stats page without persistent 500 behavior. 2:47 EDGE-3_2-47.png
EDGE-4 Three history replay cycles across mixed routes completed and follow-up health/readiness checks remained healthy. 3:05 EDGE-4_3-05.png
ADV-1 All 10 unauthorized dev-session POST requests returned 401, and follow-up health/readiness checks stayed healthy (204/200) with no 500. 9:24 ADV-1_9-24.png
ADV-2 All malformed Authorization variants returned safe 401 responses, and service remained responsive afterward (health 204, ready 200) with no 500. 9:26 ADV-2_9-26.png
ADV-3 Executed 50 mixed requests plus protected-route reloads; service stayed responsive with no persistent 500 cascade. 7:23 ADV-3_7-23.png
ADV-4 Fresh-cookie protected-route probing completed without backend crash and service health stayed available. 3:21 ADV-4_3-21.png
RAPID-1 Thirty rapid route toggles plus periodic reloads completed with recoverable UI state and healthy backend probe. 3:42 RAPID-1_3-42.png
MOBILE-1 At 390x844, repeated refresh and navigation toggles remained operable with no persistent startup-failure cascade. 4:05 MOBILE-1_4-05.png
JOURNEY-1 Repeated mid-journey refresh interruptions converged to a stable rendered state with backend health intact. 4:15 JOURNEY-1_4-15.png
📋 View Recording

Screen Recording

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant