fix: Make OpenTelemetry startup idempotent by mike-inkeep · Pull Request #2684 · inkeep/agents

mike-inkeep · 2026-03-13T14:11:04Z

Vite hot reload frequently tries to re-run the otel SDK startup, which is not idempotent and leads to a 500 error on all subsequent requests. This requires restarting pnpm dev.
The error shows up as: 10:04:53 AM [vite] Internal server error: MetricReader can not be bound to a MeterProvider again.

This PR makes that operation idempotent by using a global var to keep track of whether otel was started up already.

vercel · 2026-03-13T14:11:09Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agents-api	Ready	Preview, Comment	Mar 16, 2026 1:32pm
agents-manage-ui	Ready	Preview, Comment	Mar 16, 2026 1:32pm

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
agents-docs	Skipped		Mar 16, 2026 1:32pm

changeset-bot · 2026-03-13T14:11:11Z

⚠️ No Changeset found

Latest commit: d5f8ac1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

pullfrog

The idempotency approach is sound for the Vite HMR use case, but there's a breaking change that needs addressing before merge.

^{｜ Fix all ➔ ｜ Fix 👍s ➔ ｜ View workflow run ｜ pullfrog.com ｜ 𝕏}

pullfrog · 2026-03-13T14:13:09Z

agents-api/src/instrumentation.ts

 });

-export const defaultSDK = new NodeSDK({
+const defaultSDK = new NodeSDK({


Removing the export from defaultSDK is a breaking change. create-agents-template/apps/agents-api/src/instrumentation.ts:19 does:

const { defaultSDK } = await import('@inkeep/agents-api/instrumentation'); defaultSDK.start();

@inkeep/agents-api/instrumentation is a declared subpath export in agents-api/package.json. The template (and any downstream consumer using that subpath) will break at runtime.

Either:

Update the template in this PR to use startOpenTelemetrySDK(), or

Continue exporting defaultSDK (possibly deprecated) alongside the new function.

pullfrog · 2026-03-13T14:13:09Z

agents-api/src/instrumentation.ts

+const OTEL_STARTED_KEY = Symbol.for('@inkeep/agents-api.otelStarted');
+
+export function startOpenTelemetrySDK(): void {
+  if ((globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY]) {
+    return;
+  }
+  try {
+    defaultSDK.start();
+    (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
+  } catch (error) {
+    const msg = error instanceof Error ? error.message : String(error);
+    if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
+      (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
+      return;
+    }
+    throw error;
+  }
+}


The new NodeSDK(...) constructor at line 85 still runs at module scope on every HMR re-evaluation — allocating fresh span processors, context managers, and instrumentations each time. The guard only prevents start() from firing twice, but the stale SDK objects from prior evaluations are leaked.

Consider moving the new NodeSDK(...) construction inside startOpenTelemetrySDK() behind the same guard, or caching the instance on globalThis so repeated module evaluations reuse it.

pullfrog · 2026-03-13T14:13:09Z

agents-api/src/instrumentation.ts

+    const msg = error instanceof Error ? error.message : String(error);
+    if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
+      (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
+      return;


When this catch fires, OTEL recovered via the fallback path rather than the primary guard. A console.warn (or logger call) here would help operators distinguish between clean idempotency and error-recovery idempotency during debugging.

claude

PR Review Summary

(4) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

🟠 1) create-agents-template/apps/agents-api/src/instrumentation.ts:19-20 Breaking change: defaultSDK export removed but template still imports it

Issue: This PR removes the export keyword from defaultSDK (making it private), but create-agents-template still imports this export directly:

// create-agents-template/apps/agents-api/src/instrumentation.ts:19-20
const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();

Why: When users upgrade @inkeep/agents-api or scaffold a new project using the template, the code will fail at runtime with an import error. This is a breaking change that affects the template's compatibility with the package.

Fix: Update the template to use the new idempotent function:

// create-agents-template/apps/agents-api/src/instrumentation.ts
const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();

Refs:

🟠 2) agents-api/src/instrumentation.ts:95-110 Missing test coverage for idempotency logic

Issue: The new startOpenTelemetrySDK() function has zero test coverage. It has three distinct code paths: (1) early return when already started, (2) successful SDK start with flag set, and (3) error handling for MetricReader binding errors — none of which are tested.

Why: Without tests, future refactoring could break the HMR safety mechanism, causing the original 500 error to resurface. The codebase already has a test pattern for this exact scenario in socket-mode.test.ts:

// packages/agents-work-apps/src/__tests__/slack/socket-mode.test.ts:121
it('should skip if client already exists (HMR safety)', async () => {
  (globalThis as Record<string, unknown>).__inkeep_slack_socket_mode_client__ = {};
  // ...
  expect(mockSocketModeStart).not.toHaveBeenCalled();
});

Fix: Add a test file at agents-api/src/__tests__/instrumentation.test.ts with cases for:

should skip SDK start if already started (HMR safety) — verify early return
should set global flag after successful start — verify flag is set
should catch MetricReader binding error and set flag — verify error suppression
should rethrow non-MetricReader errors — verify other errors propagate

Refs:

Slack Socket Mode HMR test pattern

🟡 Minor (1) 🟡

Inline Comments:

🟡 Minor: instrumentation.ts:104-107 Silent error suppression without logging

💭 Consider (2) 💭

💭 1) agents-api/src/instrumentation.ts:104 Brittle string matching for error detection

Issue: The error check msg.includes('MetricReader') && msg.includes('can not be bound') relies on specific error message text from @opentelemetry/sdk-node. If OpenTelemetry changes the error message wording in a future version, this check would fail.

Why: A dependency update could silently break the hot-reload fix. Since this is dev-only, it might not be caught until it disrupts developer workflow.

Fix: Add a version comment documenting the expected error format:

// Known error from @opentelemetry/sdk-node when SDK.start() is called multiple times.
// Verified against @opentelemetry/sdk-node@0.212.0. Update if error message changes.
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {

Refs:

OpenTelemetry SDK Node source

💭 2) agents-api/src/instrumentation.ts:102-108 Catch block sets flag even when SDK may be in partial state

Issue: If the MetricReader error is thrown, it means defaultSDK.start() partially succeeded before failing. Setting the flag to true prevents future retries, but the SDK may be in an inconsistent state.

Why: In the HMR context this is fine because the previous SDK instance's tracers are still working. But it's worth noting that subsequent calls to shutdown or flush on this new instance would operate on a broken SDK.

Fix: This is acceptable as-is for the HMR use case. Consider adding a code comment documenting this edge case for future maintainers.

🚫 REQUEST CHANGES

Summary: The approach using Symbol.for() + globalThis for idempotency is correct and follows established patterns in this codebase (see Slack Socket Mode). However, two issues need addressing before merge: (1) the template breaking change — create-agents-template imports the now-private defaultSDK export, and (2) the fix lacks test coverage for a behavior-critical change. Both are straightforward to fix.

Discarded (1)

Location	Issue	Reason Discarded
`agents-docs/content/guides/observability/langfuse-usage.mdx:50`	Documentation shows old OTEL pattern	Pre-existing issue, docs show user-maintained code example, LOW confidence

Reviewers (5)

Reviewer	Returned	Main Findings	Consider	Inline Comments	Discarded
`pr-review-sre`	3	1	0	1	1
`pr-review-tests`	2	1	1	0	0
`pr-review-errors`	2	0	1	1	0
`pr-review-precision`	2	1	1	0	0
`pr-review-standards`	0	0	0	0	0
Total	9	3	3	2	1

Note: Findings were deduplicated across reviewers — template breaking change was flagged by both pr-review-sre and pr-review-precision; silent error suppression was flagged by both pr-review-sre and pr-review-errors.

claude · 2026-03-13T14:18:28Z

agents-api/src/instrumentation.ts

+    if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
+      (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
+      return;
+    }


🟡 Minor: Silent error suppression without logging

Issue: When the MetricReader binding error is caught and suppressed, there's no logging to indicate this happened.

Why: This follows a different pattern than createSafeBatchProcessor() (line 37) which logs warnings for initialization failures. Without logging, debugging OTEL issues becomes harder — there's no trace that startup was skipped due to a binding conflict.

Fix: Add a debug/info log to maintain observability:

Suggested change

if (msg.includes('MetricReader') && msg.includes('can not be bound')) {

(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;

return;

}

if (msg.includes('MetricReader') && msg.includes('can not be bound')) {

logger.debug({}, 'OpenTelemetry SDK already started (MetricReader already bound), skipping re-initialization');

(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;

return;

}

Refs:

createSafeBatchProcessor logging pattern — existing fallback logging pattern

Slack Socket Mode HMR logging — logs "HMR reload detected"

itoqa · 2026-03-13T15:04:39Z

Ito Test Report ✅

19 test cases ran. 19 passed.

The verification run completed with stable API and UI behavior across baseline checks, stress scenarios, auth/session handling, navigation edge cases, and post-stress resilience validation. 🔍 Code review of the PR-scoped startup-idempotency change and observed runtime outcomes did not surface a code-supported defect under the exercised paths.

✅ Passed (19)

Test Case	Summary	Timestamp
ROUTE-1	Six consecutive /health probes returned 204 with no 5xx responses.	0:00
ROUTE-2	The /ready endpoint returned 200 with expected healthy schema fields and no server error.	0:00
ROUTE-3	Root navigation bootstrapped into tenant projects successfully with no observed auth/bootstrap 500 responses.	0:58
ROUTE-4	Tenant route navigation remained functional across settings and projects views with no server error page during transitions.	1:59
LOGIC-1	Completed 20 hard reload cycles on authenticated tenant route and app stayed usable without persistent server-error state.	16:05
LOGIC-2	Ran 10-tab concurrent load and refresh sequence; all tabs remained on valid tenant projects page without 5xx behavior.	16:05
LOGIC-3	No 5xx occurred across repeated auth-session probes and a valid dev-session bootstrap returned 200 with session cookie issuance.	18:03
EDGE-1	Direct deep-link to tenant projects from a cleared state resolved successfully with controlled auth bootstrap and no 500 response.	2:13
EDGE-2	Completed three back/back/forward/forward cycles across tenant routes; no stale 500/internal-error state appeared during traversal.	20:15
EDGE-3	Mobile viewport bootstrap completed into tenant projects with visible primary UI and no observed startup-related 500 responses.	2:34
EDGE-4	Repeated refresh interruptions during route activity recovered to stable tenant page each time with no persistent server-error loop.	16:05
EDGE-5	Completed 30 open/close churn iterations and periodic health probes; API health remained 204 throughout and UI stayed functional.	16:05
ADV-1	POST /api/auth/dev-session without Authorization returned controlled 401 response.	0:00
ADV-2	Ten rapid requests with incorrect bearer token consistently returned 401 without any 5xx.	0:00
ADV-3	Malformed and oversized Authorization header variants were rejected with controlled non-5xx statuses.	0:00
ADV-4	After deleting the auth cookie, protected-route fan-out completed with controlled route handling and no raw 500 error pages.	18:05
ADV-5	Encoded script/traversal path variants were handled in a controlled way (error/not-found style behavior) without 500 responses or raw script payload reflection.	20:31
ADV-6	High-concurrency mixed probes completed with only expected 200/204/401 statuses and no 5xx responses.	18:08
RESILIENCE-1	After a 2-minute post-stress interval, tenant UI routes remained functional, /health returned 204, /ready returned 200 with valid contract, and no delayed 500 regression was observed.	22:29

📋 View Recording

Screen Recording

mike-inkeep · 2026-03-16T13:27:42Z

@claude add export back to defaultSDK

move new NodeSDK() into startOpenTelemetrySDK behind the same guard, or cache the instance on globalThis so repeated module evaluations reuse it

inkeep-internal-ci · 2026-03-16T13:28:06Z

Claude finished @mike-inkeep's task in 1m 43s —— View job

Changes implemented

Read current instrumentation.ts and template file
Add export back to defaultSDK (fixes breaking change for create-agents-template)
Move new NodeSDK() behind a globalThis guard via getOrCreateSDK() — repeated HMR module evaluations now reuse the cached instance instead of leaking fresh SDK objects
Commit and push

What changed in agents-api/src/instrumentation.ts:

Added OTEL_SDK_KEY symbol on globalThis to cache the NodeSDK instance
getOrCreateSDK() checks the cache first, only constructs a new NodeSDK if none exists
export const defaultSDK = getOrCreateSDK() — export is restored, template import is unbroken
startOpenTelemetrySDK() uses a shared OtelGlobal type alias (cleaner casts)
·

Restores the export on defaultSDK to avoid breaking the create-agents-template subpath import. Moves the new NodeSDK() construction behind a globalThis guard (getOrCreateSDK) so repeated Vite HMR module evaluations reuse the same instance instead of leaking fresh SDK objects. Co-authored-by: mike-inkeep <mike-inkeep@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

itoqa · 2026-03-16T14:11:21Z

Ito Test Report ✅

19 test cases ran. 19 passed.

✅ The local dev stability run stayed healthy across API route checks, auth bootstrap flows, multi-tab/mobile stress, and adversarial auth requests. Code-change logs only show non-product test-run configuration tweaks (dev server host/overlay indicators), and no test output provided a code-supported defect signal, so all executed cases are reported as passed.

✅ Passed (19)

Test Case	Summary	Timestamp
ROUTE-1	Twenty sequential /health checks returned HTTP 204 with no HTTP 500 responses.	0:00
ROUTE-2	Readiness probes stayed healthy after route churn and returned HTTP 200 without HTTP 500.	0:00
ROUTE-3	Clean-context navigation to /default/projects rendered successfully with protected UI available.	1:08
ROUTE-4	OpenAPI remained parseable across repeated checks and docs route stayed reachable with no HTTP 500 status responses.	0:00
ROUTE-5	Ten rapid refreshes completed on /default/projects with stable page availability and no persistent failure pattern.	1:21
LOGIC-1	Repeated first-hit style root requests with periodic readiness/openapi probes stayed stable and did not show HTTP 500 responses.	0:00
LOGIC-2	Parallel bursts across core routes stayed responsive and did not produce HTTP 500 status responses.	0:00
LOGIC-3	Forced recovery probe completed with five healthy checks and successful return to protected UI; no poisoned state persisted.	7:25
EDGE-1	Thirty reloads with periodic readiness probes remained stable; all sampled health/readiness checks stayed valid.	1:38
EDGE-2	Five-tab concurrent navigation and reload sequence completed with responsive API endpoints and no persistent crash state.	2:31
EDGE-3	Five clean-cookie deep-link interruption cycles converged to rendered stats page without persistent 500 behavior.	2:47
EDGE-4	Three history replay cycles across mixed routes completed and follow-up health/readiness checks remained healthy.	3:05
ADV-1	All 10 unauthorized dev-session POST requests returned 401, and follow-up health/readiness checks stayed healthy (204/200) with no 500.	9:24
ADV-2	All malformed Authorization variants returned safe 401 responses, and service remained responsive afterward (health 204, ready 200) with no 500.	9:26
ADV-3	Executed 50 mixed requests plus protected-route reloads; service stayed responsive with no persistent 500 cascade.	7:23
ADV-4	Fresh-cookie protected-route probing completed without backend crash and service health stayed available.	3:21
RAPID-1	Thirty rapid route toggles plus periodic reloads completed with recoverable UI state and healthy backend probe.	3:42
MOBILE-1	At 390x844, repeated refresh and navigation toggles remained operable with no persistent startup-failure cascade.	4:05
JOURNEY-1	Repeated mid-journey refresh interruptions converged to a stable rendered state with backend health intact.	4:15

📋 View Recording

Screen Recording

fix: Make OpenTelemetry startup idempotent

f5d50c4

vercel bot temporarily deployed to Preview – agents-docs March 13, 2026 14:11 Inactive

mike-inkeep requested a review from shagun-singh-inkeep March 13, 2026 14:11

pullfrog bot reviewed Mar 13, 2026

View reviewed changes

vercel bot deployed to Preview – agents-api March 13, 2026 14:13 View deployment

vercel bot deployed to Preview – agents-manage-ui March 13, 2026 14:14 View deployment

claude bot reviewed Mar 13, 2026

View reviewed changes

github-actions bot deleted a comment from claude bot Mar 13, 2026

vercel bot temporarily deployed to Preview – agents-docs March 16, 2026 13:29 Inactive

vercel bot deployed to Preview – agents-manage-ui March 16, 2026 13:31 View deployment

vercel bot deployed to Preview – agents-api March 16, 2026 13:32 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Make OpenTelemetry startup idempotent#2684

fix: Make OpenTelemetry startup idempotent#2684
mike-inkeep wants to merge 2 commits intomainfrom
fix/hot_reload_metricreader_error

mike-inkeep commented Mar 13, 2026

Uh oh!

vercel bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

pullfrog bot left a comment •

edited

Loading

Uh oh!

pullfrog bot Mar 13, 2026

Uh oh!

pullfrog bot Mar 13, 2026

Uh oh!

pullfrog bot Mar 13, 2026

Uh oh!

claude bot left a comment

Uh oh!

claude bot Mar 13, 2026

Uh oh!

itoqa bot commented Mar 13, 2026

Uh oh!

mike-inkeep commented Mar 16, 2026

Uh oh!

inkeep-internal-ci bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

itoqa bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mike-inkeep commented Mar 13, 2026

Uh oh!

vercel bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

pullfrog bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pullfrog bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

PR Review Summary

🟠⚠️ Major (2) 🟠⚠️

🟡 Minor (1) 🟡

💭 Consider (2) 💭

🚫 REQUEST CHANGES

Uh oh!

claude bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

itoqa bot commented Mar 13, 2026

Ito Test Report ✅

Uh oh!

mike-inkeep commented Mar 16, 2026

Uh oh!

inkeep-internal-ci bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes implemented

Uh oh!

itoqa bot commented Mar 16, 2026

Ito Test Report ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Mar 13, 2026 •

edited

Loading

changeset-bot bot commented Mar 13, 2026 •

edited

Loading

pullfrog bot left a comment •

edited

Loading

inkeep-internal-ci bot commented Mar 16, 2026 •

edited

Loading