fix: Make OpenTelemetry startup idempotent#2684
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
|
There was a problem hiding this comment.
The idempotency approach is sound for the Vite HMR use case, but there's a breaking change that needs addressing before merge.
agents-api/src/instrumentation.ts
Outdated
| }); | ||
|
|
||
| export const defaultSDK = new NodeSDK({ | ||
| const defaultSDK = new NodeSDK({ |
There was a problem hiding this comment.
Removing the export from defaultSDK is a breaking change. create-agents-template/apps/agents-api/src/instrumentation.ts:19 does:
const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();@inkeep/agents-api/instrumentation is a declared subpath export in agents-api/package.json. The template (and any downstream consumer using that subpath) will break at runtime.
Either:
- Update the template in this PR to use
startOpenTelemetrySDK(), or - Continue exporting
defaultSDK(possibly deprecated) alongside the new function.
| const OTEL_STARTED_KEY = Symbol.for('@inkeep/agents-api.otelStarted'); | ||
|
|
||
| export function startOpenTelemetrySDK(): void { | ||
| if ((globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY]) { | ||
| return; | ||
| } | ||
| try { | ||
| defaultSDK.start(); | ||
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | ||
| } catch (error) { | ||
| const msg = error instanceof Error ? error.message : String(error); | ||
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | ||
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | ||
| return; | ||
| } | ||
| throw error; | ||
| } | ||
| } |
There was a problem hiding this comment.
The new NodeSDK(...) constructor at line 85 still runs at module scope on every HMR re-evaluation — allocating fresh span processors, context managers, and instrumentations each time. The guard only prevents start() from firing twice, but the stale SDK objects from prior evaluations are leaked.
Consider moving the new NodeSDK(...) construction inside startOpenTelemetrySDK() behind the same guard, or caching the instance on globalThis so repeated module evaluations reuse it.
| const msg = error instanceof Error ? error.message : String(error); | ||
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | ||
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | ||
| return; |
There was a problem hiding this comment.
When this catch fires, OTEL recovered via the fallback path rather than the primary guard. A console.warn (or logger call) here would help operators distinguish between clean idempotency and error-recovery idempotency during debugging.
There was a problem hiding this comment.
PR Review Summary
(4) Total Issues | Risk: Medium
🟠⚠️ Major (2) 🟠⚠️
🟠 1) create-agents-template/apps/agents-api/src/instrumentation.ts:19-20 Breaking change: defaultSDK export removed but template still imports it
Issue: This PR removes the export keyword from defaultSDK (making it private), but create-agents-template still imports this export directly:
// create-agents-template/apps/agents-api/src/instrumentation.ts:19-20
const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();Why: When users upgrade @inkeep/agents-api or scaffold a new project using the template, the code will fail at runtime with an import error. This is a breaking change that affects the template's compatibility with the package.
Fix: Update the template to use the new idempotent function:
// create-agents-template/apps/agents-api/src/instrumentation.ts
const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();Refs:
🟠 2) agents-api/src/instrumentation.ts:95-110 Missing test coverage for idempotency logic
Issue: The new startOpenTelemetrySDK() function has zero test coverage. It has three distinct code paths: (1) early return when already started, (2) successful SDK start with flag set, and (3) error handling for MetricReader binding errors — none of which are tested.
Why: Without tests, future refactoring could break the HMR safety mechanism, causing the original 500 error to resurface. The codebase already has a test pattern for this exact scenario in socket-mode.test.ts:
// packages/agents-work-apps/src/__tests__/slack/socket-mode.test.ts:121
it('should skip if client already exists (HMR safety)', async () => {
(globalThis as Record<string, unknown>).__inkeep_slack_socket_mode_client__ = {};
// ...
expect(mockSocketModeStart).not.toHaveBeenCalled();
});Fix: Add a test file at agents-api/src/__tests__/instrumentation.test.ts with cases for:
should skip SDK start if already started (HMR safety)— verify early returnshould set global flag after successful start— verify flag is setshould catch MetricReader binding error and set flag— verify error suppressionshould rethrow non-MetricReader errors— verify other errors propagate
Refs:
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
instrumentation.ts:104-107Silent error suppression without logging
💭 Consider (2) 💭
💭 1) agents-api/src/instrumentation.ts:104 Brittle string matching for error detection
Issue: The error check msg.includes('MetricReader') && msg.includes('can not be bound') relies on specific error message text from @opentelemetry/sdk-node. If OpenTelemetry changes the error message wording in a future version, this check would fail.
Why: A dependency update could silently break the hot-reload fix. Since this is dev-only, it might not be caught until it disrupts developer workflow.
Fix: Add a version comment documenting the expected error format:
// Known error from @opentelemetry/sdk-node when SDK.start() is called multiple times.
// Verified against @opentelemetry/sdk-node@0.212.0. Update if error message changes.
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {Refs:
💭 2) agents-api/src/instrumentation.ts:102-108 Catch block sets flag even when SDK may be in partial state
Issue: If the MetricReader error is thrown, it means defaultSDK.start() partially succeeded before failing. Setting the flag to true prevents future retries, but the SDK may be in an inconsistent state.
Why: In the HMR context this is fine because the previous SDK instance's tracers are still working. But it's worth noting that subsequent calls to shutdown or flush on this new instance would operate on a broken SDK.
Fix: This is acceptable as-is for the HMR use case. Consider adding a code comment documenting this edge case for future maintainers.
🚫 REQUEST CHANGES
Summary: The approach using Symbol.for() + globalThis for idempotency is correct and follows established patterns in this codebase (see Slack Socket Mode). However, two issues need addressing before merge: (1) the template breaking change — create-agents-template imports the now-private defaultSDK export, and (2) the fix lacks test coverage for a behavior-critical change. Both are straightforward to fix.
Discarded (1)
| Location | Issue | Reason Discarded |
|---|---|---|
agents-docs/content/guides/observability/langfuse-usage.mdx:50 |
Documentation shows old OTEL pattern | Pre-existing issue, docs show user-maintained code example, LOW confidence |
Reviewers (5)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-sre |
3 | 1 | 0 | 0 | 1 | 0 | 1 |
pr-review-tests |
2 | 1 | 1 | 0 | 0 | 0 | 0 |
pr-review-errors |
2 | 0 | 1 | 0 | 1 | 0 | 0 |
pr-review-precision |
2 | 1 | 1 | 0 | 0 | 0 | 0 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 9 | 3 | 3 | 0 | 2 | 0 | 1 |
Note: Findings were deduplicated across reviewers — template breaking change was flagged by both pr-review-sre and pr-review-precision; silent error suppression was flagged by both pr-review-sre and pr-review-errors.
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | ||
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | ||
| return; | ||
| } |
There was a problem hiding this comment.
🟡 Minor: Silent error suppression without logging
Issue: When the MetricReader binding error is caught and suppressed, there's no logging to indicate this happened.
Why: This follows a different pattern than createSafeBatchProcessor() (line 37) which logs warnings for initialization failures. Without logging, debugging OTEL issues becomes harder — there's no trace that startup was skipped due to a binding conflict.
Fix: Add a debug/info log to maintain observability:
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | |
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | |
| return; | |
| } | |
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | |
| logger.debug({}, 'OpenTelemetry SDK already started (MetricReader already bound), skipping re-initialization'); | |
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | |
| return; | |
| } |
Refs:
- createSafeBatchProcessor logging pattern — existing fallback logging pattern
- Slack Socket Mode HMR logging — logs "HMR reload detected"
|
@claude add export back to defaultSDK move new NodeSDK() into startOpenTelemetrySDK behind the same guard, or cache the instance on globalThis so repeated module evaluations reuse it |
|
Claude finished @mike-inkeep's task in 1m 43s —— View job Changes implemented
What changed in
|
Restores the export on defaultSDK to avoid breaking the create-agents-template subpath import. Moves the new NodeSDK() construction behind a globalThis guard (getOrCreateSDK) so repeated Vite HMR module evaluations reuse the same instance instead of leaking fresh SDK objects. Co-authored-by: mike-inkeep <mike-inkeep@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ito Test Report ✅19 test cases ran. 19 passed. ✅ The local dev stability run stayed healthy across API route checks, auth bootstrap flows, multi-tab/mobile stress, and adversarial auth requests. Code-change logs only show non-product test-run configuration tweaks (dev server host/overlay indicators), and no test output provided a code-supported defect signal, so all executed cases are reported as passed. ✅ Passed (19)
📋 View Recording |







































Vite hot reload frequently tries to re-run the otel SDK startup, which is not idempotent and leads to a 500 error on all subsequent requests. This requires restarting
pnpm dev.The error shows up as:
10:04:53 AM [vite] Internal server error: MetricReader can not be bound to a MeterProvider again.This PR makes that operation idempotent by using a global var to keep track of whether otel was started up already.