Skip to content

Workflow fix #2685

Open
shagun-singh-inkeep wants to merge 6 commits intomainfrom
workflow-fix-reset
Open

Workflow fix #2685
shagun-singh-inkeep wants to merge 6 commits intomainfrom
workflow-fix-reset

Conversation

@shagun-singh-inkeep
Copy link
Collaborator

No description provided.

@vercel
Copy link

vercel bot commented Mar 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 13, 2026 9:39pm
agents-docs Error Error Mar 13, 2026 9:39pm
agents-manage-ui Ready Ready Preview, Comment Mar 13, 2026 9:39pm

Request Review

@changeset-bot
Copy link

changeset-bot bot commented Mar 13, 2026

🦋 Changeset detected

Latest commit: e1d47f2

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@inkeep/agents-core Patch
@inkeep/agents-api Patch
@inkeep/agents-manage-ui Patch
@inkeep/agents-cli Patch
@inkeep/agents-sdk Patch
@inkeep/agents-work-apps Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch
@inkeep/agents-email Patch
@inkeep/agents-mcp Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@changeset-bot
Copy link

changeset-bot bot commented Mar 13, 2026

⚠️ No Changeset found

Latest commit: 8a88bd2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@pullfrog
Copy link
Contributor

pullfrog bot commented Mar 13, 2026

TL;DR — Replaces the per-trigger daisy-chaining workflow architecture with a single centralized scheduler workflow that polls a new trigger_schedules runtime table every 60 seconds and dispatches one-shot workflows for each due trigger. This eliminates the complex adoption/supersession logic, removes the scheduled_workflows manage-DB dependency, and adds a post-deploy CI hook to restart the scheduler on the latest Vercel deployment.

Key changes

  • schedulerWorkflow + SchedulerService — New singleton long-lived workflow that ticks every 60 s, checks if it's still the active scheduler via scheduler_state, and dispatches due triggers.
  • triggerDispatcher — New dispatch layer: finds due rows in trigger_schedules, claims them with optimistic locking, starts one-shot scheduledTriggerRunnerWorkflow instances, and rolls back on failure.
  • trigger_schedules + scheduler_state tables — Two new runtime-DB tables (migration 0023) with a partial index for efficient dispatch queries.
  • ScheduledTriggerService rewrite — Lifecycle hooks (onTriggerCreated/Updated/Deleted) now upsert/delete rows in trigger_schedules instead of managing per-trigger workflow runs via DoltgreSQL.
  • scheduledTriggerRunnerWorkflow simplification — Converted from a daisy-chaining loop (sleep → execute → chain next) to a one-shot workflow (create invocation → execute with retries → done).
  • /api/deploy/restart-scheduler route — New deploy hook endpoint authenticated via INKEEP_AGENTS_RUN_API_BYPASS_SECRET, called by CI after Vercel promotion.
  • vercel-production.yml — New restart-scheduler job that curls the deploy hook after promotion.
  • computeNextRunAt extraction — Cron/one-time next-run logic extracted into a shared pure function used by both the service layer and the dispatcher.
  • Reconciliation handler simplificationscheduled_triggers check handler now returns empty results since per-trigger workflow tracking is removed.

Summary | 23 files | 2 commits | base: mainworkflow-fix-reset


Centralized schedulerWorkflow replaces per-trigger daisy-chaining

Before: Each enabled trigger spawned its own long-lived workflow that slept until the next execution, ran the agent, then daisy-chained a fresh workflow for the next iteration — with complex adoption and supersession logic in checkTriggerEnabledStep.
After: A single schedulerWorkflow runs in a loop (60 s ticks), querying trigger_schedules for due rows. Each trigger execution is a stateless one-shot workflow dispatched by triggerDispatcher.

The scheduler registers itself in scheduler_state (singleton row) and self-terminates if a newer run supersedes it. The dispatcher uses optimistic claim-based locking (claimedAt column) to prevent double-dispatch in multi-instance scenarios, with rollback on workflow start failure.

How does supersession work on deploy?

On Vercel, a new deployment triggers the restart-scheduler CI job, which POSTs to /api/deploy/restart-scheduler. This calls startSchedulerWorkflow(), which inserts a new currentRunId into scheduler_state. The old scheduler detects it's no longer current on its next tick via checkSchedulerCurrentStep and exits gracefully. On postgres/local worlds, the scheduler starts on server boot via recoverOrphanedWorkflows flow in index.ts.

schedulerWorkflow.ts · SchedulerService.ts · schedulerSteps.ts


triggerDispatcher — claim-based dispatch with rollback

Before: Trigger execution was tightly coupled to the per-trigger workflow lifecycle (sleep → wake → execute → chain).
After: dispatchDueTriggers() queries trigger_schedules for enabled rows with next_run_at <= now and claimed_at IS NULL, claims each row, advances next_run_at (or disables for one-time triggers), starts a one-shot workflow, and releases the claim. On workflow start failure, the schedule is rolled back.

triggerDispatcher.ts · computeNextRunAt.ts


trigger_schedules + scheduler_state runtime tables

Before: Workflow state lived in the manage DB's scheduled_workflows table (DoltgreSQL, branch-scoped).
After: Two new runtime-DB tables: trigger_schedules (composite PK on tenant_id + scheduled_trigger_id, partial index on next_run_at for dispatch) and scheduler_state (singleton row tracking the active scheduler run).

Table Purpose Key columns
trigger_schedules Materialized view of enabled triggers for polling next_run_at, claimed_at, enabled
scheduler_state Tracks active scheduler workflow current_run_id, deployment_id

runtime-schema.ts · 0023_amazing_romulus.sql · triggerSchedules.ts · schedulerState.ts


ScheduledTriggerService rewrite — sync to runtime table

Before: onTriggerCreated/Updated/Deleted resolved DoltgreSQL refs, managed scheduled_workflows records, and started/stopped per-trigger workflow runs.
After: These hooks call upsertTriggerSchedule / updateTriggerScheduleEnabled / deleteTriggerSchedule on the runtime DB. The centralized scheduler picks up changes on its next tick.

ScheduledTriggerService.ts · scheduled-triggers.ts


scheduledTriggerRunnerWorkflow — one-shot execution

Before: The runner workflow contained sleep logic, daisy-chaining (startNextIterationStep), pre/post-sleep enabled checks, pending invocation lookup, and parent adoption.
After: A stateless one-shot workflow: verify trigger enabled → create invocation (idempotent) → execute with retries → mark completed/failed.

scheduledTriggerRunner.ts · scheduledTriggerSteps.ts


Deploy hook: /api/deploy/restart-scheduler

Before: No mechanism to move the scheduler onto a new Vercel deployment.
After: A POST /api/deploy/restart-scheduler endpoint (authenticated via INKEEP_AGENTS_RUN_API_BYPASS_SECRET) starts a new scheduler workflow. The vercel-production.yml CI pipeline calls it after promotion.

restartScheduler.ts · vercel-production.yml · index.ts

Pullfrog  | View workflow run | Using Claude Code | Triggered by Pullfrogpullfrog.com𝕏

Copy link
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid architectural improvement — moving from per-trigger daisy-chaining to a centralized scheduler with a trigger_schedules table is much simpler to reason about and operate. The claim/advance/rollback pattern is well thought out.

A few issues to address before merge, ordered by severity:

  1. Security bug: The restart endpoint is accessible when INKEEP_AGENTS_RUN_API_BYPASS_SECRET is unset (undefined !== undefined is false).
  2. Correctness: lte vs eq in claim safety, missing cronTimezone in schedule-change detection, gutted reconciliation with no replacement.
  3. Robustness: No error handling in the restart handler, no error handling around the dispatch tick in the scheduler loop, claim-then-release flow has a theoretical re-dispatch window for fast cron intervals.

Issues on lines outside the diff (cannot be commented inline):

  • agents-api/src/data-reconciliation/handlers/scheduled-triggers.ts:14-16scheduleChanged checks cronExpression and runAt but not cronTimezone. Changing the timezone of a cron expression changes the effective schedule (e.g. 0 9 * * * in UTC vs EST), but this code path won't cancel pending invocations or recompute nextRunAt.

  • agents-api/src/data-reconciliation/handlers/scheduled-triggers.ts:26-34 — The check function now returns all-empty arrays, effectively disabling data reconciliation for scheduled triggers. In the new architecture, a useful check would verify that every enabled trigger in the manage DB has a corresponding trigger_schedules row with a non-null nextRunAt, and that no orphaned rows exist. Consider adding a TODO or a basic cross-table consistency check.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow runpullfrog.com𝕏

const authHeader = c.req.header('Authorization');
const token = authHeader?.replace('Bearer ', '');

if (token !== env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: undefined secret matches undefined token. INKEEP_AGENTS_RUN_API_BYPASS_SECRET is optional in env.ts. If unset, both env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET and token are undefined — so undefined !== undefined is false, granting access to anyone. Add an early guard:

if (!env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
  return c.json({ error: 'Endpoint not configured' }, 503);
}

return c.json({ error: 'Unauthorized' }, 401);
}

const result = await startSchedulerWorkflow();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If startSchedulerWorkflow() throws (e.g. DB connection failure), Hono's onError handler returns a generic 500, and the CI curl -sf will retry with no actionable info. Wrap in try-catch and return a structured error body so CI logs are useful for debugging.

}),
async (c) => {
const authHeader = c.req.header('Authorization');
const token = authHeader?.replace('Bearer ', '');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: replace('Bearer ', '') doesn't validate the prefix — Basic xyz silently becomes Basic xyz. Use authHeader?.startsWith('Bearer ') ? authHeader.slice(7) : undefined for stricter parsing.

expectedClaimedAt: string | null;
}): Promise<boolean> => {
const claimCondition = params.expectedClaimedAt
? lte(triggerSchedules.claimedAt, params.expectedClaimedAt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lte should be eq for claim safety. lte(claimedAt, expectedClaimedAt) succeeds if claimedAt is any value ≤ the expected timestamp — e.g., a stale claim from a previous cycle would still match. This defeats the optimistic-concurrency purpose. Use eq so only the exact expected state can be claimed.

nextRunAt: string | null;
enabled?: boolean;
}): Promise<void> => {
const set: Record<string, unknown> = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Record<string, unknown> bypasses Drizzle's column-name validation — typos in property names would be silently ignored. Consider building the object inline or typing as Partial<typeof triggerSchedules.$inferInsert>.

return 'skipped';
}

await releaseTriggerScheduleClaim(runDbClient)({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Releasing the claim after advancing nextRunAt creates a window where the row is eligible again. For most cron intervals this is fine because nextRunAt is in the future. But for fast intervals (every minute), the next nextRunAt could already be due by the time releaseTriggerScheduleClaim runs. Consider combining advance + release into a single atomic update, or omitting the release entirely (the advance already moves the schedule forward).

tenantId,
scheduledTriggerId,
nextRunAt: schedule.nextRunAt,
enabled: isOneTime ? true : schedule.enabled,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one-time triggers, rollback restores enabled: true. If the workflow engine stays down, the next tick (60s later) will re-dispatch indefinitely. Consider adding a failedAttempts counter or max-retry cap to prevent infinite re-dispatch of one-time triggers.

return { status: 'superseded', runId: myRunId };
}

await dispatchDueTriggersStep();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dispatchDueTriggersStep throws (e.g. transient DB error), the workflow step will fail and the framework's retry mechanism kicks in. Depending on backoff behavior, this could repeatedly spam the DB. Consider wrapping the dispatch in a try-catch to log and continue to the next tick for transient errors.

"${{ needs.deploy-agents-api.outputs.url }}/api/deploy/restart-scheduler" \
-H "Authorization: Bearer ${{ secrets.INKEEP_AGENTS_RUN_API_BYPASS_SECRET }}" \
--retry 3 \
--retry-delay 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding --retry-all-errors and --max-time 30. --retry 3 only retries on HTTP 5xx. If the new deployment hasn't finished booting, the connection may be refused outright, and curl won't retry without --retry-all-errors.

Comment on lines +16 to +21
const run = await start(schedulerWorkflow, []);

await upsertSchedulerState(runDbClient)({
currentRunId: run.runId,
deploymentId: getDeploymentId(),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race: supersession window allows dual dispatch. start(schedulerWorkflow) on line 16 launches the new scheduler, which calls registerSchedulerStep (writing runId to scheduler_state). But the old scheduler may have already woken and passed checkSchedulerCurrentStep before the upsert lands. Consider calling upsertSchedulerState before start(schedulerWorkflow) with a sentinel value, so the old scheduler's next check fails immediately.

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(9) Total Issues | Risk: High

🔴❗ Critical (2) ❗🔴

🔴 1) trigger_schedules Missing data migration for existing scheduled triggers

Issue: The new trigger_schedules runtime table will be empty after deployment. Existing enabled triggers in the manage DB (scheduled_triggers table) will not be backfilled. The scheduler workflow reads from trigger_schedules to dispatch triggers, but only newly created/updated triggers will be synced via onTriggerCreated/onTriggerUpdated.

Why: After deploying this migration, all existing enabled scheduled triggers will stop executing. The old per-trigger daisy-chaining workflows have been removed, and the new centralized scheduler will find zero rows in trigger_schedules. Production cron jobs and one-time triggers will silently fail to run until manually touched via the API or UI. This is a data consistency gap that will cause production outages.

Fix: Add a one-time backfill step. Options:

  1. Add a migration script (similar to scripts/sync-spicedb.sh) that queries all enabled triggers from manage DB and inserts corresponding rows into trigger_schedules
  2. Add backfill logic to startSchedulerWorkflow() in agents-api/src/index.ts that runs on server startup before the scheduler starts
  3. Use the data reconciliation framework to sync existing triggers on first scheduler tick

Example:

// In SchedulerService.ts or index.ts startup
async function backfillTriggerSchedules() {
  const enabledTriggers = await listEnabledScheduledTriggers(manageDb)({ scopes: { /* all tenants */ } });
  for (const trigger of enabledTriggers) {
    await syncTriggerToScheduleTable(trigger);
  }
}

Refs:


Inline Comments:

  • 🔴 Critical: restartScheduler.ts:32 Timing-attack vulnerable secret comparison + bypass when secret unset

🟠⚠️ Major (4) 🟠⚠️

🟠 1) trigger_schedules No claim timeout mechanism - stuck triggers unrecoverable

Issue: The claimedAt field has no expiration mechanism. If a dispatcher crashes after claimTriggerSchedule but before releaseTriggerScheduleClaim, the trigger remains claimed indefinitely. The partial index excludes claimed rows from findDueTriggerSchedules, so the trigger will never fire again.

Why: This creates a permanent silent failure mode. The only recovery would be manual database intervention to clear claimedAt. There's no alerting, no self-healing, and no visibility into stuck claims.

Fix: Modify the claim condition to treat stale claims as reclaimable:

// In claimTriggerSchedule - also claim if claimedAt is older than threshold
const claimCondition = or(
  isNull(triggerSchedules.claimedAt),
  lt(triggerSchedules.claimedAt, sql`now() - interval '5 minutes'`)
);

Or add a periodic cleanup job that releases claims older than a threshold.

Refs:


🟠 2) scheduled-triggers.ts Data reconciliation check gutted - no observability into scheduler health

Issue: The check() function now returns empty arrays for all audit categories. This removes the ability to detect orphaned, missing, or stuck triggers through the data reconciliation system.

Why: Operations teams lose visibility into scheduled trigger health. If triggers fail to sync to trigger_schedules or the scheduler workflow dies, there is no automated detection. This contradicts the existing data reconciliation pattern used throughout the codebase.

Fix: Implement a new check function that validates the new architecture:

check: async (ctx): Promise<ScheduledTriggerAuditResult> => {
  const [enabledTriggers, schedules] = await Promise.all([
    listEnabledScheduledTriggers(ctx.manageDb)({ scopes: ctx.scopes }),
    listTriggerSchedulesByProject(ctx.runDb)({ scopes: ctx.scopes }),
  ]);
  
  const scheduleMap = new Map(schedules.map(s => [s.scheduledTriggerId, s]));
  const missingWorkflows = enabledTriggers
    .filter(t => !scheduleMap.has(t.id))
    .map(t => ({ triggerId: t.id, triggerName: t.name }));
  
  return { missingWorkflows, orphanedWorkflows: [], staleWorkflows: [], deadWorkflows: [], verificationFailures: [] };
}

Refs:


🟠 3) system Missing changeset for schema/behavior changes

Issue: This PR adds a database migration, new data-access exports, and a new API endpoint. Per AGENTS.md changeset guidance, schema changes requiring migration warrant a minor bump.

Fix: Create a changeset:

pnpm bump minor --pkg agents-core --pkg agents-api "Add scheduler workflow with centralized trigger dispatch and deploy restart endpoint"

🟠 4) system Critical paths have no test coverage

Issue: The following new code has no test coverage:

  • dispatchSingleTrigger — claim/advance/rollback concurrency control
  • computeNextRunAt — cron parsing and timezone handling
  • claimTriggerSchedule — optimistic locking primitive
  • checkSchedulerCurrentStep — scheduler supersession logic
  • /api/deploy/restart-scheduler — auth validation

Why: The dispatcher's claim/advance/rollback sequence is the core scheduling mechanism. A bug here could cause duplicate dispatches, permanently stuck triggers, or silent missed executions. These are the highest-risk code paths with zero coverage.

Refs:


Inline Comments:

  • 🟠 Major: triggerDispatcher.ts:108 No error handling for claim release
  • 🟠 Major: restartScheduler.ts:36 No error handling for scheduler restart
  • 🟠 Major: computeNextRunAt.ts:21 Unhandled cron parsing exception

🟡 Minor (3) 🟡

🟡 1) SchedulerService.ts:16 Race condition between workflow start and state registration

Issue: The scheduler state is updated (line 18-21) after start() succeeds (line 16). If the process crashes between these operations, the new workflow runs without being registered in scheduler_state. However, registerSchedulerStep inside the workflow also calls upsertSchedulerState, providing a fallback.

Why: The double-registration is fine but creates potential inconsistency during the race window. Low severity because the workflow self-registers.

Refs:


🟡 2) index.ts:142 Scheduler startup failure doesn't prevent server from serving traffic

Issue: If startSchedulerWorkflow() fails during startup, the error is caught and logged but the server continues running without a scheduler. This is a silent failure state.

Fix: Consider exposing scheduler status via a health endpoint, or emit metrics/alerts when the scheduler fails to start.


🟡 3) triggerSchedules.ts:87-101 No limit on findDueTriggerSchedules query

Issue: The query returns all due triggers without a LIMIT clause. If many triggers become due simultaneously (e.g., after an outage), this could cause memory pressure.

Fix: Add a configurable LIMIT and process in batches across ticks.


Inline Comments:

  • 🟡 Minor: vercel-production.yml:170 Missing timeout-minutes
  • 🟡 Minor: vercel-production.yml:178 curl --retry won't retry HTTP 5xx

🚫 REQUEST CHANGES

Summary: This PR introduces a well-designed centralized scheduler architecture, but has a critical data migration gap that will break all existing scheduled triggers on deployment. The trigger_schedules table will be empty, and existing triggers won't fire until manually touched. Additionally, there are several error handling gaps that could leave triggers permanently stuck, and the data reconciliation check has been gutted without replacement. Please address the migration backfill and error handling issues before merging.

Discarded (8)
Location Issue Reason Discarded
schedulerSteps.ts:8 Duplicate scheduler state registration Intentional fallback - workflow self-registers as recovery mechanism
triggerDispatcher.ts:89 Non-null assertion on nextRunAt Query guarantees non-null via WHERE clause; fragile but correct
triggerDispatcher.ts:45-49 Error logging without trigger context Minor logging improvement, not blocking
schedulerWorkflow.ts:30 Missing error handling in infinite loop Workflow framework handles step retries
vercel-production.yml:9 Missing secret documentation Documentation-only, not blocking
runtime-schema.ts:960 PK doesn't include all scope columns Intentional simplification, scheduledTriggerId unique per tenant
restartScheduler.ts:20 Handler-level auth vs middleware Intentional pattern for deploy hooks
ScheduledTriggerService.ts:45-52 No error handling in onTriggerCreated Error propagates to caller appropriately
Reviewers (8)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-architecture 5 2 0 0 0 0 3
pr-review-errors 8 1 0 0 3 0 4
pr-review-sre 8 2 0 0 1 0 5
pr-review-tests 6 1 0 0 0 0 5
pr-review-breaking-changes 4 1 0 0 0 0 3
pr-review-security-iam 3 0 0 0 1 0 2
pr-review-devops 6 1 0 0 2 0 3
pr-review-standards 3 1 0 0 0 0 2
Total 43 9 0 0 7 0 27


if (token !== env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
return c.json({ error: 'Unauthorized' }, 401);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL: Timing-attack vulnerable secret comparison + bypass when secret unset

Issue: This comparison is not constant-time and allows auth bypass when INKEEP_AGENTS_RUN_API_BYPASS_SECRET is not configured. When the env var is unset (undefined), and no Authorization header is sent (token is also undefined), then undefined !== undefined is false, so auth passes.

Why: An attacker could probe the endpoint to infer the secret via timing differences. More critically, if deployed without the secret configured, the endpoint becomes fully unauthenticated, allowing anyone to restart the scheduler.

Fix:

Suggested change
}
if (!env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
return c.json({ error: 'Endpoint not available' }, 503);
}
if (!token || !constantTimeEqual(token, env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET)) {
return c.json({ error: 'Unauthorized' }, 401);
}

You'll need to add a constantTimeEqual helper using crypto.timingSafeEqual() - see existing patterns in packages/agents-core/src/utils/apiKeys.ts:97.

Refs:

await releaseTriggerScheduleClaim(runDbClient)({
tenantId,
scheduledTriggerId,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: No error handling for claim release - triggers can get stuck permanently

Issue: If releaseTriggerScheduleClaim fails after the workflow has been successfully started (line 93), the trigger remains claimed indefinitely. There's no claim timeout mechanism in the schema, so this trigger would never fire again.

Why: A transient database error at this point would leave claimedAt set with no recovery path. The only symptom would be silently missing scheduled executions, which are very hard to debug.

Fix:

Suggested change
});
try {
await releaseTriggerScheduleClaim(runDbClient)({
tenantId,
scheduledTriggerId,
});
} catch (err) {
logger.error(
{ scheduledTriggerId, err },
'Failed to release trigger claim after successful dispatch - trigger may remain claimed'
);
// Don't throw - workflow is already running
}

Also consider adding a claim timeout mechanism in findDueTriggerSchedules to reclaim triggers where claimed_at is older than a threshold (e.g., 5 minutes).

Refs:


const result = await startSchedulerWorkflow();

logger.info(result, 'Scheduler workflow restarted via deploy hook');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: No error handling - scheduler restart failures are silent

Issue: If startSchedulerWorkflow() throws, the error propagates as an unhandled exception. The CI deploy step will fail, but the error details won't be surfaced in the response body.

Why: A failed scheduler restart during deployment leaves the system without an active scheduler workflow. Scheduled triggers would silently stop firing with no clear indication of why.

Fix:

Suggested change
logger.info(result, 'Scheduler workflow restarted via deploy hook');
try {
const result = await startSchedulerWorkflow();
logger.info(result, 'Scheduler workflow restarted via deploy hook');
return c.json(result);
} catch (err) {
logger.error({ error: err }, 'Failed to restart scheduler workflow via deploy hook');
return c.json(
{ error: 'Failed to start scheduler workflow', details: err instanceof Error ? err.message : String(err) },
500
);
}

currentDate: baseDate,
tz: cronTimezone || 'UTC',
});
return interval.next().toISOString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Unhandled cron parsing exception can leave triggers stuck

Issue: CronExpressionParser.parse() throws if the cron expression is invalid. This function is called from syncTriggerToScheduleTable and dispatchSingleTrigger. An invalid expression that bypasses validation would cause unhandled exceptions.

Why: In dispatchSingleTrigger, this throws after claiming the trigger but before advancing it, leaving the trigger stuck in claimed state.

Fix:

Suggested change
return interval.next().toISOString();
if (cronExpression) {
try {
const baseDate = lastScheduledFor ? new Date(lastScheduledFor) : new Date();
const interval = CronExpressionParser.parse(cronExpression, {
currentDate: baseDate,
tz: cronTimezone || 'UTC',
});
return interval.next().toISOString();
} catch (err) {
throw new Error(
`Invalid cron expression '${cronExpression}': ${err instanceof Error ? err.message : String(err)}`
);
}
}

restart-scheduler:
name: Restart scheduler workflow
needs: [promote, deploy-agents-api]
runs-on: ubuntu-latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Missing timeout-minutes on restart-scheduler job

Issue: This job has no timeout-minutes setting. If the endpoint hangs, the job could run until GitHub's 6-hour default.

Fix:

Suggested change
runs-on: ubuntu-latest
timeout-minutes: 5
steps:

Other jobs in this repo use timeout-minutes: 15-30. Since the curl has retries (max ~15s), 5 minutes provides ample margin.

"${{ needs.deploy-agents-api.outputs.url }}/api/deploy/restart-scheduler" \
-H "Authorization: Bearer ${{ secrets.INKEEP_AGENTS_RUN_API_BYPASS_SECRET }}" \
--retry 3 \
--retry-delay 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: curl --retry won't retry HTTP 5xx errors by default

Issue: --retry 3 only retries on transient network errors (connection refused, timeout), not HTTP 5xx responses.

Fix: Add --retry-all-errors to retry on 5xx responses:

Suggested change
--retry-delay 5
--retry 3 \
--retry-delay 5 \
--retry-all-errors

This is curl 7.71+, available on ubuntu-latest runners.

@github-actions github-actions bot deleted a comment from claude bot Mar 13, 2026
@itoqa
Copy link

itoqa bot commented Mar 13, 2026

Ito Test Report ❌

19 test cases ran. 12 passed, 7 failed.

This run confirms multiple production-code defects across deploy-hook authentication and scheduled-trigger execution paths. ✅ Core CRUD, authorization boundaries, mobile usability, and injection protections behaved as expected in included passing cases, while several scheduling and rapid-action flows still show real correctness gaps under code-first review.

✅ Passed (12)
Test Case Summary Timestamp Screenshot
ROUTE-4 Created a recurring scheduled trigger, edited it, and confirmed updated values persisted in the scheduled triggers list. 3:55 ROUTE-4_3-55.png
ROUTE-5 Deleting the scheduled trigger removed it from the list, and refreshing the stale invocations tab produced a safe 404 page. 8:07 ROUTE-5_8-07.png
ROUTE-6 Invocation list stayed coherent while switching status views after trigger execution, with no duplicated historical entries. 15:43 ROUTE-6_15-43.png
ROUTE-7 Workflow process endpoint returned 200 with processed=true and timestamp. 0:00 ROUTE-7_0-00.png
EDGE-3 After disabling the recurring trigger and observing/refreshing, no new scheduler dispatches were observed. 15:43 EDGE-3_15-43.png
EDGE-5 Submit + immediate refresh kept a single created trigger row with no duplicates. 20:46 EDGE-5_20-46.png
JOURNEY-1 Deep-link edit and invocations navigation remained coherent through back/forward transitions and hard refresh. 8:07 JOURNEY-1_8-07.png
MOBILE-1 iPhone 13 viewport retained accessible trigger list actions, edit controls, and save flow. 8:08 MOBILE-1_8-08.png
ADV-1 Malformed Authorization variants consistently returned 401 Unauthorized. 0:00 ADV-1_0-00.png
ADV-2 Cross-project and cross-tenant tampering attempts returned 403 without foreign metadata disclosure. 26:57 ADV-2_26-57.png
ADV-3 Non-admin runAsUser patch attempt returned 403 and did not change trigger runAsUserId. 26:57 ADV-3_26-57.png
ADV-4 Script-like payload/template content did not execute in UI and surfaces remained stable. 21:19 ADV-4_21-19.png
❌ Failed (7)
Test Case Summary Timestamp Screenshot
ROUTE-1 Both bearer-token restart attempts returned 401 Unauthorized instead of 200. 0:00 ROUTE-1_0-00.png
ROUTE-2 Request without Authorization header returned 200 with runId payload instead of 401. 0:00 ROUTE-2_0-00.png
EDGE-1 Invalid cron value was accepted and rendered as 'Hourly at :61', showing incomplete validation. 18:15 EDGE-1_18-15.png
EDGE-2 One-time trigger created with past runAt appeared as a new pending invocation. 15:43 EDGE-2_15-43.png
EDGE-4 Equivalent 1:00 AM schedules produced inconsistent nextRunAt behavior across timezone paths. 20:18 EDGE-4_20-18.png
RAPID-1 Rapid rerun/cancel interactions produced multiple invocation rows for one trigger. 15:43 RAPID-1_15-43.png
ADV-5 Burst traffic still returned 401 for bearer-token requests, with no successful authorized restart observed. 0:00 ADV-5_0-00.png
Restart scheduler endpoint accepts valid bearer token – Failed
  • Where: Deploy hook API endpoint /api/deploy/restart-scheduler in agents-api.

  • Steps to reproduce: Send POST /api/deploy/restart-scheduler with Authorization: Bearer <valid-token> while bypass secret is unset in runtime env.

  • What failed: Valid bearer-token calls are rejected with 401 instead of restarting the scheduler workflow.

  • Code analysis: The handler strips the bearer prefix and compares token equality directly against an optional env var; when the secret is missing, any provided token fails the equality check.

  • Relevant code:

    agents-api/src/routes/restartScheduler.ts (lines 27-31)

    const authHeader = c.req.header('Authorization');
    const token = authHeader?.replace('Bearer ', '');
    
    if (token !== env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
      return c.json({ error: 'Unauthorized' }, 401);
    }

    agents-api/src/env.ts (lines 79-82)

    INKEEP_AGENTS_RUN_API_BYPASS_SECRET: z
      .string()
      .optional()
      .describe('Run API bypass secret for local development and testing (skips auth)'),
  • Why this is likely a bug: Optional secret configuration combined with direct equality creates a broken auth path where valid bearer tokens fail when the secret is unset.

  • Introduced by this PR: Yes – this PR modified the relevant code.

Restart scheduler endpoint rejects missing token – Failed
  • Where: Deploy hook API endpoint /api/deploy/restart-scheduler in agents-api.

  • Steps to reproduce: Send POST /api/deploy/restart-scheduler without an Authorization header when bypass secret is unset.

  • What failed: Missing-token request succeeds with 200 and returns scheduler run IDs instead of 401.

  • Code analysis: Both token and env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET can be undefined; the equality check then passes and incorrectly authorizes an unauthenticated restart.

  • Relevant code:

    agents-api/src/routes/restartScheduler.ts (lines 27-34)

    const authHeader = c.req.header('Authorization');
    const token = authHeader?.replace('Bearer ', '');
    
    if (token !== env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
      return c.json({ error: 'Unauthorized' }, 401);
    }
    
    const result = await startSchedulerWorkflow();
  • Why this is likely a bug: An unauthenticated request should never be authorized; current logic allows auth bypass when the secret is undefined.

  • Introduced by this PR: Yes – this PR modified the relevant code.

Invalid cron expression does not break list or edit surfaces – Failed
  • Where: Scheduled trigger creation/update schema and list-time next-run computation.

  • Steps to reproduce: Create or update a trigger with an out-of-range cron minute (for example 61 * * * *) and load scheduled trigger list.

  • What failed: Invalid cron is accepted at write time and later fails at parse time, producing inconsistent/blank next-run behavior.

  • Code analysis: Cron validation uses a regex that checks token structure but not numeric ranges; later parsing errors are only logged and swallowed.

  • Relevant code:

    packages/agents-core/src/validation/schemas.ts (lines 949-954)

    export const CronExpressionSchema = z
      .string()
      .regex(
        /^(\*(?:\/\d+)?|[\d,-]+(?:\/\d+)?)\s+(\*(?:\/\d+)?|[\d,-]+(?:\/\d+)?)\s+(\*(?:\/\d+)?|[\d,-]+(?:\/\d+)?)\s+(\*(?:\/\d+)?|[\d,-]+(?:\/\d+)?)\s+(\*(?:\/\d+)?|[\d,\-A-Za-z]+(?:\/\d+)?)$/,
        'Invalid cron expression. Expected 5 fields: minute hour day month weekday'
      )

    agents-api/src/domains/manage/routes/scheduledTriggers.ts (lines 156-166)

    const interval = CronExpressionParser.parse(trigger.cronExpression, {
      currentDate: baseDate,
      tz: trigger.cronTimezone || 'UTC',
    });
    const nextDate = interval.next();
    runInfo.nextRunAt = nextDate.toISOString();
    } catch (error) {
      logger.warn(
        { triggerId: trigger.id, cronExpression: trigger.cronExpression, error },
        'Failed to calculate nextRunAt from cron expression'
      );
    }
  • Why this is likely a bug: Production input validation should reject invalid schedules up front instead of persisting values that break runtime scheduling logic.

  • Introduced by this PR: No – pre-existing bug (code not changed in this PR).

  • Timestamp: 18:15

One-time trigger with past runAt is handled safely – Failed
  • Where: Trigger schedule computation and dispatcher due-trigger logic.

  • Steps to reproduce: Create a one-time trigger with runAt already in the past and allow scheduler dispatch tick.

  • What failed: A past one-time schedule is treated as immediately due and creates a pending invocation instead of being rejected or ignored.

  • Code analysis: computeNextRunAt returns runAt as-is (even if past), and dispatcher queries all due schedules with asOf: now, so stale one-time schedules dispatch instantly.

  • Relevant code:

    agents-api/src/domains/run/services/computeNextRunAt.ts (lines 11-13)

    if (runAt && !cronExpression) {
      return runAt;
    }

    agents-api/src/domains/run/services/triggerDispatcher.ts (lines 27-29)

    const dueTriggers = await findDueTriggerSchedules(runDbClient)({
      asOf: now.toISOString(),
    });
  • Why this is likely a bug: One-time schedules in the past should not generate new executions, but current production logic makes them dispatchable immediately.

  • Introduced by this PR: Yes – this PR modified the relevant code.

  • Timestamp: 15:43

DST/timezone boundary computes expected next run – Failed
  • Where: Scheduled trigger timezone input handling and next-run calculation path.

  • Steps to reproduce: Save cron triggers with different timezone values, including non-IANA values, then inspect list next-run data.

  • What failed: Timezone paths diverge to missing nextRunAt instead of deterministic scheduling behavior.

  • Code analysis: Timezone input is accepted as an arbitrary string (no IANA validation), and parse failures in list-time cron calculation are swallowed, leaving nextRunAt unset.

  • Relevant code:

    packages/agents-core/src/validation/schemas.ts (lines 973-977)

    z
      .string()
      .max(64)
      .default('UTC')
      .describe('IANA timezone for cron expression (e.g., America/New_York, Europe/London)'),

    agents-api/src/domains/manage/routes/scheduledTriggers.ts (lines 156-166)

    const interval = CronExpressionParser.parse(trigger.cronExpression, {
      currentDate: baseDate,
      tz: trigger.cronTimezone || 'UTC',
    });
    const nextDate = interval.next();
    runInfo.nextRunAt = nextDate.toISOString();
    } catch (error) {
      logger.warn(
        { triggerId: trigger.id, cronExpression: trigger.cronExpression, error },
        'Failed to calculate nextRunAt from cron expression'
      );
    }
  • Why this is likely a bug: Allowing invalid timezone values to persist causes production scheduling state to degrade into missing next-run values instead of controlled validation errors.

  • Introduced by this PR: No – pre-existing bug (code not changed in this PR).

  • Timestamp: 20:18

Rapid clicks on run-now/cancel do not produce duplicate transitions – Failed
  • Where: Manual run endpoint for scheduled triggers (POST /{id}/run).

  • Steps to reproduce: Trigger multiple rapid run-now actions for the same scheduled trigger.

  • What failed: Multiple invocation rows are created in close succession instead of collapsing to a single transition.

  • Code analysis: Each request always generates a fresh invocation ID and a time-based idempotency key, so rapid repeated requests are treated as distinct runs.

  • Relevant code:

    agents-api/src/domains/manage/routes/scheduledTriggers.ts (lines 1352-1364)

    const invocationId = generateId();
    
    await createScheduledTriggerInvocation(runDbClient)({
      id: invocationId,
      tenantId,
      projectId,
      agentId,
      scheduledTriggerId,
      status: 'pending',
      scheduledFor: new Date().toISOString(),
      idempotencyKey: `manual-run-${scheduledTriggerId}-${Date.now()}`,
      attemptNumber: 1,
    });
  • Why this is likely a bug: The endpoint lacks deduplication/locking semantics for burst clicks, so production behavior violates expected idempotent transition handling.

  • Introduced by this PR: No – pre-existing bug (code not changed in this PR).

  • Timestamp: 15:43

Restart endpoint withstands burst requests without degraded auth behavior – Failed
  • Where: Deploy hook API endpoint /api/deploy/restart-scheduler under burst request load.

  • Steps to reproduce: Send concurrent restart requests with bearer tokens while bypass secret is unset.

  • What failed: All bearer requests return 401 and no authorized restart succeeds under burst traffic.

  • Code analysis: Auth logic is deterministic equality against an optional env var; with missing secret, bearer token requests all fail, so burst mode cannot produce valid authorized outcomes.

  • Relevant code:

    agents-api/src/routes/restartScheduler.ts (lines 27-31)

    const authHeader = c.req.header('Authorization');
    const token = authHeader?.replace('Bearer ', '');
    
    if (token !== env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
      return c.json({ error: 'Unauthorized' }, 401);
    }
  • Why this is likely a bug: The same auth defect that blocks valid token requests also causes full failure under burst traffic, so authorized behavior degrades entirely.

  • Introduced by this PR: Yes – this PR modified the relevant code.

📋 View Recording

Screen Recording

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(2) Total Issues | Risk: High

This is a delta review scoped to the 2 commits (11 files) since the last automated review. The delta addresses several prior issues but leaves the most critical one unaddressed.

🔴❗ Critical (1) ❗🔴

🔴 1) system Missing data migration for existing scheduled triggers — STILL UNADDRESSED

Issue: This was flagged in the prior review as Critical and remains unaddressed in this delta. The new trigger_schedules runtime table will be empty after deployment. Existing enabled triggers in the manage DB will not be backfilled.

Why: After deploying this migration, all existing enabled scheduled triggers will stop executing. The old per-trigger daisy-chaining workflows have been removed, and the new centralized scheduler will find zero rows in trigger_schedules. Production cron jobs and one-time triggers will silently fail to run until manually touched via the API or UI.

Fix: Add a one-time backfill step. Options:

  1. Add a migration script that queries all enabled triggers from manage DB and inserts corresponding rows into trigger_schedules
  2. Add backfill logic to startSchedulerWorkflow() in agents-api/src/index.ts that runs on server startup before the scheduler starts
  3. Use the data reconciliation framework to sync existing triggers on first scheduler tick

Refs:


Inline Comments:

  • 🟠 Major: scheduled-triggers.ts:18-20 scheduleChanged missing cronTimezone comparison

🟡 Minor (0) 🟡

No new minor issues in delta.

💭 Consider (1) 💭

💭 1) .github/workflows/vercel-production.yml:167 Consider continue-on-error for non-blocking scheduler restart

Issue: The restart-scheduler job will fail the entire workflow if the scheduler restart endpoint returns an error after all retries. The scheduler has self-recovery mechanisms (supersession checks, re-registration on restart).

Why: If the CI call fails, the scheduler will recover when the API boots or on the next cron tick. However, if scheduler restart is considered critical to deploy success, the current strict behavior is appropriate.

Fix: Add continue-on-error: true at job level if scheduler restart failure should not fail the deploy workflow.


✅ Delta Fixes Verified

The following issues from the prior review have been properly addressed in this delta:

Issue Status
Security: Timing-attack vulnerable secret comparison ✅ Fixed — constantTimeEqual() using crypto.timingSafeEqual()
Security: Auth bypass when secret unset ✅ Fixed — Returns 503 when INKEEP_AGENTS_RUN_API_BYPASS_SECRET is unset
Security: No error handling in restart handler ✅ Fixed — Try-catch with structured 500 response
CI: Missing timeout-minutes ✅ Fixed — Added timeout-minutes: 2
CI: curl --retry won't retry HTTP 5xx ✅ Fixed — Added --retry-all-errors
Correctness: Data reconciliation check gutted ✅ Fixed — Restored with cross-table comparison + tests
Correctness: No claim timeout mechanism ✅ Fixed — Added 5-minute stale claim recovery
Correctness: lte vs eq in claim safety ✅ Fixed — Simplified to atomic stale-claim-aware WHERE clause

🕐 Pending Recommendations (1)


🚫 REQUEST CHANGES

Summary: The delta addresses most feedback from the prior review — security fixes are complete, CI improvements are in place, and the data reconciliation check has been properly restored with tests. However, the critical data migration gap (existing triggers won't be backfilled to trigger_schedules) remains unaddressed. This will cause all existing scheduled triggers to stop firing after deployment. Please add a backfill mechanism before merging.

Discarded (3)
Location Issue Reason Discarded
registry.test.ts:313 Missing test for disabled orphaned schedules Valid but low priority — edge case that affects reconciliation accuracy but not runtime execution
registry.test.ts:293 Missing test edge case documentation Valid but INFO level — documents intended behavior but doesn't catch bugs
vercel-production.yml Partial index won't cover stale claims query Acceptable tradeoff — stale claims are rare, unclaimed triggers (common case) use index
Reviewers (5)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 0 0 0 0 0 0 0
pr-review-security-iam 0 0 0 0 0 0 0
pr-review-sre 1 0 0 0 1 0 0
pr-review-tests 3 0 0 0 0 0 2
pr-review-devops 4 0 1 0 0 0 1
Total 8 0 1 0 1 0 3

Note: Security and standards reviewers returned 0 issues because all prior security issues were verified as fixed.

@github-actions github-actions bot deleted a comment from claude bot Mar 13, 2026
@itoqa
Copy link

itoqa bot commented Mar 13, 2026

Ito Test Report ❌

13 test cases ran. 11 passed, 2 failed.

✅ The scheduled-trigger regression pass found stable behavior for deep-link loading, validation, mobile usability, invocation creation, and auth hardening checks. 🔍 Code-first verification of the two included failures indicates likely product defects in next-run handling for past one-time schedules and in rapid toggle state consistency under concurrent updates.

✅ Passed (11)
Test Case Summary Timestamp Screenshot
ROUTE-4 Endpoint returned 503 with error Endpoint not available as expected for unavailable run bypass secret profile. 0:00 ROUTE-4_0-00.png
ROUTE-5 GET /api/workflow/process returned 200 with valid JSON {processed:true} and completed in 50 seconds, within expected long-running window. 1:30 ROUTE-5_1-30.png
LOGIC-4 Deleted trigger row was removed from Scheduled tab and invocations view showed no invocation entries. 9:58 LOGIC-4_9-58.png
LOGIC-5 Run Now created an invocation record and it progressed to terminal Failed state with trace links. 9:58 LOGIC-5_9-58.png
LOGIC-6 Deep-link edit route loaded correctly and persisted populated form data across refresh without unauthorized or missing-data errors. 12:33 LOGIC-6_12-33.png
EDGE-3 Submitting invalid custom cron produced validation error and prevented update persistence. 9:58 EDGE-3_9-58.png
EDGE-4 At 390x844 viewport, trigger controls remained accessible and an edit/update roundtrip completed successfully. 15:26 EDGE-4_15-26.png
EDGE-5 Three back/forward cycles preserved route context and page state without stale data or overlay corruption. 15:48 EDGE-5_15-48.png
ADV-3 Direct unauthenticated manage endpoint request returned 401 Unauthorized and did not expose trigger data. 19:07 ADV-3_19-07.png
ADV-4 Authenticated tampered-scope request returned 403 and UI tampering rendered 404 project-not-found state with no foreign data exposure. 19:04 ADV-4_19-04.png
ADV-5 Script-tag and onerror payload strings were persisted as plain text in list/edit views, and no script execution dialog was triggered. 19:00 ADV-5_19-00.png
❌ Failed (2)
Test Case Summary Timestamp Screenshot
EDGE-1 Created a one-time trigger in the past, but the list still displayed a concrete Next Run timestamp instead of an em dash. 14:11 EDGE-1_14-11.png
EDGE-2 After rapid toggling, the intended final disabled state did not persist after refresh and reverted to enabled. 14:43 EDGE-2_14-43.png
One-time trigger scheduled in the past shows no upcoming next run – Failed
  • Where: Scheduled triggers list Next Run column for one-time trigger rows.

  • Steps to reproduce: Create a one-time trigger with runAt in the past, open/refresh the scheduled triggers list, and inspect Next Run.

  • What failed: UI showed a concrete upcoming value instead of no upcoming run () for a past one-time schedule.

  • Code analysis: I traced next-run derivation through runtime run-info aggregation and schedule computation. The run-info batch function promotes any pending invocation to nextRunAt without checking whether it is already in the past, and one-time schedule computation returns runAt directly with no past-time guard.

  • Relevant code: Cite 1–3 key files with line numbers and include short code snippets (fenced blocks with language).

    packages/agents-core/src/data-access/runtime/scheduledTriggerInvocations.ts (lines 479–486)

    for (const inv of allInvocations) {
      const triggerInfo = result.get(inv.scheduledTriggerId);
      if (!triggerInfo) continue;
      if (inv.status === 'pending' && !triggerInfo.nextRunAt) {
        triggerInfo.nextRunAt = inv.scheduledFor;
      }
      if ((inv.status === 'completed' || inv.status === 'failed') && !triggerInfo.lastRunAt) {

    agents-api/src/domains/run/services/computeNextRunAt.ts (lines 11–21)

    if (runAt && !cronExpression) {
      return runAt;
    }
    
    if (cronExpression) {
      const baseDate = lastScheduledFor ? new Date(lastScheduledFor) : new Date();
      const interval = CronExpressionParser.parse(cronExpression, {
        currentDate: baseDate,
        tz: cronTimezone || 'UTC',
      });
  • Why this is likely a bug: The code path can surface stale/past schedule timestamps as nextRunAt, which conflicts with expected one-time past-trigger semantics (no upcoming run).

  • Introduced by this PR: No – pre-existing bug (code not changed in this PR).

  • Timestamp: 14:11

Rapid enable/disable interaction keeps final trigger state consistent – Failed
  • Where: Scheduled trigger enable switch on the triggers table.

  • Steps to reproduce: Rapidly click the enabled switch multiple times, stop on disabled as intended, refresh, then verify persisted status.

  • What failed: Final persisted state can differ from the user’s final intent after rapid interactions.

  • Code analysis: The UI issues independent toggle requests without serializing/canceling in-flight updates, while the API applies each PATCH as a blind write with no version/ordering guard. This allows out-of-order request completion to overwrite the intended final value.

  • Relevant code: Cite 1–3 key files with line numbers and include short code snippets (fenced blocks with language).

    agents-manage-ui/src/components/project-triggers/project-scheduled-triggers-table.tsx (lines 95–106)

    const toggleEnabled = async (triggerId: string, agentId: string, currentEnabled: boolean) => {
      const newEnabled = !currentEnabled;
      setLoadingTriggers((prev) => new Set(prev).add(triggerId));
    
      try {
        const result = await updateScheduledTriggerEnabledAction(
          tenantId,
          projectId,
          agentId,

    agents-manage-ui/src/components/project-triggers/project-scheduled-triggers-table.tsx (lines 317–322)

    <Switch
      checked={trigger.enabled}
      onCheckedChange={() =>
        toggleEnabled(trigger.id, trigger.agentId, trigger.enabled)
      }
      disabled={isLoading || !canManage}
    />

    agents-api/src/domains/manage/routes/scheduledTriggers.ts (lines 568–582)

    const updatedTrigger = await updateScheduledTrigger(db)({
      scopes: { tenantId, projectId, agentId },
      scheduledTriggerId: id,
      data: {
        ...body,
        maxRetries: resolveRetryValue(body.maxRetries, existing.maxRetries, 3),
        retryDelaySeconds: resolveRetryValue(
          body.retryDelaySeconds,
          existing.retryDelaySeconds,
          60
        ),
  • Why this is likely a bug: Without request-order protection, rapid toggles can commit stale writes and violate final-state consistency expected by users.

  • Introduced by this PR: No – pre-existing bug (code not changed in this PR).

  • Timestamp: 14:43

📋 View Recording

Screen Recording

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low (delta scope)

This is a delta review scoped to changes since the last automated review (7af85b5d481f).

Delta Changes

The only change in this delta is the addition of a new specification document:

  • specs/2026-03-13-manage-table-cron-dispatcher/SPEC.md (+256 lines)

This document evaluates an alternative architectural approach (using the manage DoltgreSQL table vs runtime Postgres table for scheduling) and recommends against it. The analysis is thorough and well-reasoned, covering:

  • Branch iteration cost scaling (O(n) vs O(1))
  • DoltgreSQL versioning overhead for high-frequency transactional writes
  • Connection pool pressure
  • Broken claim/release locking semantics in Dolt's commit model

The spec correctly concludes that the runtime-table approach (already implemented in this PR) is the better design choice.

✅ Delta Review: Clean

No new issues in the delta. The spec document is well-written architectural documentation that supports the implementation decisions already made.


🕐 Pending Recommendations (1)


💡 APPROVE WITH SUGGESTIONS

Summary: The delta (spec document) is clean and adds valuable architectural context. However, the critical data migration gap from prior reviews remains unaddressed — existing scheduled triggers will not be synced to the new trigger_schedules table on deployment, causing all existing cron/one-time triggers to stop firing. Please add a backfill mechanism before merging.

Reviewers (0)

No code reviewers dispatched — delta contains only documentation changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant