Skip to content

fix: harden DBProvider against runaway retries and swallowed failures#26

Merged
elpete merged 34 commits into
mainfrom
fix/db-provider-max-attempts-bypass
Apr 20, 2026
Merged

fix: harden DBProvider against runaway retries and swallowed failures#26
elpete merged 34 commits into
mainfrom
fix/db-provider-max-attempts-bypass

Conversation

@elpete
Copy link
Copy Markdown
Contributor

@elpete elpete commented Apr 18, 2026

Summary

  • Runaway retry guard in the pickup loopprocessLockedRecord now checks attempts >= maxAttempts before incrementing or dispatching. Jobs stuck in the runaway state (reserved but never executed, attempts already at the cap) are immediately marked failed instead of being dispatched again.
  • Hardened marshalJob exception handler — if releaseJob throws inside .onException, the handler now falls back to markJobFailed (terminal) instead of letting the future swallow the secondary exception and leaving the row reserved. If afterJobFailed itself throws, forceFailJob is called as a last resort to guarantee the row exits the retry loop.
  • forceFailJob on DBProvider — minimal, bulletproof write that sets failedDate and clears reservedBy/reservedDate unconditionally, bypassing all bookkeeping. Used only when normal failure paths are broken.
  • Integration test suite — new DBProviderMaxAttemptsSpec covers: forceFailJob behaviour, runaway-job pre-flight guard, normal-path pass-through, releaseJob-throws fallback, and afterJobFailed-throws escalation.

elpete and others added 30 commits February 12, 2026 10:26
Lucee can return the `stackTrace` property on an exception as a Java array
of StackTraceElement objects rather than a simple string. Guard against
this by checking isSimpleValue() before comparing to "" and by serializing
complex values to JSON before inserting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
reservedDate compared against pool.getTimeout() always used the pool's
fixed 60s window, causing jobs with longer per-job timeouts (e.g. 300s) to
be re-grabbed and have attempts incremented while still running.
availableDate is already set to now + jobTimeout at reservation time, so
comparing it against now correctly respects each job's actual timeout.

Fixes both fetchPotentiallyOpenRecords and tryToLockRecords.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…timeout

Three specs covering the core bug fix in DBProvider:
- A reserved job within its own timeout is not re-grabbed
- A reserved job whose timeout has expired is re-grabbed
- A job past the pool timeout but within the job timeout is not re-grabbed

Uses TestBox makePublic() to expose fetchPotentiallyOpenRecords for
direct testing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ockRecords guard

ColdBoxAsyncProvider: The thenCompose closure referenced
arguments.currentAttempt which does not exist in the closure scope,
so setCurrentAttempt() never executed for retried jobs. Changed to
check the captured attempts variable instead.

DBProvider.tryToLockRecords: Added whereNotNull(reservedDate) guard
to the availableDate OR branch, consistent with the same fix already
applied to fetchPotentiallyOpenRecords. Ensures only genuinely
timed-out reserved jobs are re-locked.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If dispatchThenJobIfNeeded or dispatchCatchJobIfNeeded threw an
exception (e.g. missing WireBox mapping, connection error), the
finally job would never be dispatched. Wrap both in try/catch so
dispatchFinallyJobIfNeeded always runs, matching the semantic
contract of "finally".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The exception was passed to the job onFailure handler under the
misspelled key "excpetion" instead of "exception", so any job
defining onFailure( exception ) would receive an undefined argument.
Aligns SyncProvider with the AbstractQueueProvider behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the marshalJob future's exception handler fails to execute
properly, jobs remain reserved with reservedDate set. Once the
job timeout passes, fetchPotentiallyOpenRecords picks the row up
again with no maxAttempts check, causing unbounded retries
(observed: 29 attempts on a job configured for 3).

Add a defensive guard at pickup time: before incrementing attempts
and dispatching, compare the DB attempts column against
maxAttempts. If exceeded, mark the job as failed and skip.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The marshalJob future was fire-and-forget. Any exception inside
.onException — releaseJob serialization, batch recording, an
interceptor, the job's onFailure hook — was silently absorbed by
the unobserved future, leaving the row reserved and triggering
unbounded timeout-based re-pickups.

Wrap each side-effect in its own try/catch so one failure cannot
prevent the rest from running, and add an outer catch that calls
a new forceFailJob hook as a last-resort terminal kill.

forceFailJob is a no-op on AbstractQueueProvider and overridden
on DBProvider with a minimal UPDATE that sets failedDate and
clears the reservation. It deliberately skips bookkeeping
(onFailure hook, interceptors, batch recording) — it only runs
when the proper failure path itself throws, and the alternative
is the row being retried until manually cleaned up.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…empts integration tests

The inline loop body is pulled into a private processLockedRecord method so integration
tests can make it public and exercise the maxAttempts pre-flight guard in isolation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If log.error threw in the async thread (e.g. when serialising a MockBox
exception struct), markJobFailed was silently skipped, leaving the row
reserved and the timeout watcher free to re-pick it up indefinitely.
Wrapping the log call in its own try/catch ensures the fallback path to
markJobFailed is always reached regardless of logging failures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WireBox provider methods (newQuery) are not reliably invocable through
MockBox pass-through in async threads, so the DB write in markJobAsFailedById
silently failed in CI regardless of the log-guarding fix. Switching to a
mock-call assertion — the same technique used by the forceFailJob-fallback
test — verifies the correct code path without depending on the broken DB
write path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MockBox cannot reliably stub private methods reached through the async
closure's captured variables scope, and its pass-through for WireBox
provider methods (newQuery) is broken in async threads. Replace the
prepareMock approach with a real FailingReleaseDBProvider subclass that
overrides releaseJob to throw — no mocking needed, so newQuery stays a
proper WireBox provider method and the DB write goes through correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
elpete and others added 2 commits April 20, 2026 02:03
Adds tests/resources/app/models/**/*.cfc to the format, format:check,
and format:watch scripts so fixture files like FailingReleaseDBProvider
are checked and formatted consistently with the rest of the codebase.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens queue execution around DB-backed jobs to prevent runaway retries and ensure terminal failure state is recorded even when secondary failure-handling paths throw, and adds integration coverage for these failure modes.

Changes:

  • Adds max-attempts preflight guarding and a last-resort forceFailJob() path to guarantee DB rows exit retry loops.
  • Fixes DBProvider timeout “re-pickup” logic to respect job-specific timeouts (via availableDate) and adds integration specs.
  • Fixes SyncProvider to pass the exception into onFailure under the correctly spelled argument key.

Reviewed changes

Copilot reviewed 22 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/specs/integration/Providers/SyncProviderOnFailureSpec.cfc New integration spec validating onFailure receives exception (not misspelled key).
tests/specs/integration/Providers/DBProviderTimeoutWatcherSpec.cfc New integration specs for DB timeout watcher behavior with job-specific timeouts.
tests/specs/integration/Providers/DBProviderMaxAttemptsSpec.cfc New integration specs for runaway retry guard + failure-path hardening (including forceFailJob).
tests/specs/integration/DBBatchRepositoryCountsSpec.cfc New integration specs validating batch pending/failed/successful job counting.
tests/specs/integration/BatchFinallySpec.cfc New integration specs ensuring batch finally job dispatches on success and failure.
tests/resources/app/models/Providers/FailingReleaseDBProvider.cfc Test fixture provider that forces releaseJob() to throw.
tests/resources/app/models/Jobs/SendWelcomeEmailJob.cfc Formatting-only adjustments for a test job fixture.
tests/resources/app/models/Jobs/RequestScopeBeforeAndAfterJob.cfc New test job fixture to validate request-scope before/after execution.
tests/resources/app/models/Jobs/ReleaseTestJob.cfc Formatting-only adjustments for a test job fixture.
tests/resources/app/models/Jobs/OnFailureCapturingJob.cfc New test job fixture that captures onFailure argument behavior.
tests/resources/app/models/Jobs/BeforeAndAfterJob.cfc Formatting-only adjustments for a test job fixture.
tests/resources/app/models/Jobs/AlwaysErrorJob.cfc New test job fixture that always throws.
tests/Application.cfc Adds javaSettings.loadPaths for test JVM class loading from /lib.
resources/database/migrations/2000_01_01_000008_add_successfulJobs_to_cbq_batches_table.cfc Adds successfulJobs column to batches table.
resources/database/migrations/2000_01_01_000009_make_cbq_batches_name_nullable.cfc Makes cbq_batches.name nullable.
models/Providers/SyncProvider.cfc Fixes onFailure argument key to exception.
models/Providers/DBProvider.cfc Adds processLockedRecord() runaway guard; adds forceFailJob(); changes timeout watcher logic to use availableDate.
models/Providers/ColdBoxAsyncProvider.cfc Adjusts async chaining for delayed dispatch and attempt propagation.
models/Providers/AbstractQueueProvider.cfc Hardens .onException handling against side-effect failures; introduces forceFailJob() hook and centralized markJobFailed().
models/Jobs/DBBatchRepository.cfc Adds successfulJobs initialization; fixes pending/failed bookkeeping updates and batch materialization.
models/Jobs/Batch.cfc Adds logging and guards around dispatch of then/catch lifecycle jobs; tracks successfulJobs.
interceptors/LogFailedJobsInterceptor.cfc Makes failed-job logging inserts more defensive for nullable/structured exception fields.
box.json Bumps module version to 6.0.0-beta.3 and expands formatter script targets.
.github/workflows/release.yml Moves CI MySQL to 8.0 and configures mysql_native_password for the test user.
.github/workflows/pr.yml Moves CI MySQL to 8.0 and configures mysql_native_password for the test user.
.github/workflows/cron.yml Moves CI MySQL to 8.0 and configures mysql_native_password for the test user.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread models/Providers/AbstractQueueProvider.cfc
Comment thread interceptors/LogFailedJobsInterceptor.cfc
Comment thread tests/specs/integration/Providers/SyncProviderOnFailureSpec.cfc
Comment thread tests/resources/app/models/Jobs/OnFailureCapturingJob.cfc Outdated
elpete and others added 2 commits April 19, 2026 21:41
- Wrap afterJobHook in try/catch so a throwing hook logs via
  logSideEffectFailure and does not flip a successful job to failed
- Rename onFailureExceptionIsExpcetion → onFailureExceptionHasExcpetionKey
  in OnFailureCapturingJob and SyncProviderOnFailureSpec to separate the
  intentional 'excpetion' typo-under-test from the variable naming

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rceptor

memento, properties, exceptionExtendedInfo, exceptionStackTrace, and
exception all map to LONGTEXT in the cbq_failed_jobs schema. Binding them
as CF_SQL_VARCHAR risks truncation or driver errors on large payloads
(e.g. deep stack traces or complex job mementos).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@elpete elpete merged commit 45d1a05 into main Apr 20, 2026
13 checks passed
elpete added a commit that referenced this pull request Apr 20, 2026
- Wrap afterJobHook in try/catch so a throwing hook logs via
  logSideEffectFailure and does not flip a successful job to failed
- Rename onFailureExceptionIsExpcetion → onFailureExceptionHasExcpetionKey
  in OnFailureCapturingJob and SyncProviderOnFailureSpec to separate the
  intentional 'excpetion' typo-under-test from the variable naming

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants