fix: harden DBProvider against runaway retries and swallowed failures#26
Merged
Conversation
Lucee can return the `stackTrace` property on an exception as a Java array of StackTraceElement objects rather than a simple string. Guard against this by checking isSimpleValue() before comparing to "" and by serializing complex values to JSON before inserting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
reservedDate compared against pool.getTimeout() always used the pool's fixed 60s window, causing jobs with longer per-job timeouts (e.g. 300s) to be re-grabbed and have attempts incremented while still running. availableDate is already set to now + jobTimeout at reservation time, so comparing it against now correctly respects each job's actual timeout. Fixes both fetchPotentiallyOpenRecords and tryToLockRecords. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…timeout Three specs covering the core bug fix in DBProvider: - A reserved job within its own timeout is not re-grabbed - A reserved job whose timeout has expired is re-grabbed - A job past the pool timeout but within the job timeout is not re-grabbed Uses TestBox makePublic() to expose fetchPotentiallyOpenRecords for direct testing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ockRecords guard ColdBoxAsyncProvider: The thenCompose closure referenced arguments.currentAttempt which does not exist in the closure scope, so setCurrentAttempt() never executed for retried jobs. Changed to check the captured attempts variable instead. DBProvider.tryToLockRecords: Added whereNotNull(reservedDate) guard to the availableDate OR branch, consistent with the same fix already applied to fetchPotentiallyOpenRecords. Ensures only genuinely timed-out reserved jobs are re-locked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If dispatchThenJobIfNeeded or dispatchCatchJobIfNeeded threw an exception (e.g. missing WireBox mapping, connection error), the finally job would never be dispatched. Wrap both in try/catch so dispatchFinallyJobIfNeeded always runs, matching the semantic contract of "finally". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The exception was passed to the job onFailure handler under the misspelled key "excpetion" instead of "exception", so any job defining onFailure( exception ) would receive an undefined argument. Aligns SyncProvider with the AbstractQueueProvider behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the marshalJob future's exception handler fails to execute properly, jobs remain reserved with reservedDate set. Once the job timeout passes, fetchPotentiallyOpenRecords picks the row up again with no maxAttempts check, causing unbounded retries (observed: 29 attempts on a job configured for 3). Add a defensive guard at pickup time: before incrementing attempts and dispatching, compare the DB attempts column against maxAttempts. If exceeded, mark the job as failed and skip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The marshalJob future was fire-and-forget. Any exception inside .onException — releaseJob serialization, batch recording, an interceptor, the job's onFailure hook — was silently absorbed by the unobserved future, leaving the row reserved and triggering unbounded timeout-based re-pickups. Wrap each side-effect in its own try/catch so one failure cannot prevent the rest from running, and add an outer catch that calls a new forceFailJob hook as a last-resort terminal kill. forceFailJob is a no-op on AbstractQueueProvider and overridden on DBProvider with a minimal UPDATE that sets failedDate and clears the reservation. It deliberately skips bookkeeping (onFailure hook, interceptors, batch recording) — it only runs when the proper failure path itself throws, and the alternative is the row being retried until manually cleaned up. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…empts integration tests The inline loop body is pulled into a private processLockedRecord method so integration tests can make it public and exercise the maxAttempts pre-flight guard in isolation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If log.error threw in the async thread (e.g. when serialising a MockBox exception struct), markJobFailed was silently skipped, leaving the row reserved and the timeout watcher free to re-pick it up indefinitely. Wrapping the log call in its own try/catch ensures the fallback path to markJobFailed is always reached regardless of logging failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WireBox provider methods (newQuery) are not reliably invocable through MockBox pass-through in async threads, so the DB write in markJobAsFailedById silently failed in CI regardless of the log-guarding fix. Switching to a mock-call assertion — the same technique used by the forceFailJob-fallback test — verifies the correct code path without depending on the broken DB write path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MockBox cannot reliably stub private methods reached through the async closure's captured variables scope, and its pass-through for WireBox provider methods (newQuery) is broken in async threads. Replace the prepareMock approach with a real FailingReleaseDBProvider subclass that overrides releaseJob to throw — no mocking needed, so newQuery stays a proper WireBox provider method and the DB write goes through correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/resources/app/models/**/*.cfc to the format, format:check, and format:watch scripts so fixture files like FailingReleaseDBProvider are checked and formatted consistently with the rest of the codebase. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens queue execution around DB-backed jobs to prevent runaway retries and ensure terminal failure state is recorded even when secondary failure-handling paths throw, and adds integration coverage for these failure modes.
Changes:
- Adds max-attempts preflight guarding and a last-resort
forceFailJob()path to guarantee DB rows exit retry loops. - Fixes DBProvider timeout “re-pickup” logic to respect job-specific timeouts (via
availableDate) and adds integration specs. - Fixes
SyncProviderto pass the exception intoonFailureunder the correctly spelled argument key.
Reviewed changes
Copilot reviewed 22 out of 26 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/specs/integration/Providers/SyncProviderOnFailureSpec.cfc | New integration spec validating onFailure receives exception (not misspelled key). |
| tests/specs/integration/Providers/DBProviderTimeoutWatcherSpec.cfc | New integration specs for DB timeout watcher behavior with job-specific timeouts. |
| tests/specs/integration/Providers/DBProviderMaxAttemptsSpec.cfc | New integration specs for runaway retry guard + failure-path hardening (including forceFailJob). |
| tests/specs/integration/DBBatchRepositoryCountsSpec.cfc | New integration specs validating batch pending/failed/successful job counting. |
| tests/specs/integration/BatchFinallySpec.cfc | New integration specs ensuring batch finally job dispatches on success and failure. |
| tests/resources/app/models/Providers/FailingReleaseDBProvider.cfc | Test fixture provider that forces releaseJob() to throw. |
| tests/resources/app/models/Jobs/SendWelcomeEmailJob.cfc | Formatting-only adjustments for a test job fixture. |
| tests/resources/app/models/Jobs/RequestScopeBeforeAndAfterJob.cfc | New test job fixture to validate request-scope before/after execution. |
| tests/resources/app/models/Jobs/ReleaseTestJob.cfc | Formatting-only adjustments for a test job fixture. |
| tests/resources/app/models/Jobs/OnFailureCapturingJob.cfc | New test job fixture that captures onFailure argument behavior. |
| tests/resources/app/models/Jobs/BeforeAndAfterJob.cfc | Formatting-only adjustments for a test job fixture. |
| tests/resources/app/models/Jobs/AlwaysErrorJob.cfc | New test job fixture that always throws. |
| tests/Application.cfc | Adds javaSettings.loadPaths for test JVM class loading from /lib. |
| resources/database/migrations/2000_01_01_000008_add_successfulJobs_to_cbq_batches_table.cfc | Adds successfulJobs column to batches table. |
| resources/database/migrations/2000_01_01_000009_make_cbq_batches_name_nullable.cfc | Makes cbq_batches.name nullable. |
| models/Providers/SyncProvider.cfc | Fixes onFailure argument key to exception. |
| models/Providers/DBProvider.cfc | Adds processLockedRecord() runaway guard; adds forceFailJob(); changes timeout watcher logic to use availableDate. |
| models/Providers/ColdBoxAsyncProvider.cfc | Adjusts async chaining for delayed dispatch and attempt propagation. |
| models/Providers/AbstractQueueProvider.cfc | Hardens .onException handling against side-effect failures; introduces forceFailJob() hook and centralized markJobFailed(). |
| models/Jobs/DBBatchRepository.cfc | Adds successfulJobs initialization; fixes pending/failed bookkeeping updates and batch materialization. |
| models/Jobs/Batch.cfc | Adds logging and guards around dispatch of then/catch lifecycle jobs; tracks successfulJobs. |
| interceptors/LogFailedJobsInterceptor.cfc | Makes failed-job logging inserts more defensive for nullable/structured exception fields. |
| box.json | Bumps module version to 6.0.0-beta.3 and expands formatter script targets. |
| .github/workflows/release.yml | Moves CI MySQL to 8.0 and configures mysql_native_password for the test user. |
| .github/workflows/pr.yml | Moves CI MySQL to 8.0 and configures mysql_native_password for the test user. |
| .github/workflows/cron.yml | Moves CI MySQL to 8.0 and configures mysql_native_password for the test user. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Wrap afterJobHook in try/catch so a throwing hook logs via logSideEffectFailure and does not flip a successful job to failed - Rename onFailureExceptionIsExpcetion → onFailureExceptionHasExcpetionKey in OnFailureCapturingJob and SyncProviderOnFailureSpec to separate the intentional 'excpetion' typo-under-test from the variable naming Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rceptor memento, properties, exceptionExtendedInfo, exceptionStackTrace, and exception all map to LONGTEXT in the cbq_failed_jobs schema. Binding them as CF_SQL_VARCHAR risks truncation or driver errors on large payloads (e.g. deep stack traces or complex job mementos). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
elpete
added a commit
that referenced
this pull request
Apr 20, 2026
- Wrap afterJobHook in try/catch so a throwing hook logs via logSideEffectFailure and does not flip a successful job to failed - Rename onFailureExceptionIsExpcetion → onFailureExceptionHasExcpetionKey in OnFailureCapturingJob and SyncProviderOnFailureSpec to separate the intentional 'excpetion' typo-under-test from the variable naming Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
processLockedRecordnow checksattempts >= maxAttemptsbefore incrementing or dispatching. Jobs stuck in the runaway state (reserved but never executed, attempts already at the cap) are immediately marked failed instead of being dispatched again.marshalJobexception handler — ifreleaseJobthrows inside.onException, the handler now falls back tomarkJobFailed(terminal) instead of letting the future swallow the secondary exception and leaving the row reserved. IfafterJobFaileditself throws,forceFailJobis called as a last resort to guarantee the row exits the retry loop.forceFailJobon DBProvider — minimal, bulletproof write that setsfailedDateand clearsreservedBy/reservedDateunconditionally, bypassing all bookkeeping. Used only when normal failure paths are broken.DBProviderMaxAttemptsSpeccovers:forceFailJobbehaviour, runaway-job pre-flight guard, normal-path pass-through,releaseJob-throws fallback, andafterJobFailed-throws escalation.