Zero-silent-failure error handling, page-level logging, and bug fixes (PR #46 + #45)#47
Closed
Nishit24113 wants to merge 2 commits into
Closed
Zero-silent-failure error handling, page-level logging, and bug fixes (PR #46 + #45)#47Nishit24113 wants to merge 2 commits into
Nishit24113 wants to merge 2 commits into
Conversation
…ixes Bug fixes (from PR #46 and #45): - Fix env var passing in RunAltTextGenerationTask: read s3_bucket/s3_key directly from the Map iterator input instead of indexing into the ECS ContainerOverrides array, which GuardDuty sidecar injection can reorder - Fix Bedrock aspect-ratio rejection for form PDFs: skip images exceeding the 20:1 limit and assign "Decorative element" alt text instead of crashing Error handling (no failure can go unreported): - New failure-handler Lambda wired to a Step Functions Catch (States.ALL). On any failure it aggregates per-station detail and writes result/FAILED_<name>.json where the frontend already polls, carrying the reason category and failing chunk/page range - Instrument all 5 stations to write temp/<name>/_errors/<station>.json plus a structured CloudWatch line (station, reason, chunk, page range) - Splitter writes the marker directly (it runs before the state machine) - Fix two silent-success paths: the title Lambda returned 500 dicts and the Java merger returned an error string, both treated as success by Step Functions; they now report and raise so the Catch fires - Add docs/ERROR_HANDLING.md describing the frontend FAILED_ marker contract
Collaborator
Author
|
Closing for now. Will validate the error-handling + bug-fix changes locally against multiple test PDFs and deploy to the pdf-dev account for testing first, then re-open/recreate the PR once verified. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR makes pipeline failures visible to the user instead of failing silently, adds page-level failure logging to CloudWatch, and rolls in two existing bug-fix PRs so they ship together.
Previously, when any processing step failed, the workflow ended FAILED with nothing written to the
result/folder. The frontend pollsresult/to detect completion, so it would poll indefinitely and the user saw the job "spin" for hours with no explanation.Included bug fixes
RunAltTextGenerationTask(app.py): read$.s3_bucket/$.s3_keydirectly from the Map iterator input instead of indexing into the ECSContainerOverridesarray, which GuardDuty Runtime Monitoring can reorder by injecting a sidecar (caused ~50% intermittent failures).alt_text_generator.js): images exceeding Bedrock's 20:1 aspect-ratio limit (thin lines/borders in forms) are skipped and assigned "Decorative element" alt text instead of crashing the whole job.Error handling — two-layer, zero silent failures
temp/<name>/_errors/<station>.jsonon failure (station, reason category, chunk index, page range) and emits a structured CloudWatch line:File: <name>, Status: FAILED | station=adobe | reason=ADOBE_API | chunk=8 | pages=1401-1600Catch(States.ALL) → new failure-handler Lambda is the exhaustive safety net. It fires on every failure — in-code and infrastructure (container can't start, OOM, timeout, IAM) — aggregates the detail files, and writesresult/FAILED_<name>.jsonwhere the frontend already polls. Invoked only on failure ⇒ ~zero cost.The splitter writes the marker directly because it runs before the state machine exists (the Catch can't cover it).
Silent-success bugs fixed
The title Lambda (returned
500dicts) and the Java merger (returned an error string) were treated as success by Step Functions, letting the workflow continue past a real failure. Both now report detail and raise so the Catch fires.Stations instrumented
pdf-splitter(Python),adobe-autotag(Python),alt-text-generator(JS),pdf-merger(Java),title-generator(Python).Frontend contract
The UI (separate repo
PDF_accessability_UI) should also check forresult/FAILED_<name>.jsonwhile polling. Schema + reason categories documented indocs/ERROR_HANDLING.md.app.pyloads the prebuiltPDFMergerLambda-1.0-SNAPSHOT.jar; nothing indeploy.sh/buildspec runsmvn package, so the Java change requires a rebuild to take effect.FAILED_marker check (separate repo).PAGES_PER_CHUNK(default 200) must stay in sync across splitter/stations/handler.Validation
py_compile✓ · JSnode --check✓ · CDK construct tree synthesizes ✓Test plan
cdk synthin devresult/FAILED_<name>.jsonwithreason=ADOBE_API+ page rangereason=BEDROCK_APIresult/COMPLIANT_<name>.pdf