Skip to content

Zero-silent-failure error handling, page-level logging, and bug fixes (PR #46 + #45)#47

Closed
Nishit24113 wants to merge 2 commits into
mainfrom
fix/error-handling-and-logging
Closed

Zero-silent-failure error handling, page-level logging, and bug fixes (PR #46 + #45)#47
Nishit24113 wants to merge 2 commits into
mainfrom
fix/error-handling-and-logging

Conversation

@Nishit24113

@Nishit24113 Nishit24113 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR makes pipeline failures visible to the user instead of failing silently, adds page-level failure logging to CloudWatch, and rolls in two existing bug-fix PRs so they ship together.

Previously, when any processing step failed, the workflow ended FAILED with nothing written to the result/ folder. The frontend polls result/ to detect completion, so it would poll indefinitely and the user saw the job "spin" for hours with no explanation.

Included bug fixes

  • PR Fix env var passing in RunAltTextGenerationTask #46 — env var passing in RunAltTextGenerationTask (app.py): read $.s3_bucket/$.s3_key directly from the Map iterator input instead of indexing into the ECS ContainerOverrides array, which GuardDuty Runtime Monitoring can reorder by injecting a sidecar (caused ~50% intermittent failures).
  • PR Fix: Bedrock aspect ratio rejection for form PDFs #45 — Bedrock aspect-ratio rejection for form PDFs (alt_text_generator.js): images exceeding Bedrock's 20:1 aspect-ratio limit (thin lines/borders in forms) are skipped and assigned "Decorative element" alt text instead of crashing the whole job.

Error handling — two-layer, zero silent failures

  1. Each station writes a detail file temp/<name>/_errors/<station>.json on failure (station, reason category, chunk index, page range) and emits a structured CloudWatch line:
    File: <name>, Status: FAILED | station=adobe | reason=ADOBE_API | chunk=8 | pages=1401-1600
  2. Step Functions Catch (States.ALL) → new failure-handler Lambda is the exhaustive safety net. It fires on every failure — in-code and infrastructure (container can't start, OOM, timeout, IAM) — aggregates the detail files, and writes result/FAILED_<name>.json where the frontend already polls. Invoked only on failure ⇒ ~zero cost.

The splitter writes the marker directly because it runs before the state machine exists (the Catch can't cover it).

Silent-success bugs fixed

The title Lambda (returned 500 dicts) and the Java merger (returned an error string) were treated as success by Step Functions, letting the workflow continue past a real failure. Both now report detail and raise so the Catch fires.

Stations instrumented

pdf-splitter (Python), adobe-autotag (Python), alt-text-generator (JS), pdf-merger (Java), title-generator (Python).

Frontend contract

The UI (separate repo PDF_accessability_UI) should also check for result/FAILED_<name>.json while polling. Schema + reason categories documented in docs/ERROR_HANDLING.md.

⚠️ Deploy notes

  • Rebuild the merger JARapp.py loads the prebuilt PDFMergerLambda-1.0-SNAPSHOT.jar; nothing in deploy.sh/buildspec runs mvn package, so the Java change requires a rebuild to take effect.
  • Frontend needs the one-line FAILED_ marker check (separate repo).
  • PAGES_PER_CHUNK (default 200) must stay in sync across splitter/stations/handler.

Validation

  • Python py_compile ✓ · JS node --check ✓ · CDK construct tree synthesizes ✓
  • Java reviewed manually (no local JDK/Maven); JSON escaping hardened.
  • Not yet deployed — dev-account deploy/test to follow.

Test plan

  • cdk synth in dev
  • Force an Adobe failure → result/FAILED_<name>.json with reason=ADOBE_API + page range
  • Force a Bedrock failure → marker with reason=BEDROCK_API
  • Upload a form PDF with thin lines → completes (aspect-ratio skip)
  • Confirm successful runs still produce result/COMPLIANT_<name>.pdf

…ixes

Bug fixes (from PR #46 and #45):
- Fix env var passing in RunAltTextGenerationTask: read s3_bucket/s3_key
  directly from the Map iterator input instead of indexing into the ECS
  ContainerOverrides array, which GuardDuty sidecar injection can reorder
- Fix Bedrock aspect-ratio rejection for form PDFs: skip images exceeding
  the 20:1 limit and assign "Decorative element" alt text instead of crashing

Error handling (no failure can go unreported):
- New failure-handler Lambda wired to a Step Functions Catch (States.ALL).
  On any failure it aggregates per-station detail and writes
  result/FAILED_<name>.json where the frontend already polls, carrying the
  reason category and failing chunk/page range
- Instrument all 5 stations to write temp/<name>/_errors/<station>.json plus
  a structured CloudWatch line (station, reason, chunk, page range)
- Splitter writes the marker directly (it runs before the state machine)
- Fix two silent-success paths: the title Lambda returned 500 dicts and the
  Java merger returned an error string, both treated as success by Step
  Functions; they now report and raise so the Catch fires
- Add docs/ERROR_HANDLING.md describing the frontend FAILED_ marker contract
@Nishit24113

Copy link
Copy Markdown
Collaborator Author

Closing for now. Will validate the error-handling + bug-fix changes locally against multiple test PDFs and deploy to the pdf-dev account for testing first, then re-open/recreate the PR once verified.

@Nishit24113 Nishit24113 closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant