Report root cause when a stack update fails#3
Conversation
On a failed deploy, scan the stack events for this update and surface the earliest failed resource as the error. Under GitHub Actions, write a summary table of all failed resources plus a console deep link to the job summary.
doistbot
left a comment
There was a problem hiding this comment.
This PR improves CloudFormation deploy failure reporting by scanning stack events for the update's ClientRequestToken, identifying the earliest failed resource as the root cause, and writing a summary table with console links to the GitHub Actions job summary.
Few things worth tightening:
- Stack-level failures get dropped: The
continuestatement that skips stack events also hides pre-flight failures (missing IAM capabilities, export overlaps, etc.) where no resource fails, resulting in a generic error instead of the real cause. Remove thecontinueso stack-level_FAILEDevents are still collected — since stack events are newer than resource failures,sortedFailureswill still place the true root cause first. - Edge cases in the root-cause message: A
*_FAILEDevent with noResourceStatusReasonproduces a malformed message (… STATUS: — see stack events). Treat an emptyroot.reasonas missing and fall back to the generic message + console link. Separately, the console URL hardcodesconsole.aws.amazon.com, which is wrong for GovCloud and China partitions — derive the host from the stack ARN/partition instead. - Nondeterministic root-cause selection: Events collected into a map then sorted only by timestamp can shuffle equal-timestamp failures, making the reported root cause vary run to run. Use
EventIdfor dedup instead of the composite key, and add a deterministic tiebreaker or preserve scan order sofailures[0]is stable. - Polling and testing gaps: On busy stacks, pagination now continues through the full one-hour cutoff even after rollback — break early once the matching
UPDATE_IN_PROGRESSevent for this token is found. Also, the two highest-value functions (reportFailureandwriteStepSummary) are untested despite being pure functions with an existing test harness; adding coverage for both the populated and empty cases would guard against regressions in the core logic this PR introduces.
I also included a few optional follow-up notes in the details below.
Optional follow-up note (1)
- [P3] main.go:146: Since the stack's terminal event is always newer than the resource failures that caused it, it will be encountered first in this newest-to-oldest stream. You can avoid redundantly populating the
failuresmap on every tick during a rollback by checkingif terminal != "" && isFailure(...). This safely defers all map operations and string parsing until the final tick when the map is actually used.
| types.ResourceStatusRollbackFailed: | ||
| terminal = evt.ResourceStatus | ||
| } | ||
| continue |
There was a problem hiding this comment.
🟠 P1 The continue statement skips all stack-level events from being evaluated by isFailure. While this hides the redundant "The following resource(s) failed" stack event during typical deployments, it also drops stack-level pre-flight failures (e.g., missing IAM capabilities, export overlaps, or unresolvable parameters) where no underlying resources fail. This results in the action returning a generic UPDATE_ROLLBACK_COMPLETE error and dropping the actual root cause, which is a regression from the previous behavior.
Remove continue so that stack-level _FAILED events are still collected. Because stack-level events occur after resource failures, sortedFailures will still correctly place the true resource root cause first when both exist.
| log.Printf("%s%s (%s) %s: %s", githubErrPrefix, e.logicalID, e.resType, e.status, oneLine(e.reason)) | ||
| } | ||
| root := failures[0] | ||
| return fmt.Errorf("%s (%s) %s: %s — see stack events: %s", root.logicalID, root.resType, root.status, oneLine(root.reason), consoleURL) |
There was a problem hiding this comment.
🟡 P2 ResourceStatusReason is optional in CloudFormation events, so a *_FAILED event can legitimately arrive with no reason. In that case this now returns Resource (Type) STATUS: — see stack events, even though the PR description says to fall back to a generic message when no resource-level reason is available. Please treat an empty root.reason as missing here and either omit the : %s fragment or fall back to the terminal-status message + console link.
| } | ||
|
|
||
| func eventsConsoleURL(region, stackID string) string { | ||
| return fmt.Sprintf("https://%[1]s.console.aws.amazon.com/cloudformation/home?region=%[1]s#/stacks/events?stackId=%s", |
There was a problem hiding this comment.
🟡 P2 This hardcodes the commercial AWS Console domain, so the deep link is wrong for GovCloud and China regions even though the action itself can run there. Please derive the console host from the stack ARN/partition (or otherwise map the region to the correct console domain) instead of always using *.console.aws.amazon.com.
| reason: unptr(evt.ResourceStatusReason), | ||
| timestamp: unptr(evt.Timestamp), | ||
| } | ||
| failures[fe.logicalID+"\x00"+fe.timestamp.String()] = fe |
There was a problem hiding this comment.
🟡 P2 Replace this synthetic composite key with the event's unique ID (unptr(evt.EventId)). CloudFormation guarantees a stable EventId for every stack event, which provides a safer, allocation-free way to deduplicate events without relying on stringified timestamps.
|
|
||
| func sortedFailures(m map[string]failedEvent) []failedEvent { | ||
| out := slices.Collect(maps.Values(m)) | ||
| slices.SortFunc(out, func(a, b failedEvent) int { return a.timestamp.Compare(b.timestamp) }) |
There was a problem hiding this comment.
🟡 P2 This only sorts by timestamp, but the failures were first collected into a map, so any two events with the same timestamp end up in arbitrary order. Since reportFailure() treats failures[0] as the root cause, the reported root failure can change run to run for equal-timestamp events. Keep scan order in a slice (using a separate set only for dedupe if needed), or add a deterministic tiebreaker before selecting the first failure.
| return cmp.Or(likelyRootCause, fmt.Errorf("%v, see AWS CloudFormation Console for more details", evt.ResourceStatus)) | ||
| case types.ResourceStatusUpdateComplete: | ||
| return nil | ||
| case types.ResourceStatusUpdateRollbackComplete, |
There was a problem hiding this comment.
🟡 P2 This change stops returning immediately on rollback and now keeps paginating DescribeStackEvents until the one-hour cutoff. On stacks with a lot of recent history, that means extra API pages for unrelated older operations right on the failure path. Since events are newest-first, consider breaking once you've reached the current update's start event for this ClientRequestToken (for example the matching stack UPDATE_IN_PROGRESS event) instead of scanning the whole cutoff window.
| // table plus per-resource error annotations — and returns the most likely root | ||
| // cause as an error. The earliest failure is the root cause; later ones usually | ||
| // cascade from it. | ||
| func reportFailure(region, stackID string, terminal types.ResourceStatus, failures []failedEvent) error { |
There was a problem hiding this comment.
🟡 P2 reportFailure is the core of this PR — it picks failures[0] as the root cause, formats the error message, logs failures[1:] as annotations, and falls back to a generic message when there are no resource-level failures. None of this is tested. A regression here (wrong element selected, broken fallback, malformed message) would directly reintroduce the wrong-root-cause problem this PR fixes. It's a pure function in the same package — with GITHUB_STEP_SUMMARY unset, a test can pass a []failedEvent and assert the returned error string for both the populated and empty cases.
| region, url.QueryEscape(stackID)) | ||
| } | ||
|
|
||
| func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error { |
There was a problem hiding this comment.
🟡 P2 writeStepSummary produces the primary new user-facing output — the Markdown table written to ~30 repos' job summaries. The table header, separator row, per-row timestamp formatting, and console link assembly are all untested logic that mdCell/eventsConsoleURL tests alone don't cover. It's testable with a temp file: call writeStepSummary with the temp path, read it back, and assert the rendered Markdown. A regression would produce garbled summaries across all consuming repos.
Addresses Doist/platform-backlog#734 — failed deploys currently surface only a single, often-wrong error line, leaving you to dig through the AWS Console. This action is the standard deploy step across ~30 Doist repos, so the improvement applies broadly, not just to Todoist.
What changed
ClientRequestTokenand collect all genuinely-failed resources (any*_FAILEDstatus, skipping cancellation cascades). Report the earliest failure as the root cause — the previous heuristic captured the newest failed event, which is usually a downstream symptom.::error::annotations.Notes
cloudformation:DescribeStackEventswas already required.