Report root cause when a stack update fails by deorus · Pull Request #3 · Doist/update-cloudformation-stack

deorus · 2026-06-20T10:50:44Z

Addresses Doist/platform-backlog#734 — failed deploys currently surface only a single, often-wrong error line, leaving you to dig through the AWS Console. This action is the standard deploy step across ~30 Doist repos, so the improvement applies broadly, not just to Todoist.

What changed

On failure, scan the stack events for this update's ClientRequestToken and collect all genuinely-failed resources (any *_FAILED status, skipping cancellation cascades). Report the earliest failure as the root cause — the previous heuristic captured the newest failed event, which is usually a downstream symptom.
Under GitHub Actions, write a summary table to the job summary: Time (UTC) / Resource / Type / Status / Reason, plus a deep link to the stack's events tab in the AWS Console. Remaining failures are emitted as ::error:: annotations.
Falls back to a generic message + console link when no resource-level reason is available.

Notes

No new IAM permission — cloudformation:DescribeStackEvents was already required.
Tests cover failure detection, root-cause ordering, Markdown-cell escaping, and the console URL.

On a failed deploy, scan the stack events for this update and surface the earliest failed resource as the error. Under GitHub Actions, write a summary table of all failed resources plus a console deep link to the job summary.

doistbot

This PR improves CloudFormation deploy failure reporting by scanning stack events for the update's ClientRequestToken, identifying the earliest failed resource as the root cause, and writing a summary table with console links to the GitHub Actions job summary.

Few things worth tightening:

Stack-level failures get dropped: The continue statement that skips stack events also hides pre-flight failures (missing IAM capabilities, export overlaps, etc.) where no resource fails, resulting in a generic error instead of the real cause. Remove the continue so stack-level _FAILED events are still collected — since stack events are newer than resource failures, sortedFailures will still place the true root cause first.
Edge cases in the root-cause message: A *_FAILED event with no ResourceStatusReason produces a malformed message (… STATUS: — see stack events). Treat an empty root.reason as missing and fall back to the generic message + console link. Separately, the console URL hardcodes console.aws.amazon.com, which is wrong for GovCloud and China partitions — derive the host from the stack ARN/partition instead.
Nondeterministic root-cause selection: Events collected into a map then sorted only by timestamp can shuffle equal-timestamp failures, making the reported root cause vary run to run. Use EventId for dedup instead of the composite key, and add a deterministic tiebreaker or preserve scan order so failures[0] is stable.
Polling and testing gaps: On busy stacks, pagination now continues through the full one-hour cutoff even after rollback — break early once the matching UPDATE_IN_PROGRESS event for this token is found. Also, the two highest-value functions (reportFailure and writeStepSummary) are untested despite being pure functions with an existing test harness; adding coverage for both the populated and empty cases would guard against regressions in the core logic this PR introduces.

I also included a few optional follow-up notes in the details below.

Optional follow-up note (1)

[P3] main.go:146: Since the stack's terminal event is always newer than the resource failures that caused it, it will be encountered first in this newest-to-oldest stream. You can avoid redundantly populating the failures map on every tick during a rollback by checking if terminal != "" && isFailure(...). This safely defers all map operations and string parsing until the final tick when the map is actually used.

_{Share Feedback • Review Logs}

doistbot · 2026-06-22T06:48:50Z

+						types.ResourceStatusRollbackFailed:
+						terminal = evt.ResourceStatus
 					}
+					continue


🟠 P1 The continue statement skips all stack-level events from being evaluated by isFailure. While this hides the redundant "The following resource(s) failed" stack event during typical deployments, it also drops stack-level pre-flight failures (e.g., missing IAM capabilities, export overlaps, or unresolvable parameters) where no underlying resources fail. This results in the action returning a generic UPDATE_ROLLBACK_COMPLETE error and dropping the actual root cause, which is a regression from the previous behavior.

Remove continue so that stack-level _FAILED events are still collected. Because stack-level events occur after resource failures, sortedFailures will still correctly place the true resource root cause first when both exist.

doistbot · 2026-06-22T06:48:50Z

+		log.Printf("%s%s (%s) %s: %s", githubErrPrefix, e.logicalID, e.resType, e.status, oneLine(e.reason))
+	}
+	root := failures[0]
+	return fmt.Errorf("%s (%s) %s: %s — see stack events: %s", root.logicalID, root.resType, root.status, oneLine(root.reason), consoleURL)


🟡 P2 ResourceStatusReason is optional in CloudFormation events, so a *_FAILED event can legitimately arrive with no reason. In that case this now returns Resource (Type) STATUS: — see stack events, even though the PR description says to fall back to a generic message when no resource-level reason is available. Please treat an empty root.reason as missing here and either omit the : %s fragment or fall back to the terminal-status message + console link.

doistbot · 2026-06-22T06:48:50Z

+}
+
+func eventsConsoleURL(region, stackID string) string {
+	return fmt.Sprintf("https://%[1]s.console.aws.amazon.com/cloudformation/home?region=%[1]s#/stacks/events?stackId=%s",


🟡 P2 This hardcodes the commercial AWS Console domain, so the deep link is wrong for GovCloud and China regions even though the action itself can run there. Please derive the console host from the stack ARN/partition (or otherwise map the region to the correct console domain) instead of always using *.console.aws.amazon.com.

doistbot · 2026-06-22T06:48:50Z

+						reason:    unptr(evt.ResourceStatusReason),
+						timestamp: unptr(evt.Timestamp),
+					}
+					failures[fe.logicalID+"\x00"+fe.timestamp.String()] = fe


🟡 P2 Replace this synthetic composite key with the event's unique ID (unptr(evt.EventId)). CloudFormation guarantees a stable EventId for every stack event, which provides a safer, allocation-free way to deduplicate events without relying on stringified timestamps.

doistbot · 2026-06-22T06:48:50Z

+
+func sortedFailures(m map[string]failedEvent) []failedEvent {
+	out := slices.Collect(maps.Values(m))
+	slices.SortFunc(out, func(a, b failedEvent) int { return a.timestamp.Compare(b.timestamp) })


🟡 P2 This only sorts by timestamp, but the failures were first collected into a map, so any two events with the same timestamp end up in arbitrary order. Since reportFailure() treats failures[0] as the root cause, the reported root failure can change run to run for equal-timestamp events. Keep scan order in a slice (using a separate set only for dedupe if needed), or add a deterministic tiebreaker before selecting the first failure.

doistbot · 2026-06-22T06:48:50Z

-						return cmp.Or(likelyRootCause, fmt.Errorf("%v, see AWS CloudFormation Console for more details", evt.ResourceStatus))
 					case types.ResourceStatusUpdateComplete:
 						return nil
+					case types.ResourceStatusUpdateRollbackComplete,


🟡 P2 This change stops returning immediately on rollback and now keeps paginating DescribeStackEvents until the one-hour cutoff. On stacks with a lot of recent history, that means extra API pages for unrelated older operations right on the failure path. Since events are newest-first, consider breaking once you've reached the current update's start event for this ClientRequestToken (for example the matching stack UPDATE_IN_PROGRESS event) instead of scanning the whole cutoff window.

doistbot · 2026-06-22T06:48:50Z

+// table plus per-resource error annotations — and returns the most likely root
+// cause as an error. The earliest failure is the root cause; later ones usually
+// cascade from it.
+func reportFailure(region, stackID string, terminal types.ResourceStatus, failures []failedEvent) error {


🟡 P2 reportFailure is the core of this PR — it picks failures[0] as the root cause, formats the error message, logs failures[1:] as annotations, and falls back to a generic message when there are no resource-level failures. None of this is tested. A regression here (wrong element selected, broken fallback, malformed message) would directly reintroduce the wrong-root-cause problem this PR fixes. It's a pure function in the same package — with GITHUB_STEP_SUMMARY unset, a test can pass a []failedEvent and assert the returned error string for both the populated and empty cases.

doistbot · 2026-06-22T06:48:50Z

+		region, url.QueryEscape(stackID))
+}
+
+func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error {


🟡 P2 writeStepSummary produces the primary new user-facing output — the Markdown table written to ~30 repos' job summaries. The table header, separator row, per-row timestamp formatting, and console link assembly are all untested logic that mdCell/eventsConsoleURL tests alone don't cover. It's testable with a temp file: call writeStepSummary with the temp path, read it back, and assert the rendered Markdown. A regression would produce garbled summaries across all consuming repos.

Report root cause when a stack update fails

413a10e

On a failed deploy, scan the stack events for this update and surface the earliest failed resource as the error. Under GitHub Actions, write a summary table of all failed resources plus a console deep link to the job summary.

deorus requested a review from artyom June 22, 2026 06:42

deorus marked this pull request as ready for review June 22, 2026 06:42

doistbot reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Report root cause when a stack update fails#3

Report root cause when a stack update fails#3
deorus wants to merge 1 commit into
mainfrom
better-failure-reporting

deorus commented Jun 20, 2026

Uh oh!

doistbot left a comment

Uh oh!

doistbot Jun 22, 2026

Uh oh!

doistbot Jun 22, 2026

Uh oh!

doistbot Jun 22, 2026

Uh oh!

doistbot Jun 22, 2026

Uh oh!

doistbot Jun 22, 2026

Uh oh!

doistbot Jun 22, 2026

Uh oh!

doistbot Jun 22, 2026

Uh oh!

doistbot Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

deorus commented Jun 20, 2026

What changed

Notes

Uh oh!

doistbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants