Skip to content

WIP - WAL#5149

Draft
denik wants to merge 33 commits intomainfrom
denik/wal-review1
Draft

WIP - WAL#5149
denik wants to merge 33 commits intomainfrom
denik/wal-review1

Conversation

@denik
Copy link
Copy Markdown
Contributor

@denik denik commented Apr 30, 2026

Changes

Why

Tests

@denik denik temporarily deployed to test-trigger-is April 30, 2026 14:29 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is April 30, 2026 14:29 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 1, 2026 11:49 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 1, 2026 11:49 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 1, 2026 14:04 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 1, 2026 14:04 — with GitHub Actions Inactive
Varun Deep Saini and others added 24 commits May 1, 2026 16:10
Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
Signed-off-by: Varun Deep Saini <deepsainivarun@gmail.com>
Signed-off-by: Varun Deep Saini <deepsainivarun@gmail.com>
Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
fix Open() calls; replace Finalize() with Close(); close state file in plan
Move state open/close management to process.go so the lifecycle is
transparent. process.go opens state for read (with WAL recovery) after
PullResourcesState and defers close. Deploy/destroy upgrade to write
mode via the new UpgradeToWrite() method which initializes the WAL
without re-reading state JSON.

Internal functions (CalculatePlan, ExportState, InitForApply,
ValidatePlanAgainstState) no longer manage their own open/close —
they expect state to already be open. Self-managed callers (bind,
migrate, yaml_sync, diff) handle their own state lifecycle.

Plan command uses ProcessBundleRetWithPlan to compute the plan while
state is still open for read inside processBundleRetInternal.

Co-authored-by: Isaac
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
If only the WAL header was written (no resource changes), replayWAL now
discards the WAL without saving the state file. This avoids the spurious
"Updating deployment state..." message on no-op deploys in the direct
engine.

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
The splits were introduced because the direct engine printed
"Updating deployment state..." on deploys with no resource changes,
while terraform did not. The preceding commit fixes the root cause
(WAL without entries no longer writes the state file), so both
engines now produce identical output for these tests.

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
denik added 9 commits May 1, 2026 16:10
Populate stateIDs from State on Reload so it always mirrors the
effective view: initialized from disk, updated by SaveState/DeleteState.
GetResourceID now consults stateIDs unconditionally instead of checking
walFile and falling back to State.

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
…roach

State opened for read in ProcessBundleRet stays open after return.
Deploy and Destroy call UpgradeToWrite + Close internally, so no
defensive defer is needed. plan.go reverts to the two-step pattern
from main: ProcessBundleRet then phases.RunPlan. ProcessBundleRetWithPlan
and opts.ComputePlan are removed.

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
…ternal

Close(ctx) -> Finalize(ctx) to match main's naming.
plan was a named return in processBundleRetInternal only to support
the now-removed ProcessBundleRetWithPlan; demote it to a local variable.

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
- Collapse processBundleRetInternal back into ProcessBundleRet (named returns)
- ProcessBundle calls ProcessBundleRet like on main
- Restore needDirectState guard so state is only opened when needed
- Move var plan back after the state block

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
- migrate.go: use len(state) instead of len(StateDB.Data.State) since Finalize()
  resets Data after saving; fixes "Migrated 0 resources" regression
- dashboard.go, diff.go: remove unnecessary defer StateDB.Finalize for read-only
  opens - no WAL file is open so no cleanup is needed
- bind.go, state_test.go, upload_state_for_yaml_sync.go, migrate.go: fix
  errcheck lint issues on Finalize calls that cannot return error in read mode

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
…ation

Key fixes to bundle/direct/dstate/state.go:

- Change walSuffix from ".WAL" to ".wal" for case-sensitive filesystem
  compatibility (Linux). All acceptance tests create .wal files.

- Remove CLIVersion and StateVersion checks from validateWALHeader.
  These fields are informational metadata and should not gate WAL recovery.
  Pre-created test WAL files without these fields now validate correctly.

- Fix validateWALHeader error messages:
  - Lineage mismatch: "WAL lineage (%s) does not match state lineage (%s)"
  - Stale serial (< expected): return errStaleWAL sentinel
  - Future serial (> expected): "WAL serial (%d) is ahead of expected (%d),
    state may be corrupted"

- Add errStaleWAL sentinel so replayWAL can silently delete stale WALs
  instead of returning an error that fails the deploy.

- Fix mergeWalIntoState for partial recovery:
  - Skip corrupted entries with log.Warnf instead of failing immediately
  - Save corrupted lines to PATH.wal.corrupted for debugging
  - Update db.stateIDs alongside db.Data.State when applying WAL entries

- Add MkdirAll before WAL file creation in Open and UpgradeToWrite.
  Fixes bind failing on first use when the state directory doesn't exist yet.

- Wrap replayWAL errors as "WAL recovery failed: %w".
- Wrap Open's replayWAL call as "reading state from %s: %w".

Update acceptance test expected outputs accordingly:
- WAL tests: new error messages, partial recovery behavior, stale WAL handling
- Migrate tests: migration count now correct (was 0 due to earlier fix)
- Bind tests: now succeed when state directory didn't exist before
- State/deploy failure tests: "Updating deployment state..." removed when
  no state was written (no-change or failed deploys)

Co-authored-by: Isaac
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik force-pushed the denik/wal-review1 branch from 4e30122 to 864a2b9 Compare May 1, 2026 14:12
@denik denik temporarily deployed to test-trigger-is May 1, 2026 14:13 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 1, 2026 14:13 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants