Skip to content

Non-breaking plan fails when removed column is part of existing clustering key (Snowflake) #5813

@nickmuoh

Description

@nickmuoh

Summary

When a FULL model removes both a column and its clustered_by clause in the same edit, SQLMesh categorizes the change as non-breaking and attempts an in-place ALTER TABLE ... DROP COLUMN against the existing physical table. On Snowflake, this fails because the column it is trying to drop is still part of the clustering key on the existing physical table.

The plan aborts mid-apply with a partial state: some sibling models in the same plan have already had their physical tables created/cloned, while everything downstream of the failing model is skipped.

Environment

  • sqlmesh version: 0.231.1
  • Engine: Snowflake
  • Model kind: FULL
  • Plan flags: --forward-only --skip-backfill --allow-destructive-model <other_model>

Scenario

A FULL model — call it ANALYTICS.SESSION_FACTS — was previously deployed with a clustering key that references a materialized column:

MODEL (
  name ANALYTICS.SESSION_FACTS,
  kind FULL,
  clustered_by (TENANT_ID, PARTITION_MONTH),
  ...
);

SELECT
  tenant_id,
  user_id,
  session_start_ts,
  date_trunc('month', session_start_ts) AS partition_month,
  ...
FROM ...

The deployed physical table has PARTITION_MONTH as a stored column that participates in the clustering key.

The model is then edited to:

  1. Remove clustered_by entirely.
  2. Stop projecting PARTITION_MONTH (it is no longer in the SELECT).
MODEL (
  name ANALYTICS.SESSION_FACTS,
  kind FULL,
  -- clustered_by removed
  ...
);

SELECT
  tenant_id,
  user_id,
  session_start_ts,
  -- partition_month no longer projected
  ...
FROM ...

Root Cause: CLONE inherits the clustering key

SQLMesh's non-breaking code path for FULL models uses Snowflake's CREATE TABLE ... CLONE to create the new physical snapshot from the existing one:

CREATE TABLE IF NOT EXISTS ANALYTICS.sqlmesh__ANALYTICS_SILVER.SESSION_FACTS__<new_fingerprint>__dev
CLONE ANALYTICS.sqlmesh__ANALYTICS_SILVER.SESSION_FACTS__<old_fingerprint>

CLONE in Snowflake copies the full table metadata, including clustering keys. SQLMesh then attempts to ALTER TABLE ... DROP COLUMN PARTITION_MONTH on the cloned snapshot — which fails because the inherited clustering key still references that column.

On failure, SQLMesh rolls back by dropping the cloned snapshot entirely. This means:

  • Manually ALTERing the cloned target is not possible — by the time the error surfaces, the rollback has already dropped the new snapshot.
  • Re-running the plan repeats the loop: clone from source → ALTER fail → drop. Every subsequent attempt fails identically.

Expected Behavior

SQLMesh should either:

  1. Sequence the destructive ALTERs correctly — after cloning, drop the inherited clustering key first (ALTER TABLE ... DROP CLUSTERING KEY), then drop the now-orphaned column. Snowflake permits this when the order is correct.
  2. Reclassify the change as breaking when it detects that the column to be dropped participates in the existing clustering key, so a new physical table version is created via CTAS (not CLONE) — avoiding the inherited key entirely.
  3. Surface a more actionable error that identifies the source snapshot whose clustering key must be manually dropped before the plan can succeed.

Currently, SQLMesh attempts the column drop in isolation and surfaces Snowflake's raw error without any guidance on how to recover.

Actual Behavior

The plan fails partway through the physical layer update with:

Failed models

  "ANALYTICS"."SESSION_FACTS"

    ProgrammingError:
      003203 (42601): <query_id>: Cannot drop column 'PARTITION_MONTH' which belongs to a clustering key

Error: Plan application failed.

By the time the error appears, ~17% of the physical layer update had already completed (sibling models cloned/created). The remaining models — including all downstream consumers of SESSION_FACTS — are reported as skipped.

Reproduction Steps

  1. Deploy a FULL model that projects a derived column (e.g. DATE_TRUNC('MONTH', some_ts) AS partition_month) and references that materialized column directly in clustered_by. Apply the plan; let the physical table be created with the clustering key on that column.
  2. Edit the model to (a) remove clustered_by and (b) stop projecting the materialized column.
  3. Run sqlmesh plan <env> --forward-only --skip-backfill.
  4. Observe SQLMesh categorizing the change as non-breaking, cloning the existing physical snapshot, and then attempting ALTER TABLE ... DROP COLUMN PARTITION_MONTH on the clone, which Snowflake rejects with error 003203 (42601).
  5. Observe SQLMesh rolling back by dropping the newly cloned snapshot.
  6. Re-run the plan — observe the loop repeats identically.

Workarounds

Only one workaround is effective:

Manually drop the clustering key on the source snapshot (the one being cloned from — typically the snapshot pointed to by the production environment's view) before re-running the plan:

ALTER TABLE ANALYTICS.sqlmesh__ANALYTICS_SILVER.SESSION_FACTS__<prod_fingerprint>
  DROP CLUSTERING KEY;

Once the source snapshot no longer carries the clustering key, the next CLONE produces a snapshot without it, and the subsequent ALTER TABLE ... DROP COLUMN succeeds.

⚠️ This directly mutates a production physical snapshot. Snowflake's SELECT * view semantics are unaffected (the view still resolves correctly), but Snowflake's automatic clustering service stops operating on that table.

Attempted but ineffective workarounds:

  • --allow-destructive-model ANALYTICS.SESSION_FACTS — does not bypass the CLONE code path for FULL models in this scenario. The plan still clones the source snapshot, inheriting the clustering key, and the ALTER still fails.
  • Manually ALTER TABLE ... DROP CLUSTERING KEY on the cloned target — not possible because SQLMesh's rollback drops the cloned snapshot before the user can intervene.

Why This Matters

The change appears trivial from the model author's perspective — drop a clustering key, drop a derived column. The plan summary categorizes it as non-breaking, suggesting it should apply cleanly. The root issue is that SQLMesh's CLONE-based non-breaking code path faithfully copies the source's clustering constraints, but then issues ALTERs that conflict with those inherited constraints without accounting for the required sequencing.

The partial-apply outcome compounds the problem: the plan is left in an inconsistent state where some new physical tables exist but downstream consumers are stuck pointing at old snapshots, and the only recovery path requires directly mutating a production physical table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't workingEngine: SnowflakeIssues related to Snowflake

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions