Skip to content

[SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources#56111

Open
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:fix-overwrite-nonexistent-catalog-options
Open

[SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources#56111
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:fix-overwrite-nonexistent-catalog-options

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented May 26, 2026

What changes were proposed in this pull request?

DataFrameWriter.saveCommand calls catalog.loadTable(ident) for SaveMode.Append | SaveMode.Overwrite without a try/catch when the V2 source implements SupportsCatalogOptions, so writing to a brand-new identifier throws:

org.apache.spark.sql.catalyst.analysis.NoSuchTableException: ...
  at DataFrameWriter.saveCommand(DataFrameWriter.scala:179)

This PR catches NoSuchTableException in the Overwrite case and falls back to CreateTableAsSelect. Append keeps throwing: it documents "append to existing data", and silently creating would hide bugs. A new internal conf spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows flips back to the old behavior. The CreateTableAsSelect plan that's now built in two places is extracted into a private helper.

Why are the changes needed?

df.write.format(provider).mode("overwrite").save("/some/new/path") should work on a brand-new path the same way it does for parquet/json/orc. Today it throws NoSuchTableException for any V2 source implementing SupportsCatalogOptions (Iceberg, Lance, custom connectors). The asymmetry has been around since SPARK-29219 introduced SupportsCatalogOptions in Spark 3.0 — the V1 branch never goes through loadTable, so this only shows up on the V2 path. The full behavior matrix after this PR:

Mode × target V1 V2 before V2 after
Overwrite, missing creates throws creates
Overwrite, existing truncate + write overwrite unchanged
Append, missing creates throws throws
Append, existing append append unchanged
ErrorIfExists, missing creates creates unchanged
ErrorIfExists, existing throws throws unchanged
Ignore, missing creates creates unchanged
Ignore, existing no-op no-op unchanged

Only the first row changes. Append-on-missing intentionally stays a strict failure; aligning it with V1 would silently create a table the user expected to already exist.

Does this PR introduce any user-facing change?

Yes. mode("overwrite").save(<missing identifier>) now creates the table, and the migration guide is updated to reflect this. Setting spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows=true restores the old behavior. mode("append") on a missing table still throws.

How was this patch tested?

New tests in SupportsCatalogOptionsSuite. Four reuse the existing testCreateAndRead helper, so Overwrite-on-missing now has the same coverage shape (session/testcat catalog, with and without partitioning) as the existing ErrorIfExists and Ignore tests. Three more pin specific edges: Append-on-missing still throws, the legacy conf restores the throw, and withSchemaEvolution() + mode("overwrite") on a missing table raises UNSUPPORTED_SCHEMA_EVOLUTION.CREATE_TABLE.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

.doc("When set to true, SaveMode.Overwrite against a missing table on a " +
"SupportsCatalogOptions source throws NoSuchTableException instead of " +
"creating the table. Restores the pre-SPARK-57068 behavior.")
.version("4.3.0")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.3.0 or 4.2.0?

…ing for SupportsCatalogOptions sources

### What changes were proposed in this pull request?

In `DataFrameWriter.saveCommand`, the `SaveMode.Append | SaveMode.Overwrite`
branch calls `catalog.loadTable(ident)` without catching `NoSuchTableException`
when the V2 source implements `SupportsCatalogOptions`. The exception
propagates straight to the user, even though `SaveMode.ErrorIfExists` and
`SaveMode.Ignore` on the same call succeed by routing to `CreateTableAsSelect`.

This change catches `NoSuchTableException` for `SaveMode.Overwrite` only and
routes to `CreateTableAsSelect(ignoreIfExists = false)`, mirroring the
`createMode` arm immediately below. `SaveMode.Append` on a non-existent
identifier intentionally continues to throw, because Append explicitly expects
an existing table and silently creating would mask user mistakes.

A new internal SQL conf `spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows`
restores the pre-fix behavior for users who depend on it.

The `CreateTableAsSelect` construction shared between the new fall-back path
and the existing `createMode` arm is extracted into a private helper
`createTableAsSelectForCatalogOptions` to keep both sites in sync.

### Why are the changes needed?

The most idiomatic write call for any V2 connector,

    df.write.format(provider).mode("overwrite").save(newPath)

fails with `NoSuchTableException` when `newPath` does not yet exist, whereas
the equivalent V1 call (e.g. `format("parquet")`) succeeds by creating the
table. V2 sources that implement `SupportsCatalogOptions` (Iceberg, Lance, and
custom connectors) all hit this asymmetry. The fix aligns V2
`SaveMode.Overwrite` semantics with V1: overwrite-on-missing creates the
table, overwrite-on-existing truncates and writes.

Behavior matrix after this change:

| Mode × Target          | V1            | V2 before    | V2 after   |
|------------------------|---------------|--------------|------------|
| Overwrite, missing     | creates       | **throws**   | creates    |
| Overwrite, existing    | truncate+write| overwrite    | unchanged  |
| Append, missing        | creates       | throws       | throws*    |
| Append, existing       | append        | append       | unchanged  |
| ErrorIfExists, missing | creates       | creates      | unchanged  |
| ErrorIfExists, existing| throws        | throws       | unchanged  |
| Ignore, missing        | creates       | creates      | unchanged  |
| Ignore, existing       | no-op         | no-op        | unchanged  |

\* Intentional V1 divergence — see PR description.

There is an inherent race window between `loadTable` (throws) and
`CreateTableAsSelect`: a concurrent writer creating the table in between
will cause `TableAlreadyExistsException` rather than overwriting. This is
acceptable; V1's filesystem-atomic path doesn't expose it because V1 never
consults a catalog. Users retry.

### Does this PR introduce _any_ user-facing change?

Yes. `df.write.format(<V2 SupportsCatalogOptions source>).mode("overwrite")
.save(<new identifier>)` now creates the table instead of throwing
`NoSuchTableException`. No behavior change for paths that already exist. The
migration guide has been updated. The legacy flag
`spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows` restores the
prior behavior.

### How was this patch tested?

New tests in `SupportsCatalogOptionsSuite`:
- `save works with Overwrite - no table, no partitioning, session catalog`
- `save works with Overwrite - no table, with partitioning, session catalog`
- `save works with Overwrite - no table, no partitioning, testcat catalog`
- `save works with Overwrite - no table, with partitioning, testcat catalog`

These reuse the existing `testCreateAndRead` helper, which verifies catalog
state (table identity, partitioning, columns) in addition to data.

Plus three behavior-pinning tests:
- `Append mode still fails when table is missing - testcat catalog` (pins
  the intentional Append divergence)
- `legacy flag restores throw on Overwrite-missing` (verifies the new conf)
- `Overwrite + withSchemaEvolution on missing table is rejected` (verifies
  the schema-evolution gate fires with the expected error class)

Existing tests continue to pass.

### Was this patch authored or co-authored using generative AI tooling?

No.
@LuciferYang LuciferYang force-pushed the fix-overwrite-nonexistent-catalog-options branch from 4a2a359 to 0b3a0ef Compare May 26, 2026 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant