[SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources#56111
Open
LuciferYang wants to merge 1 commit into
Conversation
LuciferYang
commented
May 26, 2026
| .doc("When set to true, SaveMode.Overwrite against a missing table on a " + | ||
| "SupportsCatalogOptions source throws NoSuchTableException instead of " + | ||
| "creating the table. Restores the pre-SPARK-57068 behavior.") | ||
| .version("4.3.0") |
Contributor
Author
There was a problem hiding this comment.
4.3.0 or 4.2.0?
…ing for SupportsCatalogOptions sources
### What changes were proposed in this pull request?
In `DataFrameWriter.saveCommand`, the `SaveMode.Append | SaveMode.Overwrite`
branch calls `catalog.loadTable(ident)` without catching `NoSuchTableException`
when the V2 source implements `SupportsCatalogOptions`. The exception
propagates straight to the user, even though `SaveMode.ErrorIfExists` and
`SaveMode.Ignore` on the same call succeed by routing to `CreateTableAsSelect`.
This change catches `NoSuchTableException` for `SaveMode.Overwrite` only and
routes to `CreateTableAsSelect(ignoreIfExists = false)`, mirroring the
`createMode` arm immediately below. `SaveMode.Append` on a non-existent
identifier intentionally continues to throw, because Append explicitly expects
an existing table and silently creating would mask user mistakes.
A new internal SQL conf `spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows`
restores the pre-fix behavior for users who depend on it.
The `CreateTableAsSelect` construction shared between the new fall-back path
and the existing `createMode` arm is extracted into a private helper
`createTableAsSelectForCatalogOptions` to keep both sites in sync.
### Why are the changes needed?
The most idiomatic write call for any V2 connector,
df.write.format(provider).mode("overwrite").save(newPath)
fails with `NoSuchTableException` when `newPath` does not yet exist, whereas
the equivalent V1 call (e.g. `format("parquet")`) succeeds by creating the
table. V2 sources that implement `SupportsCatalogOptions` (Iceberg, Lance, and
custom connectors) all hit this asymmetry. The fix aligns V2
`SaveMode.Overwrite` semantics with V1: overwrite-on-missing creates the
table, overwrite-on-existing truncates and writes.
Behavior matrix after this change:
| Mode × Target | V1 | V2 before | V2 after |
|------------------------|---------------|--------------|------------|
| Overwrite, missing | creates | **throws** | creates |
| Overwrite, existing | truncate+write| overwrite | unchanged |
| Append, missing | creates | throws | throws* |
| Append, existing | append | append | unchanged |
| ErrorIfExists, missing | creates | creates | unchanged |
| ErrorIfExists, existing| throws | throws | unchanged |
| Ignore, missing | creates | creates | unchanged |
| Ignore, existing | no-op | no-op | unchanged |
\* Intentional V1 divergence — see PR description.
There is an inherent race window between `loadTable` (throws) and
`CreateTableAsSelect`: a concurrent writer creating the table in between
will cause `TableAlreadyExistsException` rather than overwriting. This is
acceptable; V1's filesystem-atomic path doesn't expose it because V1 never
consults a catalog. Users retry.
### Does this PR introduce _any_ user-facing change?
Yes. `df.write.format(<V2 SupportsCatalogOptions source>).mode("overwrite")
.save(<new identifier>)` now creates the table instead of throwing
`NoSuchTableException`. No behavior change for paths that already exist. The
migration guide has been updated. The legacy flag
`spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows` restores the
prior behavior.
### How was this patch tested?
New tests in `SupportsCatalogOptionsSuite`:
- `save works with Overwrite - no table, no partitioning, session catalog`
- `save works with Overwrite - no table, with partitioning, session catalog`
- `save works with Overwrite - no table, no partitioning, testcat catalog`
- `save works with Overwrite - no table, with partitioning, testcat catalog`
These reuse the existing `testCreateAndRead` helper, which verifies catalog
state (table identity, partitioning, columns) in addition to data.
Plus three behavior-pinning tests:
- `Append mode still fails when table is missing - testcat catalog` (pins
the intentional Append divergence)
- `legacy flag restores throw on Overwrite-missing` (verifies the new conf)
- `Overwrite + withSchemaEvolution on missing table is rejected` (verifies
the schema-evolution gate fires with the expected error class)
Existing tests continue to pass.
### Was this patch authored or co-authored using generative AI tooling?
No.
4a2a359 to
0b3a0ef
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
DataFrameWriter.saveCommandcallscatalog.loadTable(ident)forSaveMode.Append | SaveMode.Overwritewithout a try/catch when the V2 source implementsSupportsCatalogOptions, so writing to a brand-new identifier throws:This PR catches
NoSuchTableExceptionin theOverwritecase and falls back toCreateTableAsSelect.Appendkeeps throwing: it documents "append to existing data", and silently creating would hide bugs. A new internal confspark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrowsflips back to the old behavior. TheCreateTableAsSelectplan that's now built in two places is extracted into a private helper.Why are the changes needed?
df.write.format(provider).mode("overwrite").save("/some/new/path")should work on a brand-new path the same way it does for parquet/json/orc. Today it throwsNoSuchTableExceptionfor any V2 source implementingSupportsCatalogOptions(Iceberg, Lance, custom connectors). The asymmetry has been around since SPARK-29219 introducedSupportsCatalogOptionsin Spark 3.0 — the V1 branch never goes throughloadTable, so this only shows up on the V2 path. The full behavior matrix after this PR:Only the first row changes. Append-on-missing intentionally stays a strict failure; aligning it with V1 would silently create a table the user expected to already exist.
Does this PR introduce any user-facing change?
Yes.
mode("overwrite").save(<missing identifier>)now creates the table, and the migration guide is updated to reflect this. Settingspark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows=truerestores the old behavior.mode("append")on a missing table still throws.How was this patch tested?
New tests in
SupportsCatalogOptionsSuite. Four reuse the existingtestCreateAndReadhelper, so Overwrite-on-missing now has the same coverage shape (session/testcat catalog, with and without partitioning) as the existingErrorIfExistsandIgnoretests. Three more pin specific edges: Append-on-missing still throws, the legacy conf restores the throw, andwithSchemaEvolution() + mode("overwrite")on a missing table raisesUNSUPPORTED_SCHEMA_EVOLUTION.CREATE_TABLE.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code