Skip to content

[SPARK-57083][SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion#56127

Closed
uros-db wants to merge 2 commits into
apache:masterfrom
uros-db:geography-srids
Closed

[SPARK-57083][SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion#56127
uros-db wants to merge 2 commits into
apache:masterfrom
uros-db:geography-srids

Conversation

@uros-db
Copy link
Copy Markdown
Contributor

@uros-db uros-db commented May 26, 2026

What changes were proposed in this pull request?

Now that GeographyType supports per-row SRIDs, this PR fixes several code paths that were silently dropping the SRID when materializing a GeographyVal from WKB and manually substituting the default value (4326).

Why are the changes needed?

Without these fixes, geographies read from Arrow, Parquet, Python, or constructed via CatalystTypeConverters.convertToCatalyst had their SRID rewritten to the default 4326, regardless of the column's declared SRID or the value's actual SRID. This produced silent data corruption for any geography stored with a non-default geographic SRID.

Does this PR introduce any user-facing change?

Yes — bug fix. Reading a Parquet column declared as GeographyType(<non-default SRID>), materializing geographies from Arrow/Python, or converting Geography values via CatalystTypeConverters.convertToCatalyst now yields values whose SRID matches the column's declared SRID. Previously the SRID was always 4326.

How was this patch tested?

Updated relevant test suites and added appropriate unit test cases.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4.7)

@uros-db uros-db changed the title [SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion [SPARK-57083][SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion May 26, 2026
Copy link
Copy Markdown
Contributor Author

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Please review. This should go into 4.2.0, in order to avoid data corruption for geographies stored with a non-default geographic SRID.

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Prior state and problem. Spark already supports per-row geographic SRIDs in GeographyType (e.g. GeographyType(4267), GeographyType(ANY)), and the geometry side of the type system already threads SRIDs end-to-end via STUtils.stGeomFromWKB(wkb, srid). On the geography side, however, five read paths were materialising GeographyVals via the single-arg STUtils.stGeogFromWKB(wkb), which internally delegates to Geography.fromWkb(wkb) and unconditionally substitutes DEFAULT_SRID = 4326. The column's declared SRID (and any per-row SRID from Arrow/Python) was discarded, so reads of a GeographyType(<non-4326>) Parquet column, Arrow-backed geographies, Python-side conversions, and CatalystTypeConverters.convertToCatalyst(Geography) all produced values whose stored SRID was silently 4326 — visible to users via ST_SRID(...) and as data corruption for any geography with a non-default geographic SRID.

Design approach. Mirror the existing geometry pattern exactly: introduce a new STUtils.stGeogFromWKB(byte[] wkb, int srid) overload that validates the SRID via GeographyType.isSridSupported (throwing ST_INVALID_SRID_VALUE otherwise) and constructs the Geography via Geography.fromWkb(wkb, srid). Then thread the SRID through each of the five callsites:

  • ArrowColumnVector.getGeography — uses the per-row SRID from child(0) (already validated by gt.assertSridAllowedForType).
  • CatalystTypeConverters.convertToCatalyst(Geography) — uses g.getSrid.
  • EvaluatePython Geography case — uses the per-row SRID from the Python repr (already validated by assertSridAllowedForType).
  • WKBToGeographyConverter.convert (vectorized Parquet reader) — uses the column-type SRID already being passed in but previously dropped.
  • ParquetRowConverter.ParquetGeographyConverter — now takes the column-type SRID as a constructor arg, mirroring ParquetGeometryConverter.

Key design decisions.

  • Validate or silently coerce? The new overload throws ST_INVALID_SRID_VALUE for non-geographic SRIDs (e.g. 0, 3857) rather than coercing to 4326. This is the right call — silent coercion was the source of the bug — and is consistent with the geometry sibling. The new convertToCatalyst with Geography with invalid SRID test locks this in.
  • Where does the SRID come from? Column-type SRID for Parquet (row + vectorized), per-row SRID for Arrow and Python. Matches what the geometry path already does.

Implementation sketch. Each fixed callsite is a literal parallel of the existing geometry callsite — no new structural decisions, no layer changes, no new branches. The only renaming is a local geometrygeography in ParquetGeographyConverter.addBinary, which was misnamed.

Tests. Unit-level coverage for STUtils.stGeogFromWKB(wkb, srid) (valid + invalid SRID lists); CatalystTypeConverters coverage for non-default and invalid SRIDs; end-to-end ParquetGeoSuite round-trip covering both the row-based reader (ParquetGeographyConverter) and the vectorized reader (WKBToGeographyConverter), plus a dedicated dictionary-encoding case that exercises the setDictionary path and the vectorized dictionary path. Two read paths (ArrowColumnVector.getGeography and EvaluatePython Geography case) are fixed but not directly covered by new tests — see inline comment.

LGTM modulo the small test-coverage observation below.

@uros-db
Copy link
Copy Markdown
Contributor Author

uros-db commented May 27, 2026

@cloud-fan Comments addressed and all checks green. Could you please merge into master/4.x/4.2?

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master/4.x/4.2!

@cloud-fan cloud-fan closed this in 4a61083 May 27, 2026
cloud-fan pushed a commit that referenced this pull request May 27, 2026
…eaders, and Python conversion

### What changes were proposed in this pull request?
Now that `GeographyType` supports per-row SRIDs, this PR fixes several code paths that were silently dropping the SRID when materializing a `GeographyVal` from WKB and manually substituting the default value (`4326`).

### Why are the changes needed?
Without these fixes, geographies read from Arrow, Parquet, Python, or constructed via `CatalystTypeConverters.convertToCatalyst` had their SRID rewritten to the default 4326, regardless of the column's declared SRID or the value's actual SRID. This produced silent data corruption for any geography stored with a non-default geographic SRID.

### Does this PR introduce _any_ user-facing change?
Yes — bug fix. Reading a Parquet column declared as `GeographyType(<non-default SRID>)`, materializing geographies from Arrow/Python, or converting Geography values via `CatalystTypeConverters.convertToCatalyst` now yields values whose SRID matches the column's declared SRID. Previously the SRID was always 4326.

### How was this patch tested?
Updated relevant test suites and added appropriate unit test cases.

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.7)

Closes #56127 from uros-db/geography-srids.

Authored-by: Uros Bojanic <uros.bojanic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 4a61083)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request May 27, 2026
…eaders, and Python conversion

### What changes were proposed in this pull request?
Now that `GeographyType` supports per-row SRIDs, this PR fixes several code paths that were silently dropping the SRID when materializing a `GeographyVal` from WKB and manually substituting the default value (`4326`).

### Why are the changes needed?
Without these fixes, geographies read from Arrow, Parquet, Python, or constructed via `CatalystTypeConverters.convertToCatalyst` had their SRID rewritten to the default 4326, regardless of the column's declared SRID or the value's actual SRID. This produced silent data corruption for any geography stored with a non-default geographic SRID.

### Does this PR introduce _any_ user-facing change?
Yes — bug fix. Reading a Parquet column declared as `GeographyType(<non-default SRID>)`, materializing geographies from Arrow/Python, or converting Geography values via `CatalystTypeConverters.convertToCatalyst` now yields values whose SRID matches the column's declared SRID. Previously the SRID was always 4326.

### How was this patch tested?
Updated relevant test suites and added appropriate unit test cases.

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.7)

Closes #56127 from uros-db/geography-srids.

Authored-by: Uros Bojanic <uros.bojanic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 4a61083)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants