[SPARK-57083][SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion by uros-db · Pull Request #56127 · apache/spark

uros-db · 2026-05-26T18:36:47Z

What changes were proposed in this pull request?

Now that GeographyType supports per-row SRIDs, this PR fixes several code paths that were silently dropping the SRID when materializing a GeographyVal from WKB and manually substituting the default value (4326).

Why are the changes needed?

Without these fixes, geographies read from Arrow, Parquet, Python, or constructed via CatalystTypeConverters.convertToCatalyst had their SRID rewritten to the default 4326, regardless of the column's declared SRID or the value's actual SRID. This produced silent data corruption for any geography stored with a non-default geographic SRID.

Does this PR introduce any user-facing change?

Yes — bug fix. Reading a Parquet column declared as GeographyType(<non-default SRID>), materializing geographies from Arrow/Python, or converting Geography values via CatalystTypeConverters.convertToCatalyst now yields values whose SRID matches the column's declared SRID. Previously the SRID was always 4326.

How was this patch tested?

Updated relevant test suites and added appropriate unit test cases.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4.7)

uros-db

@cloud-fan Please review. This should go into 4.2.0, in order to avoid data corruption for geographies stored with a non-default geographic SRID.

cloud-fan

Summary

Prior state and problem. Spark already supports per-row geographic SRIDs in GeographyType (e.g. GeographyType(4267), GeographyType(ANY)), and the geometry side of the type system already threads SRIDs end-to-end via STUtils.stGeomFromWKB(wkb, srid). On the geography side, however, five read paths were materialising GeographyVals via the single-arg STUtils.stGeogFromWKB(wkb), which internally delegates to Geography.fromWkb(wkb) and unconditionally substitutes DEFAULT_SRID = 4326. The column's declared SRID (and any per-row SRID from Arrow/Python) was discarded, so reads of a GeographyType(<non-4326>) Parquet column, Arrow-backed geographies, Python-side conversions, and CatalystTypeConverters.convertToCatalyst(Geography) all produced values whose stored SRID was silently 4326 — visible to users via ST_SRID(...) and as data corruption for any geography with a non-default geographic SRID.

Design approach. Mirror the existing geometry pattern exactly: introduce a new STUtils.stGeogFromWKB(byte[] wkb, int srid) overload that validates the SRID via GeographyType.isSridSupported (throwing ST_INVALID_SRID_VALUE otherwise) and constructs the Geography via Geography.fromWkb(wkb, srid). Then thread the SRID through each of the five callsites:

ArrowColumnVector.getGeography — uses the per-row SRID from child(0) (already validated by gt.assertSridAllowedForType).
CatalystTypeConverters.convertToCatalyst(Geography) — uses g.getSrid.
EvaluatePython Geography case — uses the per-row SRID from the Python repr (already validated by assertSridAllowedForType).
WKBToGeographyConverter.convert (vectorized Parquet reader) — uses the column-type SRID already being passed in but previously dropped.
ParquetRowConverter.ParquetGeographyConverter — now takes the column-type SRID as a constructor arg, mirroring ParquetGeometryConverter.

Key design decisions.

Validate or silently coerce? The new overload throws ST_INVALID_SRID_VALUE for non-geographic SRIDs (e.g. 0, 3857) rather than coercing to 4326. This is the right call — silent coercion was the source of the bug — and is consistent with the geometry sibling. The new convertToCatalyst with Geography with invalid SRID test locks this in.
Where does the SRID come from? Column-type SRID for Parquet (row + vectorized), per-row SRID for Arrow and Python. Matches what the geometry path already does.

Implementation sketch. Each fixed callsite is a literal parallel of the existing geometry callsite — no new structural decisions, no layer changes, no new branches. The only renaming is a local geometry → geography in ParquetGeographyConverter.addBinary, which was misnamed.

Tests. Unit-level coverage for STUtils.stGeogFromWKB(wkb, srid) (valid + invalid SRID lists); CatalystTypeConverters coverage for non-default and invalid SRIDs; end-to-end ParquetGeoSuite round-trip covering both the row-based reader (ParquetGeographyConverter) and the vectorized reader (WKBToGeographyConverter), plus a dedicated dictionary-encoding case that exercises the setDictionary path and the vectorized dictionary path. Two read paths (ArrowColumnVector.getGeography and EvaluatePython Geography case) are fixed but not directly covered by new tests — see inline comment.

LGTM modulo the small test-coverage observation below.

uros-db · 2026-05-27T08:57:22Z

@cloud-fan Comments addressed and all checks green. Could you please merge into master/4.x/4.2?

cloud-fan · 2026-05-27T10:44:33Z

thanks, merging to master/4.x/4.2!

…eaders, and Python conversion ### What changes were proposed in this pull request? Now that `GeographyType` supports per-row SRIDs, this PR fixes several code paths that were silently dropping the SRID when materializing a `GeographyVal` from WKB and manually substituting the default value (`4326`). ### Why are the changes needed? Without these fixes, geographies read from Arrow, Parquet, Python, or constructed via `CatalystTypeConverters.convertToCatalyst` had their SRID rewritten to the default 4326, regardless of the column's declared SRID or the value's actual SRID. This produced silent data corruption for any geography stored with a non-default geographic SRID. ### Does this PR introduce _any_ user-facing change? Yes — bug fix. Reading a Parquet column declared as `GeographyType(<non-default SRID>)`, materializing geographies from Arrow/Python, or converting Geography values via `CatalystTypeConverters.convertToCatalyst` now yields values whose SRID matches the column's declared SRID. Previously the SRID was always 4326. ### How was this patch tested? Updated relevant test suites and added appropriate unit test cases. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.7) Closes #56127 from uros-db/geography-srids. Authored-by: Uros Bojanic <uros.bojanic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4a61083) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Initial commit

85394a6

uros-db changed the title ~~[SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion~~ [SPARK-57083][SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion May 26, 2026

uros-db commented May 26, 2026

View reviewed changes

cloud-fan approved these changes May 27, 2026

View reviewed changes

Comment thread sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

Address comment

ca72a67

cloud-fan closed this in 4a61083 May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57083][SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion#56127

[SPARK-57083][SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion#56127
uros-db wants to merge 2 commits into
apache:masterfrom
uros-db:geography-srids

uros-db commented May 26, 2026

Uh oh!

uros-db left a comment

Uh oh!

cloud-fan left a comment

Uh oh!

Uh oh!

uros-db commented May 27, 2026

Uh oh!

cloud-fan commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

uros-db commented May 26, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

Uh oh!

uros-db commented May 27, 2026

Uh oh!

cloud-fan commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants