[SPARK-57083][SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion#56127
[SPARK-57083][SQL] Preserve geography SRID across encoders, Parquet readers, and Python conversion#56127uros-db wants to merge 2 commits into
Conversation
uros-db
left a comment
There was a problem hiding this comment.
@cloud-fan Please review. This should go into 4.2.0, in order to avoid data corruption for geographies stored with a non-default geographic SRID.
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem. Spark already supports per-row geographic SRIDs in GeographyType (e.g. GeographyType(4267), GeographyType(ANY)), and the geometry side of the type system already threads SRIDs end-to-end via STUtils.stGeomFromWKB(wkb, srid). On the geography side, however, five read paths were materialising GeographyVals via the single-arg STUtils.stGeogFromWKB(wkb), which internally delegates to Geography.fromWkb(wkb) and unconditionally substitutes DEFAULT_SRID = 4326. The column's declared SRID (and any per-row SRID from Arrow/Python) was discarded, so reads of a GeographyType(<non-4326>) Parquet column, Arrow-backed geographies, Python-side conversions, and CatalystTypeConverters.convertToCatalyst(Geography) all produced values whose stored SRID was silently 4326 — visible to users via ST_SRID(...) and as data corruption for any geography with a non-default geographic SRID.
Design approach. Mirror the existing geometry pattern exactly: introduce a new STUtils.stGeogFromWKB(byte[] wkb, int srid) overload that validates the SRID via GeographyType.isSridSupported (throwing ST_INVALID_SRID_VALUE otherwise) and constructs the Geography via Geography.fromWkb(wkb, srid). Then thread the SRID through each of the five callsites:
ArrowColumnVector.getGeography— uses the per-row SRID fromchild(0)(already validated bygt.assertSridAllowedForType).CatalystTypeConverters.convertToCatalyst(Geography)— usesg.getSrid.EvaluatePythonGeography case — uses the per-row SRID from the Python repr (already validated byassertSridAllowedForType).WKBToGeographyConverter.convert(vectorized Parquet reader) — uses the column-type SRID already being passed in but previously dropped.ParquetRowConverter.ParquetGeographyConverter— now takes the column-type SRID as a constructor arg, mirroringParquetGeometryConverter.
Key design decisions.
- Validate or silently coerce? The new overload throws
ST_INVALID_SRID_VALUEfor non-geographic SRIDs (e.g. 0, 3857) rather than coercing to 4326. This is the right call — silent coercion was the source of the bug — and is consistent with the geometry sibling. The newconvertToCatalyst with Geography with invalid SRIDtest locks this in. - Where does the SRID come from? Column-type SRID for Parquet (row + vectorized), per-row SRID for Arrow and Python. Matches what the geometry path already does.
Implementation sketch. Each fixed callsite is a literal parallel of the existing geometry callsite — no new structural decisions, no layer changes, no new branches. The only renaming is a local geometry → geography in ParquetGeographyConverter.addBinary, which was misnamed.
Tests. Unit-level coverage for STUtils.stGeogFromWKB(wkb, srid) (valid + invalid SRID lists); CatalystTypeConverters coverage for non-default and invalid SRIDs; end-to-end ParquetGeoSuite round-trip covering both the row-based reader (ParquetGeographyConverter) and the vectorized reader (WKBToGeographyConverter), plus a dedicated dictionary-encoding case that exercises the setDictionary path and the vectorized dictionary path. Two read paths (ArrowColumnVector.getGeography and EvaluatePython Geography case) are fixed but not directly covered by new tests — see inline comment.
LGTM modulo the small test-coverage observation below.
|
@cloud-fan Comments addressed and all checks green. Could you please merge into master/4.x/4.2? |
|
thanks, merging to master/4.x/4.2! |
…eaders, and Python conversion ### What changes were proposed in this pull request? Now that `GeographyType` supports per-row SRIDs, this PR fixes several code paths that were silently dropping the SRID when materializing a `GeographyVal` from WKB and manually substituting the default value (`4326`). ### Why are the changes needed? Without these fixes, geographies read from Arrow, Parquet, Python, or constructed via `CatalystTypeConverters.convertToCatalyst` had their SRID rewritten to the default 4326, regardless of the column's declared SRID or the value's actual SRID. This produced silent data corruption for any geography stored with a non-default geographic SRID. ### Does this PR introduce _any_ user-facing change? Yes — bug fix. Reading a Parquet column declared as `GeographyType(<non-default SRID>)`, materializing geographies from Arrow/Python, or converting Geography values via `CatalystTypeConverters.convertToCatalyst` now yields values whose SRID matches the column's declared SRID. Previously the SRID was always 4326. ### How was this patch tested? Updated relevant test suites and added appropriate unit test cases. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.7) Closes #56127 from uros-db/geography-srids. Authored-by: Uros Bojanic <uros.bojanic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4a61083) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…eaders, and Python conversion ### What changes were proposed in this pull request? Now that `GeographyType` supports per-row SRIDs, this PR fixes several code paths that were silently dropping the SRID when materializing a `GeographyVal` from WKB and manually substituting the default value (`4326`). ### Why are the changes needed? Without these fixes, geographies read from Arrow, Parquet, Python, or constructed via `CatalystTypeConverters.convertToCatalyst` had their SRID rewritten to the default 4326, regardless of the column's declared SRID or the value's actual SRID. This produced silent data corruption for any geography stored with a non-default geographic SRID. ### Does this PR introduce _any_ user-facing change? Yes — bug fix. Reading a Parquet column declared as `GeographyType(<non-default SRID>)`, materializing geographies from Arrow/Python, or converting Geography values via `CatalystTypeConverters.convertToCatalyst` now yields values whose SRID matches the column's declared SRID. Previously the SRID was always 4326. ### How was this patch tested? Updated relevant test suites and added appropriate unit test cases. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.7) Closes #56127 from uros-db/geography-srids. Authored-by: Uros Bojanic <uros.bojanic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4a61083) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Now that
GeographyTypesupports per-row SRIDs, this PR fixes several code paths that were silently dropping the SRID when materializing aGeographyValfrom WKB and manually substituting the default value (4326).Why are the changes needed?
Without these fixes, geographies read from Arrow, Parquet, Python, or constructed via
CatalystTypeConverters.convertToCatalysthad their SRID rewritten to the default 4326, regardless of the column's declared SRID or the value's actual SRID. This produced silent data corruption for any geography stored with a non-default geographic SRID.Does this PR introduce any user-facing change?
Yes — bug fix. Reading a Parquet column declared as
GeographyType(<non-default SRID>), materializing geographies from Arrow/Python, or converting Geography values viaCatalystTypeConverters.convertToCatalystnow yields values whose SRID matches the column's declared SRID. Previously the SRID was always 4326.How was this patch tested?
Updated relevant test suites and added appropriate unit test cases.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.7)