[GH-2809] Support distance joins for raster predicates#2980
Conversation
53602f9 to
1e23c81
Compare
1e23c81 to
3a95f0b
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds RS_DWithin support for raster distance predicates and routes raster distance joins through optimized broadcast and partitioned join paths using WGS84 envelope expansion and spheroidal refinement.
Changes:
- Adds
RS_DWithinraster/geometry and raster/raster predicate implementation, Spark expression, catalog registration, and docs. - Updates join detection/planning/execution to support raster distance joins via
BroadcastIndexJoinExecandDistanceJoinExec. - Adds broadcast and non-broadcast join tests for raster distance predicates.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
common/src/main/java/org/apache/sedona/common/raster/RasterPredicates.java |
Adds WGS84 spheroidal rsDWithin raster predicate helpers. |
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/raster/RasterPredicates.scala |
Adds Catalyst RS_DWithin expression. |
spark/common/src/main/scala/org/apache/sedona/sql/UDF/Catalog.scala |
Registers RS_DWithin. |
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala |
Detects raster distance join predicates. |
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/OptimizableJoinCondition.scala |
Marks RS_DWithin as optimizable. |
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/TraitJoinQueryBase.scala |
Adds expanded WGS84 envelope RDD helper. |
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/SpatialIndexExec.scala |
Builds raster distance indexes with expanded WGS84 envelopes. |
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/BroadcastIndexJoinExec.scala |
Handles raster stream-side distance shapes for broadcast joins. |
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/DistanceJoinExec.scala |
Routes raster distance joins through WGS84 envelope RDDs. |
spark/common/src/test/scala/org/apache/sedona/sql/RasterJoinSuite.scala |
Adds partitioned raster distance join tests. |
spark/common/src/test/scala/org/apache/sedona/sql/BroadcastIndexJoinSuite.scala |
Adds broadcast raster distance join tests. |
docs/api/sql/Raster-Predicates/RS_DWithin.md |
Documents new SQL predicate. |
docs/api/sql/Raster-Functions.md |
Adds predicate table entry. |
docs/api/sql/Optimizer.md |
Documents raster distance join planning. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * Distance predicate for rasters: `RS_DWithin(left, right, distance)`. `left` and `right` can | ||
| * each be a raster or a geometry (at least one must be a raster). Returns true when the shapes | ||
| * are within `distance` of each other, with both sides projected to a common CRS prior to the JTS | ||
| * distance check (mirroring [[RS_Intersects]]). This expression is recognised by |
There was a problem hiding this comment.
Resolved in b488799. Rewrote the Scaladoc to describe the actual semantics: WGS84 reprojection, spheroidal centroid distance in meters via Spheroid.distance, and the Haversine envelope expansion used by the join planner. No longer claims parity with RS_Intersects/JTS.
| public static boolean rsDWithin(GridCoverage2D raster, Geometry geometry, double distance) { | ||
| Pair<Geometry, Geometry> geometries = toWGS84Pair(raster, geometry); | ||
| return Spheroid.distance(geometries.getLeft(), geometries.getRight()) <= distance; |
There was a problem hiding this comment.
Resolved in b488799. Added four direct unit tests in common/src/test/java/org/apache/sedona/common/raster/RasterPredicatesTest.java with hard-coded expected booleans:
testDWithinWGS84RasterPointMeterSemantics— coincident centroid + a 10°-east point (~1 112 km geodesic); a 10.0 threshold (the pre-fix degree value) must reject the pair, catching the unit regression directly.testDWithinSwappedOperands— confirms the (raster, geom) overload is symmetric at and around the threshold.testDWithinProjectedRasterReprojects— UTM 32610 raster vs WGS84 and EPSG:3857 points, asserts the same truth value across CRSes so the WGS84 reprojection cannot silently regress.testDWithinRasterRaster— same-CRS and cross-CRS (UTM + WGS84) raster-raster pairs with symmetric assertions and threshold brackets.
The join tests in RasterJoinSuite still build their expected set from the same predicate (they assert the planner produces a result consistent with the predicate), but the predicate itself is now anchored by these standalone cases.
|
One small docs issue: |
3a95f0b to
b488799
Compare
|
Fixed in b488799. Changed the link target from |
20b64fd to
20fbb82
Compare
Add `RS_DWithin(raster|geom, raster|geom, distance)` so distance joins can use raster operands, and route the join planner through the existing spatial-index machinery. - `RS_DWithin` expression in `RasterPredicates.scala`, backed by new `RasterPredicates.rsDWithin` overloads (raster-geom, raster-raster) that reuse `convertCRSIfNeeded` and JTS `isWithinDistance`. - `JoinQueryDetector` and `OptimizableJoinCondition` recognise `RS_DWithin` as a distance-join predicate; the relationship label collapses to `RS_DWithin` for all raster + distance cases. - `BroadcastIndexJoinExec.createStreamShapes` and the new `TraitJoinQueryBase.toExpandedWGS84EnvelopeRDD` handle the raster stream and build sides for broadcast-index joins; `SpatialIndexExec` and `DistanceJoinExec` route to the same helper so non-broadcast distance joins work too. - Drop the placeholder `UnsupportedOperationException` guards for distance + raster combinations; geography + raster + distance remains guarded since the geography refiner does not handle raster shapes. Tests - `BroadcastIndexJoinSuite`: `RS_DWithin` covers stream-raster / broadcast-raster / swapped-operand forms. - `RasterJoinSuite`: new `RS_DWithin distance join` describe block covers `DistanceJoinExec` with both partition-side configs, swapped operands, and raster-raster. Docs - New `docs/api/sql/Raster-Predicates/RS_DWithin.md` page. - `Raster-Functions.md` predicate table row. - `Optimizer.md` raster-distance-join subsection.
20fbb82 to
b823cce
Compare
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Support distance joins for raster predicates #2809What changes were proposed in this PR?
Adds an
RS_DWithin(left, right, distance)predicate so distance joins can use raster operands, and wires it into the existing geometry-based spatial-join infrastructure.Predicate (per-row evaluation)
RS_DWithinSQL function with three overloads:(raster, geom, distance),(geom, raster, distance),(raster, raster, distance).RasterPredicates.rsDWithin(Java, incommon), which:WKBGeography.fromJTS, forcing CCW shells (S2's expected orientation; the raster convex hull is emitted CW byGeometryFunctions.convexHull).org.apache.sedona.common.geography.Functions.dWithin, which uses S2'sClosestEdgeQueryto compute the true minimum geodesic distance between the two shapes — not centroid-to-centroid.distanceis therefore always meters, and overlapping or touching footprints yield distance 0 (soRS_DWithin(a, b, 0)matchesRS_Intersects(a, b)).Join planning (two-phase)
The planner reuses the existing geometry-based spatial-join machinery for the coarse phase, then applies the Geography refinement only to survivors:
JoinQueryDetector.getRasterDistanceJoinDetection+OptimizableJoinConditionRS_DWithin(a, b, d)and produces aJoinQueryDetectionwithSpatialPredicate.INTERSECTS,isGeography = false,distance = Some(d), and the fullRS_DWithinexpression carried asextraCondition.SpatialIndexExec+TraitJoinQueryBase.toExpandedWGS84EnvelopeRDDexpandRasterFilterEnvelope, which picks the bound from the projected envelope shape: mid-latitude / single-hemisphere footprints get a tight Haversine-meter bound (the same envelope expansionST_DistanceSphereuses forisGeography = true); polar / antimeridian-crossing / globe-spanning footprints get a(-180, 180, -90, 90)world envelope.BroadcastIndexJoinExec.createStreamShapes(raster branch) /DistanceJoinExec.toSpatialRddPairexpandRasterFilterEnveloperule on the partitioned/streamed side.DistanceJoinExecroutes both sides through the helper (using a literal-zero radius on the side that didn't receive the user-supplied distance) so the bound choice applies symmetrically.JoinQuery.spatialJoin/BroadcastIndexJoinExec.innerJoinenvelope.intersects(envelope)on the expanded rectangles. Returns coarse candidate pairs.boundConditionevaluated per row viaPredicate.create(extraCondition, output)RS_DWithin→RasterPredicates.rsDWithin→ S2ClosestEdgeQuery. This is the meter-correct, true-minimum-distance step.BroadcastIndexJoinExecis chosen when one side is small enough to broadcast, otherwiseDistanceJoinExec. The placeholderUnsupportedOperationExceptionfor distance + raster is removed; geography + raster + distance remains guarded because the geography refiner doesn't accept raster shapes.Filter-bound selection
Two bounds are used, switched per row based on the projected envelope shape; both keep the join result identical to evaluating
RS_DWithinrow-by-row.distance. The R-tree partitioner can prune aggressively.expandRasterFilterEnvelopesubstitutes(-180, 180, -90, 90)so those rows pair with every counterpart at the index stage and the per-row S2 predicate produces the answer. Trigger conditions:maxY >= 90,minY <= -90, orwidth > 180°.How was this patch tested?
common/.../RasterPredicatesTest.java— four new standalone Java unit tests with hard-coded expected booleans, covering: WGS84 raster + meter-unit semantics (including a 10.0-threshold case that catches the pre-fix degree-unit regression directly), swapped-operand symmetry, projected-CRS reprojection (UTM 32610 raster + WGS84 and EPSG:3857 points), and cross-CRS raster-raster.spark/common/.../BroadcastIndexJoinSuite.scala—Passed RS_DWithinexercises stream-raster, broadcast-raster, and swapped-operand forms (with a 1 m threshold against a global raster, since the buildings sit inside it and the predicate now returns minimum-distance 0).spark/common/.../RasterJoinSuite.scala— newRS_DWithin distance joindescribe block reusing the suite's shared 7-raster / 16-geometry set, which includes the polarEPSG:3996and antimeridian-crossingEPSG:32601rasters. The join's output must match whatRasterPredicates.rsDWithincomputes for each pair via S2, validating both bound choices and the symmetricDistanceJoinExecrouting. Covers both partition-side configs, swapped operands, and raster-raster.-Dspark=3.4 -Pscala2.12.Did this PR include necessary documentation updates?
v1.9.1in theSincefield.docs/api/sql/Raster-Predicates/RS_DWithin.mdwith the meters / WGS84 / S2-minimum-distance semantics, all three signatures, a SQL example in meters, and a note on how the predicate plugs into the distance-join planner.Raster-Functions.md: predicate table row reflecting the minimum-geodesic-distance semantics.Optimizer.md: new "Raster distance join" subsection covering the two-phase plan (per-row bound choice between tight Haversine and global envelope, plus S2ClosestEdgeQueryfor the refinement) with broadcast and non-broadcast SQL examples in meters.