Skip to content

[GH-2809] Support distance joins for raster predicates#2980

Merged
jiayuasu merged 1 commit into
apache:masterfrom
jiayuasu:feature/raster-distance-join
May 30, 2026
Merged

[GH-2809] Support distance joins for raster predicates#2980
jiayuasu merged 1 commit into
apache:masterfrom
jiayuasu:feature/raster-distance-join

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented May 20, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Adds an RS_DWithin(left, right, distance) predicate so distance joins can use raster operands, and wires it into the existing geometry-based spatial-join infrastructure.

Predicate (per-row evaluation)

  • New RS_DWithin SQL function with three overloads: (raster, geom, distance), (geom, raster, distance), (raster, raster, distance).
  • Backed by RasterPredicates.rsDWithin (Java, in common), which:
    • Unconditionally reprojects both inputs to WGS84 (no CRS-native fast path).
    • Wraps each WGS84 convex hull as WKBGeography.fromJTS, forcing CCW shells (S2's expected orientation; the raster convex hull is emitted CW by GeometryFunctions.convexHull).
    • Delegates to org.apache.sedona.common.geography.Functions.dWithin, which uses S2's ClosestEdgeQuery to compute the true minimum geodesic distance between the two shapes — not centroid-to-centroid.
  • distance is therefore always meters, and overlapping or touching footprints yield distance 0 (so RS_DWithin(a, b, 0) matches RS_Intersects(a, b)).

Join planning (two-phase)

The planner reuses the existing geometry-based spatial-join machinery for the coarse phase, then applies the Geography refinement only to survivors:

Phase Where What it does
1. Detect JoinQueryDetector.getRasterDistanceJoinDetection + OptimizableJoinCondition Recognises RS_DWithin(a, b, d) and produces a JoinQueryDetection with SpatialPredicate.INTERSECTS, isGeography = false, distance = Some(d), and the full RS_DWithin expression carried as extraCondition.
2. Index build SpatialIndexExec + TraitJoinQueryBase.toExpandedWGS84EnvelopeRDD Projects each row to a WGS84 envelope and runs it through expandRasterFilterEnvelope, which picks the bound from the projected envelope shape: mid-latitude / single-hemisphere footprints get a tight Haversine-meter bound (the same envelope expansion ST_DistanceSphere uses for isGeography = true); polar / antimeridian-crossing / globe-spanning footprints get a (-180, 180, -90, 90) world envelope.
3. Stream side BroadcastIndexJoinExec.createStreamShapes (raster branch) / DistanceJoinExec.toSpatialRddPair Same expandRasterFilterEnvelope rule on the partitioned/streamed side. DistanceJoinExec routes both sides through the helper (using a literal-zero radius on the side that didn't receive the user-supplied distance) so the bound choice applies symmetrically.
4. R-tree filter JoinQuery.spatialJoin / BroadcastIndexJoinExec.innerJoin Plain JTS envelope.intersects(envelope) on the expanded rectangles. Returns coarse candidate pairs.
5. Refinement boundCondition evaluated per row via Predicate.create(extraCondition, output) Calls back into RS_DWithinRasterPredicates.rsDWithin → S2 ClosestEdgeQuery. This is the meter-correct, true-minimum-distance step.

BroadcastIndexJoinExec is chosen when one side is small enough to broadcast, otherwise DistanceJoinExec. The placeholder UnsupportedOperationException for distance + raster is removed; geography + raster + distance remains guarded because the geography refiner doesn't accept raster shapes.

Filter-bound selection

Two bounds are used, switched per row based on the projected envelope shape; both keep the join result identical to evaluating RS_DWithin row-by-row.

  • Tight Haversine bound (mid-latitude / single-hemisphere): a conservative polar-radius meter-to-degree conversion expands the envelope by ≥ the true geodesic equivalent of distance. The R-tree partitioner can prune aggressively.
  • Global envelope (polar projections like EPSG:3996 / EPSG:3413; antimeridian-crossing UTM zones like EPSG:32601; or any input whose projected envelope already exceeds 180° in longitude or grazes a pole): expandRasterFilterEnvelope substitutes (-180, 180, -90, 90) so those rows pair with every counterpart at the index stage and the per-row S2 predicate produces the answer. Trigger conditions: maxY >= 90, minY <= -90, or width > 180°.

How was this patch tested?

  • common/.../RasterPredicatesTest.java — four new standalone Java unit tests with hard-coded expected booleans, covering: WGS84 raster + meter-unit semantics (including a 10.0-threshold case that catches the pre-fix degree-unit regression directly), swapped-operand symmetry, projected-CRS reprojection (UTM 32610 raster + WGS84 and EPSG:3857 points), and cross-CRS raster-raster.
  • spark/common/.../BroadcastIndexJoinSuite.scalaPassed RS_DWithin exercises stream-raster, broadcast-raster, and swapped-operand forms (with a 1 m threshold against a global raster, since the buildings sit inside it and the predicate now returns minimum-distance 0).
  • spark/common/.../RasterJoinSuite.scala — new RS_DWithin distance join describe block reusing the suite's shared 7-raster / 16-geometry set, which includes the polar EPSG:3996 and antimeridian-crossing EPSG:32601 rasters. The join's output must match what RasterPredicates.rsDWithin computes for each pair via S2, validating both bound choices and the symmetric DistanceJoinExec routing. Covers both partition-side configs, swapped operands, and raster-raster.
  • All 122 tests across the two join suites and the 22 raster predicate unit tests pass locally under -Dspark=3.4 -Pscala2.12.

Did this PR include necessary documentation updates?

  • Yes, I am adding a new API. I am using the current SNAPSHOT version number v1.9.1 in the Since field.
  • Yes, I have updated the documentation:
    • New docs/api/sql/Raster-Predicates/RS_DWithin.md with the meters / WGS84 / S2-minimum-distance semantics, all three signatures, a SQL example in meters, and a note on how the predicate plugs into the distance-join planner.
    • Raster-Functions.md: predicate table row reflecting the minimum-geodesic-distance semantics.
    • Optimizer.md: new "Raster distance join" subsection covering the two-phase plan (per-row bound choice between tight Haversine and global envelope, plus S2 ClosestEdgeQuery for the refinement) with broadcast and non-broadcast SQL examples in meters.

@jiayuasu jiayuasu force-pushed the feature/raster-distance-join branch from 53602f9 to 1e23c81 Compare May 22, 2026 06:25
@jiayuasu jiayuasu marked this pull request as draft May 22, 2026 07:05
@jiayuasu jiayuasu force-pushed the feature/raster-distance-join branch from 1e23c81 to 3a95f0b Compare May 29, 2026 07:02
@jiayuasu jiayuasu requested a review from Copilot May 29, 2026 07:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds RS_DWithin support for raster distance predicates and routes raster distance joins through optimized broadcast and partitioned join paths using WGS84 envelope expansion and spheroidal refinement.

Changes:

  • Adds RS_DWithin raster/geometry and raster/raster predicate implementation, Spark expression, catalog registration, and docs.
  • Updates join detection/planning/execution to support raster distance joins via BroadcastIndexJoinExec and DistanceJoinExec.
  • Adds broadcast and non-broadcast join tests for raster distance predicates.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
common/src/main/java/org/apache/sedona/common/raster/RasterPredicates.java Adds WGS84 spheroidal rsDWithin raster predicate helpers.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/raster/RasterPredicates.scala Adds Catalyst RS_DWithin expression.
spark/common/src/main/scala/org/apache/sedona/sql/UDF/Catalog.scala Registers RS_DWithin.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala Detects raster distance join predicates.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/OptimizableJoinCondition.scala Marks RS_DWithin as optimizable.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/TraitJoinQueryBase.scala Adds expanded WGS84 envelope RDD helper.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/SpatialIndexExec.scala Builds raster distance indexes with expanded WGS84 envelopes.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/BroadcastIndexJoinExec.scala Handles raster stream-side distance shapes for broadcast joins.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/DistanceJoinExec.scala Routes raster distance joins through WGS84 envelope RDDs.
spark/common/src/test/scala/org/apache/sedona/sql/RasterJoinSuite.scala Adds partitioned raster distance join tests.
spark/common/src/test/scala/org/apache/sedona/sql/BroadcastIndexJoinSuite.scala Adds broadcast raster distance join tests.
docs/api/sql/Raster-Predicates/RS_DWithin.md Documents new SQL predicate.
docs/api/sql/Raster-Functions.md Adds predicate table entry.
docs/api/sql/Optimizer.md Documents raster distance join planning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +160 to +163
* Distance predicate for rasters: `RS_DWithin(left, right, distance)`. `left` and `right` can
* each be a raster or a geometry (at least one must be a raster). Returns true when the shapes
* are within `distance` of each other, with both sides projected to a common CRS prior to the JTS
* distance check (mirroring [[RS_Intersects]]). This expression is recognised by
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in b488799. Rewrote the Scaladoc to describe the actual semantics: WGS84 reprojection, spheroidal centroid distance in meters via Spheroid.distance, and the Haversine envelope expansion used by the join planner. No longer claims parity with RS_Intersects/JTS.

Comment on lines +94 to +96
public static boolean rsDWithin(GridCoverage2D raster, Geometry geometry, double distance) {
Pair<Geometry, Geometry> geometries = toWGS84Pair(raster, geometry);
return Spheroid.distance(geometries.getLeft(), geometries.getRight()) <= distance;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in b488799. Added four direct unit tests in common/src/test/java/org/apache/sedona/common/raster/RasterPredicatesTest.java with hard-coded expected booleans:

  • testDWithinWGS84RasterPointMeterSemantics — coincident centroid + a 10°-east point (~1 112 km geodesic); a 10.0 threshold (the pre-fix degree value) must reject the pair, catching the unit regression directly.
  • testDWithinSwappedOperands — confirms the (raster, geom) overload is symmetric at and around the threshold.
  • testDWithinProjectedRasterReprojects — UTM 32610 raster vs WGS84 and EPSG:3857 points, asserts the same truth value across CRSes so the WGS84 reprojection cannot silently regress.
  • testDWithinRasterRaster — same-CRS and cross-CRS (UTM + WGS84) raster-raster pairs with symmetric assertions and threshold brackets.

The join tests in RasterJoinSuite still build their expected set from the same predicate (they assert the planner produces a result consistent with the predicate), but the predicate itself is now anchored by these standalone cases.

@jiayuasu
Copy link
Copy Markdown
Member Author

jiayuasu commented May 29, 2026

One small docs issue: docs/api/sql/Raster-Predicates/RS_DWithin.md links the geometry-only case to ../Predicate/#st_dwithin, but that path looks broken. It should probably point to the existing geometry predicate page, e.g. ../Predicates/ST_DWithin.md.

@jiayuasu jiayuasu force-pushed the feature/raster-distance-join branch from 3a95f0b to b488799 Compare May 29, 2026 18:12
@jiayuasu
Copy link
Copy Markdown
Member Author

Fixed in b488799. Changed the link target from ../Predicate/#st_dwithin to ../Predicates/ST_DWithin.md, which is the actual location of the geometry-only predicate doc.

@jiayuasu jiayuasu force-pushed the feature/raster-distance-join branch 6 times, most recently from 20b64fd to 20fbb82 Compare May 30, 2026 04:34
Add `RS_DWithin(raster|geom, raster|geom, distance)` so distance joins
can use raster operands, and route the join planner through the existing
spatial-index machinery.

- `RS_DWithin` expression in `RasterPredicates.scala`, backed by new
  `RasterPredicates.rsDWithin` overloads (raster-geom, raster-raster)
  that reuse `convertCRSIfNeeded` and JTS `isWithinDistance`.
- `JoinQueryDetector` and `OptimizableJoinCondition` recognise
  `RS_DWithin` as a distance-join predicate; the relationship label
  collapses to `RS_DWithin` for all raster + distance cases.
- `BroadcastIndexJoinExec.createStreamShapes` and the new
  `TraitJoinQueryBase.toExpandedWGS84EnvelopeRDD` handle the raster
  stream and build sides for broadcast-index joins; `SpatialIndexExec`
  and `DistanceJoinExec` route to the same helper so non-broadcast
  distance joins work too.
- Drop the placeholder `UnsupportedOperationException` guards for
  distance + raster combinations; geography + raster + distance remains
  guarded since the geography refiner does not handle raster shapes.

Tests
- `BroadcastIndexJoinSuite`: `RS_DWithin` covers stream-raster /
  broadcast-raster / swapped-operand forms.
- `RasterJoinSuite`: new `RS_DWithin distance join` describe block
  covers `DistanceJoinExec` with both partition-side configs, swapped
  operands, and raster-raster.

Docs
- New `docs/api/sql/Raster-Predicates/RS_DWithin.md` page.
- `Raster-Functions.md` predicate table row.
- `Optimizer.md` raster-distance-join subsection.
@jiayuasu jiayuasu force-pushed the feature/raster-distance-join branch from 20fbb82 to b823cce Compare May 30, 2026 04:57
@jiayuasu jiayuasu marked this pull request as ready for review May 30, 2026 06:25
@jiayuasu jiayuasu added this to the sedona-1.9.1 milestone May 30, 2026
@jiayuasu jiayuasu merged commit bd909f7 into apache:master May 30, 2026
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support distance joins for raster predicates

2 participants