refactor(rust/sedona-spatial-join): Support wraparound rectangles in EvaluatedGeometryArray#799
Conversation
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
| /// The method can be called multiple times to insert data in batches before finalizing. The values | ||
| /// in rects are ordered (xmin, ymin, xmax, ymax). | ||
| pub fn push_build(&mut self, rects: &[(f32, f32, f32, f32)]) -> Result<()> { | ||
| // Re-interpreting rects as a flat f32 array (xmin, ymin, xmax, ymax) |
There was a problem hiding this comment.
cc @pwrliang to make sure I didn't muck this up...the evaluated geometry array is no longer using Rect so I updated this to just use f32s. I don't think the GPU tests actually run here but this does build.
There was a problem hiding this comment.
The changes look good to me. I also pulled the branch and ran the CI on it https://github.com/wherobots/sedona-db-gpu-tester/actions/runs/25140655727/job/73694467245
| let rect = Rect::new(coord!(x: min_x, y: min_y), coord!(x: max_x, y: max_y)); | ||
| rect_vec.push(Some(rect)); | ||
| if let Some((min_x, min_y, max_x, max_y)) = maybe_bounds { | ||
| rect_vec.push(Bounds2D::new((min_x, max_x), (min_y, max_y))); |
There was a problem hiding this comment.
This is the main point of the PR...eliminate the expansion of potentially very small feature bounds into the entire width of the earth.
There was a problem hiding this comment.
These tests were copied by via LLM from the sedona-spatial-join crate, taking the random geometry parameters from the Python tests. I needed this to debug a deduplication issue but it is good to have regardless.
| for (idx, rect_opt) in rects.iter().enumerate() { | ||
| if let Some(rect) = rect_opt { | ||
| native_rects.push(*rect); | ||
| } else { | ||
| for (idx, rect) in rects.iter().enumerate() { | ||
| if rect.is_empty() { | ||
| native_rects.push(empty_rect); | ||
| } else { | ||
| let (x, y) = rect.clone().into_inner(); | ||
| native_rects.push((x.0, y.0, x.1, y.1)); |
There was a problem hiding this comment.
cc @pwrliang...did I reinterpret this correctly?
| for (idx, rect) in rects.iter().enumerate() { | ||
| let (left, right) = rect.split(&self.wraparound.unwrap_or(Interval::empty())); | ||
| if !left.is_empty() { | ||
| let (x, y) = left.into_inner(); | ||
| let data_idx = rtree_builder.add(x.0, y.0, x.1, y.1); | ||
| batch_pos_vec[data_idx as usize] = (batch_idx as i32, idx as i32); | ||
| } | ||
|
|
||
| if !right.is_empty() { | ||
| let (x, y) = right.into_inner(); | ||
| let data_idx = rtree_builder.add(x.0, y.0, x.1, y.1); | ||
| batch_pos_vec[data_idx as usize] = (batch_idx as i32, idx as i32); | ||
| } | ||
| } |
There was a problem hiding this comment.
This is the main change in the spatial index: we can now insert two rectangles into the index instead of one.
| // using several boxes. | ||
| candidates.sort_unstable(); | ||
| candidates.dedup(); | ||
| // using several boxes (e.g., for antimeridian-crossing geometries). | ||
| // First dedup by data_idx (fast), then dedup by position (handles wraparound case). | ||
| if self.inner.wraparound.is_some() { | ||
| candidates.sort_unstable(); | ||
| candidates.dedup(); | ||
|
|
||
| // Dedup by position: when a geometry spans the antimeridian, it may be indexed | ||
| // as two separate boxes with different data_idx values that map to the same position. | ||
| let mut seen_positions: std::collections::HashSet<(i32, i32)> = | ||
| std::collections::HashSet::new(); | ||
| candidates.retain(|data_idx| { | ||
| let pos = self.inner.data_id_to_batch_pos[*data_idx as usize]; | ||
| seen_positions.insert(pos) | ||
| }); | ||
| } |
There was a problem hiding this comment.
...and when we query it we have to deduplicate the results. The deduplication happened already before this PR but the DWithin tests exposed that it didn't handle the multiple source rectangles properly.
We may want to solve this performance issue before merging this PR (a HashSet is almost certainly slowing us down here).
| /// A float32 bounding box with wraparound support | ||
| /// | ||
| /// This struct is conceptually similar to the Rect<f32> but explicitly supports | ||
| /// wraparound x intervals to ensure raw xmin and xmax values are not misused. | ||
| #[derive(Debug, Clone, PartialEq)] | ||
| pub struct Bounds2D { | ||
| x: (f32, f32), | ||
| y: (f32, f32), | ||
| } |
There was a problem hiding this comment.
This is the main utility supporting the EvaluatedGeometryArray. I played with making the IntervalTrait generic but it was a bit of a pain and I was keen to keep this scoped.
There was a problem hiding this comment.
Pull request overview
This PR refactors Sedona’s Rust spatial-join stack to represent per-geometry bounds with wraparound (antimeridian) support via a new Bounds2D type, and threads those bounds through spatial indexing and statistics so geography joins don’t produce misleading GeoStatistics.
Changes:
- Introduces
Bounds2D(wraparound-aware f32 bounds) and replacesOption<Rect<f32>>rectangles throughoutEvaluatedGeometryArrayand related spill/partition/index code. - Updates statistics collection to ingest per-row bounding boxes directly (removing bbox from
GeometrySummary, renaminganalyze_geometry→analyze_wkb). - Adds geography integration tests focused on antimeridian-crossing geometries; updates GPU plumbing to use the new bounds representation.
Reviewed changes
Copilot reviewed 25 out of 26 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona-testing/src/datagen.rs | Updates tests to use analyze_wkb after analyze refactor. |
| rust/sedona-testing/src/benchmark_util.rs | Updates benchmark tests to use analyze_wkb. |
| rust/sedona-spatial-join/src/utils/bounds.rs | Adds new wraparound-aware Bounds2D type + unit tests. |
| rust/sedona-spatial-join/src/utils.rs | Exposes the new bounds module. |
| rust/sedona-spatial-join/src/stream.rs | Uses per-row bounds when updating probe-side GeoStatistics. |
| rust/sedona-spatial-join/src/partitioning/util.rs | Removes geo_rect_to_bbox helper (Rect no longer the primary representation). |
| rust/sedona-spatial-join/src/partitioning/stream_repartitioner.rs | Switches repartitioning assignment and per-slot stats to use Bounds2D/BoundingBox. |
| rust/sedona-spatial-join/src/operand_evaluator.rs | Replaces Rect<f32> with Bounds2D in evaluated arrays; distance expansion now uses interval expansion. |
| rust/sedona-spatial-join/src/index/partitioned_index_provider.rs | Updates stats construction in tests to use bounds-aware accumulator updates. |
| rust/sedona-spatial-join/src/index/default_spatial_index_builder.rs | Adds wraparound insertion support by splitting wraparound bounds into multiple rectangles. |
| rust/sedona-spatial-join/src/index/default_spatial_index.rs | Adds wraparound-aware probing by splitting probe rectangles; dedups candidates across split inserts. |
| rust/sedona-spatial-join/src/index/build_side_collector.rs | Builds build-side stats and bbox samples from evaluated bounds. |
| rust/sedona-spatial-join/src/evaluated_batch/spill.rs | Serializes/deserializes Bounds2D instead of optional Rect. |
| rust/sedona-spatial-join-gpu/src/join_provider.rs | Aligns GPU evaluated-array factory with Bounds2D bounds computation. |
| rust/sedona-spatial-join-gpu/src/index/gpu_spatial_index_builder.rs | Switches GPU index build input rectangles to raw 4-float tuples. |
| rust/sedona-spatial-join-gpu/src/index/gpu_spatial_index.rs | Switches GPU probe rectangles to raw 4-float tuples. |
| rust/sedona-spatial-join-gpu/Cargo.toml | Updates deps to rely on sedona-geometry rather than geo-index/geo-types in this crate. |
| rust/sedona-spatial-join-geography/tests/spatial_join_integration.rs | Adds integration tests for antimeridian-crossing geography joins. |
| rust/sedona-spatial-join-geography/src/spatial_index_builder.rs | Configures spatial index builder with wraparound bounds (-180..180). |
| rust/sedona-spatial-join-geography/src/join_provider.rs | Produces wraparound-aware bounds in evaluated arrays via Bounds2D. |
| rust/sedona-spatial-join-geography/Cargo.toml | Adds dev-deps needed for new integration tests. |
| rust/sedona-geometry/src/interval.rs | Makes WraparoundInterval::split() public for wraparound decomposition. |
| rust/sedona-geometry/src/analyze.rs | Renames/changes analysis API (analyze_wkb) and removes bbox from GeometrySummary. |
| rust/sedona-functions/src/st_analyze_agg.rs | Refactors Analyze accumulator to accept explicit bbox (including wraparound) when updating stats. |
| c/sedona-libgpuspatial/src/lib.rs | Changes GPU rect input type to tuples and updates tests accordingly. |
| Cargo.lock | Reflects dependency graph updates from the refactor and added tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for rect in batch.geom_array.rects() { | ||
| let partition = if rect.is_empty() { | ||
| // Round-robin empty geometries through regular partitions to avoid | ||
| // overloading a single slot when the build side is mostly empty. | ||
| let p = SpatialPartition::Regular(cnt); | ||
| cnt = (cnt + 1) % num_regular_partitions; | ||
| p | ||
| } else { | ||
| partitioner.partition_no_multi(&BoundingBox::xy(rect.x(), rect.y()))? | ||
| }; | ||
| assignments.push(partition); | ||
| } | ||
| } | ||
| PartitionedSide::ProbeSide => { | ||
| for rect_opt in batch.geom_array.rects() { | ||
| let partition = match rect_opt { | ||
| Some(rect) => partitioner.partition(&geo_rect_to_bbox(rect))?, | ||
| None => SpatialPartition::None, | ||
| for rect in batch.geom_array.rects() { | ||
| let partition = if rect.is_empty() { | ||
| SpatialPartition::None | ||
| } else { | ||
| partitioner.partition(&BoundingBox::xy(rect.x(), rect.y()))? | ||
| }; |
There was a problem hiding this comment.
assign_rows() now passes BoundingBox::xy(rect.x(), rect.y()) directly into the partitioner. For wraparound x-intervals this will currently error in the partitioning utilities (e.g., bbox_to_f32_rect returns an Execution error when bbox.x().is_wraparound()). This means repartitioning/out-of-core paths can fail for geography antimeridian-crossing geometries. Consider making assign_rows() wraparound-aware (e.g., split bounds using the configured absolute wraparound interval and union the partition results / force Multi), or ensure geography joins avoid partitioners that require non-wraparound rectangles until wraparound partitioning is implemented.
There was a problem hiding this comment.
This is valid...if a partition ends up with wraparound bounds, the out of core join will currently error. I opened #800 to track.
Co-authored-by: Copilot <copilot@github.com>
|
Can we also add tests for regions around the poles, hemisphere-spanning polygons, or full-globe geographies? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 26 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
I added a few of these (and it should be easy to add more as we come across more difficult scenarios!) |
This PR adds support for wraparound bounds in the
EvaluatedGeometryArray. It introduces a new struct,Bounds2D, which are just f32 bounds with wraparound support. The main consequence of this was removing the bounding box from theGeometrySummaryand reusing the bounds from the evaluated array when constructing theGeoStatistics. This solves the issue of geographies having possibly bogusGeoStatistics, although does introduce the possibility that we trigger some not-yet implemented behaviour in the out-of core join (that may require further eliminating Rect usage in the kdb and rtree partitioners).Closes #782.