Skip to content

[GH-2880] Omit bbox in GeoParquet metadata for empty files#2903

Merged
jiayuasu merged 1 commit intoapache:masterfrom
james-willis:fix/empty-geoparquet-bbox
May 5, 2026
Merged

[GH-2880] Omit bbox in GeoParquet metadata for empty files#2903
jiayuasu merged 1 commit intoapache:masterfrom
james-willis:fix/empty-geoparquet-bbox

Conversation

@james-willis
Copy link
Copy Markdown
Collaborator

@james-willis james-willis commented May 4, 2026

Summary

Fixes #2880.

When a Spark partition has zero rows, Sedona's GeoParquet writer was emitting bbox: [0, 0, 0, 0] in the per-column geo metadata. Per the GeoParquet 1.1 spec, bbox is the bounding box of the geometries in the file and is optional ("if specified, MUST be encoded with an array..."). For a file with no geometries we should omit it, not fabricate a phantom extent at Null Island.

The fabricated [0, 0, 0, 0] is harmful for downstream consumers:

  • Corrupts dataset extent reporting. Tools that aggregate per-file bboxes (or GDAL's GetExtent() on a single file) report (0, 0) as the layer extent.
  • Spec non-compliance. Asserts an extent that does not exist in the file.
  • Subtly poisons bbox-based file pruning in any reader that uses the column-metadata bbox for spatial pushdown — files appear to overlap any AOI containing the origin.

Change

Make GeometryFieldMetaData.bbox an Option[Seq[Double]] and emit None (which json4s omits from the JSON) when no geometries were observed during write. The bbox accumulator at GeometryColumnBoundingBox was already correctly initialized to (+Inf, +Inf, -Inf, -Inf); only the fallback at GeoParquetWriteSupport.scala:248-254 needed fixing.

The type change ripples through all consumers of the case class — the spark-{3.4,3.5,4.0,4.1} GeoParquetMetadataPartitionReaderFactory files, GeoParquetSpatialFilter.LeafFilter, and StacBatch are updated to handle the Option. Sedona's own LeafFilter already had a "no bbox → don't prune" branch (the original code path for files with empty bbox Seq); this PR extends that branch to handle the None case.

Why doesn't the reader prune zero-row files outright?

After this PR, a new empty file has bbox = None in the geo metadata, and LeafFilter.evaluate falls through to "no extent info, don't prune" — so empty files are always included in the scan (correctly returning 0 rows). We considered going further:

  1. Fingerprint-prune legacy [0, 0, 0, 0] files. Would let Sedona skip pre-existing buggy-writer files without scanning them. Rejected: a wrong bbox on an empty file never produces wrong query results, only sometimes wastes a scan. The downstream-reader correctness concerns (GDAL extent, stac-geoparquet aggregation) are addressed by the writer fix; Sedona's own pruning is correct in either direction because the file has zero rows.

  2. Row-count-prune any zero-row file. The reader path uses ParquetFooterReader.readFooter(..., SKIP_ROW_GROUPS). Under that filter, ParquetMetadata.getBlocks() returns empty regardless of the actual row group count, and the file-level num_rows from the Parquet thrift is not exposed via parquet-mr's Java FileMetaData API. Getting an honest row count requires switching to NO_FILTER, which adds row-group/column-stat thrift parse on every GeoParquet read. Rejected for the same reason: empty files are cheap to scan and the saved overhead doesn't justify the added cost on every read.

In both cases the marginal saved scan time wasn't worth the added complexity. If a workload's empty-file count is high enough to matter, a follow-up PR can switch to NO_FILTER only when a spatial filter is being pushed down (paying the parse cost only on queries that benefit from row-count pruning).

Impact on other GeoParquet readers

Researched how other major readers handle missing/null bbox vs. [0, 0, 0, 0]. Omitting the field is strictly an improvement in every case:

Reader Uses column-metadata bbox for pruning? Tolerates missing bbox? Effect of [0,0,0,0] today
DuckDB Spatial (ST_Read / read_parquet) No — pruning uses Parquet column stats on a separate bbox struct column (discussion #484) Yes No effect on query results
GeoPandas (read_parquet in geopandas/io/arrow.py) No — bbox= filter routes through covering.bbox struct column or a point encoding Yes; GeoPandas itself already omits bbox for all-NA columns (if np.isfinite(bbox).all(): ...) No effect on query results
GDAL/OGR Parquet driver No — pruning uses bbox struct column / covering.bbox (driver docs) Yes Corrupts GetExtent() reporting (returns origin)
pyarrow N/A — surfaces the JSON, doesn't interpret it Yes None
stac-geoparquet aggregator Not yet (TODO in _to_parquet.py) N/A Would corrupt collection bbox unions once implemented
Sedona Yes — GeoParquetSpatialFilter.LeafFilter Yes (after this PR) Wasted scans for AOIs near origin (no wrong results, since the file is empty)

GeoPandas' existing behavior is precedent — it already omits bbox for empty/all-NA inputs.

Test plan

  • Updated geoparquetIOTests "GeoParquet save should work with empty dataframes": now asserts the bbox field is omitted from the JSON (was previously asserting Seq(0.0, 0.0, 0.0, 0.0)).
  • CI to run the rest of geoparquet* and Stac* suites (the type change touches StacBatch.scala and the v2 metadata partition reader factory across all four spark-{3.4,3.5,4.0,4.1} variants).

Spec reference

GeoParquet 1.1 (column metadata for the geometry column):

bbox: Bounding Box of the geometries in the file, formatted according to RFC 7946, section 5.

The bbox, if specified, MUST be encoded with an array...

https://geoparquet.org/releases/v1.1.0/

When a Spark partition has zero rows the GeoParquet writer was emitting
`bbox: [0, 0, 0, 0]` in the per-column geo metadata. Per the GeoParquet
1.1 spec, `bbox` is the bounding box of the geometries in the file and
is optional ("if specified, MUST be encoded..."), so for a file with no
geometries we should omit it rather than fabricate an extent.

The fabricated `[0, 0, 0, 0]` is especially harmful: it places a phantom
"data at Null Island" claim in the metadata, breaking bbox-based file
pruning in downstream readers (Sedona's own GeoParquetSpatialFilter,
DuckDB Spatial, GDAL's OGR_GEOPARQUET driver, GeoPandas) and corrupting
dataset-level extent aggregation.

This change makes `GeometryFieldMetaData.bbox` an `Option[Seq[Double]]`
and writes `None` (which json4s omits from JSON) when no geometries
were observed. All consumers of the case class are updated.
@james-willis james-willis requested a review from jiayuasu as a code owner May 4, 2026 17:53
@james-willis james-willis marked this pull request as draft May 4, 2026 18:01
@james-willis james-willis force-pushed the fix/empty-geoparquet-bbox branch 2 times, most recently from 72b178a to 3b308d8 Compare May 4, 2026 18:29
@james-willis james-willis marked this pull request as ready for review May 4, 2026 18:59
@jiayuasu jiayuasu added this to the sedona-1.9.1 milestone May 5, 2026
@jiayuasu jiayuasu added the bug label May 5, 2026
@jiayuasu jiayuasu changed the title [SEDONA-2880] Omit bbox in GeoParquet metadata for empty files [GH-2880] Omit bbox in GeoParquet metadata for empty files May 5, 2026
@jiayuasu jiayuasu merged commit e9a2d46 into apache:master May 5, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GeoParquet writer should not produce [0, 0, 0, 0] bbox in file metadata for empty parquet

2 participants