feat: N-dimensional raster type extension (Phase 1)#749
Conversation
2b969d1 to
66b3c2e
Compare
paleolimbot
left a comment
There was a problem hiding this comment.
I'll take a closer look at the implementation tomorrow, but wanted to get a preliminary review of the schema out!
| Field::new(column::METADATA, Self::metadata_type(), false), | ||
| Field::new(column::CRS, Self::crs_type(), true), // Optional: may be inferred from data | ||
| Field::new(column::CRS, Self::crs_type(), true), | ||
| Field::new(column::TRANSFORM, Self::transform_type(), false), | ||
| Field::new(column::X_DIM, DataType::Utf8View, false), | ||
| Field::new(column::Y_DIM, DataType::Utf8View, false), | ||
| Field::new(column::BANDS, Self::bands_type(), true), | ||
| ]) | ||
| } | ||
|
|
||
| /// Raster metadata schema | ||
| pub fn metadata_type() -> DataType { | ||
| DataType::Struct(Fields::from(vec![ | ||
| // Raster dimensions | ||
| Field::new(column::WIDTH, DataType::UInt64, false), | ||
| Field::new(column::HEIGHT, DataType::UInt64, false), | ||
| // Geospatial transformation parameters | ||
| Field::new(column::UPPERLEFT_X, DataType::Float64, false), | ||
| Field::new(column::UPPERLEFT_Y, DataType::Float64, false), | ||
| Field::new(column::SCALE_X, DataType::Float64, false), | ||
| Field::new(column::SCALE_Y, DataType::Float64, false), | ||
| Field::new(column::SKEW_X, DataType::Float64, false), | ||
| Field::new(column::SKEW_Y, DataType::Float64, false), | ||
| ])) | ||
| /// Affine transform schema — 6-element GDAL GeoTransform: | ||
| /// `[origin_x, scale_x, skew_x, origin_y, skew_y, scale_y]` | ||
| pub fn transform_type() -> DataType { | ||
| DataType::List(FieldRef::new(Field::new("item", DataType::Float64, false))) | ||
| } |
There was a problem hiding this comment.
Just highlighting for myself + others that this is the main change.
The main conceptual change here seems to be that the height and width in 2D space are now members of the bands. Even though it might involve some duplication, I wonder if a top-level shape (whose ordering is always x, y, and possibly z in some future) would still be appropriate. The bands would then all be parsed according to the x/y dim name and validated against the top-level declaration.
X_DIM + Y_DIM should probably also be a list to allow for future Z. Lists are kind of a pain in Rust but interacting with these in Python is fairly easy to do and this PR is probably the only one that needs to consider that detail.
There was a problem hiding this comment.
in zarr, each variable (here band) can have any permutation of dimensions. so band 1 might be xyzt, 2 might be yzt, 3 might be xy
There was a problem hiding this comment.
X_DIM + Y_DIM should probably also be a list to allow for future Z. Lists are kind of a pain in Rust but interacting with these in Python is fairly easy to do and this PR is probably the only one that needs to consider that detail.
Sorry I'm not clear what you want me to do. Should I decompose this into 2 instances of some GeoTransformDimension type?
I adopted this norm to match GDAL. I my assumption was that X and Y are somewhat privileged in the system. I hadn't given much thought to affine transformations outside of the xy plane.
There was a problem hiding this comment.
Sorry I'm not clear what you want me to do.
- Change
Field::new(column::X_DIM, DataType::Utf8View, false), Field::new(column::Y_DIM, DataType::Utf8View, false)toField::new(column::DIMS, DataType::new_list(DataType::Utf8View), false). This be of length 2 for now and you don't have to change your Rust wrapper. - Your Arrow representation of the transform is already Z-ready (it would have 12 values instead of 6 in the future where Z is supported).
- Add
Field::new(column::SHAPE, DataType::new_list(DataType::Int64), false), which is the number of values in each ofcolumn::DIMS, in the same order. I may be reading it wrong but I think you currently have to peek into the first band for this. I think it is clearer to keep the spatial source of truth at the raster level and then validate every band against that (the transform and the height/width are very closely related to each other and I think having them at the same level makes sense). Also having a zero-band raster is surprisingly useful (e.g., gdal uses this as a way to specify a target grid to use for the result of an operation like warp).
paleolimbot
left a comment
There was a problem hiding this comment.
I did another round...there are a few places where tests or comments were removed that were still useful.
I do think it's worth iterating a tiny bit on the schema, and importantly, I think we should wait until @Kontinuation has finished the GDAL read portion of the previous set of raster refactoring because (1) this was a lot of work and Kristin's time is valuable and (2) these are important use cases to consider when making the new type work!
0af0b01 to
4302af4
Compare
…hape
Replace the scalar x_dim/y_dim Utf8View fields with a top-level
spatial_dims (List<Utf8View>) and add spatial_shape (List<Int64>).
The raster struct becomes the single source of truth for the spatial
grid; every band must contain each name in spatial_dims with a size
matching the corresponding entry in spatial_shape.
- sedona-schema: column constants SPATIAL_DIMS / SPATIAL_SHAPE,
raster_indices 0..4, new spatial_dims_type() / spatial_shape_type()
accessors.
- sedona-raster: RasterRef gains spatial_dims() + spatial_shape();
x_dim() / y_dim() / width() / height() become default methods
reading from the top-level fields, so width/height work even on
zero-band rasters ("target grid" GDAL warp pattern).
- RasterStructArray: decode the two new ListArrays; per-row slicing
mirrors the existing band dim_names/shape accessors.
- RasterBuilder: new start_raster(transform, spatial_dims[],
spatial_shape[], crs) signature; finish_raster strictly validates
every band against the declared spatial grid and returns
InvalidArgumentError on mismatch.
- sedona-testing: assert_raster_equal compares the new fields.
Addresses Dewey's review comment on PR apache#749 re: making the raster the
authority on the spatial grid.
4302af4 to
4033c82
Compare
…hape
Replace the scalar x_dim/y_dim Utf8View fields with a top-level
spatial_dims (List<Utf8View>) and add spatial_shape (List<Int64>).
The raster struct becomes the single source of truth for the spatial
grid; every band must contain each name in spatial_dims with a size
matching the corresponding entry in spatial_shape.
- sedona-schema: column constants SPATIAL_DIMS / SPATIAL_SHAPE,
raster_indices 0..4, new spatial_dims_type() / spatial_shape_type()
accessors.
- sedona-raster: RasterRef gains spatial_dims() + spatial_shape();
x_dim() / y_dim() / width() / height() become default methods
reading from the top-level fields, so width/height work even on
zero-band rasters ("target grid" GDAL warp pattern).
- RasterStructArray: decode the two new ListArrays; per-row slicing
mirrors the existing band dim_names/shape accessors.
- RasterBuilder: new start_raster(transform, spatial_dims[],
spatial_shape[], crs) signature; finish_raster strictly validates
every band against the declared spatial grid and returns
InvalidArgumentError on mismatch.
- sedona-testing: assert_raster_equal compares the new fields.
Addresses Dewey's review comment on PR apache#749 re: making the raster the
authority on the spatial grid.
58b56e1 to
4ddfdbf
Compare
999a577 to
dda82a7
Compare
| builder.append_value(dt.pixel_type_name()); | ||
| match raster.band_data_type((band_index - 1) as usize) { | ||
| Some(dt) => builder.append_value(dt.pixel_type_name()), | ||
| None => builder.append_null(), |
There was a problem hiding this comment.
not totally sure if we should handle this branch or throw. should generally be unreachable - we don't expect the arrow to contain datatype values that don't exist.
94be286 to
965e7d9
Compare
| // contiguous_data() is Cow::Borrowed for is_2d identity views; the | ||
| // borrow points at the StructArray's backing buffer, which outlives | ||
| // the dataset (held by the caller). For Cow::Owned the pointer would | ||
| // dangle the moment the Cow drops, so we reject that case loudly. | ||
| let band_data = band | ||
| .contiguous_data() | ||
| .map_err(|e| arrow_datafusion_err!(e))?; | ||
| let bytes: &[u8] = match &band_data { | ||
| Cow::Borrowed(b) => b, | ||
| Cow::Owned(_) => { | ||
| return exec_err!( | ||
| "Internal: contiguous_data must be borrowed for is_2d bands; got owned" | ||
| ); | ||
| } | ||
| }; | ||
| let data_ptr = bytes.as_ptr(); |
There was a problem hiding this comment.
Note to self. This whole section and the contiguous data thing will need some thought from others and me.
3484f09 to
a33fdd1
Compare
Replaces apache#787's 2D-only band schema with the canonical N-D schema: spatial_dims/spatial_shape at the raster level; bands carry dim_names, source_shape, nullable view, outdb_uri, outdb_format, plus the non-nullable data buffer. Removes nodata_value, storage_type, outdb_url, and outdb_band_id - every one is encodable in the new schema: - storage_type ↔ outdb_uri.is_null() (null = InDb, set = OutDbRef). - outdb_url ↔ outdb_uri (no rename, same string). - outdb_band_id ↔ encoded inside outdb_uri (#band=N or GDAL native subdataset URI), parsed only inside the GDAL format driver. - nodata_value ↔ typed nodata: Binary (a null row means "no nodata"). Top-level adds spatial_dims: List<Utf8View> and spatial_shape: List<Int64>; nullable view is List<Struct<source_axis, start, step, steps: Int64>> where a null row encodes the canonical identity view. Note: intermediate commits in this PR are not expected to build; only the PR tip is CI-green. The trait, reader/builder, RS_* migration, and GDAL loader port land in subsequent commits.
RasterRef and BandRef accessors over the canonical N-D schema: spatial_dims/spatial_shape, transform, crs, num_bands, band(i), and band-level dim_names, source_shape, shape (visible, derived from view), view, data_type, nodata, outdb_uri, outdb_format, nd_buffer, contiguous_data returning Cow<[u8]>. validate_view enforces all view rules including i64-overflow on start + (steps-1)*step. NdBuffer exposes raw buffer + shape + byte strides + offset for zero-copy access (numpy / Arrow C Data Interface boundary); VIEW → byte strides happens inside nd_buffer(). Adds BandRef::is_2d() default method as the gate GDAL-backed paths use to refuse N-D input cleanly: true iff dim_names == ["y","x"] over the identity view.
… reader/builder + RS_* migration View-aware Arrow reader (RasterStructArray, BandRefImpl) with corruption- surgery (negative steps, bad source_axis, length mismatch) that round-trips an ArrowError. Builder exposes start_raster / start_band for full N-D plus start_raster_2d / start_band_2d for legacy 2D, with identity-view default written as a null view row. finish_raster validates each band's visible shape against the raster's spatial_shape along the spatial dims. All 33 RS_* functions migrated mechanically; outputs on 2D inputs are byte-identical to apache#787. RS_BandPath keeps its existing inline fragment-stripping (format-agnostic display, untouched by the GDAL parser). Test helpers in sedona-testing rewritten on the N-D builder API.
Reads outdb_uri + parse_outdb_source instead of apache#787's storage_type / outdb_url / outdb_band_id triplet. Each GDAL-backed SQL function gates on BandRef::is_2d() at entry and returns an Execution error on N-D input. VSI normalization, the dataset cache, and RasterIO bodies are byte-for-byte unchanged from apache#787 - only the schema-read sites move. In-db reads use BandRef::contiguous_data() and require Cow::Borrowed so MEM datasets can point at the StructArray's backing buffer without copying; for is_2d identity views this always holds. Tests rebuilt to use RasterBuilder directly. Adds an N-D rejection test for raster_ref_to_gdal_mem and the VRT path, plus an end-to-end
a33fdd1 to
7a1c7e7
Compare
`raster_ref_to_gdal_mem` previously returned a `Result<Dataset>` and
guarded against `BandRef::contiguous_data()` returning `Cow::Owned`
with a runtime tripwire ("Internal: contiguous_data must be borrowed
for is_2d bands; got owned"). The check was correct — handing GDAL a
pointer into a `Vec<u8>` that drops at the end of the iteration would
dangle — but it ties an internal invariant ("`is_2d` ⇒ Borrowed") to
incidental properties of today's reader. Any future copy path in the
reader (compression, BinaryView block-boundary stitching, alignment
fix-up, sliced/broadcast/transposed views from apache#813 / apache#750) would
detonate the tripwire on perfectly valid 2-D rasters.
Change: return `Result<(Dataset, Vec<Vec<u8>>)>`. On `Cow::Borrowed`
the GDAL band still points directly at the StructArray buffer
(zero-copy). On `Cow::Owned` we move the `Vec<u8>` out of the Cow
without copying — the reader's existing materialization is the only
allocation — and stash it in the returned vector. The caller (the
provider in `gdal_dataset_provider.rs`) parks it in a new
`RasterDataset::_owned_band_bytes` field that lives as long as the
MEM dataset that holds the pointers.
`raster_ref_to_gdal_empty` discards the always-empty vector.
`BandRef::nd_buffer()` and `BandRef::contiguous_data()` previously returned the empty Arrow `data` buffer when `is_indb() == false`, giving consumers silent garbage instead of a signal that the band needs a backend-specific OutDb resolver. Replace the silent path with `ArrowError::NotYetImplemented`, documenting the integration point for the future resolver work. No existing call site in `sedona-raster-functions` reads bytes from OutDb bands today (RS_BandPath only reads the URI), so this guard turns a latent gap into a visible one without breaking behavior.
Summary
Upgrades SedonaDB's raster type from 2D tiles to N-dimensional chunks with named dimensions, enabling support for climate model outputs, hyperspectral imagery, Zarr/NetCDF datacubes, and other multi-dimensional geospatial datasets.
This is a schema-and-API migration. All 33 existing
RS_*functions are migrated mechanically and produce identical outputs on 2D inputs; no new SQL functions are added in this PR. GDAL-backed functions added in #787 are ported to read from the new schema and gate onBandRef::is_2d()— N-D bands return a clearExecutionerror pending MDArray/Zarr support.This PR is rebased onto post-#787
mainand is the destructive replacement of #787's band schema. Because the schema swap and the GDAL loader port have to land together to keepmaincompiling, the PR is structured as four review-friendly commits but only the tip is required to be CI-green — intermediate commits intentionally don't build and the PR is not bisectable across commits 1–3.This PR depends on #812 (GDAL outdb-URI parser); merge order: #812 → #749.
Commits
nodata_value,storage_type,outdb_url,outdb_band_id; addsdim_names,source_shape, nullableview,outdb_uri,outdb_format. Top-level gainsspatial_dims/spatial_shape.is_2d—RasterRefandBandRefaccessors for the N-D schema;BandRef::is_2d()default method returns true iffdim_names == ["y","x"]with the identity view (used by GDAL-backed paths to refuse N-D input cleanly).viewrow, all 33RS_*functions ported.RS_BandPathkeeps its existing fragment-stripping (format-agnostic display) and is unchanged.sedona-raster-gdalto read fromoutdb_uri+ the feat(raster-gdal): add parse_outdb_source helper for the GDAL format driver #812 parser instead ofstorage_type/outdb_url/outdb_band_id. VSI normalization, dataset cache, and RasterIO bodies are byte-for-byte unchanged — only the schema-read sites move. Each GDAL-backed SQL function gates onband.is_2d()at entry.Equivalence mapping (vs #787)
storage_type(InDb / OutDbRef enum)data.is_empty()— InDb bands carry their bytes indata; OutDb bands have an emptydatabuffer and resolve throughoutdb_uri+outdb_format.outdb_urloutdb_uri(no rename, same string)outdb_band_idoutdb_uri(<uri>#band=Nor GDAL native subdataset URI). Parsed only inside the GDAL format driver via #812'sparse_outdb_source. Format-agnostic surfaces (incl.RS_BandPath) treatoutdb_urias opaque.nodata_valuenodata: Binary(typed bytes; a null row means "no nodata")Top-level adds
spatial_dims: List<Utf8>andspatial_shape: List<UInt64>(e.g.["y","x"]and[height, width]).Schema (canonical)
The
viewfield is defined in the schema so it is part of the canonical N-D layout and round-trips through Arrow IPC, but in this PR every writer emits a null view row (canonical identity) and the reader treats a non-null view row as a hard read error. Encoding identity as a null row keeps the common "no slice applied" case free in Arrow space; consumers see the same shape and bytes regardless of which path produced the band. The composition path that decodes non-null view rows lands in #813.Traits
RasterRef—spatial_dims,spatial_shape,transform,crs,num_bands,band(i),width/height/x_dim/y_dim(derived).BandRef—name,dim_names,source_shape,shape(visible, derived),view,data_type,nodata,outdb_uri,outdb_format,nd_buffer,contiguous_datareturningCow<[u8]>,is_2d(default).NdBuffer— zero-copy buffer view used at the numpy / Arrow C Data Interface boundary.Builder
start_raster/start_bandfor full N-D;start_raster_2d/start_band_2dfor legacy 2D.start_bandwrites a null view (canonical identity) — no caller change.finish_bandvalidates each band's visible shape against the raster'sspatial_shapealong the spatial dims.Backward compatibility
Legacy 2D rasters are represented as bands with
dim_names = ["y","x"],source_shape = [height, width], and an identity view. All existingRS_*outputs are byte-identical on 2D inputs; every test that passed against #787's GDAL functions passes against the ported loader.Crates modified
sedona-schema— N-D raster schema definition.sedona-raster— traits, Arrow array reader, builder, affine transform, display.sedona-testing— raster test helpers.sedona-raster-functions— allRS_*function implementations migrated;RS_BandPathunchanged.sedona-raster-gdal— loader readsoutdb_uri+ feat(raster-gdal): add parse_outdb_source helper for the GDAL format driver #812 parser; SQL functions gate onis_2d().Out of scope (follow-ups)
append_as_indb_raster/dataset_to_indb_raster(deleted here, see note above) against the N-D schema.validate_view,start_band_with_view, view-decoding in the reader, view-awarecontiguous_data, and the explicit-view test cluster — landing in feat(raster): view machinery for non-identity band views #813 between this PR and feat: add N-D raster dimension query and manipulation functions #750.is_2d()guard for selected functions).Test plan
sedona-schema: tests pass; verifies theviewfield is nullable and the view-struct field order matches the indices module.sedona-raster: tests pass — identity round-trip viastart_band(null view row); on-disk schema invariant that identity bands serialise as null view rows; reader rejects non-null view rows with a clear error;BandRef::is_2dtruth-table tests; builder visible-shape validation againstspatial_shape.sedona-raster-functions: tests pass — all existing function tests produce identical outputs after migration.sedona-raster-gdal: 46/46 pass — every feat(sedona-raster-gdal) add GDAL foundation library for supporting GDAL-based RS functions #787 SQL-function test passes against the ported loader; new tests cover N-D rejection atraster_ref_to_gdal_memand the VRT path; one end-to-end test confirmspath#band=2selects band 2 of a two-band GeoTIFF.cargo clippy --all-targets -- -D warningsclean on the affected crates.cargo fmt --all --checkclean.