Skip to content

feat: N-dimensional raster type extension (Phase 1)#749

Open
james-willis wants to merge 6 commits into
apache:mainfrom
james-willis:jw/nd-raster-type
Open

feat: N-dimensional raster type extension (Phase 1)#749
james-willis wants to merge 6 commits into
apache:mainfrom
james-willis:jw/nd-raster-type

Conversation

@james-willis
Copy link
Copy Markdown
Contributor

@james-willis james-willis commented Apr 1, 2026

Summary

Upgrades SedonaDB's raster type from 2D tiles to N-dimensional chunks with named dimensions, enabling support for climate model outputs, hyperspectral imagery, Zarr/NetCDF datacubes, and other multi-dimensional geospatial datasets.

This is a schema-and-API migration. All 33 existing RS_* functions are migrated mechanically and produce identical outputs on 2D inputs; no new SQL functions are added in this PR. GDAL-backed functions added in #787 are ported to read from the new schema and gate on BandRef::is_2d() — N-D bands return a clear Execution error pending MDArray/Zarr support.

This PR is rebased onto post-#787 main and is the destructive replacement of #787's band schema. Because the schema swap and the GDAL loader port have to land together to keep main compiling, the PR is structured as four review-friendly commits but only the tip is required to be CI-green — intermediate commits intentionally don't build and the PR is not bisectable across commits 1–3.

This PR depends on #812 (GDAL outdb-URI parser); merge order: #812#749.

Note on utils.rs — the GDAL indb-loader helpers added in #811 (append_as_indb_raster, dataset_to_indb_raster) are deleted in this PR because they target the legacy BandMetadata / StorageType / band_data_writer API that the N-D port removes. They are reintroduced against the canonical N-D schema in the follow-up PR #818, which is already drafted and stacked on top of this branch.

Commits

  1. Schema swap — replace feat(sedona-raster-gdal) add GDAL foundation library for supporting GDAL-based RS functions #787's band schema with the canonical N-D schema. Removes nodata_value, storage_type, outdb_url, outdb_band_id; adds dim_names, source_shape, nullable view, outdb_uri, outdb_format. Top-level gains spatial_dims / spatial_shape.
  2. Trait surface + is_2dRasterRef and BandRef accessors for the N-D schema; BandRef::is_2d() default method returns true iff dim_names == ["y","x"] with the identity view (used by GDAL-backed paths to refuse N-D input cleanly).
  3. Reader / builder / RS_* migration — identity-view reader and builder, identity-view default via null view row, all 33 RS_* functions ported. RS_BandPath keeps its existing fragment-stripping (format-agnostic display) and is unchanged.
  4. Port GDAL loader to canonical schema — rewrite sedona-raster-gdal to read from outdb_uri + the feat(raster-gdal): add parse_outdb_source helper for the GDAL format driver #812 parser instead of storage_type / outdb_url / outdb_band_id. VSI normalization, dataset cache, and RasterIO bodies are byte-for-byte unchanged — only the schema-read sites move. Each GDAL-backed SQL function gates on band.is_2d() at entry.

Equivalence mapping (vs #787)

#787 field New schema
storage_type (InDb / OutDbRef enum) data.is_empty() — InDb bands carry their bytes in data; OutDb bands have an empty data buffer and resolve through outdb_uri + outdb_format.
outdb_url outdb_uri (no rename, same string)
outdb_band_id encoded inside outdb_uri (<uri>#band=N or GDAL native subdataset URI). Parsed only inside the GDAL format driver via #812's parse_outdb_source. Format-agnostic surfaces (incl. RS_BandPath) treat outdb_uri as opaque.
nodata_value nodata: Binary (typed bytes; a null row means "no nodata")

Top-level adds spatial_dims: List<Utf8> and spatial_shape: List<UInt64> (e.g. ["y","x"] and [height, width]).

Schema (canonical)

raster: Struct<
  crs:           Utf8View (nullable),
  transform:     List<Float64>,            // 6-element GDAL GeoTransform
  spatial_dims:  List<Utf8View>,           // ["x","y"] today
  spatial_shape: List<Int64>,              // [width, height] today
  bands: List<Struct<
    name:         Utf8 (nullable),
    dim_names:    List<Utf8>,              // 2D bands: ["y","x"]
    source_shape: List<UInt64>,            // C-order extent of source
    data_type:    UInt32 (BandDataType discriminant),
    nodata:       Binary (nullable),
    view:         List<Struct<source_axis,start,step,steps: Int64>>
                  (nullable; null = canonical identity view),
    outdb_uri:    Utf8 (nullable),         // null = InDb; set = OutDbRef
    outdb_format: Utf8View (nullable),     // GDAL driver hint
    data:         BinaryView (non-nullable; empty for outdb-only)
  >>
>

The view field is defined in the schema so it is part of the canonical N-D layout and round-trips through Arrow IPC, but in this PR every writer emits a null view row (canonical identity) and the reader treats a non-null view row as a hard read error. Encoding identity as a null row keeps the common "no slice applied" case free in Arrow space; consumers see the same shape and bytes regardless of which path produced the band. The composition path that decodes non-null view rows lands in #813.

Traits

  • RasterRefspatial_dims, spatial_shape, transform, crs, num_bands, band(i), width/height/x_dim/y_dim (derived).
  • BandRefname, dim_names, source_shape, shape (visible, derived), view, data_type, nodata, outdb_uri, outdb_format, nd_buffer, contiguous_data returning Cow<[u8]>, is_2d (default).
  • NdBuffer — zero-copy buffer view used at the numpy / Arrow C Data Interface boundary.

Builder

  • start_raster / start_band for full N-D; start_raster_2d / start_band_2d for legacy 2D.
  • start_band writes a null view (canonical identity) — no caller change.
  • finish_band validates each band's visible shape against the raster's spatial_shape along the spatial dims.

Backward compatibility

Legacy 2D rasters are represented as bands with dim_names = ["y","x"], source_shape = [height, width], and an identity view. All existing RS_* outputs are byte-identical on 2D inputs; every test that passed against #787's GDAL functions passes against the ported loader.

Crates modified

  • sedona-schema — N-D raster schema definition.
  • sedona-raster — traits, Arrow array reader, builder, affine transform, display.
  • sedona-testing — raster test helpers.
  • sedona-raster-functions — all RS_* function implementations migrated; RS_BandPath unchanged.
  • sedona-raster-gdal — loader reads outdb_uri + feat(raster-gdal): add parse_outdb_source helper for the GDAL format driver #812 parser; SQL functions gate on is_2d().

Out of scope (follow-ups)

Test plan

  • sedona-schema: tests pass; verifies the view field is nullable and the view-struct field order matches the indices module.
  • sedona-raster: tests pass — identity round-trip via start_band (null view row); on-disk schema invariant that identity bands serialise as null view rows; reader rejects non-null view rows with a clear error; BandRef::is_2d truth-table tests; builder visible-shape validation against spatial_shape.
  • sedona-raster-functions: tests pass — all existing function tests produce identical outputs after migration.
  • sedona-raster-gdal: 46/46 pass — every feat(sedona-raster-gdal) add GDAL foundation library for supporting GDAL-based RS functions #787 SQL-function test passes against the ported loader; new tests cover N-D rejection at raster_ref_to_gdal_mem and the VRT path; one end-to-end test confirms path#band=2 selects band 2 of a two-band GeoTIFF.
  • cargo clippy --all-targets -- -D warnings clean on the affected crates.
  • cargo fmt --all --check clean.

@james-willis james-willis marked this pull request as ready for review April 3, 2026 18:06
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a closer look at the implementation tomorrow, but wanted to get a preliminary review of the schema out!

Comment on lines -27 to 46
Field::new(column::METADATA, Self::metadata_type(), false),
Field::new(column::CRS, Self::crs_type(), true), // Optional: may be inferred from data
Field::new(column::CRS, Self::crs_type(), true),
Field::new(column::TRANSFORM, Self::transform_type(), false),
Field::new(column::X_DIM, DataType::Utf8View, false),
Field::new(column::Y_DIM, DataType::Utf8View, false),
Field::new(column::BANDS, Self::bands_type(), true),
])
}

/// Raster metadata schema
pub fn metadata_type() -> DataType {
DataType::Struct(Fields::from(vec![
// Raster dimensions
Field::new(column::WIDTH, DataType::UInt64, false),
Field::new(column::HEIGHT, DataType::UInt64, false),
// Geospatial transformation parameters
Field::new(column::UPPERLEFT_X, DataType::Float64, false),
Field::new(column::UPPERLEFT_Y, DataType::Float64, false),
Field::new(column::SCALE_X, DataType::Float64, false),
Field::new(column::SCALE_Y, DataType::Float64, false),
Field::new(column::SKEW_X, DataType::Float64, false),
Field::new(column::SKEW_Y, DataType::Float64, false),
]))
/// Affine transform schema — 6-element GDAL GeoTransform:
/// `[origin_x, scale_x, skew_x, origin_y, skew_y, scale_y]`
pub fn transform_type() -> DataType {
DataType::List(FieldRef::new(Field::new("item", DataType::Float64, false)))
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just highlighting for myself + others that this is the main change.

The main conceptual change here seems to be that the height and width in 2D space are now members of the bands. Even though it might involve some duplication, I wonder if a top-level shape (whose ordering is always x, y, and possibly z in some future) would still be appropriate. The bands would then all be parsed according to the x/y dim name and validated against the top-level declaration.

X_DIM + Y_DIM should probably also be a list to allow for future Z. Lists are kind of a pain in Rust but interacting with these in Python is fairly easy to do and this PR is probably the only one that needs to consider that detail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in zarr, each variable (here band) can have any permutation of dimensions. so band 1 might be xyzt, 2 might be yzt, 3 might be xy

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_DIM + Y_DIM should probably also be a list to allow for future Z. Lists are kind of a pain in Rust but interacting with these in Python is fairly easy to do and this PR is probably the only one that needs to consider that detail.

Sorry I'm not clear what you want me to do. Should I decompose this into 2 instances of some GeoTransformDimension type?

I adopted this norm to match GDAL. I my assumption was that X and Y are somewhat privileged in the system. I hadn't given much thought to affine transformations outside of the xy plane.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not clear what you want me to do.

  • Change Field::new(column::X_DIM, DataType::Utf8View, false), Field::new(column::Y_DIM, DataType::Utf8View, false) to Field::new(column::DIMS, DataType::new_list(DataType::Utf8View), false). This be of length 2 for now and you don't have to change your Rust wrapper.
  • Your Arrow representation of the transform is already Z-ready (it would have 12 values instead of 6 in the future where Z is supported).
  • Add Field::new(column::SHAPE, DataType::new_list(DataType::Int64), false), which is the number of values in each of column::DIMS, in the same order. I may be reading it wrong but I think you currently have to peek into the first band for this. I think it is clearer to keep the spatial source of truth at the raster level and then validate every band against that (the transform and the height/width are very closely related to each other and I think having them at the same level makes sense). Also having a zero-band raster is surprisingly useful (e.g., gdal uses this as a way to specify a target grid to use for the result of an operation like warp).

Comment thread rust/sedona-schema/src/raster.rs Outdated
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another round...there are a few places where tests or comments were removed that were still useful.

I do think it's worth iterating a tiny bit on the schema, and importantly, I think we should wait until @Kontinuation has finished the GDAL read portion of the previous set of raster refactoring because (1) this was a lot of work and Kristin's time is valuable and (2) these are important use cases to consider when making the new type work!

Comment thread rust/sedona-raster/src/affine_transformation.rs Outdated
Comment thread rust/sedona-raster/src/affine_transformation.rs
Comment thread rust/sedona-raster/src/display.rs
Comment thread rust/sedona-raster/src/outdb_uri.rs Outdated
Comment thread rust/sedona-testing/src/rasters.rs
Comment thread rust/sedona-testing/src/rasters.rs
@james-willis james-willis marked this pull request as draft April 20, 2026 18:17
@james-willis james-willis force-pushed the jw/nd-raster-type branch 2 times, most recently from 0af0b01 to 4302af4 Compare April 20, 2026 18:41
james-willis added a commit to james-willis/sedona-db that referenced this pull request Apr 22, 2026
…hape

Replace the scalar x_dim/y_dim Utf8View fields with a top-level
spatial_dims (List<Utf8View>) and add spatial_shape (List<Int64>).
The raster struct becomes the single source of truth for the spatial
grid; every band must contain each name in spatial_dims with a size
matching the corresponding entry in spatial_shape.

- sedona-schema: column constants SPATIAL_DIMS / SPATIAL_SHAPE,
  raster_indices 0..4, new spatial_dims_type() / spatial_shape_type()
  accessors.
- sedona-raster: RasterRef gains spatial_dims() + spatial_shape();
  x_dim() / y_dim() / width() / height() become default methods
  reading from the top-level fields, so width/height work even on
  zero-band rasters ("target grid" GDAL warp pattern).
- RasterStructArray: decode the two new ListArrays; per-row slicing
  mirrors the existing band dim_names/shape accessors.
- RasterBuilder: new start_raster(transform, spatial_dims[],
  spatial_shape[], crs) signature; finish_raster strictly validates
  every band against the declared spatial grid and returns
  InvalidArgumentError on mismatch.
- sedona-testing: assert_raster_equal compares the new fields.

Addresses Dewey's review comment on PR apache#749 re: making the raster the
authority on the spatial grid.
james-willis added a commit to james-willis/sedona-db that referenced this pull request May 4, 2026
…hape

Replace the scalar x_dim/y_dim Utf8View fields with a top-level
spatial_dims (List<Utf8View>) and add spatial_shape (List<Int64>).
The raster struct becomes the single source of truth for the spatial
grid; every band must contain each name in spatial_dims with a size
matching the corresponding entry in spatial_shape.

- sedona-schema: column constants SPATIAL_DIMS / SPATIAL_SHAPE,
  raster_indices 0..4, new spatial_dims_type() / spatial_shape_type()
  accessors.
- sedona-raster: RasterRef gains spatial_dims() + spatial_shape();
  x_dim() / y_dim() / width() / height() become default methods
  reading from the top-level fields, so width/height work even on
  zero-band rasters ("target grid" GDAL warp pattern).
- RasterStructArray: decode the two new ListArrays; per-row slicing
  mirrors the existing band dim_names/shape accessors.
- RasterBuilder: new start_raster(transform, spatial_dims[],
  spatial_shape[], crs) signature; finish_raster strictly validates
  every band against the declared spatial grid and returns
  InvalidArgumentError on mismatch.
- sedona-testing: assert_raster_equal compares the new fields.

Addresses Dewey's review comment on PR apache#749 re: making the raster the
authority on the spatial grid.
@james-willis james-willis force-pushed the jw/nd-raster-type branch 3 times, most recently from 58b56e1 to 4ddfdbf Compare May 5, 2026 00:18
@james-willis james-willis force-pushed the jw/nd-raster-type branch 2 times, most recently from 999a577 to dda82a7 Compare May 5, 2026 18:14
builder.append_value(dt.pixel_type_name());
match raster.band_data_type((band_index - 1) as usize) {
Some(dt) => builder.append_value(dt.pixel_type_name()),
None => builder.append_null(),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not totally sure if we should handle this branch or throw. should generally be unreachable - we don't expect the arrow to contain datatype values that don't exist.

@james-willis james-willis force-pushed the jw/nd-raster-type branch 6 times, most recently from 94be286 to 965e7d9 Compare May 5, 2026 22:22
Comment on lines +221 to +236
// contiguous_data() is Cow::Borrowed for is_2d identity views; the
// borrow points at the StructArray's backing buffer, which outlives
// the dataset (held by the caller). For Cow::Owned the pointer would
// dangle the moment the Cow drops, so we reject that case loudly.
let band_data = band
.contiguous_data()
.map_err(|e| arrow_datafusion_err!(e))?;
let bytes: &[u8] = match &band_data {
Cow::Borrowed(b) => b,
Cow::Owned(_) => {
return exec_err!(
"Internal: contiguous_data must be borrowed for is_2d bands; got owned"
);
}
};
let data_ptr = bytes.as_ptr();
Copy link
Copy Markdown
Contributor Author

@james-willis james-willis May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self. This whole section and the contiguous data thing will need some thought from others and me.

@james-willis james-willis force-pushed the jw/nd-raster-type branch 3 times, most recently from 3484f09 to a33fdd1 Compare May 5, 2026 23:00
Replaces apache#787's 2D-only band schema with the canonical N-D schema:
spatial_dims/spatial_shape at the raster level; bands carry dim_names,
source_shape, nullable view, outdb_uri, outdb_format, plus the
non-nullable data buffer. Removes nodata_value, storage_type,
outdb_url, and outdb_band_id - every one is encodable in the new
schema:

- storage_type ↔ outdb_uri.is_null() (null = InDb, set = OutDbRef).
- outdb_url ↔ outdb_uri (no rename, same string).
- outdb_band_id ↔ encoded inside outdb_uri (#band=N or GDAL native
  subdataset URI), parsed only inside the GDAL format driver.
- nodata_value ↔ typed nodata: Binary (a null row means "no nodata").

Top-level adds spatial_dims: List<Utf8View> and spatial_shape:
List<Int64>; nullable view is List<Struct<source_axis, start, step,
steps: Int64>> where a null row encodes the canonical identity view.

Note: intermediate commits in this PR are not expected to build; only
the PR tip is CI-green. The trait, reader/builder, RS_* migration,
and GDAL loader port land in subsequent commits.
RasterRef and BandRef accessors over the canonical N-D schema:
spatial_dims/spatial_shape, transform, crs, num_bands, band(i), and
band-level dim_names, source_shape, shape (visible, derived from view),
view, data_type, nodata, outdb_uri, outdb_format, nd_buffer,
contiguous_data returning Cow<[u8]>.

validate_view enforces all view rules including i64-overflow on
start + (steps-1)*step. NdBuffer exposes raw buffer + shape + byte
strides + offset for zero-copy access (numpy / Arrow C Data Interface
boundary); VIEW → byte strides happens inside nd_buffer().

Adds BandRef::is_2d() default method as the gate GDAL-backed paths
use to refuse N-D input cleanly: true iff dim_names == ["y","x"]
over the identity view.
… reader/builder + RS_* migration

View-aware Arrow reader (RasterStructArray, BandRefImpl) with corruption-
surgery (negative steps, bad source_axis, length mismatch) that
round-trips an ArrowError. Builder exposes start_raster / start_band
for full N-D plus start_raster_2d / start_band_2d for legacy 2D, with
identity-view default written as a null view row. finish_raster
validates each band's visible shape against the raster's spatial_shape
along the spatial dims.

All 33 RS_* functions migrated mechanically; outputs on 2D inputs are
byte-identical to apache#787. RS_BandPath keeps its existing inline
fragment-stripping (format-agnostic display, untouched by the GDAL
parser). Test helpers in sedona-testing rewritten on the N-D builder
API.
Reads outdb_uri + parse_outdb_source instead of apache#787's storage_type /
outdb_url / outdb_band_id triplet. Each GDAL-backed SQL function gates
on BandRef::is_2d() at entry and returns an Execution error on N-D
input. VSI normalization, the dataset cache, and RasterIO bodies are
byte-for-byte unchanged from apache#787 - only the schema-read sites move.

In-db reads use BandRef::contiguous_data() and require Cow::Borrowed
so MEM datasets can point at the StructArray's backing buffer without
copying; for is_2d identity views this always holds.

Tests rebuilt to use RasterBuilder directly. Adds an N-D rejection
test for raster_ref_to_gdal_mem and the VRT path, plus an end-to-end
`raster_ref_to_gdal_mem` previously returned a `Result<Dataset>` and
guarded against `BandRef::contiguous_data()` returning `Cow::Owned`
with a runtime tripwire ("Internal: contiguous_data must be borrowed
for is_2d bands; got owned"). The check was correct — handing GDAL a
pointer into a `Vec<u8>` that drops at the end of the iteration would
dangle — but it ties an internal invariant ("`is_2d` ⇒ Borrowed") to
incidental properties of today's reader. Any future copy path in the
reader (compression, BinaryView block-boundary stitching, alignment
fix-up, sliced/broadcast/transposed views from apache#813 / apache#750) would
detonate the tripwire on perfectly valid 2-D rasters.

Change: return `Result<(Dataset, Vec<Vec<u8>>)>`. On `Cow::Borrowed`
the GDAL band still points directly at the StructArray buffer
(zero-copy). On `Cow::Owned` we move the `Vec<u8>` out of the Cow
without copying — the reader's existing materialization is the only
allocation — and stash it in the returned vector. The caller (the
provider in `gdal_dataset_provider.rs`) parks it in a new
`RasterDataset::_owned_band_bytes` field that lives as long as the
MEM dataset that holds the pointers.

`raster_ref_to_gdal_empty` discards the always-empty vector.
`BandRef::nd_buffer()` and `BandRef::contiguous_data()` previously
returned the empty Arrow `data` buffer when `is_indb() == false`,
giving consumers silent garbage instead of a signal that the band
needs a backend-specific OutDb resolver. Replace the silent path
with `ArrowError::NotYetImplemented`, documenting the integration
point for the future resolver work.

No existing call site in `sedona-raster-functions` reads bytes from
OutDb bands today (RS_BandPath only reads the URI), so this guard
turns a latent gap into a visible one without breaking behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants