pluggable registry for input/export arrow kernels#7824
pluggable registry for input/export arrow kernels#7824
Conversation
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Merging this PR will degrade performance by 12.38%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | bench_compare_sliced_dict_primitive[(1000, 10000)] |
79.7 µs | 91 µs | -12.38% |
| ❌ | Simulation | bench_compare_sliced_dict_primitive[(2000, 10000)] |
85.2 µs | 95.2 µs | -10.54% |
| ❌ | Simulation | bench_compare_sliced_dict_primitive[(2500, 10000)] |
88 µs | 97.9 µs | -10.19% |
| ❌ | Simulation | bench_compare_sliced_dict_primitive[(3333, 10000)] |
92.5 µs | 102.9 µs | -10.15% |
| ❌ | Simulation | bench_compare_sliced_dict_varbinview[(1000, 10000)] |
111.8 µs | 124.6 µs | -10.28% |
| ❌ | Simulation | canonicalize_compare[(1000, 4, 4)] |
122.1 µs | 135.7 µs | -10% |
Comparing aduffy/arrow-vtable (71d6452) with develop (f3d5f09)
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Polar Signals Profiling ResultsLatest Run
Previous Runs (2)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.981x ➖ datafusion / vortex-file-compressed (0.981x ➖, 0↑ 0↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.027x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.015x ➖, 0↑ 0↓)
datafusion / parquet (0.994x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.022x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.000x ➖, 0↑ 0↓)
duckdb / parquet (1.004x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.990x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.996x ➖, 0↑ 0↓)
datafusion / parquet (1.000x ➖, 1↑ 1↓)
datafusion / arrow (1.022x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.005x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.988x ➖, 0↑ 0↓)
duckdb / parquet (0.993x ➖, 1↑ 0↓)
duckdb / duckdb (0.991x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.108x ❌, 0↑ 55↓)
datafusion / vortex-compact (1.094x ➖, 1↑ 42↓)
datafusion / parquet (1.099x ➖, 1↑ 50↓)
duckdb / vortex-file-compressed (1.096x ➖, 1↑ 47↓)
duckdb / vortex-compact (1.084x ➖, 0↑ 39↓)
duckdb / parquet (1.055x ➖, 0↑ 14↓)
duckdb / duckdb (1.106x ❌, 0↑ 56↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.992x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.960x ➖, 0↑ 0↓)
datafusion / parquet (0.922x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.990x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.007x ➖, 0↑ 0↓)
duckdb / parquet (0.989x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (1.002x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.020x ➖, 0↑ 0↓)
duckdb / parquet (1.033x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.006x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.014x ➖, 0↑ 0↓)
datafusion / parquet (0.988x ➖, 0↑ 0↓)
datafusion / arrow (1.126x ❌, 0↑ 14↓)
duckdb / vortex-file-compressed (1.056x ➖, 0↑ 3↓)
duckdb / vortex-compact (1.035x ➖, 0↑ 1↓)
duckdb / parquet (1.013x ➖, 0↑ 0↓)
duckdb / duckdb (1.068x ➖, 0↑ 5↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.073x ➖, 0↑ 4↓)
datafusion / parquet (1.055x ➖, 0↑ 3↓)
duckdb / vortex-file-compressed (1.089x ➖, 0↑ 11↓)
duckdb / parquet (1.022x ➖, 0↑ 3↓)
duckdb / duckdb (1.026x ➖, 0↑ 2↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.954x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.975x ➖, 1↑ 0↓)
datafusion / parquet (0.994x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (1.009x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.029x ➖, 0↑ 0↓)
duckdb / parquet (1.010x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.993x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.996x ➖, 1↑ 2↓)
datafusion / parquet (0.973x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.000x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.020x ➖, 0↑ 1↓)
duckdb / parquet (1.038x ➖, 0↑ 0↓)
Full attributed analysis
|
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Summary
Adds a pluggable
ArrowSessionregistry onVortexSessionfor round-tripping Vortex extension types in and out of Arrow extension types. Unblocks Arrow round-trip forarrow.uuidtoday, witharrow.parquet.variant, GeoArrow, and tensor types as the next consumers.Part of #7686.
API changes
The session exposes two trait-driven plugin slots:
ArrowExportVTable— dispatched by target Arrow extension name (ARROW:extension:name). Implementations turn a VortexArrayRefinto an ArrowArrayRefshaped to the requestedField. Also providesto_arrow_fieldfor schema inference when only a VortexDTypeis inhand.
ArrowImportVTable— dispatched by source Arrow extension name carried on the incomingField. Implementations turn an ArrowArrayRefback into a VortexArrayRef, including any storage re-encoding (e.g.FixedSizeBinary[16]→FixedSizeList<u8; 16>for UUID).Both traits return
Unsupported(input)to defer to the next plugin or to the canonical fallback, so multiple plugins can register against the same key and probe in order.New session entry points (
vortex-array/src/arrow/session.rs):ArrowSession::to_arrow_field/to_arrow_schema— VortexDType→ ArrowField/Schema, recursing into containers so nested extension fields go through the registered plugin.ArrowSession::from_arrow_field/from_arrow_schema— inverse direction, plugin-aware.ArrowSession::from_arrow_record_batch/execute_record_batch—RecordBatchround-trip.ArrowSessionExtextension trait so anySessionExtcan callsession.arrow().….The default session pre-registers the builtin UUID plugin (
vortex-array/src/extension/uuid/arrow.rs).What's not in the plugin layer
Date,Time, andTimestampare Vortex builtin extensions that map directly to native Arrow temporal types, so they continue to go through the canonical executor (vortex-array/src/arrow/executor/temporal.rs) rather than the plugin registry. The plugin layer is reserved for Arrow extension types that the canonical path can't express.DataFusion wiring
vortex-datafusionnow goes through the session for schema/array conversion:convert/schema.rs::calculate_physical_schemausesArrowSession::to_arrow_fieldso extension metadata survives projection.persistent/format.rsandpersistent/opener.rsroute schema inference through the session.persistent/sink.rsusesfrom_arrow_record_batch, passing the original schema separately fromRecordBatch::schema()to preserveARROW:extension:namemetadata that DataFusion strips at runtime.Tests
Two new end-to-end tests in
vortex-datafusion/src/persistent/tests.rs:arrow_uuid_extension_roundtrip— write Arrow UUID column to a Vortex file via the session,SELECT *it back, assert the field still carries theUuidextension type and the values match.arrow_uuid_extension_roundtrip_nested_struct— same flow with the UUID nested in a top-levelStruct, exercising recursive session-aware schema inference.