Skip to content

feat: add __arrow_c_stream__ function#11338

Draft
jules-ch wants to merge 9 commits into
pydata:mainfrom
jules-ch:arrow-pycapsule-datarray
Draft

feat: add __arrow_c_stream__ function#11338
jules-ch wants to merge 9 commits into
pydata:mainfrom
jules-ch:arrow-pycapsule-datarray

Conversation

@jules-ch
Copy link
Copy Markdown

@jules-ch jules-ch commented May 13, 2026

Description

Add pyarrow capsule method to quickly convert datarray to polars

The function is mostly zero copy, only the coordinates grid need to be computed.

I wanted to implement the __arrow_c_array__ function to return a fixed_shape_tensor but somehow polars prioritize this over __arrow_c_stream__ method.

So for convenience I leave this here for now.

Feel free to close this PR and discuss this further in a dedicated issue if you want.

We can go one step further to save memory with pa.DictionaryArray to use the indice encoding that pyarrow supports out of the box we just need to create the indices using numpy before.

This enable :

>>> import polars as pl
>>> import pyarrow as pa
>>> ds = xr.tutorial.load_dataset("air_temperature")
>>> df = pl.from_arrow(ds.air)
>>> df
shape: (3_869_000, 4)
┌─────────────────────┬──────┬───────┬────────┐
│ timelatlonair    │
│ ------------    │
│ datetime[ns]        ┆ f32f32f64    │
╞═════════════════════╪══════╪═══════╪════════╡
│ 2013-01-01 00:00:0075.0200.0241.2  │
│ 2013-01-01 00:00:0075.0202.5242.5  │
│ 2013-01-01 00:00:0075.0205.0243.5  │
│ 2013-01-01 00:00:0075.0207.5244.0  │
│ 2013-01-01 00:00:0075.0210.0244.1  │
│ …                   ┆ …    ┆ …     ┆ …      │
│ 2014-12-31 18:00:0015.0320.0297.39 │
│ 2014-12-31 18:00:0015.0322.5297.19 │
│ 2014-12-31 18:00:0015.0325.0296.49 │
│ 2014-12-31 18:00:0015.0327.5296.19 │
│ 2014-12-31 18:00:0015.0330.0295.69 │
└─────────────────────┴──────┴───────┴────────┘

Checklist

AI Disclosure

  • This PR contains AI-generated content.
    • I have tested any AI-generated content in my PR.
    • I take responsibility for any AI-generated content in my PR. Tools: {e.g., Claude, Codex, GitHub Copilot, ChatGPT, etc.}

jules-ch added 2 commits May 13, 2026 20:21
Add pyarrow capsule method to quickly convert datarray to polars

The function is mostly zero copy, only the coordinates grid need to be
computed
@jules-ch jules-ch force-pushed the arrow-pycapsule-datarray branch from 3c5df58 to ef226fb Compare May 13, 2026 18:28
@jules-ch jules-ch force-pushed the arrow-pycapsule-datarray branch from 3cd7cdc to d6ea2fe Compare May 13, 2026 18:30
@jules-ch jules-ch force-pushed the arrow-pycapsule-datarray branch from bc5c796 to e5e19d6 Compare May 13, 2026 18:32
@jules-ch jules-ch marked this pull request as draft May 14, 2026 13:08
Comment thread xarray/core/dataarray.py
Comment on lines +491 to +492
if not values.flags.c_contiguous:
values = np.ascontiguousarray(values)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only use values.ravel down there to ensure contiguous array.

@jules-ch
Copy link
Copy Markdown
Author

jules-ch commented May 14, 2026

Thought

For xr.Dataset I think we can loop over dataarrays convert them to pyarrow tables and join them together in one big pyarrow Table.

It will be sparse if dataarrays does not have the same coords, but that's another PR altogether.

Edit:

Well that worked better than I thought it would:

pl.DataFrame(pa.table(ds))
shape: (2_082_966, 7)
┌───────┬───────┬──────────┬───────────┬───────────────┬──────┬──────┐
│ monthlevellatitudelongitudezuv    │
│ ---------------------  │
│ i32i32f32f32f64f64f64  │
╞═══════╪═══════╪══════════╪═══════════╪═══════════════╪══════╪══════╡
│ 120090.0-180.0106837.512109nullnull │
│ 120090.0-179.25106839.237136nullnull │
│ 120090.0-178.5106837.512109nullnull │
│ 120090.0-177.75106839.237136nullnull │
│ 120090.0-177.0106837.512109nullnull │
│ …     ┆ …     ┆ …        ┆ …         ┆ …             ┆ …    ┆ …    │
│ null200nullnullnullnullnull │
│ null500nullnullnullnullnull │
│ null850nullnullnullnullnull │
│ 1nullnullnullnullnullnull │
│ 7nullnullnullnullnullnull │
└───────┴───────┴──────────┴───────────┴───────────────┴──────┴──────┘

Will make an another PR if this one get merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas)

1 participant