Skip to content

Add option flatten_top_level_arrays to TapeDecoder#9496

Open
Rafferty97 wants to merge 4 commits intoapache:mainfrom
Rafferty97:json-array
Open

Add option flatten_top_level_arrays to TapeDecoder#9496
Rafferty97 wants to merge 4 commits intoapache:mainfrom
Rafferty97:json-array

Conversation

@Rafferty97
Copy link
Contributor

Which issue does this PR close?

N/A

Rationale for this change

There are JSON files in the wild that are structured as a single array at the root level, i.e. [{...}, {....}, {...}]. At present, this would be read by the tape decoder as one single array-valued record, but since each nested object is what constitutes a "row", this is not ideal.

So, this PR extends the TapeDecoder to support these kinds of JSON files via an opt-in configuration option.

A PR was recently merged into Datafusion to support this exact usecase, but it currently employs a streaming converter to transform "top-level array" JSON sources into ND-JSON. If this PR is merged, this would facilitate the refactoring of this feature to use the TapeDecoder directly, which should noticably improve performance. See apache/datafusion#19920 for context.

What changes are included in this PR?

This PR modestly refactors the TapeDecoder to facilitate this use case, by adding an option called flatten_top_level_arrays. When enabled, any top-level arrays are "flattened" such that their elements each become an individual row in the output batch, rather than the entire array becoming a single row as would otherwise happen.

Are these changes tested?

Yes, these changes pass all existing unit tests, and I've added a new unit test for this feature specifically.

Are there any user-facing changes?

The primary change is the addition of a new_with_options method on TapeDecoder, that allows the user to specify a value for the new configuration option. I figured a config struct was more future-proof, and we may want to mark it as #[non_exhaustive] but that's debatable.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 2, 2026
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row handling logic seems really redundant between normal and flattened-array modes... trying to figure out if there's a better way to express it?

Also -- what would be the intended semantics of flattened-array parsing a json file containing multiple arrays?

[{"a": 1}, {"a": 2}]
[{"a": 3}, {"a": 4}]

(I think the current code would produce four rows of output, but is that correct if the file is "supposed" to contain only one large array?)

mod timestamp_array;

/// A builder for [`Reader`] and [`Decoder`]
#[derive(Clone)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to be Clone now where previously it was not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clone the ReaderBuilder in do_read_config, which was the most straightforward way of allowing tests to pass in additional config. Otherwise I woud've has to add an impl Fn() -> ReaderBuilder which seemed more convoluted than just having ReaderBuilder implement Clone.

@Rafferty97
Copy link
Contributor Author

The row handling logic seems really redundant between normal and flattened-array modes... trying to figure out if there's a better way to express it?

Also -- what would be the intended semantics of flattened-array parsing a json file containing multiple arrays?

[{"a": 1}, {"a": 2}]
[{"a": 3}, {"a": 4}]

(I think the current code would produce four rows of output, but is that correct if the file is "supposed" to contain only one large array?)

There's a little bit of code duplication but I think it's quite manageable and warranted.

The intended semantics for that example would be to produce 4 rows, whereas the current code would either error out because the top-level items aren't objects, or return two arrays if it's in "is_field" mode via the ReaderBuilder::new_as_field method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants