Add option flatten_top_level_arrays to TapeDecoder#9496
Add option flatten_top_level_arrays to TapeDecoder#9496Rafferty97 wants to merge 4 commits intoapache:mainfrom
flatten_top_level_arrays to TapeDecoder#9496Conversation
scovich
left a comment
There was a problem hiding this comment.
The row handling logic seems really redundant between normal and flattened-array modes... trying to figure out if there's a better way to express it?
Also -- what would be the intended semantics of flattened-array parsing a json file containing multiple arrays?
[{"a": 1}, {"a": 2}]
[{"a": 3}, {"a": 4}](I think the current code would produce four rows of output, but is that correct if the file is "supposed" to contain only one large array?)
| mod timestamp_array; | ||
|
|
||
| /// A builder for [`Reader`] and [`Decoder`] | ||
| #[derive(Clone)] |
There was a problem hiding this comment.
Why does it need to be Clone now where previously it was not?
There was a problem hiding this comment.
I clone the ReaderBuilder in do_read_config, which was the most straightforward way of allowing tests to pass in additional config. Otherwise I woud've has to add an impl Fn() -> ReaderBuilder which seemed more convoluted than just having ReaderBuilder implement Clone.
There's a little bit of code duplication but I think it's quite manageable and warranted. The intended semantics for that example would be to produce 4 rows, whereas the current code would either error out because the top-level items aren't objects, or return two arrays if it's in "is_field" mode via the ReaderBuilder::new_as_field method. |
Which issue does this PR close?
N/A
Rationale for this change
There are JSON files in the wild that are structured as a single array at the root level, i.e.
[{...}, {....}, {...}]. At present, this would be read by the tape decoder as one single array-valued record, but since each nested object is what constitutes a "row", this is not ideal.So, this PR extends the
TapeDecoderto support these kinds of JSON files via an opt-in configuration option.A PR was recently merged into Datafusion to support this exact usecase, but it currently employs a streaming converter to transform "top-level array" JSON sources into ND-JSON. If this PR is merged, this would facilitate the refactoring of this feature to use the
TapeDecoderdirectly, which should noticably improve performance. See apache/datafusion#19920 for context.What changes are included in this PR?
This PR modestly refactors the
TapeDecoderto facilitate this use case, by adding an option calledflatten_top_level_arrays. When enabled, any top-level arrays are "flattened" such that their elements each become an individual row in the output batch, rather than the entire array becoming a single row as would otherwise happen.Are these changes tested?
Yes, these changes pass all existing unit tests, and I've added a new unit test for this feature specifically.
Are there any user-facing changes?
The primary change is the addition of a
new_with_optionsmethod onTapeDecoder, that allows the user to specify a value for the new configuration option. I figured a config struct was more future-proof, and we may want to mark it as#[non_exhaustive]but that's debatable.