-
Notifications
You must be signed in to change notification settings - Fork 64
feat: support nested STRUCT and ARRAY data display in anywidget mode #2359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
f583833 to
60785f3
Compare
bigframes/display/_flatten.py
Outdated
|
|
||
| def flatten_nested_data( | ||
| dataframe: pd.DataFrame, | ||
| ) -> tuple[pd.DataFrame, dict[str, list[int]], list[str], set[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuple is hard to understand. Can we use a frozen dataclass, instead?
2bb97d3 to
3944249
Compare
bigframes/display/_flatten.py
Outdated
| ) | ||
|
|
||
| new_cols_to_add[new_col_name] = pd.Series( | ||
| new_list_array.to_pylist(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_pylist() can be quite expensive to call. If we already have a pyarrow array, I don't think it's necessary to convert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've removed the .to_pylist() calls and now pass the Arrow arrays directly to pandas for better performance.
bigframes/display/_flatten.py
Outdated
|
|
||
| new_cols_to_add[new_col_name] = pd.Series( | ||
| new_list_array.to_pylist(), | ||
| dtype=pd.ArrowDtype(pa.list_(field.type)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Why are we creating a list type here? Could you explain in comments what the purpose is? I thought we were flattening based on the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've added a comment to clarify that the function is transforming an array<struct<...>> into separate array columns.
bigframes/display/_flatten.py
Outdated
| for orig_idx in dataframe.index: | ||
| non_array_data = non_array_df.loc[orig_idx].to_dict() | ||
| array_values = {} | ||
| max_len_in_row = 0 | ||
| non_na_array_found = False | ||
|
|
||
| for col_name in array_columns: | ||
| val = dataframe.loc[orig_idx, col_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looping through each value in Python, which is going to be very slow. Please use native code such as https://arrow.apache.org/docs/python/generated/pyarrow.compute.list_flatten.html to avoid such loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've refactored the array explosion logic to use a much faster vectorized approach with pandas.explode and merge, which removes the Python loops entirely.
bigframes/display/_flatten.py
Outdated
| continue | ||
|
|
||
| # Create one row per array element, up to max_len_in_row | ||
| for array_idx in range(max_len_in_row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looping through each element of each array in Python, which is going to be even slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have completely refactored _explode_array_columns to use a vectorized approach with pandas.explode and merge. This eliminated all Python loops, including the slow inner loop you pointed out, significantly improving performance.
bigframes/display/_flatten.py
Outdated
| return "struct" | ||
| if pa.types.is_list(pa_type): | ||
| return ( | ||
| "array_of_struct" | ||
| if pa.types.is_struct(pa_type.value_type) | ||
| else "array" | ||
| ) | ||
| return "clear" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These magic strings worry me. Could you create an enum for category, instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've replaced the strings with a private _ColumnCategory Enum.
bigframes/display/_flatten.py
Outdated
| continuation_rows: A set of row indices that are continuation rows. | ||
| cleared_on_continuation: A list of column names that should be cleared on continuation rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not 100% clear to me what is meant by "continuation". I assume that it means rows post-flattening that correspond to the second element of an array and beyond? Please expand these docstrings further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I've updated the docstrings in FlattenResult to explicitly clarify that "continuation rows" refer to the 2nd element onwards of an exploded array, and "cleared" columns are those (typically scalars) that are replicated but shouldn't be visually repeated.
bigframes/display/_flatten.py
Outdated
| """The result of flattening a DataFrame. | ||
| Attributes: | ||
| dataframe: The flattened DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some comments about what happens to the original index columns. Based on the description of the other fields, I assume that a unique index is created post-flatten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the docstrings and the implementation. The original index (including named Index and MultiIndex) is preserved and duplicated across the exploded rows. This serves as the visual grouping key for the table display.
bigframes/display/_flatten.py
Outdated
|
|
||
|
|
||
| @dataclasses.dataclass(frozen=True) | ||
| class ColumnClassification: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put a leading _ in front of class names that aren't intended to be used outside of this module.
| continuation_rows: set[int] | None, | ||
| clear_on_continuation: list[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, add some more explanation to the docstrings. To keep it shorter, you could reference bigframes/display/_flatten.py so that folks can look there for the complete explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I updated the docstrings to reference bigframes.display._flatten.FlattenResult for the detailed definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat feature!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a test_flatten.py file with a few tests that check some of the flattening logic directly without the HTML rendering part. Specifically, let's focus on what happens to index/multiindex columns, as that's my main worry / question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I created tests/unit/display/test_flatten.py. I moved the logic-specific tests there and added dedicated test cases (test_flatten_preserves_original_index, test_flatten_preserves_multiindex) to verify that indices are correctly preserved and duplicated during the flattening process.
8eb7211 to
ca19957
Compare
|
|
||
| classification = _classify_columns(result_df) | ||
|
|
||
| # Process ARRAY-of-STRUCT columns into multiple ARRAY columns (one per struct field). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need special logic for array of struct? why can we not achieve through just aplying array logic and then struct logic? Also, might we want to just keep on recursively unpacking stuff until there is not more array/struct left?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that we could achieve this by applying array logic (explode) first and then struct logic, but that would require a second pass (loop) because the explosion would produce new STRUT columns that need flattening.
The current approach (Transpose Array -> Flatten Structs -> Explode Arrays) allows us to:
- keep the pipeline linear: we resolve the nesting structure in a single pass without needing recursion or re-classification loops.
- Optimize performance: we flatten the struct fields column-wise before expanding the row count via explosion.
For recursion, I agree that a recursive visitor is the correct long-term solution for arbitrary nesting depths (e.g., ARRAY<STRUCT>). For this PR, I aimed to support the common BQ ARRAY pattern within the current architecture, but we should definitely refactor to full recursion if we need to support depper/arbitrary nesting.
| continuation_rows: A set of row indices in the flattened DataFrame that are | ||
| "continuation rows". These are additional rows created to display the | ||
| 2nd to Nth elements of an array. The first row (index i-1) contains | ||
| the 1st element, while these rows contain subsequent elements. | ||
| cleared_on_continuation: A list of column names that should be "cleared" | ||
| (displayed as empty) on continuation rows. Typically, these are | ||
| scalar columns (non-array) that were replicated during the explosion | ||
| process but should only be visually displayed once per original row group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to individually mark continuation rows rather than take the intersection of a row set and column set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Currently, we enforce synchronous explosion (all arrays align), so the "continuation" status effectively applies to the whole row. When we support independent array explosions, we will definitely need to track.
c405008 to
d2710c2
Compare
Implements flattening and expansion for complex data types in the interactive display for anywidget mode.
Key Features:
verified at:
Fixes #<438181139> 🦕