[codex] Add Rerun format support#218
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for reading Rerun RRD files by adding the RerunReader and exposing the read_rerun pipeline function. The review feedback highlights several critical issues in the implementation, including an AttributeError due to a non-existent drop_null() method on pyarrow.Table, potential TypeError exceptions when handling null values in list arrays and encoded images, a PEP-8 naming convention violation for the _local_rrd class, and a potential frame count mismatch in robotics mode if camera frames are missing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def _collect_table(df: Any) -> pa.Table: | ||
| table = df.to_arrow_table() | ||
| if _RERUN_SEGMENT_ID in table.column_names and table.num_rows > 0: | ||
| table = table.drop_null() | ||
| return table |
There was a problem hiding this comment.
The pyarrow.Table class does not have a drop_null() method. Calling table.drop_null() will raise an AttributeError at runtime when _RERUN_SEGMENT_ID is present in the table's columns. Instead, use pyarrow.compute.is_valid to filter out rows where the segment ID is null.
| def _collect_table(df: Any) -> pa.Table: | |
| table = df.to_arrow_table() | |
| if _RERUN_SEGMENT_ID in table.column_names and table.num_rows > 0: | |
| table = table.drop_null() | |
| return table | |
| def _collect_table(df: Any) -> pa.Table: | |
| table = df.to_arrow_table() | |
| if _RERUN_SEGMENT_ID in table.column_names and table.num_rows > 0: | |
| table = table.filter(pa.compute.is_valid(table.column(_RERUN_SEGMENT_ID))) | |
| return table |
| def _singleton_list_array(array: pa.Array) -> np.ndarray: | ||
| out = np.full(len(array), np.nan, dtype=np.float64) | ||
| for index, scalar in enumerate(array): | ||
| value = scalar.as_py() | ||
| if value: | ||
| out[index] = float(value[0]) | ||
| return out | ||
|
|
There was a problem hiding this comment.
If the list contains a null/None element (e.g., [None]), if value: will evaluate to True because the list is non-empty. However, accessing value[0] will return None, and attempting to cast it to float will raise a TypeError: float() argument must be a string or a real number, not 'NoneType'. To prevent this, ensure that value[0] is not None before casting.
| def _singleton_list_array(array: pa.Array) -> np.ndarray: | |
| out = np.full(len(array), np.nan, dtype=np.float64) | |
| for index, scalar in enumerate(array): | |
| value = scalar.as_py() | |
| if value: | |
| out[index] = float(value[0]) | |
| return out | |
| def _singleton_list_array(array: pa.Array) -> np.ndarray: | |
| out = np.full(len(array), np.nan, dtype=np.float64) | |
| for index, scalar in enumerate(array): | |
| value = scalar.as_py() | |
| if value and value[0] is not None: | |
| out[index] = float(value[0]) | |
| return out |
| def _iter_encoded_images(values: pa.Array) -> Iterator[np.ndarray]: | ||
| from io import BytesIO | ||
|
|
||
| from PIL import Image | ||
|
|
||
| for scalar in values: | ||
| value = scalar.as_py() | ||
| if not value: | ||
| continue | ||
| data = cast(bytes | bytearray | list[int], value[0]) | ||
| with Image.open(BytesIO(bytes(data))) as image: | ||
| yield np.asarray(image.convert("RGB"), dtype=np.uint8) | ||
|
|
There was a problem hiding this comment.
Similar to _singleton_list_array, if the list contains a null/None element (e.g., [None]), if not value: will evaluate to False. Then value[0] will be None, and passing it to bytes(data) will raise a TypeError: cannot convert 'NoneType' object to bytes. Ensure that value[0] is not None before attempting to decode the image.
def _iter_encoded_images(values: pa.Array) -> Iterator[np.ndarray]:
from io import BytesIO
from PIL import Image
for scalar in values:
value = scalar.as_py()
if not value or value[0] is None:
continue
data = cast(bytes | bytearray | list[int], value[0])
with Image.open(BytesIO(bytes(data))) as image:
yield np.asarray(image.convert(| class _local_rrd: | ||
| def __init__(self, source: DataFile) -> None: |
There was a problem hiding this comment.
According to PEP-8 naming conventions, class names should normally use the CapWords (CamelCase) convention. Consider renaming _local_rrd to _LocalRRD to improve code style and maintainability. Note that you will also need to update its usage on line 126.
| class _local_rrd: | |
| def __init__(self, source: DataFile) -> None: | |
| class _LocalRRD: | |
| def __init__(self, source: DataFile) -> None: |
References
- PEP-8: Class names should normally use the CapWords convention. (link)
| for name, column in _camera_columns(schema, table, self.camera_prefix).items(): | ||
| values = table.column(column).combine_chunks() | ||
| row[name] = VideoFrameSequence( | ||
| lambda values=values: _iter_encoded_images(values), | ||
| fps=self.fps or 30.0, | ||
| frame_count=len(values), | ||
| ) |
There was a problem hiding this comment.
In _robotics_row, frame_count is set to len(values) (the total number of rows in the timeline table). However, _iter_encoded_images skips null/empty frames using continue. If there are any missing/null frames in the camera column, the iterator will yield fewer frames than frame_count. This mismatch can cause downstream validation errors (e.g., in read_lerobot or other writers) and lead to out-of-sync alignment between video frames and other timeline data (like actions/states). Consider forward-filling missing frames or handling nulls explicitly to ensure the yielded frame count matches len(values).
This reverts commit e0df9ec.
|
Summary of the Rerun optimization / benchmark campaign: This branch went through two related tracks: Rerun support hardening and performance work. The final commit on the branch is the stable cleanup recipe (
Current state
|
Summary
Adds an initial production Rerun RRD reader for Refiner.
macrodata-refiner[rerun]dependencies usingrerun-sdk[datafusion].read_rerun(...)with atomic RRD file sharding, lossless recording mode, and configurable robotics mode.to_robot_rows(...).This is intentionally an early draft PR so cloud jobs can install this exact git ref while the writer and optimization work continue.
Validation
uv run --with '.[rerun]' pytest tests/readers/test_rerun_reader.pyuv run ty check src/refiner/pipeline/sources/readers/rerun.py src/refiner/pipeline/pipeline.py tests/readers/test_rerun_reader.py