[codex] Add Rerun format support by guipenedo · Pull Request #218 · macrodata-labs/refiner

Guilherme Penedo (guipenedo) · 2026-06-14T13:01:33Z

Summary

Adds an initial production Rerun RRD reader for Refiner.

Adds optional macrodata-refiner[rerun] dependencies using rerun-sdk[datafusion].
Adds read_rerun(...) with atomic RRD file sharding, lossless recording mode, and configurable robotics mode.
Adds focused synthetic RRD tests covering recording reads and conversion through to_robot_rows(...).

This is intentionally an early draft PR so cloud jobs can install this exact git ref while the writer and optimization work continue.

Validation

uv run --with '.[rerun]' pytest tests/readers/test_rerun_reader.py
uv run ty check src/refiner/pipeline/sources/readers/rerun.py src/refiner/pipeline/pipeline.py tests/readers/test_rerun_reader.py
commit hooks: ruff, ruff format, ty

gemini-code-assist

Code Review

This pull request introduces support for reading Rerun RRD files by adding the RerunReader and exposing the read_rerun pipeline function. The review feedback highlights several critical issues in the implementation, including an AttributeError due to a non-existent drop_null() method on pyarrow.Table, potential TypeError exceptions when handling null values in list arrays and encoded images, a PEP-8 naming convention violation for the _local_rrd class, and a potential frame count mismatch in robotics mode if camera frames are missing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-14T13:03:14Z

+def _collect_table(df: Any) -> pa.Table:
+    table = df.to_arrow_table()
+    if _RERUN_SEGMENT_ID in table.column_names and table.num_rows > 0:
+        table = table.drop_null()
+    return table


The pyarrow.Table class does not have a drop_null() method. Calling table.drop_null() will raise an AttributeError at runtime when _RERUN_SEGMENT_ID is present in the table's columns. Instead, use pyarrow.compute.is_valid to filter out rows where the segment ID is null.

Suggested change

def _collect_table(df: Any) -> pa.Table:

table = df.to_arrow_table()

if _RERUN_SEGMENT_ID in table.column_names and table.num_rows > 0:

table = table.drop_null()

return table

def _collect_table(df: Any) -> pa.Table:

table = df.to_arrow_table()

if _RERUN_SEGMENT_ID in table.column_names and table.num_rows > 0:

table = table.filter(pa.compute.is_valid(table.column(_RERUN_SEGMENT_ID)))

return table

gemini-code-assist · 2026-06-14T13:03:15Z

+def _singleton_list_array(array: pa.Array) -> np.ndarray:
+    out = np.full(len(array), np.nan, dtype=np.float64)
+    for index, scalar in enumerate(array):
+        value = scalar.as_py()
+        if value:
+            out[index] = float(value[0])
+    return out
+


If the list contains a null/None element (e.g., [None]), if value: will evaluate to True because the list is non-empty. However, accessing value[0] will return None, and attempting to cast it to float will raise a TypeError: float() argument must be a string or a real number, not 'NoneType'. To prevent this, ensure that value[0] is not None before casting.

Suggested change

def _singleton_list_array(array: pa.Array) -> np.ndarray:

out = np.full(len(array), np.nan, dtype=np.float64)

for index, scalar in enumerate(array):

value = scalar.as_py()

if value:

out[index] = float(value[0])

return out

def _singleton_list_array(array: pa.Array) -> np.ndarray:

out = np.full(len(array), np.nan, dtype=np.float64)

for index, scalar in enumerate(array):

value = scalar.as_py()

if value and value[0] is not None:

out[index] = float(value[0])

return out

gemini-code-assist · 2026-06-14T13:03:15Z

+def _iter_encoded_images(values: pa.Array) -> Iterator[np.ndarray]:
+    from io import BytesIO
+
+    from PIL import Image
+
+    for scalar in values:
+        value = scalar.as_py()
+        if not value:
+            continue
+        data = cast(bytes | bytearray | list[int], value[0])
+        with Image.open(BytesIO(bytes(data))) as image:
+            yield np.asarray(image.convert("RGB"), dtype=np.uint8)
+


Similar to _singleton_list_array, if the list contains a null/None element (e.g., [None]), if not value: will evaluate to False. Then value[0] will be None, and passing it to bytes(data) will raise a TypeError: cannot convert 'NoneType' object to bytes. Ensure that value[0] is not None before attempting to decode the image.

def _iter_encoded_images(values: pa.Array) -> Iterator[np.ndarray]: from io import BytesIO from PIL import Image for scalar in values: value = scalar.as_py() if not value or value[0] is None: continue data = cast(bytes | bytearray | list[int], value[0]) with Image.open(BytesIO(bytes(data))) as image: yield np.asarray(image.convert(

gemini-code-assist · 2026-06-14T13:03:15Z

+class _local_rrd:
+    def __init__(self, source: DataFile) -> None:


According to PEP-8 naming conventions, class names should normally use the CapWords (CamelCase) convention. Consider renaming _local_rrd to _LocalRRD to improve code style and maintainability. Note that you will also need to update its usage on line 126.

Suggested change

class _local_rrd:

def __init__(self, source: DataFile) -> None:

class _LocalRRD:

def __init__(self, source: DataFile) -> None:

References

PEP-8: Class names should normally use the CapWords convention. ^(link)

gemini-code-assist · 2026-06-14T13:03:15Z

+        for name, column in _camera_columns(schema, table, self.camera_prefix).items():
+            values = table.column(column).combine_chunks()
+            row[name] = VideoFrameSequence(
+                lambda values=values: _iter_encoded_images(values),
+                fps=self.fps or 30.0,
+                frame_count=len(values),
+            )


In _robotics_row, frame_count is set to len(values) (the total number of rows in the timeline table). However, _iter_encoded_images skips null/empty frames using continue. If there are any missing/null frames in the camera column, the iterator will yield fewer frames than frame_count. This mismatch can cause downstream validation errors (e.g., in read_lerobot or other writers) and lead to out-of-sync alignment between video frames and other timeline data (like actions/states). Consider forward-filling missing frames or handling nulls explicitly to ensure the yielded frame count matches len(values).

This reverts commit e0df9ec.

This reverts commit 7df6d30.

This reverts commit 85ea278.

Guilherme Penedo (guipenedo) · 2026-06-15T22:22:50Z

Summary of the Rerun optimization / benchmark campaign:

This branch went through two related tracks: Rerun support hardening and performance work. The final commit on the branch is the stable cleanup recipe (26b0a2b5), but the path to get there included a lot of benchmark-driven churn. Here is the full breakdown of what worked and what did not.

Support / correctness hardening

Added the Rerun reader and writer, then hardened them for real recordings, explicit robotics selections, and edge cases around metadata, output columns, and schema drift.
The useful support changes that stayed were the ones that removed accidental extra work without narrowing behavior: avoiding unused static reads, skipping redundant schema work, reusing schema component maps, filling scalar matrices in place, and hardening output column collision handling.
One encoded-image offset cache was tried and then reverted because it did not justify the complexity.

Benchmarking / measurement infrastructure

Added the cloud benchmark harness, comparison helper, AWS secret refresh flow, and AWS profile handling so the measurements were repeatable and did not leak credentials.
These changes made the signal trustworthy, but they were measurement-only work. They did not themselves change runtime.

Raw RRD copy / staging path

The main working area was the raw-copy path and staging pipeline.
Things that helped and stayed: skipping unnecessary table work for raw-copy benchmarks, direct-copying raw source chunks, avoiding double metadata scans, hardlinking staged copies on local filesystems, parallelizing staged source opens, parallelizing metadata scans, and tuning the writer loop / buffering / local-to-remote upload path.
The strongest concrete wins here were the direct-copy and local upload changes: local synthetic runs consistently showed the direct-copy path orders of magnitude faster than the fallback in the small benchmark, and the DataFile.copy path using put_file improved the cloud rrd-copy results. Buffering tweaks helped too, but the gains were smaller and sometimes noisy.
Things that did not stick: native remote RRD staging, metadata-fusion experiments, a single-store lookup fast path, and some reader fanout changes. Those were benchmarked, found noisy or regressive, and reverted.
The cloud rrd-copy result moved from the earlier baseline around 83.61s down into the low 80s on the better raw-copy / upload path, but the intermediate tuning was noisy enough that several of the more aggressive experiments were rolled back.

Cleanup matcher / reducer path

The cleanup path became the other real win.
Helpful changes that stayed: caching FinalizedShardWorker.worker_token, tightening the default-root cleanup parsing, precomputing cleanup keys, and using fixed-position / precomputed-string matching instead of regex in the hot path.
Local matcher benchmarks showed a clear win: the regex path was around 0.87s on the synthetic workload, while the precomputed-key path got down to about 0.23s.
The sink-level cleanup path also improved materially once the listing prefix and worker token were cached, but a dedicated sink benchmark harness was later removed from the final recipe because it was just measurement scaffolding, not product code.
An explicit cleanup_key field was tried and then removed again. It did not prove better than deriving the key from the cached worker_token, and it complicated typing and downstream code for no clear gain.

What was reverted or did not win

Reverted: native remote RRD staging, metadata fusion, fast-path single-store lookup, reader fanout oversubscription, and a few cleanup micro-optimizations that did not beat the best existing result.
Not kept: the sink-level benchmark harness and the cleanup_key field experiment.
Kept: the smaller but reliable improvements that reduced repeated work without narrowing support.

Current state

The branch is clean and pushed.
The final commit keeps the best stable cleanup recipe and the working Rerun support hardening.
The broader lesson from the benchmark history is that the real wins came from removing repeated work in the hot path, not from widening concurrency or adding more special cases.

Add Rerun reader

b7f2337

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

Guilherme Penedo (guipenedo) added 14 commits June 14, 2026 15:04

Add Rerun writer

917aee3

Harden Rerun reader and writer

14c1d01

Support explicit Rerun robotics selections

276bd9f

Document Rerun reader and writer

0c580c9

Clean up Rerun docs whitespace

d6faf32

Avoid unused Rerun static reads

087e2d3

Polish Rerun reader and writer

aa80558

Add Rerun cloud benchmark harness

012167a

Harden Rerun cloud benchmark harness

4e7b2f6

Skip unused Rerun metadata reads

2331a1f

Avoid redundant Rerun schema work

98f5f7b

Cover Rerun robotics recording payloads

85c4ccd

Add Rerun benchmark comparison helper

0e6ec55

Add Rerun benchmark AWS secret refresher

617dbf2

github-advanced-security AI found potential problems Jun 14, 2026

View reviewed changes

Comment thread benchmark/rerun/refresh_aws_secrets.py Fixed

Avoid logging Rerun benchmark secret payloads

7241054

github-advanced-security AI found potential problems Jun 14, 2026

View reviewed changes

Comment thread benchmark/rerun/refresh_aws_secrets.py Fixed

Guilherme Penedo (guipenedo) added 11 commits June 14, 2026 23:29

Silence Rerun benchmark secret refresh output

666727a

Use benchmark AWS profile by default

234006f

Use supported AWS credential export format

1bdad4e

Reuse Rerun schema component maps

337cafa

Fill Rerun scalar matrices in place

6b70f59

Cache Rerun encoded image offsets

e0df9ec

Revert "Cache Rerun encoded image offsets"

0f0cd22

This reverts commit e0df9ec.

Track Rerun metadata and output metrics

b84ff52

Harden Rerun output column collisions

31b96b7

Skip Rerun tables for raw copy benchmarks

3de8e90

Optimize Rerun raw copy path

7c6856a

Guilherme Penedo (guipenedo) added 29 commits June 15, 2026 12:28

Speed up batched Rerun writer loop

af65e54

Refine Rerun benchmark timing and writer loop

7323294

Reduce local Rerun writer path overhead

8ec50e2

Speed up default local Rerun writes

9995f3b

Speed up local-to-remote DataFile copies

0d6d90e

Tune local-to-remote upload buffering

905221d

Parallelize staged Rerun source opens

24406de

Parallelize Rerun metadata scans

0917053

Tweak Rerun batch staging concurrency

d322b0c

Revert Rerun batch staging oversubscription

63305f7

Fuse Rerun metadata staging passes

7df6d30

Revert "Fuse Rerun metadata staging passes"

004a69c

This reverts commit 7df6d30.

Trim Rerun metadata scan wrappers

cde8671

Fast-path single-store Rerun lookup

85ea278

Revert "Fast-path single-store Rerun lookup"

e282400

This reverts commit 85ea278.

Direct-copy raw Rerun source chunks

44920e2

Prefer native remote RRD staging

9a8cc38

Revert native remote RRD staging

abbf4a7

Increase RRD staging buffer size

e15f2b7

Raise RRD reader fanout

45a13fb

Back off RRD metadata scan fanout

6c71e6e

Revert RRD reader fanout

f9586ea

Optimize default RRD cleanup listing

170c51e

Speed up default RRD cleanup matching

673fdc1

Tighten default RRD cleanup parsing

095c66c

Add cleanup matcher benchmark

f049027

Speed up cleanup key lookup

7fe8bee

Precompute cleanup matcher keys

83ff8f2

Restore cleanup best recipe

26b0a2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Add Rerun format support#218

[codex] Add Rerun format support#218
Guilherme Penedo (guipenedo) wants to merge 65 commits into
mainfrom
codex/rerun-format-support

Guilherme Penedo (guipenedo) commented Jun 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

Uh oh!

Uh oh!

Guilherme Penedo (guipenedo) commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		class _local_rrd:
		def __init__(self, source: DataFile) -> None:

Uh oh!

Conversation

Guilherme Penedo (guipenedo) commented Jun 14, 2026

Summary

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Guilherme Penedo (guipenedo) commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants