⚡️ Speed up function first_non_null_non_empty_value by 23%#119
Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
Open
⚡️ Speed up function first_non_null_non_empty_value by 23%#119codeflash-ai[bot] wants to merge 1 commit intomainfrom
first_non_null_non_empty_value by 23%#119codeflash-ai[bot] wants to merge 1 commit intomainfrom
Conversation
The optimized code achieves a **22% runtime improvement** by restructuring the conditional logic to minimize per-iteration overhead when processing iterables with many None values or empty dict/list instances. ## Key Optimizations **1. Hoisted Type Tuple Outside Loop** The `(dict, list)` tuple is now created once and stored in `empty_types` before the loop, rather than being reconstructed on every iteration. This eliminates repeated tuple allocation overhead. **2. Split Conditional with Early Continue** The original code used a complex compound condition: `if value is not None and not (isinstance(value, (dict, list)) and len(value) == 0)`. The optimized version splits this into two separate checks with early `continue` statements: - First check: `if value is None: continue` - Second check: `if isinstance(value, empty_types) and not value: continue` **3. Replaced len() with Truthiness Test** For dict/list emptiness detection, the code now uses `not value` instead of `len(value) == 0`. Python's truthiness evaluation for collections is faster than calling `len()` and comparing the result. ## Performance Characteristics The optimization excels when processing data with: - **Many None values**: The early `continue` for None avoids unnecessary `isinstance()` checks (see `test_large_list_with_many_empty_containers`: 44.7% faster) - **Many empty dicts/lists**: The split logic and truthiness test accelerate filtering (see `test_large_list_with_mixed_empty_containers`: 63.5% faster) - **High filtering ratio**: When most elements need to be skipped, the streamlined logic compounds savings across iterations ## Impact on Hot Path Usage Based on `function_references`, this function is called in **`arrow_writer.py`**'s type inference code paths: - `_infer_custom_type_and_encode()`: Called to find first non-null sample for type detection (PIL images, PDFs) - `__arrow_array__()`: Used when converting numpy arrays and lists to PyArrow format These are **critical data loading paths** where datasets may contain many None placeholders or empty collections before valid data. The 22% speedup directly accelerates dataset serialization and type inference operations, which are frequently executed during data processing pipelines. The optimization maintains identical behavior while delivering consistent performance gains across test cases, with particularly strong improvements (19-63% faster) for workloads involving large iterables with high proportions of filtered values.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 23% (0.23x) speedup for
first_non_null_non_empty_valueinsrc/datasets/utils/py_utils.py⏱️ Runtime :
456 microseconds→370 microseconds(best of109runs)📝 Explanation and details
The optimized code achieves a 22% runtime improvement by restructuring the conditional logic to minimize per-iteration overhead when processing iterables with many None values or empty dict/list instances.
Key Optimizations
1. Hoisted Type Tuple Outside Loop
The
(dict, list)tuple is now created once and stored inempty_typesbefore the loop, rather than being reconstructed on every iteration. This eliminates repeated tuple allocation overhead.2. Split Conditional with Early Continue
The original code used a complex compound condition:
if value is not None and not (isinstance(value, (dict, list)) and len(value) == 0). The optimized version splits this into two separate checks with earlycontinuestatements:if value is None: continueif isinstance(value, empty_types) and not value: continue3. Replaced len() with Truthiness Test
For dict/list emptiness detection, the code now uses
not valueinstead oflen(value) == 0. Python's truthiness evaluation for collections is faster than callinglen()and comparing the result.Performance Characteristics
The optimization excels when processing data with:
continuefor None avoids unnecessaryisinstance()checks (seetest_large_list_with_many_empty_containers: 44.7% faster)test_large_list_with_mixed_empty_containers: 63.5% faster)Impact on Hot Path Usage
Based on
function_references, this function is called inarrow_writer.py's type inference code paths:_infer_custom_type_and_encode(): Called to find first non-null sample for type detection (PIL images, PDFs)__arrow_array__(): Used when converting numpy arrays and lists to PyArrow formatThese are critical data loading paths where datasets may contain many None placeholders or empty collections before valid data. The 22% speedup directly accelerates dataset serialization and type inference operations, which are frequently executed during data processing pipelines.
The optimization maintains identical behavior while delivering consistent performance gains across test cases, with particularly strong improvements (19-63% faster) for workloads involving large iterables with high proportions of filtered values.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-first_non_null_non_empty_value-mlcidxl0and push.