Skip to content

⚡️ Speed up function first_non_null_non_empty_value by 23%#119

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-first_non_null_non_empty_value-mlcidxl0
Open

⚡️ Speed up function first_non_null_non_empty_value by 23%#119
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-first_non_null_non_empty_value-mlcidxl0

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Feb 7, 2026

📄 23% (0.23x) speedup for first_non_null_non_empty_value in src/datasets/utils/py_utils.py

⏱️ Runtime : 456 microseconds 370 microseconds (best of 109 runs)

📝 Explanation and details

The optimized code achieves a 22% runtime improvement by restructuring the conditional logic to minimize per-iteration overhead when processing iterables with many None values or empty dict/list instances.

Key Optimizations

1. Hoisted Type Tuple Outside Loop
The (dict, list) tuple is now created once and stored in empty_types before the loop, rather than being reconstructed on every iteration. This eliminates repeated tuple allocation overhead.

2. Split Conditional with Early Continue
The original code used a complex compound condition: if value is not None and not (isinstance(value, (dict, list)) and len(value) == 0). The optimized version splits this into two separate checks with early continue statements:

  • First check: if value is None: continue
  • Second check: if isinstance(value, empty_types) and not value: continue

3. Replaced len() with Truthiness Test
For dict/list emptiness detection, the code now uses not value instead of len(value) == 0. Python's truthiness evaluation for collections is faster than calling len() and comparing the result.

Performance Characteristics

The optimization excels when processing data with:

  • Many None values: The early continue for None avoids unnecessary isinstance() checks (see test_large_list_with_many_empty_containers: 44.7% faster)
  • Many empty dicts/lists: The split logic and truthiness test accelerate filtering (see test_large_list_with_mixed_empty_containers: 63.5% faster)
  • High filtering ratio: When most elements need to be skipped, the streamlined logic compounds savings across iterations

Impact on Hot Path Usage

Based on function_references, this function is called in arrow_writer.py's type inference code paths:

  • _infer_custom_type_and_encode(): Called to find first non-null sample for type detection (PIL images, PDFs)
  • __arrow_array__(): Used when converting numpy arrays and lists to PyArrow format

These are critical data loading paths where datasets may contain many None placeholders or empty collections before valid data. The 22% speedup directly accelerates dataset serialization and type inference operations, which are frequently executed during data processing pipelines.

The optimization maintains identical behavior while delivering consistent performance gains across test cases, with particularly strong improvements (19-63% faster) for workloads involving large iterables with high proportions of filtered values.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 55 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from collections import OrderedDict  # real collection classes used in tests
from collections import deque

# imports
import pytest  # used for our unit tests
from src.datasets.utils.py_utils import first_non_null_non_empty_value

def test_basic_with_integers_and_empty_collections():
    # Basic case: skip None, empty dict and empty list; return the first valid numeric value (0).
    data = [None, {}, [], 0, 1]
    idx, val = first_non_null_non_empty_value(data) # 2.32μs -> 1.97μs (18.0% faster)

def test_empty_iterable_returns_minus_one():
    # Edge case: empty iterable should result in (-1, None)
    data = []
    idx, val = first_non_null_non_empty_value(data) # 963ns -> 1.01μs (4.84% slower)

def test_all_none_or_empty_collections_returns_minus_one():
    # Edge case: all values are None or empty dict/list -> no valid entry
    data = [None, {}, [], OrderedDict(), []]
    idx, val = first_non_null_non_empty_value(data) # 2.44μs -> 2.01μs (21.3% faster)

def test_empty_string_is_considered_valid():
    # Strings are NOT checked with dict/list empty logic, so empty string '' is returned as valid
    data = ['', None]
    idx, val = first_non_null_non_empty_value(data) # 1.50μs -> 1.47μs (2.18% faster)

def test_boolean_false_and_zero_are_valid_values():
    # False is not None and not an empty dict/list, so it should be accepted.
    data = [None, {}, False, 0]
    idx, val = first_non_null_non_empty_value(data) # 2.51μs -> 2.30μs (9.28% faster)

def test_empty_tuple_and_deque_are_not_filtered_like_list_or_dict():
    # Empty tuple has len==0 but is not a list/dict, so it should be returned.
    # deque is not a list/dict type, so an empty deque is also considered valid.
    data = [None, (), deque(), []]
    idx, val = first_non_null_non_empty_value(data) # 1.91μs -> 1.85μs (3.40% faster)
    # If we start with an empty deque then it should be returned first
    data2 = [None, deque(), (), []]
    idx2, val2 = first_non_null_non_empty_value(data2) # 845ns -> 826ns (2.30% faster)

def test_ordereddict_empty_is_treated_as_dict_and_filtered():
    # OrderedDict is a subclass of dict; empty OrderedDict should be filtered like {}
    data = [None, OrderedDict(), 'ok']
    idx, val = first_non_null_non_empty_value(data) # 2.12μs -> 1.83μs (15.7% faster)

def test_large_scale_many_empty_then_value_near_end():
    # Large scale test under the 1000-element constraint: create many empty entries then a final value.
    empties = [None if i % 2 == 0 else [] for i in range(500)]  # 500 elements of None or []
    data = empties + ['found']  # 'found' should be at index 500
    idx, val = first_non_null_non_empty_value(data) # 44.4μs -> 31.7μs (39.8% faster)

def test_mixed_collection_types_like_set_and_bytes_are_accepted():
    # Empty set() and empty bytes b'' are not dict/list, so they should be accepted as valid.
    data = [None, set(), {}, b'', 'next']
    # set() comes before {}, so the first acceptable is the empty set at index 1
    idx, val = first_non_null_non_empty_value(data) # 1.70μs -> 1.65μs (2.54% faster)
    # If we put {} before set(), the set should still be accepted if it comes first
    data2 = [None, {}, set(), b'']
    idx2, val2 = first_non_null_non_empty_value(data2) # 1.05μs -> 871ns (21.0% faster)

def test_returns_object_identity_for_custom_instances():
    # Non-collection arbitrary object instances should be accepted and preserved by identity.
    obj = object()
    data = [None, {}, obj]
    idx, val = first_non_null_non_empty_value(data) # 2.08μs -> 1.80μs (15.3% faster)

def test_first_non_null_non_empty_value_with_nested_collections():
    # Nested non-empty list inside outer empty list should be considered non-empty (outer list is a list but not empty here).
    nested = [1]
    data = [None, {}, [nested], []]
    idx, val = first_non_null_non_empty_value(data) # 2.15μs -> 1.74μs (24.0% faster)

def test_all_filtered_types_but_tuple_or_nonstandard_sequence():
    # Demonstrate that tuples are treated differently; empty tuple will be accepted though empty list is filtered.
    data = [None, [], (), {}]
    idx, val = first_non_null_non_empty_value(data) # 2.27μs -> 2.03μs (11.7% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from src.datasets.utils.py_utils import first_non_null_non_empty_value

def test_first_non_null_value_in_list_of_integers():
    """Test that the function correctly identifies the first non-null integer in a list."""
    codeflash_output = first_non_null_non_empty_value([None, None, 5, 10]); result = codeflash_output # 1.77μs -> 1.72μs (2.90% faster)

def test_first_non_null_value_is_first_element():
    """Test that the function returns the first element when it is non-null."""
    codeflash_output = first_non_null_non_empty_value([42, None, None]); result = codeflash_output # 1.57μs -> 1.50μs (5.15% faster)

def test_single_non_null_value():
    """Test with a single non-null value in the iterable."""
    codeflash_output = first_non_null_non_empty_value([100]); result = codeflash_output # 1.58μs -> 1.49μs (6.53% faster)

def test_string_values():
    """Test that the function works with string values."""
    codeflash_output = first_non_null_non_empty_value([None, "hello", "world"]); result = codeflash_output # 1.68μs -> 1.64μs (2.38% faster)

def test_mixed_types():
    """Test with mixed types (None, integers, strings)."""
    codeflash_output = first_non_null_non_empty_value([None, 42, "test"]); result = codeflash_output # 1.71μs -> 1.61μs (5.83% faster)

def test_boolean_values():
    """Test that boolean values are correctly identified as non-null."""
    codeflash_output = first_non_null_non_empty_value([None, False, True]); result = codeflash_output # 2.11μs -> 1.99μs (6.45% faster)

def test_zero_value():
    """Test that zero is correctly identified as a non-null value."""
    codeflash_output = first_non_null_non_empty_value([None, 0, 5]); result = codeflash_output # 1.65μs -> 1.58μs (4.37% faster)

def test_empty_string():
    """Test that empty strings are treated as non-null values (not empty container)."""
    codeflash_output = first_non_null_non_empty_value([None, "", "hello"]); result = codeflash_output # 1.70μs -> 1.60μs (5.99% faster)

def test_non_empty_dict():
    """Test with a non-empty dictionary."""
    codeflash_output = first_non_null_non_empty_value([None, {"key": "value"}, None]); result = codeflash_output # 1.83μs -> 1.62μs (13.4% faster)

def test_non_empty_list():
    """Test with a non-empty list."""
    codeflash_output = first_non_null_non_empty_value([None, [1, 2, 3], None]); result = codeflash_output # 1.77μs -> 1.63μs (8.28% faster)

def test_all_none_values():
    """Test when all values are None. Should return -1 and None."""
    codeflash_output = first_non_null_non_empty_value([None, None, None]); result = codeflash_output # 1.32μs -> 1.31μs (0.532% faster)

def test_empty_iterable():
    """Test with an empty iterable. Should return -1 and None."""
    codeflash_output = first_non_null_non_empty_value([]); result = codeflash_output # 1.01μs -> 963ns (4.88% faster)

def test_empty_dict_is_skipped():
    """Test that empty dictionaries are skipped and the function continues searching."""
    codeflash_output = first_non_null_non_empty_value([None, {}, {"a": 1}]); result = codeflash_output # 2.01μs -> 1.72μs (17.0% faster)

def test_empty_list_is_skipped():
    """Test that empty lists are skipped and the function continues searching."""
    codeflash_output = first_non_null_non_empty_value([None, [], [1]]); result = codeflash_output # 2.01μs -> 1.80μs (11.4% faster)

def test_only_empty_containers_and_none():
    """Test when only empty containers and None values are present. Should return -1."""
    codeflash_output = first_non_null_non_empty_value([None, {}, [], None]); result = codeflash_output # 2.12μs -> 1.80μs (17.3% faster)

def test_tuple_input():
    """Test that the function works with tuples."""
    codeflash_output = first_non_null_non_empty_value((None, None, 99)); result = codeflash_output # 1.72μs -> 1.60μs (7.48% faster)

def test_nested_empty_containers():
    """Test with nested empty containers."""
    codeflash_output = first_non_null_non_empty_value([None, [], None, [[]]]); result = codeflash_output # 2.10μs -> 1.79μs (17.5% faster)

def test_first_element_empty_dict():
    """Test when first element is an empty dictionary."""
    codeflash_output = first_non_null_non_empty_value([{}, "value"]); result = codeflash_output # 1.98μs -> 1.75μs (13.2% faster)

def test_first_element_empty_list():
    """Test when first element is an empty list."""
    codeflash_output = first_non_null_non_empty_value([[], 123]); result = codeflash_output # 2.02μs -> 1.81μs (11.4% faster)

def test_negative_numbers():
    """Test with negative numbers."""
    codeflash_output = first_non_null_non_empty_value([None, -42, 0]); result = codeflash_output # 1.63μs -> 1.61μs (0.805% faster)

def test_float_values():
    """Test with floating-point numbers."""
    codeflash_output = first_non_null_non_empty_value([None, 3.14, 2.71]); result = codeflash_output # 1.68μs -> 1.64μs (2.44% faster)

def test_zero_float():
    """Test that 0.0 is treated as a valid non-null value."""
    codeflash_output = first_non_null_non_empty_value([None, 0.0, 1.5]); result = codeflash_output # 1.63μs -> 1.61μs (1.49% faster)

def test_single_none():
    """Test with a single None value."""
    codeflash_output = first_non_null_non_empty_value([None]); result = codeflash_output # 1.12μs -> 1.16μs (2.77% slower)

def test_single_empty_dict():
    """Test with a single empty dictionary."""
    codeflash_output = first_non_null_non_empty_value([{}]); result = codeflash_output # 1.67μs -> 1.39μs (19.9% faster)

def test_single_empty_list():
    """Test with a single empty list."""
    codeflash_output = first_non_null_non_empty_value([[]]); result = codeflash_output # 1.71μs -> 1.42μs (20.0% faster)

def test_none_then_zero():
    """Test that None followed by zero correctly returns zero."""
    codeflash_output = first_non_null_non_empty_value([None, 0]); result = codeflash_output # 1.63μs -> 1.58μs (2.91% faster)

def test_none_then_false():
    """Test that None followed by False correctly returns False."""
    codeflash_output = first_non_null_non_empty_value([None, False]); result = codeflash_output # 1.97μs -> 2.01μs (2.29% slower)

def test_complex_numbers():
    """Test with complex numbers."""
    codeflash_output = first_non_null_non_empty_value([None, 3+4j]); result = codeflash_output # 1.95μs -> 2.09μs (6.78% slower)

def test_custom_object():
    """Test with custom objects."""
    class CustomObj:
        def __init__(self, value):
            self.value = value
    
    obj = CustomObj(42)
    codeflash_output = first_non_null_non_empty_value([None, obj]); result = codeflash_output # 1.90μs -> 1.83μs (3.71% faster)

def test_dict_with_nested_empty_list():
    """Test with dictionary containing nested empty list."""
    codeflash_output = first_non_null_non_empty_value([None, {"key": []}]); result = codeflash_output # 1.80μs -> 1.57μs (14.7% faster)

def test_list_with_none_values():
    """Test with a list containing None values (list itself is non-empty)."""
    codeflash_output = first_non_null_non_empty_value([None, [None, None]]); result = codeflash_output # 1.80μs -> 1.61μs (12.2% faster)

def test_large_list_with_first_value_at_end():
    """Test with a large list where the first non-null value is at the end."""
    large_list = [None] * 999 + [42]
    codeflash_output = first_non_null_non_empty_value(large_list); result = codeflash_output # 37.6μs -> 36.9μs (1.73% faster)

def test_large_list_with_many_empty_containers():
    """Test with a large list containing many empty containers and None values."""
    large_list = [None, {}, [], None, {}, None, {}] * 100 + [100]
    codeflash_output = first_non_null_non_empty_value(large_list); result = codeflash_output # 63.9μs -> 44.2μs (44.7% faster)

def test_large_list_all_none():
    """Test with a large list of all None values."""
    large_list = [None] * 1000
    codeflash_output = first_non_null_non_empty_value(large_list); result = codeflash_output # 37.4μs -> 36.4μs (2.74% faster)

def test_large_list_with_early_value():
    """Test that the function returns immediately upon finding the first non-null value."""
    large_list = [42] + [None] * 999
    codeflash_output = first_non_null_non_empty_value(large_list); result = codeflash_output # 1.57μs -> 1.50μs (5.08% faster)

def test_large_list_with_mixed_empty_containers():
    """Test with a large list containing alternating empty dicts and lists."""
    large_list = [{}, []] * 400 + [{"data": "value"}]
    codeflash_output = first_non_null_non_empty_value(large_list); result = codeflash_output # 107μs -> 65.6μs (63.5% faster)

def test_large_list_with_strings():
    """Test with a large list of strings, some empty and some non-empty."""
    large_list = [""] * 500 + ["found"] + [""] * 500
    codeflash_output = first_non_null_non_empty_value(large_list); result = codeflash_output # 1.56μs -> 1.43μs (9.40% faster)

def test_large_dict_in_list():
    """Test with a large non-empty dictionary."""
    large_dict = {f"key_{i}": i for i in range(500)}
    codeflash_output = first_non_null_non_empty_value([None, large_dict]); result = codeflash_output # 1.89μs -> 1.71μs (10.4% faster)

def test_large_list_containing_list():
    """Test with a large list containing another large list."""
    inner_list = list(range(100))
    codeflash_output = first_non_null_non_empty_value([None, inner_list]); result = codeflash_output # 1.68μs -> 1.65μs (1.63% faster)

def test_many_mixed_values_before_target():
    """Test with many different types of values before finding the target."""
    mixed_list = [None, {}, [], None, 0, False, ""] + [None] * 100 + [True] + [None] * 100
    codeflash_output = first_non_null_non_empty_value(mixed_list); result = codeflash_output # 2.37μs -> 1.99μs (19.1% faster)

def test_large_nested_structure():
    """Test with a large nested structure."""
    nested = {"level1": {"level2": {"level3": list(range(100))}}}
    codeflash_output = first_non_null_non_empty_value([None, nested]); result = codeflash_output # 1.78μs -> 1.57μs (13.3% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-first_non_null_non_empty_value-mlcidxl0 and push.

Codeflash Static Badge

The optimized code achieves a **22% runtime improvement** by restructuring the conditional logic to minimize per-iteration overhead when processing iterables with many None values or empty dict/list instances.

## Key Optimizations

**1. Hoisted Type Tuple Outside Loop**
The `(dict, list)` tuple is now created once and stored in `empty_types` before the loop, rather than being reconstructed on every iteration. This eliminates repeated tuple allocation overhead.

**2. Split Conditional with Early Continue**
The original code used a complex compound condition: `if value is not None and not (isinstance(value, (dict, list)) and len(value) == 0)`. The optimized version splits this into two separate checks with early `continue` statements:
- First check: `if value is None: continue` 
- Second check: `if isinstance(value, empty_types) and not value: continue`

**3. Replaced len() with Truthiness Test**
For dict/list emptiness detection, the code now uses `not value` instead of `len(value) == 0`. Python's truthiness evaluation for collections is faster than calling `len()` and comparing the result.

## Performance Characteristics

The optimization excels when processing data with:
- **Many None values**: The early `continue` for None avoids unnecessary `isinstance()` checks (see `test_large_list_with_many_empty_containers`: 44.7% faster)
- **Many empty dicts/lists**: The split logic and truthiness test accelerate filtering (see `test_large_list_with_mixed_empty_containers`: 63.5% faster)
- **High filtering ratio**: When most elements need to be skipped, the streamlined logic compounds savings across iterations

## Impact on Hot Path Usage

Based on `function_references`, this function is called in **`arrow_writer.py`**'s type inference code paths:
- `_infer_custom_type_and_encode()`: Called to find first non-null sample for type detection (PIL images, PDFs)
- `__arrow_array__()`: Used when converting numpy arrays and lists to PyArrow format

These are **critical data loading paths** where datasets may contain many None placeholders or empty collections before valid data. The 22% speedup directly accelerates dataset serialization and type inference operations, which are frequently executed during data processing pipelines.

The optimization maintains identical behavior while delivering consistent performance gains across test cases, with particularly strong improvements (19-63% faster) for workloads involving large iterables with high proportions of filtered values.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 February 7, 2026 16:08
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants