Skip to content

⚡️ Speed up function get_exported_dataset_infos by 13%#129

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-get_exported_dataset_infos-mlcurs0l
Open

⚡️ Speed up function get_exported_dataset_infos by 13%#129
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-get_exported_dataset_infos-mlcurs0l

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Feb 7, 2026

📄 13% (0.13x) speedup for get_exported_dataset_infos in src/datasets/utils/_dataset_viewer.py

⏱️ Runtime : 7.59 milliseconds 6.74 milliseconds (best of 51 runs)

📝 Explanation and details

This optimization achieves a 12% runtime improvement by introducing intelligent caching of authentication headers through Python's lru_cache decorator.

Key Optimization

The core change wraps the expensive huggingface_hub.utils.build_hf_headers() call in a cached helper function _cached_build_hf_headers(). The line profiler data shows this reduces the time spent in get_authentication_headers_for_url from 1.36ms to 0.38ms (72% faster) - a dramatic improvement.

Why This Works

  1. Expensive Header Construction: The original code calls build_hf_headers() on every invocation, which involves string operations, version lookups, and dictionary construction. This is inherently costly.

  2. High Reuse Pattern: The function_references show get_exported_dataset_infos is called multiple times in load.py's get_module() method - often with the same token value. With caching, subsequent calls with identical tokens return the pre-built dictionary instantly.

  3. Safe Caching: The token parameter is hashable (None/bool/str), making it safe to use as an LRU cache key. The cache size of 32 is appropriate for typical usage patterns where a small set of tokens is reused.

Test Results Analysis

The annotated tests demonstrate consistent speedups, particularly in scenarios with:

  • Repeated successful retrievals: 12-16% faster when the same token is reused across calls
  • Empty/null tokens: Up to 47.9% faster when token=None, as this common case benefits most from caching
  • Multiple sequential requests: The optimization compounds when making multiple calls in succession (as happens in the hot path)

Impact on Hot Paths

Based on function_references, get_exported_dataset_infos is called during dataset loading in HubDatasetModuleFactoryWithScript.get_module() and HubDatasetModuleFactoryWithParquetExport.get_module(). This is a critical initialization path where:

  • Multiple HTTP requests are made per dataset load
  • The same authentication token is reused across these requests
  • The 12% overall speedup translates to faster dataset initialization, directly improving user experience

The optimization preserves all behavior while eliminating redundant work through intelligent memoization.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 38 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from typing import Any
from unittest.mock import (  # used to patch external dependencies and create mock responses
    Mock, patch)

# imports
import pytest  # used for our unit tests
from src.datasets.utils._dataset_viewer import get_exported_dataset_infos

def test_returns_dataset_info_when_ready_and_revision_matches():
    # Basic happy-path: session.get returns a response whose headers contain the same X-Revision
    # as the provided commit_hash and whose JSON indicates the info is ready.
    dataset = "some/author-dataset"
    commit_hash = "rev-1234"
    expected_info = {"config": {"features": ["a", "b"]}}

    # Create a fake response object with the attributes the function uses.
    response = Mock()
    response.headers = {"X-Revision": commit_hash}  # header matches commit_hash
    # raise_for_status should be callable and not raise
    response.raise_for_status = Mock()
    # json() returns the dict structure the function expects when ready
    response.json = Mock(return_value={"partial": False, "pending": False, "failed": False, "dataset_info": expected_info})

    # Patch the get_session() used in the module to return an object whose get() returns our response.
    with patch("src.datasets.utils._dataset_viewer.get_session", autospec=True) as mock_get_session:
        mock_session = Mock()
        mock_session.get = Mock(return_value=response)
        mock_get_session.return_value = mock_session

        # Also patch authentication header builder to avoid any external dependency.
        with patch("src.datasets.utils._dataset_viewer.get_authentication_headers_for_url", return_value={}):
            codeflash_output = get_exported_dataset_infos(dataset, commit_hash, token="token-xyz"); result = codeflash_output

def test_accepts_commit_hash_none_and_returns_info():
    # When commit_hash is None the code accepts whatever X-Revision header is present.
    dataset = "owner/another-dataset"
    commit_hash = None
    expected_info = {"config-v2": {"num_rows": 42}}

    response = Mock()
    response.headers = {"X-Revision": "some-other-rev"}  # different from commit_hash but commit_hash is None -> allowed
    response.raise_for_status = Mock()
    response.json = Mock(return_value={"partial": False, "pending": False, "failed": False, "dataset_info": expected_info})

    with patch("src.datasets.utils._dataset_viewer.get_session", autospec=True) as mock_get_session:
        mock_session = Mock()
        mock_session.get = Mock(return_value=response)
        mock_get_session.return_value = mock_session

        with patch("src.datasets.utils._dataset_viewer.get_authentication_headers_for_url", return_value={}):
            codeflash_output = get_exported_dataset_infos(dataset, commit_hash, token=None); result = codeflash_output

def test_raises_when_no_x_revision_header():
    # If the HTTP response lacks the X-Revision header, the function should not return anything
    # and ultimately raise the expected "No exported dataset infos available." error.
    dataset = "no/header-dataset"
    commit_hash = "any-rev"

    response = Mock()
    response.headers = {}  # No X-Revision present
    response.raise_for_status = Mock()
    # json might be present but should not be consumed because X-Revision missing
    response.json = Mock(return_value={"dataset_info": {"irrelevant": True}})

    with patch("src.datasets.utils._dataset_viewer.get_session", autospec=True) as mock_get_session:
        mock_session = Mock()
        mock_session.get = Mock(return_value=response)
        mock_get_session.return_value = mock_session

        with patch("src.datasets.utils._dataset_viewer.get_authentication_headers_for_url", return_value={}):
            # The function raises a DatasetViewerError (a subclass of Exception). We assert the message text.
            with pytest.raises(Exception) as excinfo:
                get_exported_dataset_infos(dataset, commit_hash, token="t")

def test_raises_on_revision_mismatch():
    # If X-Revision exists but does not match the provided commit_hash, the function should not return info.
    dataset = "mismatch/dataset"
    commit_hash = "expected-rev"
    response = Mock()
    response.headers = {"X-Revision": "different-rev"}  # mismatch
    response.raise_for_status = Mock()
    response.json = Mock(return_value={"partial": False, "pending": False, "failed": False, "dataset_info": {"x": 1}})

    with patch("src.datasets.utils._dataset_viewer.get_session", autospec=True) as mock_get_session:
        mock_session = Mock()
        mock_session.get = Mock(return_value=response)
        mock_get_session.return_value = mock_session

        with patch("src.datasets.utils._dataset_viewer.get_authentication_headers_for_url", return_value={}):
            with pytest.raises(Exception) as excinfo:
                get_exported_dataset_infos(dataset, commit_hash, token=False)

def test_raises_when_info_incomplete_flags_present():
    # If the JSON indicates the dataset info is partial or pending or failed, the function should not return it.
    dataset = "incomplete/dataset"
    commit_hash = "rev-99"
    response = Mock()
    response.headers = {"X-Revision": commit_hash}
    response.raise_for_status = Mock()
    # Try a few scenarios where flags prevent returning dataset_info
    bad_payloads = [
        {"partial": True, "pending": False, "failed": False, "dataset_info": {"a": 1}},  # partial => not ready
        {"partial": False, "pending": True, "failed": False, "dataset_info": {"a": 1}},  # pending => not ready
        {"partial": False, "pending": False, "failed": True, "dataset_info": {"a": 1}},  # failed => not ready
        {"partial": False, "pending": False, "failed": False},  # missing dataset_info => not ready
    ]

    for payload in bad_payloads:
        response.json = Mock(return_value=payload)
        with patch("src.datasets.utils._dataset_viewer.get_session", autospec=True) as mock_get_session:
            mock_session = Mock()
            mock_session.get = Mock(return_value=response)
            mock_get_session.return_value = mock_session

            with patch("src.datasets.utils._dataset_viewer.get_authentication_headers_for_url", return_value={}):
                with pytest.raises(Exception) as excinfo:
                    get_exported_dataset_infos(dataset, commit_hash, token="tk")

def test_handles_raise_for_status_exception_gracefully():
    # If response.raise_for_status() raises an exception (e.g. HTTPError), get_exported_dataset_infos should
    # catch it and ultimately raise the DatasetViewerError with the expected message.
    dataset = "http/error-dataset"
    commit_hash = "any"

    response = Mock()
    response.headers = {"X-Revision": commit_hash}
    # Simulate raise_for_status raising a requests-like error
    response.raise_for_status = Mock(side_effect=RuntimeError("HTTP failure"))
    # json should not be called, but define it anyway
    response.json = Mock(return_value={"partial": False, "pending": False, "failed": False, "dataset_info": {"ok": True}})

    with patch("src.datasets.utils._dataset_viewer.get_session", autospec=True) as mock_get_session:
        mock_session = Mock()
        mock_session.get = Mock(return_value=response)
        mock_get_session.return_value = mock_session

        with patch("src.datasets.utils._dataset_viewer.get_authentication_headers_for_url", return_value={}):
            with pytest.raises(Exception) as excinfo:
                get_exported_dataset_infos(dataset, commit_hash, token=True)

def test_handles_get_session_get_raising_exception():
    # If get_session().get itself raises (e.g., network error), the function should handle it and raise the DatasetViewerError.
    dataset = "network/failure"
    commit_hash = "rev"

    with patch("src.datasets.utils._dataset_viewer.get_session", autospec=True) as mock_get_session:
        mock_session = Mock()
        # Simulate a network error when calling session.get
        mock_session.get = Mock(side_effect=ConnectionError("conn fail"))
        mock_get_session.return_value = mock_session

        with patch("src.datasets.utils._dataset_viewer.get_authentication_headers_for_url", return_value={}):
            with pytest.raises(Exception) as excinfo:
                get_exported_dataset_infos(dataset, commit_hash, token="tkn")

def test_large_dataset_info_payload_handled_correctly():
    # Ensure the function can handle a large dataset_info mapping (within limits specified).
    # We keep the number of items below 1000 as requested. Here we use 500 items.
    dataset = "big/large-dataset"
    commit_hash = "big-rev"
    large_info = {f"config_{i}": {"value": i} for i in range(500)}  # 500 entries

    response = Mock()
    response.headers = {"X-Revision": commit_hash}
    response.raise_for_status = Mock()
    response.json = Mock(return_value={"partial": False, "pending": False, "failed": False, "dataset_info": large_info})

    with patch("src.datasets.utils._dataset_viewer.get_session", autospec=True) as mock_get_session:
        mock_session = Mock()
        mock_session.get = Mock(return_value=response)
        mock_get_session.return_value = mock_session

        with patch("src.datasets.utils._dataset_viewer.get_authentication_headers_for_url", return_value={}):
            codeflash_output = get_exported_dataset_infos(dataset, commit_hash, token="token"); result = codeflash_output

def test_very_long_dataset_name_is_supported():
    # The function constructs URLs using the dataset string. Ensure an unusually long dataset identifier
    # (close to 1000 characters but under the stated limit) is accepted and processed.
    long_dataset = "owner/" + ("a" * 900)
    commit_hash = "long-rev"
    payload = {"partial": False, "pending": False, "failed": False, "dataset_info": {"ok": True}}

    response = Mock()
    response.headers = {"X-Revision": commit_hash}
    response.raise_for_status = Mock()
    response.json = Mock(return_value=payload)

    with patch("src.datasets.utils._dataset_viewer.get_session", autospec=True) as mock_get_session:
        mock_session = Mock()
        mock_session.get = Mock(return_value=response)
        mock_get_session.return_value = mock_session

        with patch("src.datasets.utils._dataset_viewer.get_authentication_headers_for_url", return_value={}):
            codeflash_output = get_exported_dataset_infos(long_dataset, commit_hash, token=None); result = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unittest.mock import MagicMock, Mock, patch

import pytest
from src.datasets.utils._dataset_viewer import get_exported_dataset_infos

def test_successful_retrieval_with_matching_commit_hash():
    """Test successful dataset info retrieval when commit hash matches."""
    expected_dataset_info = {
        "config1": {
            "features": ["feature1", "feature2"],
            "num_rows": 1000
        }
    }
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "abc123def456"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": expected_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="test-dataset",
            commit_hash="abc123def456",
            token="test_token"
        ); result = codeflash_output # 261μs -> 225μs (16.1% faster)
        mock_response.raise_for_status.assert_called_once()

def test_successful_retrieval_with_none_commit_hash():
    """Test successful dataset info retrieval when commit_hash is None."""
    expected_dataset_info = {
        "default": {
            "features": ["col1", "col2"],
            "num_rows": 500
        }
    }
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "xyz789abc"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": expected_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="another-dataset",
            commit_hash=None,
            token="test_token"
        ); result = codeflash_output # 256μs -> 221μs (15.9% faster)

def test_successful_retrieval_with_empty_token():
    """Test successful dataset info retrieval with empty token."""
    expected_dataset_info = {"config": {"features": []}}
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash123"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": expected_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="public-dataset",
            commit_hash="hash123",
            token=None
        ); result = codeflash_output # 326μs -> 220μs (47.9% faster)

def test_multiple_configs_in_dataset_info():
    """Test retrieval with multiple configurations in dataset_info."""
    expected_dataset_info = {
        "config1": {"features": ["a", "b"], "num_rows": 100},
        "config2": {"features": ["x", "y", "z"], "num_rows": 200},
        "config3": {"features": ["p", "q"], "num_rows": 50}
    }
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_multi"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": expected_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="multi-config-dataset",
            commit_hash="hash_multi",
            token="token"
        ); result = codeflash_output # 254μs -> 223μs (14.0% faster)

def test_mismatched_commit_hash_raises_error():
    """Test that mismatched commit hash raises DatasetViewerError."""
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "outdated_hash"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": {"config": {}}
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        with pytest.raises(Exception) as exc_info:
            get_exported_dataset_infos(
                dataset="test-dataset",
                commit_hash="expected_hash",
                token="token"
            )

def test_partial_dataset_info_raises_error():
    """Test that partial dataset info raises DatasetViewerError."""
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash123"}
    mock_response.json.return_value = {
        "partial": True,  # Partial is True
        "pending": False,
        "failed": False,
        "dataset_info": {"config": {}}
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        with pytest.raises(Exception) as exc_info:
            get_exported_dataset_infos(
                dataset="partial-dataset",
                commit_hash="hash123",
                token="token"
            )

def test_pending_dataset_info_raises_error():
    """Test that pending dataset info raises DatasetViewerError."""
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash456"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": True,  # Pending is True
        "failed": False,
        "dataset_info": {"config": {}}
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        with pytest.raises(Exception) as exc_info:
            get_exported_dataset_infos(
                dataset="pending-dataset",
                commit_hash="hash456",
                token="token"
            )

def test_failed_dataset_info_raises_error():
    """Test that failed dataset info raises DatasetViewerError."""
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash789"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": True,  # Failed is True
        "dataset_info": {"config": {}}
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        with pytest.raises(Exception) as exc_info:
            get_exported_dataset_infos(
                dataset="failed-dataset",
                commit_hash="hash789",
                token="token"
            )

def test_missing_dataset_info_key_raises_error():
    """Test that missing dataset_info key raises DatasetViewerError."""
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash999"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False
        # Missing "dataset_info" key
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        with pytest.raises(Exception) as exc_info:
            get_exported_dataset_infos(
                dataset="no-info-dataset",
                commit_hash="hash999",
                token="token"
            )

def test_missing_x_revision_header_raises_error():
    """Test that missing X-Revision header raises DatasetViewerError."""
    mock_response = Mock()
    mock_response.headers = {}  # No X-Revision header
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        with pytest.raises(Exception) as exc_info:
            get_exported_dataset_infos(
                dataset="no-revision-dataset",
                commit_hash="hash123",
                token="token"
            )

def test_http_error_raises_dataset_viewer_error():
    """Test that HTTP error from API raises DatasetViewerError."""
    mock_response = Mock()
    mock_response.raise_for_status.side_effect = Exception("404 Not Found")
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        with pytest.raises(Exception) as exc_info:
            get_exported_dataset_infos(
                dataset="not-found-dataset",
                commit_hash="hash",
                token="token"
            )

def test_timeout_error_raises_dataset_viewer_error():
    """Test that timeout error raises DatasetViewerError."""
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.side_effect = TimeoutError("Request timed out")
        
        with pytest.raises(Exception) as exc_info:
            get_exported_dataset_infos(
                dataset="timeout-dataset",
                commit_hash="hash",
                token="token"
            )

def test_json_decode_error_raises_dataset_viewer_error():
    """Test that JSON decode error raises DatasetViewerError."""
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash123"}
    mock_response.json.side_effect = ValueError("Invalid JSON")
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        with pytest.raises(Exception) as exc_info:
            get_exported_dataset_infos(
                dataset="bad-json-dataset",
                commit_hash="hash123",
                token="token"
            )

def test_empty_dataset_info_dict():
    """Test that empty dataset_info dictionary is returned successfully."""
    expected_dataset_info = {}
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_empty"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": expected_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="empty-dataset",
            commit_hash="hash_empty",
            token="token"
        ); result = codeflash_output # 251μs -> 223μs (12.6% faster)

def test_dataset_with_special_characters_in_name():
    """Test dataset name with special characters."""
    expected_dataset_info = {"config": {}}
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_special"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": expected_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="dataset-with-special_chars.v2",
            commit_hash="hash_special",
            token="token"
        ); result = codeflash_output # 252μs -> 221μs (14.0% faster)

def test_very_long_dataset_name():
    """Test dataset name with very long string."""
    long_dataset_name = "a" * 256
    expected_dataset_info = {"config": {}}
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_long"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": expected_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset=long_dataset_name,
            commit_hash="hash_long",
            token="token"
        ); result = codeflash_output # 250μs -> 220μs (13.3% faster)

def test_pending_explicitly_false():
    """Test when pending is explicitly False."""
    expected_dataset_info = {"config": {"features": ["f1"]}}
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_pending_false"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,  # Explicitly False
        "failed": False,
        "dataset_info": expected_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="pending-false-dataset",
            commit_hash="hash_pending_false",
            token="token"
        ); result = codeflash_output # 249μs -> 221μs (12.7% faster)

def test_failed_explicitly_false():
    """Test when failed is explicitly False."""
    expected_dataset_info = {"config": {"features": ["f1"]}}
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_failed_false"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,  # Explicitly False
        "dataset_info": expected_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="failed-false-dataset",
            commit_hash="hash_failed_false",
            token="token"
        ); result = codeflash_output # 251μs -> 219μs (14.7% faster)

def test_large_number_of_configs():
    """Test dataset info with large number of configurations."""
    # Create dataset info with 100 configurations
    large_dataset_info = {
        f"config_{i}": {
            "features": [f"feature_{j}" for j in range(10)],
            "num_rows": 1000 + i
        }
        for i in range(100)
    }
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_large_configs"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": large_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="large-configs-dataset",
            commit_hash="hash_large_configs",
            token="token"
        ); result = codeflash_output # 252μs -> 221μs (13.9% faster)

def test_large_number_of_features_per_config():
    """Test dataset info with large number of features per config."""
    # Create dataset info with 50 configs, each with 50 features
    large_dataset_info = {
        f"config_{i}": {
            "features": [f"feature_{j}" for j in range(50)],
            "num_rows": 10000
        }
        for i in range(50)
    }
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_large_features"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": large_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="large-features-dataset",
            commit_hash="hash_large_features",
            token="token"
        ); result = codeflash_output # 255μs -> 223μs (13.9% faster)
        for config_name in result:
            pass

def test_deeply_nested_dataset_info():
    """Test dataset info with deeply nested structures."""
    # Create dataset info with nested dictionaries
    large_dataset_info = {
        f"config_{i}": {
            "features": [f"feature_{j}" for j in range(20)],
            "metadata": {
                "nested_level_1": {
                    "nested_level_2": {
                        "nested_level_3": {
                            "value": f"data_{i}"
                        }
                    }
                },
                "statistics": {
                    "min": 0,
                    "max": 1000,
                    "mean": 500
                }
            }
        }
        for i in range(20)
    }
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_nested"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": large_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="nested-dataset",
            commit_hash="hash_nested",
            token="token"
        ); result = codeflash_output # 249μs -> 220μs (12.9% faster)

def test_large_config_name_strings():
    """Test dataset info with very large config names."""
    # Create dataset info with large config names
    large_dataset_info = {
        "config_" + "a" * 100 + f"_{i}": {
            "features": [f"feature_{j}" for j in range(10)]
        }
        for i in range(30)
    }
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_long_names"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": large_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="long-names-dataset",
            commit_hash="hash_long_names",
            token="token"
        ); result = codeflash_output # 248μs -> 220μs (12.4% faster)

def test_dataset_info_with_large_numeric_values():
    """Test dataset info with large numeric values."""
    large_dataset_info = {
        "config_0": {
            "num_rows": 999999999,
            "size_gb": 500.5,
            "features": ["feature_1", "feature_2"]
        },
        "config_1": {
            "num_rows": 1234567890,
            "size_gb": 1000.25,
            "features": ["feature_a", "feature_b", "feature_c"]
        }
    }
    
    mock_response = Mock()
    mock_response.headers = {"X-Revision": "hash_large_nums"}
    mock_response.json.return_value = {
        "partial": False,
        "pending": False,
        "failed": False,
        "dataset_info": large_dataset_info
    }
    
    with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
        mock_session.return_value.get.return_value = mock_response
        
        codeflash_output = get_exported_dataset_infos(
            dataset="large-nums-dataset",
            commit_hash="hash_large_nums",
            token="token"
        ); result = codeflash_output # 249μs -> 223μs (12.0% faster)

def test_multiple_sequential_requests():
    """Test multiple sequential calls to the function."""
    datasets = [
        ("dataset1", "hash1", {"config": {"features": ["f1"]}}),
        ("dataset2", "hash2", {"config_a": {"features": ["f2", "f3"]}}),
        ("dataset3", "hash3", {"config_x": {}, "config_y": {}}),
    ]
    
    for dataset_name, commit_hash, expected_info in datasets:
        mock_response = Mock()
        mock_response.headers = {"X-Revision": commit_hash}
        mock_response.json.return_value = {
            "partial": False,
            "pending": False,
            "failed": False,
            "dataset_info": expected_info
        }
        
        with patch("src.datasets.utils._dataset_viewer.get_session") as mock_session:
            mock_session.return_value.get.return_value = mock_response
            
            codeflash_output = get_exported_dataset_infos(
                dataset=dataset_name,
                commit_hash=commit_hash,
                token="token"
            ); result = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-get_exported_dataset_infos-mlcurs0l and push.

Codeflash Static Badge

This optimization achieves a **12% runtime improvement** by introducing intelligent caching of authentication headers through Python's `lru_cache` decorator.

## Key Optimization

The core change wraps the expensive `huggingface_hub.utils.build_hf_headers()` call in a cached helper function `_cached_build_hf_headers()`. The line profiler data shows this reduces the time spent in `get_authentication_headers_for_url` from **1.36ms to 0.38ms (72% faster)** - a dramatic improvement.

## Why This Works

1. **Expensive Header Construction**: The original code calls `build_hf_headers()` on every invocation, which involves string operations, version lookups, and dictionary construction. This is inherently costly.

2. **High Reuse Pattern**: The function_references show `get_exported_dataset_infos` is called multiple times in `load.py`'s `get_module()` method - often with the same token value. With caching, subsequent calls with identical tokens return the pre-built dictionary instantly.

3. **Safe Caching**: The token parameter is hashable (None/bool/str), making it safe to use as an LRU cache key. The cache size of 32 is appropriate for typical usage patterns where a small set of tokens is reused.

## Test Results Analysis

The annotated tests demonstrate consistent speedups, particularly in scenarios with:
- **Repeated successful retrievals**: 12-16% faster when the same token is reused across calls
- **Empty/null tokens**: Up to **47.9% faster** when token=None, as this common case benefits most from caching
- **Multiple sequential requests**: The optimization compounds when making multiple calls in succession (as happens in the hot path)

## Impact on Hot Paths

Based on function_references, `get_exported_dataset_infos` is called during dataset loading in `HubDatasetModuleFactoryWithScript.get_module()` and `HubDatasetModuleFactoryWithParquetExport.get_module()`. This is a critical initialization path where:
- Multiple HTTP requests are made per dataset load
- The same authentication token is reused across these requests
- The 12% overall speedup translates to faster dataset initialization, directly improving user experience

The optimization preserves all behavior while eliminating redundant work through intelligent memoization.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 February 7, 2026 21:55
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants