Preserve float columns when JSON loader uses field=#8209
Open
LeSingh1 wants to merge 1 commit into
Open
Conversation
Closes huggingface#6937. When the JSON loader is invoked with `field=...` (which routes to `pd.read_json` so that column insertion order is preserved, see huggingface#6914), columns whose values are all integer-valued floats such as [0.0, 1.0, 2.0] get coerced to int64. This is the underlying behavior of `pd.read_json(..., dtype_backend="pyarrow")` and is tracked upstream at pandas-dev/pandas#58866; the dataset-viewer statistics endpoint was failing as a direct consequence (see linked CI log on the issue). Replace the `pd.read_json` -> `pa.Table.from_pandas` path with a small helper, `_arrow_table_from_field`, that constructs the Arrow table directly from the already-parsed Python object: - list of dicts: keys are collected in insertion order (CPython 3.7+ dict semantics) so column order is preserved; `pa.Table.from_pydict` then performs PyArrow's own type inference, which keeps integer-valued floats as float64. - list of scalars: wrap in a single-column table named after the configured feature or fall back to "text" (mirrors the prior `df.columns.tolist() == [0]` rename). - dict of lists (column-major payload): `pa.Table.from_pydict`. - empty list with features supplied: emit empty columns matching the configured feature names so downstream `_cast_table` aligns. Pandas is no longer involved on this path. The other code path that parses raw JSON Lines via `paj.read_json` is unchanged and was not affected by this bug. Adds four regression tests in tests/packaged_modules/test_json.py: - field= path: integer-valued floats stay float64 (the exact issue repro) - list-of-dicts: mixed int / float / int-valued-float columns preserve their inferred types - dict-of-lists field payload: float column preserved - column insertion order preserved (sanity check vs the original huggingface#6914 regression huggingface#6913 covered) All 33 tests in tests/packaged_modules/test_json.py pass (29 pre-existing plus the 4 new).
d1da782 to
64aa3ae
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #6937.
When
load_dataset("json", data_files=..., field="data", ...)is used, columns whose values are all integer-valued floats ([0.0, 1.0, 2.0]) get silently coerced toint64. Repro:The underlying cause is
pd.read_json(..., dtype_backend="pyarrow")(tracked upstream at pandas-dev/pandas#58866). The path that hits it is the one introduced in #6914 to preserve column insertion order. The dataset-viewer statistics regression in the issue is a direct consequence.This PR replaces the
pandas.read_json->pa.Table.from_pandaspath in thefield=branch with a small_arrow_table_from_fieldhelper that builds the Arrow table directly from the already-parsed Python object. PyArrow's own JSON inference preservesfloat64, and CPython dict iteration order gives us the#6914column-insertion-order invariant for free (no pandas roundtrip needed).The helper handles:
df.columns == [0]rename)pa.Table.from_pydict_cast_tablealignsThe other JSON loading path (raw JSON Lines via
paj.read_json) does not have this bug and is unchanged.There is an older dormant PR for this issue (#7635 from June 2025, no reviews). I went a different direction because that PR scans the resulting DataFrame and re-casts float-looking int columns back to float after the fact. By that point pandas has already converted the values to Python ints, so the
isinstance(x, float)check there does not actually detect them, and the scan adds an O(rows) Python pass to a hot path. Sidestepping pandas entirely is simpler and faster.Tests in
tests/packaged_modules/test_json.py:test_json_field_path_preserves_float_columns: the exact JSON loader implicitly coerces floats to integers #6937 repro, assertscolstays float64test_json_field_path_preserves_float_columns_alongside_ints: list-of-dicts payload with mixed int / float / int-valued-float columns, each type preservedtest_json_field_path_preserves_float_columns_with_dict_of_lists: dict-of-lists field payloadtest_json_field_path_preserves_column_order_with_list_of_dicts: column insertion order check (sanity vs the Preserve JSON column order and support list of strings field #6914 invariant)pytest tests/packaged_modules/test_json.pyshows 33 passed, 0 failed (29 pre-existing + 4 new).