Preserve float columns when JSON loader uses `field=` by LeSingh1 · Pull Request #8209 · huggingface/datasets

LeSingh1 · 2026-05-18T21:51:34Z

Closes #6937.

When load_dataset("json", data_files=..., field="data", ...) is used, columns whose values are all integer-valued floats ([0.0, 1.0, 2.0]) get silently coerced to int64. Repro:

import tempfile, json
from datasets import load_dataset

with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
    json.dump({"data": [{"col": v} for v in [0.0, 1.0, 2.0]]}, f)

ds = load_dataset("json", data_files=f.name, field="data", split="train")
print(ds.features)  # before: {'col': Value('int64')}
                    # after:  {'col': Value('float64')}

The underlying cause is pd.read_json(..., dtype_backend="pyarrow") (tracked upstream at pandas-dev/pandas#58866). The path that hits it is the one introduced in #6914 to preserve column insertion order. The dataset-viewer statistics regression in the issue is a direct consequence.

This PR replaces the pandas.read_json -> pa.Table.from_pandas path in the field= branch with a small _arrow_table_from_field helper that builds the Arrow table directly from the already-parsed Python object. PyArrow's own JSON inference preserves float64, and CPython dict iteration order gives us the #6914 column-insertion-order invariant for free (no pandas roundtrip needed).

The helper handles:

list of dicts (the common case): collect keys in insertion order, build a column-major dict
list of scalars: wrap in a single-column table named after the configured feature, falling back to "text" (matches the prior df.columns == [0] rename)
dict of lists (column-major payload): pass through to pa.Table.from_pydict
empty list with features supplied: emit empty columns matching the configured feature names so downstream _cast_table aligns

The other JSON loading path (raw JSON Lines via paj.read_json) does not have this bug and is unchanged.

There is an older dormant PR for this issue (#7635 from June 2025, no reviews). I went a different direction because that PR scans the resulting DataFrame and re-casts float-looking int columns back to float after the fact. By that point pandas has already converted the values to Python ints, so the isinstance(x, float) check there does not actually detect them, and the scan adds an O(rows) Python pass to a hot path. Sidestepping pandas entirely is simpler and faster.

Tests in tests/packaged_modules/test_json.py:

test_json_field_path_preserves_float_columns: the exact JSON loader implicitly coerces floats to integers #6937 repro, asserts col stays float64
test_json_field_path_preserves_float_columns_alongside_ints: list-of-dicts payload with mixed int / float / int-valued-float columns, each type preserved
test_json_field_path_preserves_float_columns_with_dict_of_lists: dict-of-lists field payload
test_json_field_path_preserves_column_order_with_list_of_dicts: column insertion order check (sanity vs the Preserve JSON column order and support list of strings field #6914 invariant)

pytest tests/packaged_modules/test_json.py shows 33 passed, 0 failed (29 pre-existing + 4 new).

Closes huggingface#6937. When the JSON loader is invoked with `field=...` (which routes to `pd.read_json` so that column insertion order is preserved, see huggingface#6914), columns whose values are all integer-valued floats such as [0.0, 1.0, 2.0] get coerced to int64. This is the underlying behavior of `pd.read_json(..., dtype_backend="pyarrow")` and is tracked upstream at pandas-dev/pandas#58866; the dataset-viewer statistics endpoint was failing as a direct consequence (see linked CI log on the issue). Replace the `pd.read_json` -> `pa.Table.from_pandas` path with a small helper, `_arrow_table_from_field`, that constructs the Arrow table directly from the already-parsed Python object: - list of dicts: keys are collected in insertion order (CPython 3.7+ dict semantics) so column order is preserved; `pa.Table.from_pydict` then performs PyArrow's own type inference, which keeps integer-valued floats as float64. - list of scalars: wrap in a single-column table named after the configured feature or fall back to "text" (mirrors the prior `df.columns.tolist() == [0]` rename). - dict of lists (column-major payload): `pa.Table.from_pydict`. - empty list with features supplied: emit empty columns matching the configured feature names so downstream `_cast_table` aligns. Pandas is no longer involved on this path. The other code path that parses raw JSON Lines via `paj.read_json` is unchanged and was not affected by this bug. Adds four regression tests in tests/packaged_modules/test_json.py: - field= path: integer-valued floats stay float64 (the exact issue repro) - list-of-dicts: mixed int / float / int-valued-float columns preserve their inferred types - dict-of-lists field payload: float column preserved - column insertion order preserved (sanity check vs the original huggingface#6914 regression huggingface#6913 covered) All 33 tests in tests/packaged_modules/test_json.py pass (29 pre-existing plus the 4 new).

LeSingh1 force-pushed the fix-json-loader-float-coercion branch from d1da782 to 64aa3ae Compare May 18, 2026 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve float columns when JSON loader uses `field=`#8209

Preserve float columns when JSON loader uses `field=`#8209
LeSingh1 wants to merge 1 commit into
huggingface:mainfrom
LeSingh1:fix-json-loader-float-coercion

LeSingh1 commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeSingh1 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeSingh1 commented May 18, 2026 •

edited

Loading