Skip to content

Panic in DataFusion 54.0.0 when ordering Parquet scan by computed projection alias #23219

Description

@ProjectYht

Describe the bug

Title

Panic in DataFusion 54.0.0 when ordering Parquet scan by computed projection alias with statistics collection enabled

Description

I hit a panic in DataFusion 54.0.0 when querying a Parquet dataset from S3 and ordering by a computed projection alias.

The query is valid SQL and runs successfully in DuckDB on the same Parquet file. In DataFusion CLI, it panics during
statistics handling.

Environment

  • DataFusion version: 54.0.0
  • Interface: DataFusion CLI and Rust API
  • Source: Parquet files on S3
  • Dataset layout: Hive-style path segment, e.g.

s3:////partition_col=some_value/

The query does not reference partition_col.

The physical Parquet schema contains an id column.

Minimal SQL

CREATE EXTERNAL TABLE profile
STORED AS PARQUET
LOCATION 's3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/';

SELECT
  (((CAST(id AS BIGINT) % 1024) + 1024) % 1024) AS computed_bucket
FROM profile
ORDER BY computed_bucket, CAST(id AS BIGINT)
LIMIT 10;

Observed panic

thread 'main' panicked at .../datafusion-datasource-54.0.0/src/statistics.rs:100:48:
index out of bounds: the len is 0 but the index is 0

The relevant source location appears to be:

// datafusion-datasource-54.0.0/src/statistics.rs
if i < s.column_statistics.len() {
    ...
} else {
    let partition_value = &pv[i - s.column_statistics.len()];
    ...
}

It looks like this path assumes that when a sort/statistics column index exceeds column_statistics.len(), the column must be
a partition column and therefore indexes into partition_values. In this case partition_values is empty, causing the panic.

Additional observation

The physical id column does have Parquet min/max statistics. I checked with DuckDB metadata:

SELECT
  count(*) AS row_groups,
  count(*) FILTER (WHERE stats_min IS NULL OR stats_max IS NULL) AS missing_minmax,
  count(*) FILTER (WHERE stats_min IS NOT NULL AND stats_max IS NOT NULL) AS has_minmax
FROM parquet_metadata('s3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/*.parquet')
WHERE path_in_schema = 'id';

Result:

row_groups = 75
missing_minmax = 0
has_minmax = 75

So this does not appear to be caused by missing min/max statistics for id.

DuckDB comparison

The same query shape works in DuckDB:

SELECT
  (((CAST(id AS BIGINT) % 1024) + 1024) % 1024) AS computed_bucket
FROM 's3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/<redacted-file>.parquet'
ORDER BY computed_bucket, CAST(id AS BIGINT)
LIMIT 10;

Expected behavior

DataFusion should not panic.

If statistics cannot be derived for the computed expression computed_bucket, I would expect DataFusion to fall back to
unknown/absent column statistics and continue executing the query.

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions