-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Describe the bug
When a VALUES clause is joined against a Parquet table with Dictionary(Int32, Utf8) columns and pushdown_filters = true, the query fails with:
Parquet error: External: Compute error: Error evaluating filter predicate:
ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))
The dynamic filter pushdown from the HashJoinExec creates an InListExpr that is pushed into the Parquet scan. ArrayStaticFilter::contains() unwraps the needle (the Dictionary array from the Parquet batch) to its plain Utf8 values via downcast_dictionary_array!, but does not unwrap in_array (the stored InList values, which are also Dictionary from the type-coerced join build side). This results in a make_comparator(Utf8, Dictionary) call, which arrow_ord::ord::build_compare does not support for mixed Dictionary/non-Dictionary types.
To Reproduce
Using the DataFusion CLI (datafusion-cli):
-- Enable row-level filter pushdown (required to trigger the bug)
SET datafusion.execution.parquet.pushdown_filters = true;
SET datafusion.execution.parquet.reorder_filters = true;
-- Create a Parquet file with Dictionary-encoded string columns
COPY (
SELECT
arrow_cast(chr(65 + (row_num % 26)), 'Dictionary(Int32, Utf8)') as tag1,
row_num * 1.0 as value
FROM (SELECT unnest(range(0, 10000)) as row_num)
) TO '/tmp/dict_filter_bug.parquet';
-- This query fails
SELECT t.tag1, t.value
FROM '/tmp/dict_filter_bug.parquet' t
JOIN (VALUES ('A'), ('B')) AS v(c1)
ON t.tag1 = v.c1;Expected behavior
The query should return matching rows (tag1 = 'A' or 'B').
Actual behavior
Parquet error: External: Compute error: Error evaluating filter predicate:
ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))
Root cause
The error path through the code:
- The VALUES clause produces
Utf8strings. Type coercion casts them toDictionary(Int32, Utf8)to match the Parquet column type for the join key. - The
HashJoinExecbuilds the dynamic filter InList from the cast build-side arrays →in_arrayisDictionary(Int32, Utf8). instantiate_static_filter()receives the Dictionary array and falls through toArrayStaticFilter::try_new(), which storesin_arrayas-is.InListExpr::try_new_from_array()is called fromcreate_membership_predicate()in the dynamic filter pushdown code.- At runtime,
ArrayStaticFilter::contains(v)is called with a Dictionary array from the Parquet batch. downcast_dictionary_array!matches and recursively callsself.contains(v.values())with the unwrapped Utf8 values.- In the recursive call,
vis nowUtf8butself.in_arrayis stillDictionary(Int32, Utf8). make_comparator(Utf8, Dictionary)fails becausearrow_ord::ord::build_comparehas no arm for mixed Dictionary/non-Dictionary types.
Possible fix
The most straightforward fix would be to normalize in_array in ArrayStaticFilter::try_new() — if the input is a Dictionary, unwrap it to its values array before storing. This is a one-time cost at filter construction, and ensures that after contains() unwraps the needle Dictionary, both sides are the same plain type and the existing arrow comparison kernels work correctly.
Additional context
- This only manifests when
pushdown_filters = true(defaults tofalse), because the row-level filter evaluation inside the Parquet reader is what triggersArrayStaticFilter::contains()with Dictionary arrays. Withpushdown_filters = false, the InList filter is only used for statistics/bloom filter pruning (which works correctly), and the actual row filtering happens in the HashJoin. - The bug applies to single-column equi-joins where the VALUES side is cast to Dictionary by type coercion and the dynamic filter InList path is taken (small build side).