Skip to content

Dynamic filter pushdown fails with "Can't compare arrays of different types" when joining VALUES against Dictionary-encoded Parquet columns #20937

@erratic-pattern

Description

@erratic-pattern

Describe the bug

When a VALUES clause is joined against a Parquet table with Dictionary(Int32, Utf8) columns and pushdown_filters = true, the query fails with:

Parquet error: External: Compute error: Error evaluating filter predicate:
ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))

The dynamic filter pushdown from the HashJoinExec creates an InListExpr that is pushed into the Parquet scan. ArrayStaticFilter::contains() unwraps the needle (the Dictionary array from the Parquet batch) to its plain Utf8 values via downcast_dictionary_array!, but does not unwrap in_array (the stored InList values, which are also Dictionary from the type-coerced join build side). This results in a make_comparator(Utf8, Dictionary) call, which arrow_ord::ord::build_compare does not support for mixed Dictionary/non-Dictionary types.

To Reproduce

Using the DataFusion CLI (datafusion-cli):

-- Enable row-level filter pushdown (required to trigger the bug)
SET datafusion.execution.parquet.pushdown_filters = true;
SET datafusion.execution.parquet.reorder_filters = true;

-- Create a Parquet file with Dictionary-encoded string columns
COPY (
  SELECT
    arrow_cast(chr(65 + (row_num % 26)), 'Dictionary(Int32, Utf8)') as tag1,
    row_num * 1.0 as value
  FROM (SELECT unnest(range(0, 10000)) as row_num)
) TO '/tmp/dict_filter_bug.parquet';

-- This query fails
SELECT t.tag1, t.value
FROM '/tmp/dict_filter_bug.parquet' t
JOIN (VALUES ('A'), ('B')) AS v(c1)
ON t.tag1 = v.c1;

Expected behavior

The query should return matching rows (tag1 = 'A' or 'B').

Actual behavior

Parquet error: External: Compute error: Error evaluating filter predicate:
ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))

Root cause

The error path through the code:

  1. The VALUES clause produces Utf8 strings. Type coercion casts them to Dictionary(Int32, Utf8) to match the Parquet column type for the join key.
  2. The HashJoinExec builds the dynamic filter InList from the cast build-side arrays → in_array is Dictionary(Int32, Utf8).
  3. instantiate_static_filter() receives the Dictionary array and falls through to ArrayStaticFilter::try_new(), which stores in_array as-is.
  4. InListExpr::try_new_from_array() is called from create_membership_predicate() in the dynamic filter pushdown code.
  5. At runtime, ArrayStaticFilter::contains(v) is called with a Dictionary array from the Parquet batch.
  6. downcast_dictionary_array! matches and recursively calls self.contains(v.values()) with the unwrapped Utf8 values.
  7. In the recursive call, v is now Utf8 but self.in_array is still Dictionary(Int32, Utf8).
  8. make_comparator(Utf8, Dictionary) fails because arrow_ord::ord::build_compare has no arm for mixed Dictionary/non-Dictionary types.

Possible fix

The most straightforward fix would be to normalize in_array in ArrayStaticFilter::try_new() — if the input is a Dictionary, unwrap it to its values array before storing. This is a one-time cost at filter construction, and ensures that after contains() unwraps the needle Dictionary, both sides are the same plain type and the existing arrow comparison kernels work correctly.

Additional context

  • This only manifests when pushdown_filters = true (defaults to false), because the row-level filter evaluation inside the Parquet reader is what triggers ArrayStaticFilter::contains() with Dictionary arrays. With pushdown_filters = false, the InList filter is only used for statistics/bloom filter pruning (which works correctly), and the actual row filtering happens in the HashJoin.
  • The bug applies to single-column equi-joins where the VALUES side is cast to Dictionary by type coercion and the dynamic filter InList path is taken (small build side).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionSomething that used to work no longer does

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions