Skip to content

String scalar functions should preserve Dictionary encoding for Dictionary-typed inputs #20935

@erratic-pattern

Description

@erratic-pattern

Is your feature request related to a problem or challenge?

When string scalar functions (e.g. regexp_replace, concat, lower, upper, replace, substr, etc.) receive a Dictionary(K, Utf8) input, the current implementation unwraps the Dictionary to plain Utf8:

  1. The type coercion layer in datafusion/expr/src/type_coercion/functions.rs unwraps Dictionary(K, V) to V before the function sees it
  2. Each function's return_type also maps Dictionary(_, Utf8)Utf8
  3. The output is a plain Utf8 array

This causes two problems:

Arrow IPC/Flight message size inflation: Dictionary-encoded columns use a fixed size per row based on the key type (e.g. 4 bytes for Int32) with unique values stored once. Plain Utf8 stores the full string for every row. For columns with repeated string values, this can significantly inflate individual Arrow IPC messages, potentially exceeding gRPC message size limits (4MB default in tonic).

Performance: String operations are applied to every row instead of just the unique dictionary values.

Describe the solution you'd like

Make string scalar functions handle Dictionary-typed inputs natively, preserving the encoding:

  1. return_type: Return Dictionary(K, V') when input is Dictionary(K, V), where V' is the appropriate result value type.
  2. invoke/invoke_with_args: When the input is a DictionaryArray, apply the string operation only to the dictionary's unique values array, then construct a new DictionaryArray with the transformed values and the original keys (index) array.

Ideally this would be a reusable pattern (perhaps a helper or wrapper) that individual string functions can opt into, rather than duplicating the logic in every function.

Describe alternatives you've considered

  • Workaround: Users can wrap results in arrow_cast(some_string_fn(...), 'Dictionary(Int32, Utf8)') to re-dictionary-encode the output. However, this is non-obvious and still applies the operation to every row rather than just the unique values.
  • Coerce Dictionary to Utf8View instead of Utf8: The type coercion layer could coerce Dictionary(K, Utf8) to Utf8View rather than Utf8. Since string functions already handle Utf8View inputs and return Utf8View outputs, this would work without per-function changes. Utf8View is more compact than Utf8 for repeated values in IPC serialization, though not as compact as Dictionary encoding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions