[SPARK-57055][SQL][DOCS] Document non-binary collation gap in DataFrameStatFunctions.bloomFilter by yaooqinn · Pull Request #56114 · apache/spark

yaooqinn · 2026-05-26T08:09:17Z

What changes were proposed in this pull request?

Add @note Scaladoc to all four DataFrameStatFunctions.bloomFilter
overloads and a Migration Guide entry stating that bloom filters built
over string columns use raw UTF-8 byte equality and are collation-blind
for non-binary collations (UTF8_LCASE, ICU). For collation-consistent
membership, the docs recommend using a UTF8_BINARY-collated column, or
normalizing values manually (e.g. lower(col) for ASCII data under
UTF8_LCASE).

Why are the changes needed?

Since Spark 4.0, columns may carry non-binary collations. The bloom
filter path does not collation-normalize values, so users get silent
inconsistent mightContain results without any documented warning.
See SPARK-57055.

Does this PR introduce any user-facing change?

Documentation only.

How was this patch tested?

Scaladoc renders cleanly via build/sbt "sql-api/doc" (49s, no new
warnings).

Behavior reproduced on master via spark-shell:

// Sample 1: UTF8_BINARY (default) — collation-consistent
val df1 = Seq("Alice", "Bob", "Carol").toDF("name")
val bf1 = df1.stat.bloomFilter("name", 100L, 0.01)
bf1.mightContain("Alice")  // true
bf1.mightContain("alice")  // false

// Sample 2: UTF8_LCASE raw — silent gap documented by this PR
val df2 = spark.sql(
  "SELECT 'Alice' COLLATE UTF8_LCASE AS name UNION ALL " +
  "SELECT 'Bob' COLLATE UTF8_LCASE UNION ALL " +
  "SELECT 'Carol' COLLATE UTF8_LCASE")
val bf2 = df2.stat.bloomFilter("name", 100L, 0.01)
bf2.mightContain("Alice")  // true
bf2.mightContain("alice")  // false  <-- silent gap

// Sample 3: UTF8_LCASE + lower() ASCII work-around recommended by this PR
import org.apache.spark.sql.functions.lower
val bf3 = df2.stat.bloomFilter(lower(df2("name")), 100L, 0.01)
bf3.mightContain("alice")  // true   (user must wrap probe side too)
bf3.mightContain("Alice")  // false

No new unit tests — this is a documentation-only change with no
behavior to assert.

Scope note

PySpark / R / Connect docstring synchronization is intentionally
deferred to a follow-up PR to keep this change docs-only and minimal.

@note

…meStatFunctions.bloomFilter ### What changes were proposed in this pull request? Add @note Scaladoc to all four DataFrameStatFunctions.bloomFilter overloads and a Migration Guide entry stating that bloom filters built over string columns use raw UTF-8 byte equality and are collation-blind for non-binary collations (UTF8_LCASE, ICU). For collation-consistent membership, the docs recommend using a UTF8_BINARY-collated column, or normalizing values manually (e.g. lower(col) for ASCII data under UTF8_LCASE). ### Why are the changes needed? Since Spark 4.0, columns may carry non-binary collations. The bloom filter path does not collation-normalize values, so users get silent inconsistent mightContain results without any documented warning. See SPARK-57055. ### Does this PR introduce _any_ user-facing change? Documentation only. ### How was this patch tested? Scaladoc renders cleanly via build/sbt 'sql-api/doc'. Behavior reproduced on master via spark-shell (see PR description for repro snippet and verbatim output). ### Scope note PySpark / R / Connect docstring synchronization is intentionally deferred to a follow-up PR to keep this change docs-only and minimal.

yaooqinn force-pushed the spark-bloomfilter-collation-docs branch from d65d451 to 8d22bf3 Compare May 26, 2026 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57055][SQL][DOCS] Document non-binary collation gap in DataFrameStatFunctions.bloomFilter#56114

[SPARK-57055][SQL][DOCS] Document non-binary collation gap in DataFrameStatFunctions.bloomFilter#56114
yaooqinn wants to merge 1 commit into
apache:masterfrom
yaooqinn:spark-bloomfilter-collation-docs

yaooqinn commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaooqinn commented May 26, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Scope note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant