Skip to content

[SPARK-57055][SQL][DOCS] Document non-binary collation gap in DataFrameStatFunctions.bloomFilter#56114

Open
yaooqinn wants to merge 1 commit into
apache:masterfrom
yaooqinn:spark-bloomfilter-collation-docs
Open

[SPARK-57055][SQL][DOCS] Document non-binary collation gap in DataFrameStatFunctions.bloomFilter#56114
yaooqinn wants to merge 1 commit into
apache:masterfrom
yaooqinn:spark-bloomfilter-collation-docs

Conversation

@yaooqinn
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Add @note Scaladoc to all four DataFrameStatFunctions.bloomFilter
overloads and a Migration Guide entry stating that bloom filters built
over string columns use raw UTF-8 byte equality and are collation-blind
for non-binary collations (UTF8_LCASE, ICU). For collation-consistent
membership, the docs recommend using a UTF8_BINARY-collated column, or
normalizing values manually (e.g. lower(col) for ASCII data under
UTF8_LCASE).

Why are the changes needed?

Since Spark 4.0, columns may carry non-binary collations. The bloom
filter path does not collation-normalize values, so users get silent
inconsistent mightContain results without any documented warning.
See SPARK-57055.

Does this PR introduce any user-facing change?

Documentation only.

How was this patch tested?

Scaladoc renders cleanly via build/sbt "sql-api/doc" (49s, no new
warnings).

Behavior reproduced on master via spark-shell:

// Sample 1: UTF8_BINARY (default) — collation-consistent
val df1 = Seq("Alice", "Bob", "Carol").toDF("name")
val bf1 = df1.stat.bloomFilter("name", 100L, 0.01)
bf1.mightContain("Alice")  // true
bf1.mightContain("alice")  // false

// Sample 2: UTF8_LCASE raw — silent gap documented by this PR
val df2 = spark.sql(
  "SELECT 'Alice' COLLATE UTF8_LCASE AS name UNION ALL " +
  "SELECT 'Bob' COLLATE UTF8_LCASE UNION ALL " +
  "SELECT 'Carol' COLLATE UTF8_LCASE")
val bf2 = df2.stat.bloomFilter("name", 100L, 0.01)
bf2.mightContain("Alice")  // true
bf2.mightContain("alice")  // false  <-- silent gap

// Sample 3: UTF8_LCASE + lower() ASCII work-around recommended by this PR
import org.apache.spark.sql.functions.lower
val bf3 = df2.stat.bloomFilter(lower(df2("name")), 100L, 0.01)
bf3.mightContain("alice")  // true   (user must wrap probe side too)
bf3.mightContain("Alice")  // false

No new unit tests — this is a documentation-only change with no
behavior to assert.

Scope note

PySpark / R / Connect docstring synchronization is intentionally
deferred to a follow-up PR to keep this change docs-only and minimal.

…meStatFunctions.bloomFilter

### What changes were proposed in this pull request?

Add @note Scaladoc to all four DataFrameStatFunctions.bloomFilter
overloads and a Migration Guide entry stating that bloom filters built
over string columns use raw UTF-8 byte equality and are collation-blind
for non-binary collations (UTF8_LCASE, ICU). For collation-consistent
membership, the docs recommend using a UTF8_BINARY-collated column, or
normalizing values manually (e.g. lower(col) for ASCII data under
UTF8_LCASE).

### Why are the changes needed?

Since Spark 4.0, columns may carry non-binary collations. The bloom
filter path does not collation-normalize values, so users get silent
inconsistent mightContain results without any documented warning.
See SPARK-57055.

### Does this PR introduce _any_ user-facing change?

Documentation only.

### How was this patch tested?

Scaladoc renders cleanly via build/sbt 'sql-api/doc'. Behavior
reproduced on master via spark-shell (see PR description for repro
snippet and verbatim output).

### Scope note

PySpark / R / Connect docstring synchronization is intentionally
deferred to a follow-up PR to keep this change docs-only and minimal.
@yaooqinn yaooqinn force-pushed the spark-bloomfilter-collation-docs branch from d65d451 to 8d22bf3 Compare May 26, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant