Skip to content

Support reading files with a boolean SBBF written by Parquet Java#23245

Draft
Miha-Cancula-Flarion wants to merge 1 commit into
apache:mainfrom
Miha-Cancula-Flarion:bloom-filter-boolean-parquet-java
Draft

Support reading files with a boolean SBBF written by Parquet Java#23245
Miha-Cancula-Flarion wants to merge 1 commit into
apache:mainfrom
Miha-Cancula-Flarion:bloom-filter-boolean-parquet-java

Conversation

@Miha-Cancula-Flarion

@Miha-Cancula-Flarion Miha-Cancula-Flarion commented Jun 29, 2026

Copy link
Copy Markdown

Which issue does this PR close?

  • Closes #.

Rationale for this change

When https://github.com/apache/parquet-java/ writes a bloom filter for a boolean column, it does not actually update the values, so the filter ends up empty. At https://github.com/apache/parquet-java/blob/52c0a5e8c5ff7680cc299ce5aad60acef3a9054d/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L75 , all write() functions call bloomFilter.insertHash, except the write(boolean) one.

The DataFusion reader then incorrectly assumes that such a file contains no values, and skips it while reading.

What changes are included in this PR?

This change makes is so that we always assume that a boolean column has values, essentially ignoring the filter.

Are these changes tested?

Not yet.

Are there any user-facing changes?

This may affect performance in cases where the SBBF was written correctly, and thus legitimately excludes some data files. With this change, those files will still be scanned.

There are no changes to the API.

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Jun 29, 2026
// See https://github.com/apache/parquet-java/blob/52c0a5e8c5ff7680cc299ce5aad60acef3a9054d/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L75
// In order to correctly read such files with SBBF, we have to skip the check and return `true`
// because values may be present in the data file even if they are not in the filter.
ScalarValue::Boolean(Some(_)) => true,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this configurable
While I think this is worthwhile(especially for Spark/Hadoop parquets usecases), we definitely want to support boolean values here, probably even by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants