Skip to content

fix: translate SQL wildcards in SIMILAR TO patterns (#22263)#23188

Open
oc7o wants to merge 5 commits into
apache:mainfrom
oc7o:bugfix/similar-to-wildcard
Open

fix: translate SQL wildcards in SIMILAR TO patterns (#22263)#23188
oc7o wants to merge 5 commits into
apache:mainfrom
oc7o:bugfix/similar-to-wildcard

Conversation

@oc7o

@oc7o oc7o commented Jun 25, 2026

Copy link
Copy Markdown

SIMILAR TO previously passed the pattern straight to Arrow's regex engine, so SQL wildcards were never translated and matches were unanchored:

SELECT 'abc' SIMILAR TO 'a%';  -- returned false
SELECT 'x'   SIMILAR TO '_';   -- returned false

Translate % to (?s:.*) and _ to (?s:.) (dot-all so they match newlines), then wrap the pattern in ^(?:...)$ so the regex matches the entire string. ., ^, $, and \ are escaped as SQL literals. The POSIX metacharacters that SIMILAR TO defines (| * + ? ( ) { } [ ]) pass through to the regex unchanged.

The translation only fires for literal Utf8, LargeUtf8, and Utf8View patterns. Non-literal patterns return a not_impl_err!. Imho are silently wrong results worse than an honest error, and this mirrors how DataFusion already handles the unsupported ESCAPE clause. NULL patterns pass through unchanged.

Which issue does this PR close?

Rationale for this change

SIMILAR TO is a SQL standard operator with well-defined wildcard semantics (% = any sequence, _ = single character, full-string match). DataFusion's current behavior silently produces wrong results for the most basic patterns, which is a correctness bug for anyone porting queries from Postgres or other SQL engines.

What changes are included in this PR?

  • New sql_similar_to_regex helper in datafusion/physical-expr/src/expressions/binary.rs that translates %/_ and anchors the pattern with ^(?:...)$.
  • similar_to() now translates the pattern for literal Utf8 / LargeUtf8 / Utf8View values, passes NULL through unchanged, and returns not_impl_err! for non-literal patterns.
  • sql_similar_to_regex tracks bracket state so ^ inside [...] is treated as bracket negation, not as a literal.

Are these changes tested?

Yes:

  • test_similar_to_sql_literal_metachars confirms that ., ^, $, and \ are treated as SQL literals, not as regex operators.
  • test_similar_to_posix_metachars confirms that |, *, +, ?, (, ), {, }, [, ], [^...], and [a-z] behave as SIMILAR TO metacharacters.
  • test_similar_to_wildcards_match_newlines confirms that % and _ match newlines.
  • test_similar_to covers basic %/_ semantics, full-string anchoring, and case sensitivity.
  • test_similar_to_null_pattern and test_similar_to_non_literal_pattern_errors continue to cover the NULL pattern and non-literal-pattern error paths.
  • End-to-end coverage in datafusion/sqllogictest/test_files/strings.slt was updated: the pre-existing SIMILAR TO 'p[12].' / NOT SIMILAR TO 'p[12].' cases (which only worked because of the bug) were removed, and new cases cover literal metachars, POSIX metachars, and newline matching.

Are there any user-facing changes?

Yes:
SIMILAR TO now produces correct results for queries that were previously returning wrong answers. Queries that relied on the buggy behavior (e.g., treating ., ^, or $ as regex metacharacters) will now follow standard SQL SIMILAR TO semantics. Queries using valid SIMILAR TO POSIX metacharacters (| * + ? ( ) { } [ ]) now work as expected. Also % and _ wildcards work now.

`SIMILAR TO` previously passed the pattern straight to Arrow's regex
engine, so SQL wildcards were never translated and matches were
unanchored:

    SELECT 'abc' SIMILAR TO 'a%';  -- returned false
    SELECT 'x'   SIMILAR TO '_';   -- returned false

Translate `%` to `.*` and `_` to `.`, then wrap the pattern in
`^(?:...)$` so the regex matches the entire string. Other regex
metacharacters (`|`, `(`, `)`, `*`, `+`, `?`) pass through unchanged,
matching `SIMILAR TO`'s superset-of-regex semantics.

The translation only fires for literal `Utf8`, `LargeUtf8`, and
`Utf8View` patterns. Non-literal patterns return a `not_impl_err!` —
silently wrong results are worse than an honest error, and this mirrors
how DataFusion already handles the unsupported `ESCAPE` clause. NULL
patterns pass through unchanged.

Existing tests in `binary.rs` were relying on the bug by passing raw
regex strings as `SIMILAR TO` patterns; they have been rewritten to use
SQL wildcard syntax, and new cases cover `%`, `_`, full-string
anchoring, and regex-metacharacter passthrough. End-to-end coverage
added in `strings.slt`.
@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) labels Jun 25, 2026
@oc7o

oc7o commented Jun 25, 2026

Copy link
Copy Markdown
Author

@huaxingao @viirya @wesm Could one of you trigger CI for me please? Thanks!

@viirya

viirya commented Jun 25, 2026

Copy link
Copy Markdown
Member

@oc7o Triggered. CI is running now.

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oc7o
Thanks for working on this. The direction looks good, but I think there are still a couple of correctness issues in the SIMILAR TO translation that should be addressed before this can be merged. I also have one small test coverage suggestion.

match ch {
'%' => result.push_str(".*"),
'_' => result.push('.'),
c => result.push(c),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this. I think there is still one correctness issue here.

SIMILAR TO currently copies every non-% and non-_ character directly into the Arrow regex. That means regex metacharacters like ., ^, and $ are still treated as regex operators even though they are literals in SQL SIMILAR TO patterns.

For example, the SQL pattern a. should only match the literal string a., but the current translation produces ^(?:a.)$, so SELECT 'ab' SIMILAR TO 'a.' incorrectly returns true.

Could we translate the SQL pattern grammar explicitly instead? SQL literals should be escaped for the regex, and only the metacharacters that SIMILAR TO actually defines should be emitted as regex syntax.

@oc7o oc7o Jun 30, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I thought that it would be reasonable to also pass regex but we should just stick with vanilla SQL syntax. This is the way a user would expect it to be. The translator now escapes all non-wildcard characters.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I looked up and at least Postgres is passing some regex but not all: https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-SIMILARTOREGEXP

I'll adapt it to this

@oc7o oc7o Jun 30, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| * + ? ( ) { } [ ] are POSIX metacharacters that SIMILAR TO also defines. I've reworked the translator to escape only . ^ $ \ (the regex-only metachars) and pass the POSIX ones through to the regex. Updated tests cover both directions.
I think now it is really compliant to standard SQL or what do you say @kosiew ?

Comment thread datafusion/physical-expr/src/expressions/binary.rs Outdated
Comment thread datafusion/sqllogictest/test_files/strings.slt
@oc7o

oc7o commented Jun 30, 2026

Copy link
Copy Markdown
Author

@kosiew I gotta thank you for the effort you put into this. It was really with the eye for detail andI really learned a lot! 😊

I think with how it now is we're a bit closer to the standard SQL implementation. Maybe (when this PR is ready 🤞) escaping could be a good follow up topic for me. Since we currently still treat \ as a literal. So that 'a%' SIMILAR TO 'a\%' then returns true. Currently it would return false.

I'm curios for any responses!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostgreSQL compatibility: SIMILAR TO should treat % as a wildcard

3 participants