fix: translate SQL wildcards in SIMILAR TO patterns (#22263)#23188
fix: translate SQL wildcards in SIMILAR TO patterns (#22263)#23188oc7o wants to merge 5 commits into
Conversation
`SIMILAR TO` previously passed the pattern straight to Arrow's regex
engine, so SQL wildcards were never translated and matches were
unanchored:
SELECT 'abc' SIMILAR TO 'a%'; -- returned false
SELECT 'x' SIMILAR TO '_'; -- returned false
Translate `%` to `.*` and `_` to `.`, then wrap the pattern in
`^(?:...)$` so the regex matches the entire string. Other regex
metacharacters (`|`, `(`, `)`, `*`, `+`, `?`) pass through unchanged,
matching `SIMILAR TO`'s superset-of-regex semantics.
The translation only fires for literal `Utf8`, `LargeUtf8`, and
`Utf8View` patterns. Non-literal patterns return a `not_impl_err!` —
silently wrong results are worse than an honest error, and this mirrors
how DataFusion already handles the unsupported `ESCAPE` clause. NULL
patterns pass through unchanged.
Existing tests in `binary.rs` were relying on the bug by passing raw
regex strings as `SIMILAR TO` patterns; they have been rewritten to use
SQL wildcard syntax, and new cases cover `%`, `_`, full-string
anchoring, and regex-metacharacter passthrough. End-to-end coverage
added in `strings.slt`.
|
@huaxingao @viirya @wesm Could one of you trigger CI for me please? Thanks! |
|
@oc7o Triggered. CI is running now. |
There was a problem hiding this comment.
@oc7o
Thanks for working on this. The direction looks good, but I think there are still a couple of correctness issues in the SIMILAR TO translation that should be addressed before this can be merged. I also have one small test coverage suggestion.
| match ch { | ||
| '%' => result.push_str(".*"), | ||
| '_' => result.push('.'), | ||
| c => result.push(c), |
There was a problem hiding this comment.
Thanks for tackling this. I think there is still one correctness issue here.
SIMILAR TO currently copies every non-% and non-_ character directly into the Arrow regex. That means regex metacharacters like ., ^, and $ are still treated as regex operators even though they are literals in SQL SIMILAR TO patterns.
For example, the SQL pattern a. should only match the literal string a., but the current translation produces ^(?:a.)$, so SELECT 'ab' SIMILAR TO 'a.' incorrectly returns true.
Could we translate the SQL pattern grammar explicitly instead? SQL literals should be escaped for the regex, and only the metacharacters that SIMILAR TO actually defines should be emitted as regex syntax.
There was a problem hiding this comment.
True, I thought that it would be reasonable to also pass regex but we should just stick with vanilla SQL syntax. This is the way a user would expect it to be. The translator now escapes all non-wildcard characters.
There was a problem hiding this comment.
Actually I looked up and at least Postgres is passing some regex but not all: https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-SIMILARTOREGEXP
I'll adapt it to this
There was a problem hiding this comment.
| * + ? ( ) { } [ ] are POSIX metacharacters that SIMILAR TO also defines. I've reworked the translator to escape only . ^ $ \ (the regex-only metachars) and pass the POSIX ones through to the regex. Updated tests cover both directions.
I think now it is really compliant to standard SQL or what do you say @kosiew ?
|
@kosiew I gotta thank you for the effort you put into this. It was really with the eye for detail andI really learned a lot! 😊 I think with how it now is we're a bit closer to the standard SQL implementation. Maybe (when this PR is ready 🤞) escaping could be a good follow up topic for me. Since we currently still treat I'm curios for any responses! |
SIMILAR TOpreviously passed the pattern straight to Arrow's regex engine, so SQL wildcards were never translated and matches were unanchored:Translate
%to(?s:.*)and_to(?s:.)(dot-all so they match newlines), then wrap the pattern in^(?:...)$so the regex matches the entire string..,^,$, and\are escaped as SQL literals. The POSIX metacharacters that SIMILAR TO defines (| * + ? ( ) { } [ ]) pass through to the regex unchanged.The translation only fires for literal
Utf8,LargeUtf8, andUtf8Viewpatterns. Non-literal patterns return anot_impl_err!. Imho are silently wrong results worse than an honest error, and this mirrors how DataFusion already handles the unsupportedESCAPEclause. NULL patterns pass through unchanged.Which issue does this PR close?
SIMILAR TOshould treat%as a wildcard #22263.Rationale for this change
SIMILAR TOis a SQL standard operator with well-defined wildcard semantics (%= any sequence,_= single character, full-string match). DataFusion's current behavior silently produces wrong results for the most basic patterns, which is a correctness bug for anyone porting queries from Postgres or other SQL engines.What changes are included in this PR?
sql_similar_to_regexhelper indatafusion/physical-expr/src/expressions/binary.rsthat translates%/_and anchors the pattern with^(?:...)$.similar_to()now translates the pattern for literalUtf8/LargeUtf8/Utf8Viewvalues, passesNULLthrough unchanged, and returnsnot_impl_err!for non-literal patterns.^inside[...]is treated as bracket negation, not as a literal.Are these changes tested?
Yes:
.,^,$, and\are treated as SQL literals, not as regex operators.|, *, +, ?, (, ), {, }, [, ], [^...], and[a-z]behave asSIMILAR TOmetacharacters.%and_match newlines.%/_semantics, full-string anchoring, and case sensitivity.Are there any user-facing changes?
Yes:
SIMILAR TOnow produces correct results for queries that were previously returning wrong answers. Queries that relied on the buggy behavior (e.g., treating ., ^, or $ as regex metacharacters) will now follow standard SQL SIMILAR TO semantics. Queries using valid SIMILAR TO POSIX metacharacters (| * + ? ( ) { } [ ]) now work as expected. Also % and _ wildcards work now.