Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
9cd1566
docs: add implement-comet-expression Claude skill
andygrove Apr 30, 2026
953cb86
docs: reference PR template and add skill-acknowledgement note
andygrove Apr 30, 2026
422d2b3
docs: check datafusion-spark crate before writing native code
andygrove Apr 30, 2026
88f2331
Merge branch 'add-implement-expression-skill'
andygrove Apr 30, 2026
eb8aa14
feat: add CometUDF trait for JVM-side scalar UDFs
andygrove May 1, 2026
60a2ecd
feat: add RegExpLikeUDF using java.util.regex.Pattern
andygrove May 1, 2026
633b75e
feat: add CometUdfBridge JNI entry point for native UDF dispatch
andygrove May 1, 2026
1c64070
feat: add JvmScalarUdf proto message for JVM UDF dispatch
andygrove May 1, 2026
8f78436
feat: register CometUdfBridge in JVMClasses for native UDF dispatch
andygrove May 1, 2026
cf233d5
feat: add JvmScalarUdfExpr PhysicalExpr that dispatches to JVM via JNI
andygrove May 1, 2026
d8ab411
feat: wire JvmScalarUdf proto into native planner
andygrove May 1, 2026
4970c9c
feat: add spark.comet.exec.regexp.useJVM config
andygrove May 1, 2026
54ddd50
feat: route RLike through JVM UDF when spark.comet.exec.regexp.useJVM…
andygrove May 1, 2026
0a942ad
test: add end-to-end suite for JVM-backed RLike
andygrove May 1, 2026
fbfc158
fix: use project-wide CometArrowAllocator in RegExpLikeUDF
andygrove May 1, 2026
909ab91
docs: correct CometUdfBridge thread cache lifetime comment
andygrove May 1, 2026
862ed2e
docs: document from_ffi consumption invariant in JvmScalarUdfExpr
andygrove May 1, 2026
a943de5
style: apply make format
andygrove May 1, 2026
e1b9b2a
docs: mark spark.comet.exec.regexp.useJVM experimental and generalize…
andygrove May 1, 2026
76418c6
test: add CometRegExpBenchmark covering all rlike modes
andygrove May 1, 2026
8ac45be
ci: register new RLike JVM-bridge test suites in PR workflows
andygrove May 1, 2026
a1f8ecf
build: exclude docs/superpowers from rat and git
andygrove May 1, 2026
23a9e52
remove skill
andygrove May 1, 2026
1c66f44
refactor: rename regexp.useJVM boolean to regexp.engine enum (rust|java)
andygrove May 1, 2026
56327ed
fix: ensure UDF bridge inputs/result close on every path and resolve …
andygrove May 1, 2026
fee5ab2
fix: validate regex pattern at convert time so invalid or null patter…
andygrove May 1, 2026
7d0f25c
fix: tolerate missing CometUdfBridge class at JVMClasses init
andygrove May 1, 2026
2a43867
refactor: introduce REGEXP_ENGINE_RUST/REGEXP_ENGINE_JAVA constants
andygrove May 1, 2026
760cd94
perf: send scalar UDF arguments as length-1 vectors
andygrove May 1, 2026
85029c5
test: cover empty and all-null subject vectors in RegExpLikeUDF unit …
andygrove May 1, 2026
a16f336
feat: propagate result nullability through JvmScalarUdf proto
andygrove May 1, 2026
5937650
fix: validate UDF result row count matches longest input
andygrove May 1, 2026
1dd81fb
fix: qualify CometRLike incompat reasons by engine config
andygrove May 1, 2026
42462c3
fix: bound UDF and pattern caches with LRU eviction
andygrove May 1, 2026
8073cf3
test: stop using per-test RootAllocator in RegExpLikeUDFSuite
andygrove May 1, 2026
ce01339
test: remove RegExpLikeUDFSuite due to shading boundary
andygrove May 1, 2026
eb544d6
Merge remote-tracking branch 'apache/main' into prototype-jvm-scalar-udf
andygrove May 6, 2026
4683199
feat: add all Spark regexp expressions via JVM UDF framework
andygrove May 6, 2026
6cac094
docs: update regexp compatibility guide for java vs rust engine
andygrove May 6, 2026
1ad838b
Merge remote-tracking branch 'apache/main' into java-regexp
andygrove May 8, 2026
250b469
fix: use ConcurrentHashMap for pattern cache in regexp UDFs
andygrove May 8, 2026
941d9c7
refactor: use computeIfAbsent for pattern cache lookup
andygrove May 8, 2026
336ec6e
Merge remote-tracking branch 'apache/main' into worktree-pr-4239-rege…
andygrove May 12, 2026
ea939ce
fix: default regexp engine back to rust, mark java engine experimental
andygrove May 12, 2026
5e18c62
style: prettier format regex compatibility docs
andygrove May 12, 2026
8b92370
style: drop unused idx binding in RegExpInStrUDF to fix scalafix lint
andygrove May 12, 2026
ca6628b
style: drop unused idx bindings in regexp serde to fix scalafix lint
andygrove May 12, 2026
c4e88fb
test: set regexp engine to java in SQL tests that need it
andygrove May 13, 2026
b55adb0
Merge remote-tracking branch 'apache/main' into java-regexp
andygrove May 13, 2026
0fa237f
fix: update regexp UDFs to new CometUDF.evaluate(inputs, numRows) sig…
andygrove May 14, 2026
f6b4096
Merge branch 'main' of github.com:apache/datafusion-comet into worktr…
andygrove May 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/pr_build_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,7 @@ jobs:
org.apache.comet.expressions.conditional.CometIfSuite
org.apache.comet.expressions.conditional.CometCoalesceSuite
org.apache.comet.expressions.conditional.CometCaseWhenSuite
org.apache.comet.CometRegExpJvmSuite
- name: "sql"
value: |
org.apache.spark.sql.CometToPrettyStringSuite
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pr_build_macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,7 @@ jobs:
org.apache.comet.expressions.conditional.CometIfSuite
org.apache.comet.expressions.conditional.CometCoalesceSuite
org.apache.comet.expressions.conditional.CometCaseWhenSuite
org.apache.comet.CometRegExpJvmSuite
- name: "sql"
value: |
org.apache.spark.sql.CometToPrettyStringSuite
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ output
docs/comet-*/
docs/build/
docs/temp/
docs/superpowers/
99 changes: 96 additions & 3 deletions docs/source/user-guide/latest/compatibility/regex.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,99 @@ under the License.

# Regular Expressions

Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's
regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but
this can be overridden by setting `spark.comet.expression.regexp.allowIncompatible=true`.
Comet provides two regexp engines for evaluating regular expressions: a **Rust engine** that uses the Rust
[`regex`] crate natively and an experimental **Java engine** that calls back into the JVM. The engine is
selected with:

```
spark.comet.exec.regexp.engine=rust # default
spark.comet.exec.regexp.engine=java # experimental
```

## Choosing an engine

| | Rust engine | Java engine (experimental) |
| -------------------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| **Compatibility** | Pattern-dependent differences | 100% compatible with Spark |
| **Feature coverage** | `rlike`, `regexp_replace`, `split` only | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) |
| **Performance** | Fully native, no JNI overhead | One JNI round-trip per batch (Arrow vectors stay columnar) |
| **Pattern support** | Linear-time subset only | All Java regex features (backreferences, lookaround, etc.) |

The **Rust engine** (default) is faster but only supports a subset of patterns. When it encounters a pattern
it cannot handle, it falls back to Spark automatically. To opt in to native evaluation for patterns Comet
considers potentially incompatible, set:

```
spark.comet.expression.regexp.allowIncompatible=true
```

The **Java engine** is an experimental option for correctness-sensitive workloads. It evaluates expressions
by passing Arrow vectors to a JVM-side UDF that uses `java.util.regex`, producing identical results to Spark
for all patterns. Because it is experimental, the behavior, configuration, and supported expressions may
change in future releases.

## Why the engines differ

Java's `java.util.regex` is a backtracking engine in the Perl/PCRE family. It supports the full range of
features that style of engine provides, including some whose worst-case running time grows exponentially with
the input.

Rust's [`regex`] crate is a finite-automaton engine in the [RE2] family. It deliberately omits features that
cannot be implemented with a guarantee of linear-time matching. In exchange, every pattern it does accept runs
in time linear in the size of the input. This is the same trade-off RE2, Go's `regexp`, and several other
engines make.

The practical consequence is that Java accepts a strictly larger set of patterns than the Rust engine, and
several constructs that look the same in source have different semantics on the two sides.

## Features supported by Java but not by the Rust engine

Patterns that use any of the following will not compile in Comet's Rust engine and must run on Spark (or use
the Java engine):

- **Backreferences** such as `\1`, `\2`, or `\k<name>`. The Rust engine has no backtracking and cannot match
a previously captured group.
- **Lookaround**, including lookahead (`(?=...)`, `(?!...)`) and lookbehind (`(?<=...)`, `(?<!...)`).
- **Atomic groups** (`(?>...)`).
- **Possessive quantifiers** (`*+`, `++`, `?+`, `{n,m}+`). Rust supports greedy and lazy quantifiers but not
possessive.
- **Embedded code, conditionals, and recursion** such as `(?(cond)yes|no)` or `(?R)`. Rust accepts none of
these.

## Features that exist on both sides but behave differently

Even where both engines accept a construct, the matching behavior is not always the same.

- **Unicode-aware character classes.** In the Rust engine, `\d`, `\w`, `\s`, and `.` are Unicode-aware by
default, so `\d` matches every digit codepoint defined by Unicode rather than only `0`-`9`. Java's defaults
match ASCII only and require the `UNICODE_CHARACTER_CLASS` flag (or `(?U)` inline) to switch to Unicode
semantics. The same pattern can therefore match a different set of characters on each side.
- **Line terminators.** In multiline mode, Java treats `\r`, `\n`, `\r\n`, and a few additional Unicode line
separators as line boundaries by default. The Rust engine treats only `\n` as a line boundary unless CRLF
mode is enabled. `^`, `$`, and `.` (with `(?s)` off) all depend on this definition.
- **Case-insensitive matching.** Both engines support `(?i)`, but Java's default is ASCII case folding while
the Rust engine uses full Unicode simple case folding when Unicode mode is on. Patterns that match characters
outside ASCII can produce different results.
- **POSIX character classes.** The Rust engine supports `[[:alpha:]]` style POSIX classes inside bracket
expressions but not Java's `\p{Alpha}` shorthand. Java accepts both. Unicode property escapes (`\p{L}`,
`\p{Greek}`, etc.) are supported by both engines but cover slightly different sets of properties.
- **Octal and Unicode escapes.** Java accepts `\0nnn` for octal and `\uXXXX` for a BMP codepoint. Rust uses
`\x{...}` for arbitrary codepoints and does not accept Java's bare `\uXXXX` form.
- **Empty matches in `split`.** Spark's `StringSplit`, which is built on Java's regex, includes leading empty
strings produced by zero-width matches at the start of the input. The Rust engine's `split` follows different
rules, so split results can differ in edge cases involving empty matches even when the pattern itself is
identical on both sides.

## When the Rust engine is safe

For most ASCII-only, non-anchored patterns that use only literal characters, simple character classes, and
ordinary quantifiers, the two engines produce the same results. If you are confident your patterns fit this
shape and want to avoid the JNI overhead of the Java engine, switching to the Rust engine with
`allowIncompatible=true` is generally safe.

For anything that uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling
defaults, use the experimental Java engine.

[`java.util.regex`]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
[`regex`]: https://docs.rs/regex/latest/regex/
[RE2]: https://github.com/google/re2/wiki/Syntax
6 changes: 5 additions & 1 deletion native/jni-bridge/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,8 @@ pub struct JVMClasses<'a> {
/// acquire & release native memory.
pub comet_task_memory_manager: CometTaskMemoryManager<'a>,
/// The CometUdfBridge class used to dispatch JVM scalar UDFs.
/// `None` if the class is not on the classpath.
/// `None` if the class is not on the classpath; the JVM-UDF dispatch path
/// reports a clear error rather than crashing executor init.
pub comet_udf_bridge: Option<CometUdfBridge<'a>>,
}

Expand Down Expand Up @@ -304,6 +305,9 @@ impl JVMClasses<'_> {
comet_shuffle_block_iterator: CometShuffleBlockIterator::new(env).unwrap(),
comet_task_memory_manager: CometTaskMemoryManager::new(env).unwrap(),
comet_udf_bridge: {
// Optional: if the bridge class is absent (e.g. comet shading
// dropped org.apache.comet.udf.*), record None and clear the
// pending JVM exception so other JNI calls keep working.
let bridge = CometUdfBridge::new(env).ok();
if env.exception_check() {
env.exception_clear();
Expand Down
17 changes: 15 additions & 2 deletions native/spark-expr/src/jvm_udf/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,8 @@ impl PhysicalExpr for JvmScalarUdfExpr {
let bridge = JVMClasses::get().comet_udf_bridge.as_ref().ok_or_else(|| {
CometError::from(ExecutionError::GeneralError(
"JVM UDF bridge unavailable: org.apache.comet.udf.CometUdfBridge \
class was not found on the JVM classpath."
class was not found on the JVM classpath. Set \
spark.comet.exec.regexp.engine=rust to disable this path."
.to_string(),
))
})?;
Expand Down Expand Up @@ -237,7 +238,19 @@ impl PhysicalExpr for JvmScalarUdfExpr {
// exactly once when the Box drops at end of scope.
let result_data = unsafe { from_ffi(*out_array, &out_schema) }
.map_err(|e| CometError::Arrow { source: e })?;
Ok(ColumnarValue::Array(make_array(result_data)))
let result_array = make_array(result_data);

// The JVM may produce arrays with different field names (e.g. Arrow Java's
// ListVector uses "$data$" for child fields) than what DataFusion expects
// (e.g. "item"). Cast to the declared return_type to normalize schema.
let result_array = if result_array.data_type() != &self.return_type {
arrow::compute::cast(&result_array, &self.return_type)
.map_err(|e| CometError::Arrow { source: e })?
} else {
result_array
};

Ok(ColumnarValue::Array(result_array))
}

fn children(&self) -> Vec<&Arc<dyn PhysicalExpr>> {
Expand Down
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1153,6 +1153,7 @@ under the License.
<exclude>native/proto/src/generated/**</exclude>
<exclude>benchmarks/tpc/queries/**</exclude>
<exclude>.claude/**</exclude>
<exclude>docs/superpowers/**</exclude>
</excludes>
</configuration>
</plugin>
Expand Down
18 changes: 18 additions & 0 deletions spark/src/main/scala/org/apache/comet/CometConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,24 @@ object CometConf extends ShimCometConf {
.booleanConf
.createWithDefault(false)

val REGEXP_ENGINE_RUST = "rust"
val REGEXP_ENGINE_JAVA = "java"

val COMET_REGEXP_ENGINE: ConfigEntry[String] =
conf("spark.comet.exec.regexp.engine")
.category(CATEGORY_EXEC)
.doc(
"Selects the engine used to evaluate supported regular-expression " +
s"expressions. `$REGEXP_ENGINE_RUST` uses the native DataFusion regexp engine. " +
s"`$REGEXP_ENGINE_JAVA` is experimental and routes through a JVM-side UDF " +
"(java.util.regex.Pattern) for Spark-compatible semantics, at the cost of JNI " +
"roundtrips per batch. Expressions routed when set to java: rlike, regexp_extract, " +
"regexp_extract_all, regexp_replace, regexp_instr, and split.")
.stringConf
.transform(_.toLowerCase(Locale.ROOT))
.checkValues(Set(REGEXP_ENGINE_RUST, REGEXP_ENGINE_JAVA))
.createWithDefault(REGEXP_ENGINE_RUST)

val COMET_EXEC_SHUFFLE_WITH_HASH_PARTITIONING_ENABLED: ConfigEntry[Boolean] =
conf("spark.comet.native.shuffle.partitioning.hash.enabled")
.category(CATEGORY_SHUFFLE)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,9 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
classOf[Like] -> CometLike,
classOf[Lower] -> CometLower,
classOf[OctetLength] -> CometScalarFunction("octet_length"),
classOf[RegExpExtract] -> CometRegExpExtract,
classOf[RegExpExtractAll] -> CometRegExpExtractAll,
classOf[RegExpInStr] -> CometRegExpInStr,
classOf[RegExpReplace] -> CometRegExpReplace,
classOf[Reverse] -> CometReverse,
classOf[RLike] -> CometRLike,
Expand Down
Loading
Loading