Skip to content

[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280

Open
lyne7-sc wants to merge 7 commits into
apache:masterfrom
lyne7-sc:fix/spark_levenshtein
Open

[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280
lyne7-sc wants to merge 7 commits into
apache:masterfrom
lyne7-sc:fix/spark_levenshtein

Conversation

@lyne7-sc
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2279

Rationale for this change

Spark 4+ introduced an optional threshold parameter for Levenshtein. The previous native implementation delegated to DataFusion's built-in levenshtein() which only supports 2 arguments.

What changes are included in this PR?

  • Added a custom spark_levenshtein native function that supports both 2-arg and 3-arg (with threshold) forms
  • Changed NativeConverters.scala to route Levenshtein through buildExtScalarFunction("Spark_Levenshtein") instead of the old buildScalarFunction path
  • Removed .exclude("string Levenshtein distance") from Spark 4.0 and 4.1 test settings

Are there any user-facing changes?

Yes. The native levenshtein() function now supports the optional third threshold parameter.

How was this patch tested?

  • Rust unit tests
  • AuronStringFunctionsSuite and AuronFunctionSuite including "test function Levenshtein"

@lyne7-sc lyne7-sc changed the title [Aruon #2279] Support Levenshtein func with threshold parameter for Spark 4+ [AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+ May 19, 2026
@cxzl25 cxzl25 requested a review from Copilot May 19, 2026 12:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Ok(ColumnarValue::Array(Arc::new(splitted_builder.finish())))
}

pub fn spark_levenshtein(args: &[ColumnarValue]) -> Result<ColumnarValue> {
Copy link
Copy Markdown
Contributor

@ShreyeshArangath ShreyeshArangath May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but, in spark_levenshtein, a null threshold is coerced to Some(0), which then returns -1 for any non-zero distance and 0 for equal strings:

  Some(ColumnarValue::Scalar(scalar)) if scalar.is_null() => Some(0),
  ...
  Some(array) if array.data_type() == &DataType::Null => Some(0),
  Some(_) => thresholds.map(|array| if array.is_valid(i) { array.value(i) } else { 0 }),

I think that doesn't match Spark. In Spark, Levenshtein.eval only null-checks left/right..a null threshold at runtime causes an NPE during v.asInstanceOf[Int]. The defensible options are (a) propagate null when threshold is null, or (b) mirror Spark and error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Levenshtein func with threshold parameter for Spark 4+

3 participants