Skip to content

feat(python/sedonadb): add Expr foundation#807

Merged
jiayuasu merged 5 commits into
apache:mainfrom
jiayuasu:feature/expr-foundation
May 8, 2026
Merged

feat(python/sedonadb): add Expr foundation#807
jiayuasu merged 5 commits into
apache:mainfrom
jiayuasu:feature/expr-foundation

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented May 2, 2026

Adds the foundation of a Python expression layer that wraps DataFusion's logical Expr through PyO3, mirroring the pattern used in the R bindings (r/sedonadb/src/rust/src/expression.rs).

This is the first of four small stacked PRs that together implement Phase P1 of #791. This one ships only the foundation:

  • sedonadb.expr.Expr — column-expression wrapper with alias, cast, is_null, is_not_null, isin, negate.
  • sedonadb.expr.col(name, qualifier=None) — reference a column by name (with an optional table qualifier for joins).

Operator overloading and DataFrame integration land in follow-up PRs. Expr objects are pure syntax: they are not bound to a DataFrame at construction time, and column-validity errors surface only when an Expr is consumed by a DataFrame method.

The pre-existing sedonadb.expr.literal.lit (returning Literal, the lazy parameterized-query helper) is unchanged by this PR and is intentionally kept off sedonadb.expr's package-level surface.

Test plan

  • Unit tests in tests/expr/test_expression.py covering construction, qualified columns, alias, cast, null checks, isin, negate, chaining, and the __init__ type guard.
  • No regressions in the existing tests/test_dataframe.py.
  • CI green.

Introduce a Python expression layer that wraps DataFusion's logical
Expr through PyO3, mirroring the pattern used in the R bindings
(r/sedonadb/src/rust/src/expression.rs).

This commit ships the foundation only:

- `sedonadb.expr.Expr` — column-expression wrapper with `alias`,
  `cast`, `is_null`, `is_not_null`, `isin`, and `negate`.
- `sedonadb.expr.col(name)` — reference a column by name.
- `sedonadb.expr.lit(value)` — wrap a Python value as a literal,
  reusing the existing Literal Arrow-array coercion path.

Operator overloading and DataFrame integration land in follow-up
commits. Expr objects are pure syntax: they are not bound to a
DataFrame at construction time, and column-validity errors surface
only when an Expr is consumed by a DataFrame method.
@github-actions github-actions Bot requested a review from prantogg May 2, 2026 06:09
@jiayuasu jiayuasu requested a review from Copilot May 2, 2026 06:10
@jiayuasu jiayuasu marked this pull request as draft May 2, 2026 06:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the initial Python expression layer (sedonadb.expr) that wraps DataFusion logical Expr via PyO3, establishing the base building blocks needed for upcoming pandas-like APIs (per #791 / Phase P1).

Changes:

  • Introduces a Rust PyO3 InternalExpr wrapper and exports constructors expr_col / expr_lit from sedonadb._lib.
  • Adds Python sedonadb.expr.Expr plus col() / lit() helpers and basic expression methods (alias, cast, null checks, isin, negate).
  • Adds a new unit test suite covering expression construction and method chaining.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/sedonadb/tests/expr/test_expression.py New unit tests for the Python Expr surface.
python/sedonadb/src/lib.rs Registers the new Rust expression module/functions/classes in the PyO3 module.
python/sedonadb/src/expr.rs Implements InternalExpr and core expression operations + literal/column constructors.
python/sedonadb/python/sedonadb/expr/expression.py Implements the public Python Expr wrapper API and helpers.
python/sedonadb/python/sedonadb/expr/init.py Re-exports Expr, col, lit, and Literal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/sedonadb/python/sedonadb/expr/expression.py Outdated
Comment thread python/sedonadb/src/expr.rs
Comment thread python/sedonadb/tests/expr/test_expression.py
- expr.rs: add module-level docs and step-by-step comments on every
  factory and method, explaining what each constructs, why we clone,
  why we reject Arrow extension types in cast(), and the intended
  shape of the layer relative to the R prior art.
- expr.rs: route expr_lit() through import_arrow_scalar() instead of
  duplicating the length check + ScalarValue conversion + metadata
  handling inline. Future fixes to scalar coercion only need to live
  in one place.
- expression.py: fix is_null() docstring. The previous wording
  promised both NULL and NaN matching, but the underlying DataFusion
  Expr::IsNull only matches SQL NULL. The pandas-style NaN-aware
  helper will live on the future Series type.
- test_expression.py: lean on Expr.variant_name() for structural
  assertions and only check user-supplied identifiers (column names,
  literal values) inside repr() output. Avoids coupling the suite to
  DataFusion's Display formatting, which can change between versions
  without a semantic change.
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I know you're still working here so feel free to ignore anything that was still in the works.

Comment on lines +111 to +114
/// 2. Reject Arrow extension types up front. SedonaDB's spatial types
/// are extension types over WKB; users who reach for `cast` on
/// those almost certainly want a different operation, so we surface
/// a clear error rather than silently dropping the extension.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 (I added support for these in DataFusion 53 that we can transform using an optimizer rule to a scalar function call). We can support these before DataFusion 53 here if we need to by inserting the scalar function call directly (I may give this a go in the near future since it's maybe useful for geometry/geography).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads-up. Leaving extension-type cast as a clear error in this PR. Happy to revisit with the DF53 optimizer rule once that's available — geometry/geography is the obvious payoff.

Comment thread python/sedonadb/src/expr.rs Outdated
Comment on lines +172 to +177
/// Qualified columns (e.g. `t.x`) are not exposed yet; the Python
/// `col()` helper takes only a single name. When we add joins and
/// multi-table references we can grow this to accept an optional
/// table qualifier, matching the R side's `column(name, qualifier)`.
#[pyfunction]
pub fn expr_col(name: &str) -> PyExpr {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's easy you can probably just add a qualifier argument here while you're at it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0d1a4abexpr_col(name, qualifier=None) on the Rust side and col(name, qualifier=None) on the Python side. Test in test_col_with_qualifier.

e = col("x")
assert isinstance(e, Expr)
assert e._impl.variant_name() == "Column"
assert "x" in repr(e)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing the exact repr output is probably easy to do and is a slightly better test

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in 0d1a4ab — exact repr substrings are back (x AS y, CAST(x AS Int32), etc.), with variant_name() checks kept alongside as structural anchors. Added a module-level comment explaining the policy so the next person doesn't re-loosen them.

Comment on lines +99 to +105
def lit(value: Any) -> Expr:
"""Wrap a Python value as a literal expression.

Accepts the same value types as `sedonadb.expr.literal.lit`, including
Python scalars, pyarrow arrays/scalars, and Shapely geometries. Returns an
`Expr` suitable for composition with column expressions.
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW there is already a lit() function that returns Literal. Literal is intentionally lazy so (the python wrappers around) different functions can interpret then differently if they want to (e.g., RS_Intersects() doesn't actually need to convert a whole rasterio object to an Arrow scalar, which is expensive, to compute a correct result). Purely theoretical, not currently used, and can change, but that's why it's like that 🙂 .

Probably you can just fold this logic into _to_expr() and leave the existing lit() (which powers parameterized queries).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that makes sense — thanks for the context. Folded into 0d1a4ab: dropped the public lit() -> Expr and moved the Python-value-to-Expr coercion into the private _to_expr() helper. The existing sedonadb.expr.literal.lit returning Literal is untouched and stays re-exported from the package.

Comment on lines +40 to +43
def __init__(self, impl):
# impl is the underlying _lib.InternalExpr handle. Users normally
# do not construct Expr directly; use col() / lit() instead.
self._impl = impl
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An isinstance() check here would be good so that this errors if used incorrectly

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 0d1a4abExpr.__init__ now isinstance-checks its argument against _lib.InternalExpr and raises TypeError with a message pointing at col(). Covered by test_expr_init_rejects_wrong_type.

Comment on lines +48 to +50
def alias(self, name: str) -> "Expr":
"""Return a copy of the expression with a new output name."""
return Expr(self._impl.alias(name))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget if we have CI checks for this, but we parameter docs and examples for most functions in the Python APIs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beefed up in 0d1a4ab. Public Expr methods and col() now have Args: and Examples: docstring blocks matching the style of dataframe.py:head/limit/count.

A. expr_col now accepts an optional qualifier (e.g. col("x", "t") for
   t.x), mirroring SedonaDBExprFactory::column in the R bindings.
B. Tests pin the exact rendered Display form again (e.g. "x AS y",
   "CAST(x AS Int32)") with a module-level comment noting the policy
   so future contributors don't re-loosen them. variant_name() checks
   stay alongside as structural anchors.
C. Drop the public lit() that returned an Expr; Python-side coercion
   is now folded into the private _to_expr() helper. The pre-existing
   lit() in sedonadb.expr.literal (returning Literal) is intentionally
   lazy and powers parameterized queries; leave it untouched and keep
   it re-exported.
D. Expr.__init__ now isinstance-checks its argument and raises
   TypeError on misuse, with a message pointing at col().
E. Public Expr methods and col() grow Args / Examples docstring blocks
   matching the style of the existing dataframe.py methods.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment on lines +18 to +21
from sedonadb.expr.expression import Expr, col
from sedonadb.expr.literal import Literal, lit

__all__ = ["Expr", "Literal", "col", "lit"]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f633481lit is removed from sedonadb.expr.__init__.py's re-exports. It's a Literal constructor (the lazy parameterized-query helper), not an Expr factory, so surfacing it on the same package as col() was misleading. Users continue to from sedonadb.expr.literal import lit, matching the pre-PR layout. PR description updated to drop the obsolete bullet.

Comment on lines +36 to +37
Construct an `Expr` with `col(name)`. Plain Python values composed with an
`Expr` via operators are coerced to literal expressions automatically.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in f633481. The Expr docstring now describes only what this PR ships (coercion happens inside methods like isin, not via operators) and calls out that operator overloading arrives in a follow-up.

- expr/__init__.py: drop `lit` from `__all__` / re-exports. It returns
  a `Literal` (the parameterized-query helper), not an `Expr`, so
  surfacing it on the same package as `col()` was confusing. Users
  who need it continue to `from sedonadb.expr.literal import lit`,
  matching the pre-PR layout.
- expression.py: rewrite the `Expr` class docstring to describe what
  this PR actually ships (coercion happens inside methods like
  `isin`, not via operators) and call out that operator overloading
  arrives in a follow-up.
@jiayuasu jiayuasu marked this pull request as ready for review May 6, 2026 06:16
Comment on lines +93 to +95
assert "IN" in rep
assert "Int64(1)" in rep
assert "Int64(3)" in rep
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more of these that should just be checking the repr. (LLMs love to write tests that will almost definitely pass, particularly if they are testing their own implementation 🙂 )

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened to exact repr() == ... in f89149b.

Comment on lines +102 to +104
rep = repr(e)
assert "a" in rep
assert "Int64(2)" in rep
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened to exact repr() == ... in f89149b.

Comment on lines +109 to +110
assert e._impl.variant_name() == "Negative"
assert "(- x)" in repr(e)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this one

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened to exact repr() == ... in f89149b.

Comment on lines +115 to +117
assert e._impl.variant_name() == "Alias"
assert "missing" in repr(e)
assert "IS NULL" in repr(e)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this one

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened to exact repr() == ... in f89149b.

# Constructing an Expr referring to a non-existent column does not error.
# Errors surface only at DataFrame consumption.
e = col("nonexistent_column_xyz")
assert "nonexistent_column_xyz" in repr(e)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this one

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened to exact repr() == ... in f89149b.

Switch the five flagged tests from substring matches to exact
repr() equality so any change in DataFusion's Display output
fails the suite deliberately rather than slipping through.

- test_isin_python_scalars
- test_isin_with_expr_values
- test_negate
- test_chain_alias_after_predicate
- test_expr_is_not_bound_to_dataframe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants