Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #291 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 54 54
Lines 3121 3125 +4
=========================================
+ Hits 3121 3125 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds early validation to schema definition to prevent ambiguous schemas where multiple Columns resolve to the same effective column name (via alias), raising ImplementationError during class creation and adding regression tests.
Changes:
- Raise
ImplementationErrorwhen two columns in the same schema resolve to the same name (aliasor attribute name). - Raise
ImplementationErrorwhen a child schema introduces a column whose resolved name clashes with an inherited one. - Add tests covering both same-schema and inherited duplicate-alias scenarios.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
dataframely/_base_schema.py |
Adds duplicate-name checks during metadata collection/merging to raise ImplementationError early. |
tests/columns/test_alias.py |
Adds tests asserting duplicate aliases are rejected for same-schema and inherited definitions. |
| if duplicated_column_names := self.columns.keys() & other.columns.keys(): | ||
| raise ImplementationError( | ||
| f"Columns {duplicated_column_names} are duplicated." | ||
| ) |
There was a problem hiding this comment.
Metadata.update() now raises on any overlapping column keys between self and other. Because _get_metadata_recursively() traverses __bases__ recursively (not the MRO), diamond / multiple-inheritance patterns can visit the same ancestor more than once (e.g. A->B, A->C, D(B, C)) and would now raise even though the duplicated columns come from the same original definition. Consider de-duplicating bases during metadata collection (e.g. walk a linearized MRO once) or, at minimum, allow overlaps where the column objects are identical and only raise when the overlapping key maps to different Column instances.
| if duplicated_column_names := self.columns.keys() & other.columns.keys(): | |
| raise ImplementationError( | |
| f"Columns {duplicated_column_names} are duplicated." | |
| ) | |
| """Merge another Metadata instance into this one. | |
| Overlapping keys are allowed if and only if they refer to the *same* | |
| underlying object. This accommodates multiple-inheritance / diamond | |
| patterns where the same base schema is visited more than once. | |
| """ | |
| # Detect conflicting column definitions: same name, different Column instance | |
| duplicated_column_names = self.columns.keys() & other.columns.keys() | |
| conflicting_columns = { | |
| name | |
| for name in duplicated_column_names | |
| if self.columns[name] is not other.columns[name] | |
| } | |
| if conflicting_columns: | |
| raise ImplementationError( | |
| f"Columns {conflicting_columns} are duplicated with conflicting definitions." | |
| ) | |
| # Detect conflicting rule factories: same name, different object | |
| duplicated_rule_names = self.rules.keys() & other.rules.keys() | |
| conflicting_rules = { | |
| name | |
| for name in duplicated_rule_names | |
| if self.rules[name] is not other.rules[name] | |
| } | |
| if conflicting_rules: | |
| raise ImplementationError( | |
| f"Rules {conflicting_rules} are duplicated with conflicting definitions." | |
| ) |
| raise ImplementationError( | ||
| f"Columns {duplicated_column_names} are duplicated." |
There was a problem hiding this comment.
The exception message uses the raw set representation of duplicated_column_names (derived from key-set intersection). For multiple duplicates this will be non-deterministically ordered, which can make debugging and tests brittle. Consider formatting a stable, sorted list of duplicated names (and optionally quoting each name) instead of interpolating the set directly.
| raise ImplementationError( | |
| f"Columns {duplicated_column_names} are duplicated." | |
| formatted_names = ", ".join( | |
| sorted(repr(name) for name in duplicated_column_names) | |
| ) | |
| raise ImplementationError( | |
| f"Columns {formatted_names} are duplicated." |
Raises an
ImplementationErrorwhen a schema contains columns with duplicate aliases.Motivation
I was using LLM to generate dataframely schemas, it gives me duplicated alias like in:
It could be nice from dataframely it icould raise in this case.
Changes