-
Notifications
You must be signed in to change notification settings - Fork 1k
fix(optimizer)!: query schema directly when type annotation fails for processing UNNEST source #6451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
can you fix the type annotation instead? |
|
Yeah, I second that suggestion in favor of avoiding making this more complicated than necessary. Keen to understand more if that is tricky to do; let me know and we can hop in a call and chat about this. |
georgesittas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-BradfordPaskewitz keep in mind that you'll need to rebase and apply these changes to resolver.py due to 625654a.
3b0a522 to
fea0204
Compare
fea0204 to
41cfa9e
Compare
georgesittas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-BradfordPaskewitz another quick round of comments from me, but this should be good to merge soon.
@tobymao wanna take a look as well? Another pair of eyes is a good idea for this one.
sqlglot/optimizer/resolver.py
Outdated
| if col_type and col_type.is_type(exp.DataType.Type.ARRAY): | ||
| element_types = col_type.expressions | ||
| if element_types: | ||
| unnest.type = element_types[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sqlglot/optimizer/resolver.py
Outdated
| if element_types: | ||
| unnest.type = element_types[0] | ||
| else: | ||
| unnest.type = col_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sqlglot/optimizer/resolver.py
Outdated
| return None | ||
| table_name = table_identifier.name | ||
|
|
||
| source = scope.sources[table_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this lookup safe? Do we always know that table_name appears in sources? Should we be conservative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to be defensive
sqlglot/optimizer/resolver.py
Outdated
| return self._get_column_type_from_scope(source, column.name) | ||
|
|
||
| def _get_column_type_from_scope( | ||
| self, source: t.Union[Scope, exp.Table], col_name: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should pass a Column instance here, not col_name, because columns will already be normalized & quoted and we won't risk weird normalization corner cases when looking them up in the schema– which also normalizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, updated to use column instead of name
sqlglot/optimizer/resolver.py
Outdated
| for source_name in source.sources: | ||
| nested_source = source.sources[source_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for source_name in source.sources: | |
| nested_source = source.sources[source_name] | |
| for source_name, nested_source in source.sources.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
… processing UNNEST source
41cfa9e to
7428a39
Compare
Issue: https://fivetran.atlassian.net/browse/RD-1066808
Problem: In BigQuery, when you UNNEST an ARRAY<STRUCT<...>>, the struct fields can be referenced as unqualified columns (e.g., WHERE type = 'x'). This worked in simple queries but failed in correlated subqueries because sqlglot's type annotation couldn't resolve columns from outer scopes, so qualify_columns didn't know what struct fields to expose.
Solution: Add a fallback mechanism in qualify_columns.py that queries the schema directly when type annotation fails. When processing an UNNEST source, if the expression isn't typed, traverse parent scopes to find the column's type definition in the schema (handling both table sources and CTEs), then extract and expose the struct field names as available columns in that scope.