-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail. #6229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
| if (!ObjectInspectorUtils.compareTypes(writableObjectInspector, objInspector)) { | ||
| parquetRow.inspector = (StructObjectInspector) writableObjectInspector; | ||
| } else { | ||
| parquetRow.inspector = (StructObjectInspector) objInspector; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we can always just use writableObjectInspector? And if so, then do we even need objInspector?
In the else block, can we use writableObjectInspector if its type is same as objInspector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah actually my initial commit(commit 1) has the same solution but it has some impacts(53 test cases are failed) when analyzed those failures found that if the data is coming from a TEXT format table then the ObjectInspector coming from Operator is having of type Lazy*ObjectInspector(for some types like Int, String) where as the ObjectInspector created during ParquetHiveSerDe object is of type Writable*ObjectInspector.
For example consider string data type, the ObjectInspector coming from Operator is of type LazyStringObjectInspector which maintains the corresponding primitive java object as LazyString where as WritableStringObjectInspector maintains the primitive java object as Text which results in ClassCastException while getting the actual data.
So we can not always use the ObjectInspector created during initialization phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about this tbh (I will have to look into this in detail), but do you think it's a better solution to somehow pass the correct objInspector to this method?
Because what it looks like is we are calling the serialize method and asking it to use a particular inspector. And then, with this patch, we may not even use that inspector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it would be better for the performance to change the initialize method so that it sets the objInspector to the one that's actually used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if statement here is bit strange cause it says that whenever there is type disagreement I will use the original inspector and when types are equal I will trust what is given to me.
The fact that all tests pass implies that most of the time (if not always) in existing tests the types are equal so we are essentially hitting the else branch.
Type inequality seems to be an outlier and maybe ctas_dec_pre_scale_issue.q is the only test that covers it. Does the proposed solution work if you add more column types in the table/queries?
CREATE TABLE table_a (cint int, cstring string, cdec decimal(12,7));
INSERT INTO table_a(100, 'Bob', 12345.6789101);
...
CREATE TABLE target AS
SELECT ta.cint, ta.cstring, ta.cdec
FROM table_a ta
...
zabetak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here seems more like a bug in the plan/type system rather than the SerDe level so I get the feeling that the PR should not be touching the SerDe classes but rather the planner.
| if (!ObjectInspectorUtils.compareTypes(writableObjectInspector, objInspector)) { | ||
| parquetRow.inspector = (StructObjectInspector) writableObjectInspector; | ||
| } else { | ||
| parquetRow.inspector = (StructObjectInspector) objInspector; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if statement here is bit strange cause it says that whenever there is type disagreement I will use the original inspector and when types are equal I will trust what is given to me.
The fact that all tests pass implies that most of the time (if not always) in existing tests the types are equal so we are essentially hitting the else branch.
Type inequality seems to be an outlier and maybe ctas_dec_pre_scale_issue.q is the only test that covers it. Does the proposed solution work if you add more column types in the table/queries?
CREATE TABLE table_a (cint int, cstring string, cdec decimal(12,7));
INSERT INTO table_a(100, 'Bob', 12345.6789101);
...
CREATE TABLE target AS
SELECT ta.cint, ta.cstring, ta.cdec
FROM table_a ta
...| Select Operator | ||
| expressions: _col0 (type: decimal(12,7)) | ||
| outputColumnNames: col1 | ||
| Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Select Operator is interesting. Although it is not involved when writing to the table it seems to have the correct data type for the column that results from the join so if we had something similar before calling the File Output Operator we wouldn't hit the problem in the first place.



HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail.
What changes were proposed in this pull request?
While serializing/writing the data using
ParquetHiveSerDethen verifying whether the type of theObjectInspectorobject received(this is coming from Operator based on the data) is same as the type of theObjectInspectorcreated duringParquetHiveSerDeinitialisation(this is coming from Table schema), if not same then using theObjectInspectoravailable during initialisation phase as it matches the schema of the table to avoid issues like HIVE-26877.Here we can not use the
ObjectInspectorcreated duringParquetHiveSerDeobject because if the data is coming from a TEXT format table then theObjectInspectorcoming from will be having of typeLazy*ObjectInspectorwhere as theObjectInspectorcreated duringParquetHiveSerDeobject is of typeWritable*ObjectInspector.For example consider string data type, the ObjectInspector coming from Operator is of type
LazyStringObjectInspectorwhich maintains the corresponding primitive java object asLazyStringwhere asWritableStringObjectInspectormaintains the primitive java object as 'Text' which results inClassCastException.Why are the changes needed?
Here the type coming from Operator is decimal(17,7)(join condition derived type) but the table schema has decimal(12,7) in the q file test and for parquet files the decimal values are written as binaries with fixed length, for writing decimal(12,7) data requires 6 bytes but decimal(17,7) requires 8 bytes due to the difference between the size it is throwing exception. So to write the data properly in the destination parquet table the changes are required.
Does this PR introduce any user-facing change?
No
How was this patch tested?
mvn -Dtest=TestMiniLlapLocalCliDriver -Dqfile=ctas_dec_pre_scale_issue.q -pl itests/qtest -Pitests