Skip to content

Conversation

@mdayakar
Copy link
Contributor

@mdayakar mdayakar commented Dec 9, 2025

HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail.

What changes were proposed in this pull request?

While serializing/writing the data using ParquetHiveSerDe then verifying whether the type of the ObjectInspector object received(this is coming from Operator based on the data) is same as the type of the ObjectInspector created during ParquetHiveSerDe initialisation(this is coming from Table schema), if not same then using the ObjectInspector available during initialisation phase as it matches the schema of the table to avoid issues like HIVE-26877.

Here we can not use the ObjectInspector created during ParquetHiveSerDe object because if the data is coming from a TEXT format table then the ObjectInspector coming from will be having of type Lazy*ObjectInspector where as the ObjectInspector created during ParquetHiveSerDe object is of type Writable*ObjectInspector.

For example consider string data type, the ObjectInspector coming from Operator is of type LazyStringObjectInspector which maintains the corresponding primitive java object as LazyString where as WritableStringObjectInspector maintains the primitive java object as 'Text' which results in ClassCastException.

Why are the changes needed?

Here the type coming from Operator is decimal(17,7)(join condition derived type) but the table schema has decimal(12,7) in the q file test and for parquet files the decimal values are written as binaries with fixed length, for writing decimal(12,7) data requires 6 bytes but decimal(17,7) requires 8 bytes due to the difference between the size it is throwing exception. So to write the data properly in the destination parquet table the changes are required.

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn -Dtest=TestMiniLlapLocalCliDriver -Dqfile=ctas_dec_pre_scale_issue.q -pl itests/qtest -Pitests

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 9, 2025

Comment on lines +153 to +157
if (!ObjectInspectorUtils.compareTypes(writableObjectInspector, objInspector)) {
parquetRow.inspector = (StructObjectInspector) writableObjectInspector;
} else {
parquetRow.inspector = (StructObjectInspector) objInspector;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we can always just use writableObjectInspector? And if so, then do we even need objInspector?
In the else block, can we use writableObjectInspector if its type is same as objInspector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah actually my initial commit(commit 1) has the same solution but it has some impacts(53 test cases are failed) when analyzed those failures found that if the data is coming from a TEXT format table then the ObjectInspector coming from Operator is having of type Lazy*ObjectInspector(for some types like Int, String) where as the ObjectInspector created during ParquetHiveSerDe object is of type Writable*ObjectInspector.

For example consider string data type, the ObjectInspector coming from Operator is of type LazyStringObjectInspector which maintains the corresponding primitive java object as LazyString where as WritableStringObjectInspector maintains the primitive java object as Text which results in ClassCastException while getting the actual data.

So we can not always use the ObjectInspector created during initialization phase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about this tbh (I will have to look into this in detail), but do you think it's a better solution to somehow pass the correct objInspector to this method?
Because what it looks like is we are calling the serialize method and asking it to use a particular inspector. And then, with this patch, we may not even use that inspector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it would be better for the performance to change the initialize method so that it sets the objInspector to the one that's actually used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement here is bit strange cause it says that whenever there is type disagreement I will use the original inspector and when types are equal I will trust what is given to me.

The fact that all tests pass implies that most of the time (if not always) in existing tests the types are equal so we are essentially hitting the else branch.

Type inequality seems to be an outlier and maybe ctas_dec_pre_scale_issue.q is the only test that covers it. Does the proposed solution work if you add more column types in the table/queries?

CREATE TABLE table_a (cint int, cstring string, cdec decimal(12,7));
INSERT INTO table_a(100, 'Bob', 12345.6789101);
...
CREATE TABLE target AS
SELECT ta.cint, ta.cstring, ta.cdec
FROM table_a ta
...

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here seems more like a bug in the plan/type system rather than the SerDe level so I get the feeling that the PR should not be touching the SerDe classes but rather the planner.

Comment on lines +153 to +157
if (!ObjectInspectorUtils.compareTypes(writableObjectInspector, objInspector)) {
parquetRow.inspector = (StructObjectInspector) writableObjectInspector;
} else {
parquetRow.inspector = (StructObjectInspector) objInspector;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement here is bit strange cause it says that whenever there is type disagreement I will use the original inspector and when types are equal I will trust what is given to me.

The fact that all tests pass implies that most of the time (if not always) in existing tests the types are equal so we are essentially hitting the else branch.

Type inequality seems to be an outlier and maybe ctas_dec_pre_scale_issue.q is the only test that covers it. Does the proposed solution work if you add more column types in the table/queries?

CREATE TABLE table_a (cint int, cstring string, cdec decimal(12,7));
INSERT INTO table_a(100, 'Bob', 12345.6789101);
...
CREATE TABLE target AS
SELECT ta.cint, ta.cstring, ta.cdec
FROM table_a ta
...

Comment on lines +128 to +131
Select Operator
expressions: _col0 (type: decimal(12,7))
outputColumnNames: col1
Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Select Operator is interesting. Although it is not involved when writing to the table it seems to have the correct data type for the column that results from the join so if we had something similar before calling the File Output Operator we wouldn't hit the problem in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants