HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail. #6229

mdayakar · 2025-12-09T07:03:39Z

HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail.

What changes were proposed in this pull request?

While serializing/writing the data using ParquetHiveSerDe then verifying whether the type of the ObjectInspector object received(this is coming from Operator based on the data) is same as the type of the ObjectInspector created during ParquetHiveSerDe initialisation(this is coming from Table schema), if not same then using the ObjectInspector available during initialisation phase as it matches the schema of the table to avoid issues like HIVE-26877.

Here we can not use the ObjectInspector created during ParquetHiveSerDe object because if the data is coming from a TEXT format table then the ObjectInspector coming from will be having of type Lazy*ObjectInspector where as the ObjectInspector created during ParquetHiveSerDe object is of type Writable*ObjectInspector.

For example consider string data type, the ObjectInspector coming from Operator is of type LazyStringObjectInspector which maintains the corresponding primitive java object as LazyString where as WritableStringObjectInspector maintains the primitive java object as 'Text' which results in ClassCastException.

Why are the changes needed?

Here the type coming from Operator is decimal(17,7)(join condition derived type) but the table schema has decimal(12,7) in the q file test and for parquet files the decimal values are written as binaries with fixed length, for writing decimal(12,7) data requires 6 bytes but decimal(17,7) requires 8 bytes due to the difference between the size it is throwing exception. So to write the data properly in the destination parquet table the changes are required.

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn -Dtest=TestMiniLlapLocalCliDriver -Dqfile=ctas_dec_pre_scale_issue.q -pl itests/qtest -Pitests

…on/scale fail

sonarqubecloud · 2025-12-09T15:18:54Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

soumyakanti3578 · 2025-12-09T22:41:10Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java

+    if (!ObjectInspectorUtils.compareTypes(writableObjectInspector, objInspector)) {
+      parquetRow.inspector = (StructObjectInspector) writableObjectInspector;
+    } else {
+      parquetRow.inspector = (StructObjectInspector) objInspector;
+    }


Does this mean that we can always just use writableObjectInspector? And if so, then do we even need objInspector?
In the else block, can we use writableObjectInspector if its type is same as objInspector?

yeah actually my initial commit(commit 1) has the same solution but it has some impacts(53 test cases are failed) when analyzed those failures found that if the data is coming from a TEXT format table then the ObjectInspector coming from Operator is having of type Lazy*ObjectInspector(for some types like Int, String) where as the ObjectInspector created during ParquetHiveSerDe object is of type Writable*ObjectInspector.

For example consider string data type, the ObjectInspector coming from Operator is of type LazyStringObjectInspector which maintains the corresponding primitive java object as LazyString where as WritableStringObjectInspector maintains the primitive java object as Text which results in ClassCastException while getting the actual data.

So we can not always use the ObjectInspector created during initialization phase.

I don't know much about this tbh (I will have to look into this in detail), but do you think it's a better solution to somehow pass the correct objInspector to this method?
Because what it looks like is we are calling the serialize method and asking it to use a particular inspector. And then, with this patch, we may not even use that inspector.

Indeed, it would be better for the performance to change the initialize method so that it sets the objInspector to the one that's actually used.

The if statement here is bit strange cause it says that whenever there is type disagreement I will use the original inspector and when types are equal I will trust what is given to me.

The fact that all tests pass implies that most of the time (if not always) in existing tests the types are equal so we are essentially hitting the else branch.

Type inequality seems to be an outlier and maybe ctas_dec_pre_scale_issue.q is the only test that covers it. Does the proposed solution work if you add more column types in the table/queries?

CREATE TABLE table_a (cint int, cstring string, cdec decimal(12,7)); INSERT INTO table_a(100, 'Bob', 12345.6789101); ... CREATE TABLE target AS SELECT ta.cint, ta.cstring, ta.cdec FROM table_a ta ...

zabetak

The problem here seems more like a bug in the plan/type system rather than the SerDe level so I get the feeling that the PR should not be touching the SerDe classes but rather the planner.

zabetak · 2025-12-11T11:32:00Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java

+    if (!ObjectInspectorUtils.compareTypes(writableObjectInspector, objInspector)) {
+      parquetRow.inspector = (StructObjectInspector) writableObjectInspector;
+    } else {
+      parquetRow.inspector = (StructObjectInspector) objInspector;
+    }


The if statement here is bit strange cause it says that whenever there is type disagreement I will use the original inspector and when types are equal I will trust what is given to me.

The fact that all tests pass implies that most of the time (if not always) in existing tests the types are equal so we are essentially hitting the else branch.

Type inequality seems to be an outlier and maybe ctas_dec_pre_scale_issue.q is the only test that covers it. Does the proposed solution work if you add more column types in the table/queries?

CREATE TABLE table_a (cint int, cstring string, cdec decimal(12,7)); INSERT INTO table_a(100, 'Bob', 12345.6789101); ... CREATE TABLE target AS SELECT ta.cint, ta.cstring, ta.cdec FROM table_a ta ...

zabetak · 2025-12-11T11:35:30Z

ql/src/test/results/clientpositive/llap/ctas_dec_pre_scale_issue.q.out

+                Select Operator
+                  expressions: _col0 (type: decimal(12,7))
+                  outputColumnNames: col1
+                  Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE


This Select Operator is interesting. Although it is not involved when writing to the table it seems to have the correct data type for the column that results from the join so if we had something similar before calling the File Output Operator we wouldn't hit the problem in the first place.

HIVE-26877: Parquet CTAS with JOIN on decimals with different precisi…

3540f47

…on/scale fail

asf-ci-hive added tests pending tests unstable and removed tests pending labels Dec 9, 2025

Fixed test failures

ff0d517

asf-ci-hive added tests pending and removed tests unstable labels Dec 9, 2025

asf-ci-hive added tests passed and removed tests pending labels Dec 9, 2025

soumyakanti3578 reviewed Dec 9, 2025

View reviewed changes

zabetak reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail. #6229

HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail. #6229

mdayakar commented Dec 9, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Dec 9, 2025

Uh oh!

soumyakanti3578 Dec 9, 2025

Uh oh!

mdayakar Dec 10, 2025

Uh oh!

soumyakanti3578 Dec 10, 2025

Uh oh!

thomasrebele Dec 11, 2025

Uh oh!

zabetak Dec 11, 2025

Uh oh!

zabetak left a comment

Uh oh!

zabetak Dec 11, 2025

Uh oh!

zabetak Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail. #6229

Are you sure you want to change the base?

HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail. #6229

Conversation

mdayakar commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sonarqubecloud bot commented Dec 9, 2025

Quality Gate passed

Uh oh!

soumyakanti3578 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

mdayakar Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

soumyakanti3578 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

thomasrebele Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

zabetak Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

zabetak Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

zabetak Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mdayakar commented Dec 9, 2025 •

edited

Loading