chore: Remove config option for native_iceberg_compat#4019
Conversation
|
Thanks @andygrove 👀 |
mbutrovich
left a comment
There was a problem hiding this comment.
Thanks @andygrove, let's get that CI matrix and tech debt down.
comphead
left a comment
There was a problem hiding this comment.
it is LGTM, I was thinking if we need it for some debug/benchmark usecases but nothing comes to my head. So let it go
|
Is there still a functionality gap between |
So #3720 and #3442. We should be able resolve #3442 now that #4011 is merged. |
#3443 is fixed in #4038, so that just leaves #3720 which is just that Comet is more lenient with type widening and/or throws different exception than Spark. Not a correctness issue. |
|
@parthchandra ok if we merge this one? |
|
My general recommendation would be that we enable ignored tests before dropping Here's Claude's summary of ignored tests - 1.
|
| Test Name | Diffs |
|---|---|
join key with multiple references on the filtering plan |
4.0.1 |
SPARK-43402: FileSourceScanExec supports push down data filter with scalar subquery |
4.0.1 |
alter temporary view should follow current storeAnalyzedPlanForView config |
4.0.1 |
AdaptiveQueryExecSuite (#3442)
| Test Name | Diffs |
|---|---|
static scan metrics |
3.4.3, 3.5.8, 4.0.1 |
FileBasedDataSourceSuite (#3321)
| Test Name | Diffs |
|---|---|
Enabling/disabling ignoreMissingFiles using parquet (conditionally tagged only when format == "parquet") |
4.0.1 |
Enabling/disabling ignoreCorruptFiles |
4.0.1 |
ParquetFilterSuite (3.4.3 only)
| Test Name | Diffs |
|---|---|
filter pushdown - StringPredicate (tagged IgnoreCometNativeDataFusion in 3.4.3; IgnoreCometNativeScan in 3.5.8/4.0.1) |
3.4.3 |
ParquetSchemaSuite (#3720)
| Test Name | Diffs |
|---|---|
SPARK-35640: read binary as timestamp should throw schema incompatible error |
3.4.3, 3.5.8, 4.0.1 |
SPARK-35640: int as long should throw schema incompatible error |
3.4.3, 3.5.8 |
SPARK-47447: read TimestampLTZ as TimestampNTZ |
4.0.1 |
SPARK-36182: can't read TimestampLTZ as TimestampNTZ |
3.4.3, 3.5.8 |
SPARK-34212 Parquet should read decimals correctly |
3.4.3, 3.5.8, 4.0.1 |
row group skipping doesn't overflow when reading into larger type |
3.4.3, 3.5.8, 4.0.1 |
ParquetSchemaEvolutionSuite (#3720)
| Test Name | Diffs |
|---|---|
schema mismatch failure error message for parquet vectorized reader |
3.4.3, 3.5.8, 4.0.1 |
SPARK-45604: schema mismatch failure error on timestamp_ntz to array<timestamp_ntz> |
3.4.3, 3.5.8, 4.0.1 |
ParquetTypeWideningSuite (#3321)
| Test Name | Diffs |
|---|---|
parquet widening conversion DateType -> TimestampNTZType (conditionally tagged) |
4.0.1 |
unsupported parquet conversion $fromType -> $toType (multiple type combos) |
4.0.1 |
unsupported parquet timestamp conversion $fromType ($outputTimestampType) -> $toType |
4.0.1 |
parquet decimal precision change Decimal($fromPrecision, 2) -> Decimal($toPrecision, 2) |
4.0.1 |
parquet decimal precision and scale change Decimal($fromPrecision, $fromScale) -> Decimal($toPrecision, $toScale) |
4.0.1 |
2. assume() — runtime skip
ParquetRowIndexSuite (#3886) — 4.0.1 only
| Test Name | Condition |
|---|---|
invalid row index column type - ${conf.desc} |
Skipped when COMET_NATIVE_SCAN_IMPL is SCAN_NATIVE_DATAFUSION or SCAN_AUTO. Comet throws RuntimeException instead of SparkException. |
CometExpressionSuite — Comet's own test suite
| Test Name | Condition |
|---|---|
get_struct_field - select primitive fields |
Skipped when scanImpl == SCAN_AUTO && Spark 4.0+ |
get_struct_field - select subset of struct |
Skipped when scanImpl == SCAN_AUTO && Spark 4.0+ |
get_struct_field - read entire struct |
Skipped when scanImpl == SCAN_AUTO && Spark 4.0+ |
Summary by Tracking Issue
| Issue | Count | Description |
|---|---|---|
| #3321 | ~12 | Schema evolution, corrupt/missing files, AQE, type widening |
| #3720 | ~8 | Schema mismatch errors, decimal reads, row group skipping |
| #3442 | 1 | Static scan metrics with DPP |
| #3886 | 1 | Row index column type error type mismatch |
| (no issue) | 5 | Filter pushdown / accumulator tests (IgnoreCometNativeScan) |
| (no issue) | 3 | get_struct_field tests (auto + Spark 4.0+ only) |
|
More Claude analysis on schema mismatch. Claude recommends we explicitly check that the following tests fail with a different message instead of actually succeeding (because the results will be wrong) -
|
|
One final word from Claude -
|
Alright, I have moved this to draft for now. I'll take another more detailed look at #3720 soon. Thanks @parthchandra. |
|
Here is an update on the remaining compatibility differences. Here are tests that are still ignored: % grep -B 1 "IgnoreCometNativeDataFusion(\"" dev/diffs/4.1.1.diff
+ test("SPARK-35640: read binary as timestamp should throw schema incompatible error",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ test("SPARK-47447: read TimestampLTZ as TimestampNTZ",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ test("SPARK-34212 Parquet should read decimals correctly",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ test("row group skipping doesn't overflow when reading into larger type",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ test("schema mismatch failure error message for parquet vectorized reader",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ test("SPARK-45604: schema mismatch failure error on timestamp_ntz to array<timestamp_ntz>",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ test(s"unsupported parquet conversion $fromType -> $toType",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ test(s"unsupported parquet conversion $fromType -> $toType",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ test(s"unsupported parquet timestamp conversion $fromType ($outputTimestampType) -> $toType",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ s"parquet decimal precision change Decimal($fromPrecision, 2) -> Decimal($toPrecision, 2)",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
--
+ s"Decimal($toPrecision, $toScale)",
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")I have two pending PRs related to this, which enable some of the tests and provide better documentation around differences. |
COMET_NATIVE_SCAN_IMPL config now only allows auto or native_datafusion. CometScanRule always uses native_datafusion. Tests no longer parameterize on scan impl and iceberg_compat-specific tests are removed. Golden files are regenerated in a follow-up commit.
Plan-stability suites no longer parameterize on scan impl, so each query has a single golden directory. native_datafusion dirs are renamed to the unsuffixed name, and native_iceberg_compat dirs are removed.
3549895 to
0b664bd
Compare
# Conflicts: # .github/workflows/pr_build_linux.yml # .github/workflows/spark_sql_test.yml
|
Thanks @andygrove It looks #3720 has multiple cases beyond schema mismatch error? For example, I see It may make sense to break down #3720 into smaller categories. |
I think that one is fixed in the other pr that I linked to earlier |
Yes, fixed in #4229 |
Sure, we can do that, but I don't think it changes the fact that we want to remove Once #4229 is merged, I think we are down to 7 tests ignored, and these are edge/corner cases which are already documented in the compatibility guide. These could eventually be fixed with upstream changes to arrow-rs and datafusion if we really think these are worth the investment. |
I burned some tokens working on #4229 overnight and maybe have all the tests unignored for 4.1.1 ... I'll keep my eye on CI today ... would be great to get reviews on that one so we can unblock this work |
|
There is now only one correctness issue left with |
Which issue does this PR close?
Partially address #4020.
Rationale for this change
Removing support for
native_iceberg_compatreduces the maintenance burden of the project.This PR makes it impossible for users to select
native_iceberg_compatand stops running tests for that scan impl.Subsequent PRs will remove the implementation code.
What changes are included in this PR?
COMET_NATIVE_SCAN_IMPLconfig now only allowsautoornative_datafusionCometScanRuleno longer uses the value of config optionCOMET_NATIVE_SCAN_IMPLand just usesnative_datafusionscannative_iceberg_compatnative_iceberg_compathave been removedHow are these changes tested?