Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 18, 2026

Summary

  • Enables support for complex types (arrays, maps, structs) in Comet's native Parquet writer
  • Removes the blocking check that previously prevented complex types
  • Adds comprehensive test coverage for complex types

Changes

  • Remove complex type blocking check in CometDataWritingCommand.scala
  • Add 12 new tests for complex types in CometParquetWriterSuite.scala:
    • Basic complex types (array, struct, map)
    • Nested complex types (array of structs, struct containing array, map with struct values, deeply nested)
    • Nullable complex types with nulls at various nesting levels
    • Complex types containing decimal and temporal types
    • Empty arrays and maps
    • Fuzz testing with randomly generated complex type schemas
  • Update documentation to reflect complex type support

Test plan

  • Tests verify round-trip compatibility (write with Comet, read with Spark/Comet)
  • Fuzz testing with randomly generated schemas

🤖 Generated with Claude Code

Enables support for complex types (arrays, maps, structs) in Comet's native
Parquet writer by removing the blocking check that previously prevented them.

Changes:
- Remove complex type blocking check in CometDataWritingCommand.scala
- Add comprehensive test coverage for complex types including:
  - Basic complex types (array, struct, map)
  - Nested complex types (array of structs, struct containing array, etc.)
  - Nullable complex types with nulls at various nesting levels
  - Complex types containing decimal and temporal types
  - Empty arrays and maps
  - Fuzz testing with randomly generated complex type schemas
- Update documentation to reflect complex type support

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.63%. Comparing base (f09f8af) to head (af1d474).
⚠️ Report is 855 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3214      +/-   ##
============================================
+ Coverage     56.12%   59.63%   +3.50%     
- Complexity      976     1416     +440     
============================================
  Files           119      170      +51     
  Lines         11743    15700    +3957     
  Branches       2251     2595     +344     
============================================
+ Hits           6591     9362    +2771     
- Misses         4012     5021    +1009     
- Partials       1140     1317     +177     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove marked this pull request as draft January 18, 2026 20:17
Enable spark.comet.scan.allowIncompatible in complex type tests so that
native_iceberg_compat scan is used (which supports complex types) instead
of falling back to native_comet (which doesn't support complex types).

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@andygrove
Copy link
Member Author

andygrove commented Jan 18, 2026

With these changes, we can run the PySpark repartition benchmark fully native, and it shows an almost 2x speedup compared to Spark (and also ~2x compared to Comet when writes are disabled and Comet does the columnar-to-row transition).

plan

andygrove and others added 3 commits January 18, 2026 14:18
The CI sets COMET_PARQUET_SCAN_IMPL=native_comet for some test profiles,
which overrides the default auto mode. Since native_comet doesn't support
complex types, the scan falls back to Spark's reader which produces
OnHeapColumnVector instead of CometVector, causing the native writer to fail.

This fix explicitly sets COMET_NATIVE_SCAN_IMPL to "auto" in the test
configuration, allowing native_iceberg_compat to be used for complex types.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@andygrove andygrove marked this pull request as ready for review January 18, 2026 22:42
@andygrove andygrove requested a review from comphead January 18, 2026 22:47
@andygrove
Copy link
Member Author

With these changes, we can run the PySpark repartition benchmark fully native, and it shows an almost 2x speedup compared to Spark (and also ~2x compared to Comet when writes are disabled and Comet does the columnar-to-row transition).

@comphead ☝️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants