2PC and staging output #3068
Conversation
| outputPath: String, | ||
| committer: Option[FileCommitProtocol] = None, | ||
| jobTrackerID: String = Utils.createTempDir().getName) | ||
| case class CometNativeWriteExec(nativeOp: Operator, child: SparkPlan, outputPath: String) |
There was a problem hiding this comment.
basic execution that delegates to native writer
|
@Shekharrajak Thank you for your work. The file commit protocol has already been implemented in #2828, and work_dir is the staging dir. Is my understanding correct? cc @comphead @andygrove |
I think current original implementation duplicated what InsertIntoHadoopFsRelationCommand already does. In this PR code changes, we are not managing FileCommitProtocol ourself but delegated to Spark. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3068 +/- ##
============================================
+ Coverage 56.12% 59.72% +3.59%
- Complexity 976 1470 +494
============================================
Files 119 175 +56
Lines 11743 16156 +4413
Branches 2251 2681 +430
============================================
+ Hits 6591 9649 +3058
- Misses 4012 5153 +1141
- Partials 1140 1354 +214 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f645ba3 to
4ad285e
Compare
|
Checks are looking good now. Please review. |
|
Thanks @Shekharrajak I'll check it this week |
|
Well what Im' thinking to assess correctly the PR we need to add unit tests to see that |
e760aff to
9532302
Compare
9532302 to
d343539
Compare
|
@comphead please have a look and trigger the workflow. |
Resolved conflicts: - native/proto/src/proto/operator.proto: Merged object_store_options (field 8) and staging_file_path (field 9) into ParquetWriter message - spark/src/main/scala/org/apache/spark/sql/comet/CometNativeWriteExec.scala: Accepted upstream FileCommitProtocol integration
dd2f914 to
a6e5c34
Compare
|
Checks are green. Looks fine. |
|
Thanks @Shekharrajak I'll try to run this PR with Apache Spark writer tests |
@comphead did you get a change to do this? |
will take a look this week, I was preparing a CI writer pipeline https://github.com/apache/datafusion-comet/compare/fdf1a1b9030451fa7f6e509e8411b68e232f8d01..aaaa6a6b9422a15f7baa56551b3dcba8b931a6af and rewrote it accidentally |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #3015.
Rationale for this change
The native Parquet writer needed a fix to use
output_pathas the base directory for file writes whenwork_diris not set. Without this fix, files were being written to root (/) instead of the intended output directory.What changes are included in this PR?
staging_file_pathfield toParquetWritermessage for future 2PC supportoutput_pathas fallback whenwork_diris emptyCometNativeWriteExecto write directly to output pathCometParquetWriter2PCSuitewith basic write functionality testsHow are these changes tested?
Added
CometParquetWriter2PCSuitewith 5 tests: