Skip to content

2PC and staging output #3068

Open
Shekharrajak wants to merge 10 commits into
apache:mainfrom
Shekharrajak:feature/issue-3015
Open

2PC and staging output #3068
Shekharrajak wants to merge 10 commits into
apache:mainfrom
Shekharrajak:feature/issue-3015

Conversation

@Shekharrajak
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #3015.

Rationale for this change

The native Parquet writer needed a fix to use output_path as the base directory for file writes when work_dir is not set. Without this fix, files were being written to root (/) instead of the intended output directory.

What changes are included in this PR?

  1. Protobuf: Added staging_file_path field to ParquetWriter message for future 2PC support
  2. Native Rust: Fixed parquet_writer.rs to use output_path as fallback when work_dir is empty
  3. Scala/JVM: Simplified CometNativeWriteExec to write directly to output path
  4. Tests: Added CometParquetWriter2PCSuite with basic write functionality tests

How are these changes tested?

Added CometParquetWriter2PCSuite with 5 tests:

  • Basic successful write creates files in output directory
  • Multiple concurrent tasks write without file conflicts
  • Various data types write correctly
  • Overwrite mode replaces existing files

@Shekharrajak Shekharrajak changed the title Feature/issue 3015 2PC and staging output Jan 11, 2026
outputPath: String,
committer: Option[FileCommitProtocol] = None,
jobTrackerID: String = Utils.createTempDir().getName)
case class CometNativeWriteExec(nativeOp: Operator, child: SparkPlan, outputPath: String)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic execution that delegates to native writer

@wForget
Copy link
Copy Markdown
Member

wForget commented Jan 12, 2026

@Shekharrajak Thank you for your work. The file commit protocol has already been implemented in #2828, and work_dir is the staging dir. Is my understanding correct? cc @comphead @andygrove

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Shekharrajak commented Jan 12, 2026

@Shekharrajak Thank you for your work. The file commit protocol has already been implemented in #2828, and work_dir is the staging dir. Is my understanding correct? cc @comphead @andygrove

I think current original implementation duplicated what InsertIntoHadoopFsRelationCommand already does. In this PR code changes, we are not managing FileCommitProtocol ourself but delegated to Spark.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.72%. Comparing base (f09f8af) to head (a6e5c34).
⚠️ Report is 927 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3068      +/-   ##
============================================
+ Coverage     56.12%   59.72%   +3.59%     
- Complexity      976     1470     +494     
============================================
  Files           119      175      +56     
  Lines         11743    16156    +4413     
  Branches       2251     2681     +430     
============================================
+ Hits           6591     9649    +3058     
- Misses         4012     5153    +1141     
- Partials       1140     1354     +214     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Checks are looking good now. Please review.

@comphead
Copy link
Copy Markdown
Contributor

Thanks @Shekharrajak I'll check it this week

@comphead
Copy link
Copy Markdown
Contributor

Well what Im' thinking to assess correctly the PR we need to add unit tests to see that _temporary folder created, like CommitFailureTestRelationSuite in Spark or other Spark writer suites. I'll track this in #3209

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Well what Im' thinking to assess correctly the PR we need to add unit tests to see that _temporary folder created, like CommitFailureTestRelationSuite in Spark or other Spark writer suites. I'll track this in #3209

Thanks for review, added relevant tests in this commit d343539

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

@comphead please have a look and trigger the workflow.

Resolved conflicts:
- native/proto/src/proto/operator.proto: Merged object_store_options (field 8) and staging_file_path (field 9) into ParquetWriter message
- spark/src/main/scala/org/apache/spark/sql/comet/CometNativeWriteExec.scala: Accepted upstream FileCommitProtocol integration
@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Checks are green. Looks fine.

@comphead
Copy link
Copy Markdown
Contributor

Thanks @Shekharrajak I'll try to run this PR with Apache Spark writer tests

@andygrove
Copy link
Copy Markdown
Member

Thanks @Shekharrajak I'll try to run this PR with Apache Spark writer tests

@comphead did you get a change to do this?

@comphead
Copy link
Copy Markdown
Contributor

comphead commented Mar 16, 2026

@comphead did you get a change to do this?

will take a look this week, I was preparing a CI writer pipeline https://github.com/apache/datafusion-comet/compare/fdf1a1b9030451fa7f6e509e8411b68e232f8d01..aaaa6a6b9422a15f7baa56551b3dcba8b931a6af and rewrote it accidentally

@github-actions
Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale label May 16, 2026
@github-actions github-actions Bot removed the Stale label May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comet writer should support 2PC and staging output

5 participants