Skip to content

[Spark connector]Added spark 4.1 support#48861

Merged
xinlian12 merged 52 commits intoAzure:mainfrom
xinlian12:feat/issue-48849-spark-41-support
Apr 23, 2026
Merged

[Spark connector]Added spark 4.1 support#48861
xinlian12 merged 52 commits intoAzure:mainfrom
xinlian12:feat/issue-48849-spark-41-support

Conversation

@xinlian12
Copy link
Copy Markdown
Member

@xinlian12 xinlian12 commented Apr 18, 2026

Closes #48849

Summary

Add Spark 4.1 support to the Azure Cosmos DB Spark connector.

Addresses SPARK-52787 package reorganization where HDFSMetadataLog moved from o.a.s.sql.execution.streaming to o.a.s.sql.execution.streaming.checkpointing. Also adds storeOffsets() to ItemsBatchWriter for Databricks Runtime 18.1+ compatibility.

Architecture

New shared base module: azure-cosmos-spark_4

  • Parent module for all Spark 4.x connectors (like azure-cosmos-spark_3 for 3.x)
  • Contains 12 main + 6 test Scala files identical across Spark 4.0 and 4.1

New leaf module: azure-cosmos-spark_4-1_2-13

  • Inherits from azure-cosmos-spark_4
  • 3 override files with updated HDFSMetadataLog import for Spark 4.1
  • 1 override file (ItemsBatchWriter) with storeOffsets() for DBR 18.1+
  • Uses maven-resources-plugin to copy shared sources from spark_3 excluding the 4 overridden files
  • Sets maven.compiler.source/target=17 (Spark 4.1 APIs reference java.lang.Record)

Refactored: azure-cosmos-spark_4-0_2-13

  • Changed parent from azure-cosmos-spark_3azure-cosmos-spark_4
  • Removed 18 duplicate Scala files now in shared base

Scala compiler parameterization

Changed spark_3 scala-maven-plugin from hardcoded <source>1.8</source> to ${maven.compiler.source} so each module controls its own bytecode target via Maven properties.

JDK compatibility matrix

JDK version Spark 3.x modules Spark 4.0 module Spark 4.1 module
8, 11 ✅ Compile + Test ⏭️ Skip entirely ⏭️ Skip entirely
17+ ✅ Compile + Test ✅ Compile + Test ✅ Compile + Test

Both Spark 4.x modules skip entirely on JDK <17 via cosmos.spark.skip=true.

CI/Build Updates

  • eng/versioning/version_client.txt — added version entries
  • eng/versioning/external_dependencies.txt — added Spark 4.1.0 dependency
  • sdk/cosmos/ci.yml — added trigger paths, release params, artifact entries
  • sdk/cosmos/spark.yml — added Databricks live test stage for Spark 4.1
  • eng/pipelines/templates/stages/cosmos-emulator-matrix.json — added Spark 4.1 emulator test entries
  • eng/pipelines/aggregate-reports.yml — added exclusion/inclusion entries
  • eng/.docsettings.yml — added README exclusion
  • eng/scripts/generate_from_source_pom.py — added azure-cosmos-spark_4 as valid parent

Generated by coding-agent-harness

xinlian12 and others added 11 commits April 14, 2026 13:34
…a-jackson.version

The recent Jackson dependency update (8a671dd) bumped Jackson from 2.18.4
to 2.18.6 in all Cosmos Spark child modules but missed updating the
scala-jackson.version property in the parent POM. This caused the
maven-enforcer-plugin BannedDependencies rule to reject
jackson-module-scala_2.12 and _2.13 at version 2.18.6.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Create azure-cosmos-spark_4 shared base module (POM parent) containing
  12 main + 6 test Scala files common to all Spark 4.x versions
- Create azure-cosmos-spark_4-1_2-13 module with 3 override files using
  updated HDFSMetadataLog import (org.apache.spark.sql.execution.streaming.checkpointing)
  for SPARK-52787 package reorganization
- Refactor azure-cosmos-spark_4-0_2-13 to use spark_4 as parent, removing
  duplicated Scala files now in the shared base
- Add scala-hdfs source directories in spark_3 for HDFSMetadataLog files
  used by Spark 4.0 (old import path)
- Update CI/build configs: ci.yml triggers/artifacts, emulator matrix,
  aggregate-reports, version_client.txt, external_dependencies.txt,
  generate_from_source_pom.py valid_parents, .docsettings.yml

Implements Azure#48849

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rix, compiler settings

F1: Move HDFS-dependent files exclusively to scala-hdfs directory, add
scala-hdfs source paths to all Spark 3.x leaf module POMs to prevent
duplicate class compilation errors.

F2: Add azure-cosmos-spark_4 and azure-cosmos-spark_4-1_2-13 to
sdk/cosmos/pom.xml modules list.

F3: Add azure-cosmos-spark_4 to Spark 4.0 ProjectListOverride in PR
CI matrix.

F4: Add Spark 4.1 entry to PR CI matrix.

F5: Move maven.compiler.source/target=17 to shared azure-cosmos-spark_4
parent POM so both 4.0 and 4.1 children inherit it consistently.

F6: Remove incomplete PR link from CHANGELOG (PR number unknown at
commit time).

F7: Change cosmos-spark-version from '4.0' to '4' in shared base POM.

F8: Add explanatory comment for dead build-helper profile in base POM.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…LOG, README, remove redundant deps

- Backport scalastyle annotations to spark_3 scala-hdfs originals so copies
  differ only by the HDFSMetadataLog import line (F1)
- Add shared bug fix entries to spark_4-1 CHANGELOG from spark_4-0 (F2)
- Match spark_4-0 README version table column structure with TBD values (F3)
- Remove redundant jackson-databind/jackson-module-scala from spark_4-1 POM,
  inherited from parent spark_4 (F4)
- Add XML comments documenting intentional scala-hdfs omission in
  spark_4-1 build-helper config (F5)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…urce refs, add sync comments

- F1: Remove redundant Jackson dependencies from spark_4-0_2-13 (inherited from parent spark_4)
- F2: Remove non-existent azure-cosmos-spark_4/src/main/resources references from all 3 POMs
- F3: Add sync-reminder comments to the 3 override files in spark_4-1_2-13

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 requested review from a team and kirankumarkolli as code owners April 18, 2026 00:21
Copilot AI review requested due to automatic review settings April 18, 2026 00:21
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 changed the title [Spark connector][NO REVIEW]Added spark 4.1 support [Spark connector]Added spark 4.1 support Apr 21, 2026
@xinlian12
Copy link
Copy Markdown
Member Author

@sdkReviewAgent

@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure-cosmos-spark_4/CONTRIBUTING.md
Comment thread sdk/cosmos/azure-cosmos-spark_4-1_2-13/pom.xml Outdated
Comment thread sdk/cosmos/azure-cosmos-spark_4/pom.xml
@xinlian12
Copy link
Copy Markdown
Member Author

Review complete (35:59)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

…ecode

- spark_4 (parent): removed scala-maven-plugin override and
  scala.compiler.release — each child decides its own target
- spark_4-0: inherits source/target=1.8 from spark_3 (Spark 4.0
  doesn't reference java.lang.Record). JDK<17 skips tests only.
- spark_4-1: sets maven.compiler.source/target=17 (Spark 4.1 APIs
  reference java.lang.Record). JDK<17 skips entire build via
  cosmos.spark.skip=true.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 force-pushed the feat/issue-48849-spark-41-support branch from 2fb8a56 to d0c3e1d Compare April 21, 2026 21:22
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 and others added 3 commits April 21, 2026 14:51
… properties

Changed spark_3's scala-maven-plugin from hardcoded <source>1.8</source>
to ${maven.compiler.source}. All 3.x and 4.0 modules inherit '8' from
spark_3's properties (no behavior change). spark_4-1 overrides to '17'
via maven.compiler.source/target=17 in its own properties, which flows
through to the Scala compiler — resolving java.lang.Record (JDK 16+).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rk 4.x modules

The Spark 4.1 CI job (Spark41Scala213IntegrationTests) fails with:
  ItemsTable.scala:41: Class java.lang.Record not found

Root cause: The scala-maven-plugin inherited target=1.8 from the
spark_3 parent, but Spark 4.1 APIs reference java.lang.Record (JDK 14+).

Fix: Override scala-maven-plugin source/target to 17 in the spark_4
parent's build-scala profile. Both Spark 4.0 and 4.1 require JDK 17+.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
No longer needed since spark_3 now uses ${maven.compiler.source} instead
of hardcoded 1.8. Each child module's property flows through naturally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure-cosmos-spark_4-0_2-13/pom.xml
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Spark 4.x requires Java 17+. Skip the entire build (not just tests) on
older JDKs via cosmos.spark.skip=true, consistent with spark_4-1.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread sdk/cosmos/azure-cosmos-spark_4/pom.xml
Comment thread sdk/cosmos/azure-cosmos-spark_4/pom.xml
@xinlian12 xinlian12 merged commit d3eead7 into Azure:main Apr 23, 2026
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE REQ][Spark Connector]Add spark 4.1 support

6 participants