[Spark connector]Added spark 4.1 support#48861
Merged
xinlian12 merged 52 commits intoAzure:mainfrom Apr 23, 2026
Merged
Conversation
…a-jackson.version The recent Jackson dependency update (8a671dd) bumped Jackson from 2.18.4 to 2.18.6 in all Cosmos Spark child modules but missed updating the scala-jackson.version property in the parent POM. This caused the maven-enforcer-plugin BannedDependencies rule to reject jackson-module-scala_2.12 and _2.13 at version 2.18.6. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to upstream-main
…to upstream-main
…to upstream-main
…to upstream-main
…to upstream-main
- Create azure-cosmos-spark_4 shared base module (POM parent) containing 12 main + 6 test Scala files common to all Spark 4.x versions - Create azure-cosmos-spark_4-1_2-13 module with 3 override files using updated HDFSMetadataLog import (org.apache.spark.sql.execution.streaming.checkpointing) for SPARK-52787 package reorganization - Refactor azure-cosmos-spark_4-0_2-13 to use spark_4 as parent, removing duplicated Scala files now in the shared base - Add scala-hdfs source directories in spark_3 for HDFSMetadataLog files used by Spark 4.0 (old import path) - Update CI/build configs: ci.yml triggers/artifacts, emulator matrix, aggregate-reports, version_client.txt, external_dependencies.txt, generate_from_source_pom.py valid_parents, .docsettings.yml Implements Azure#48849 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rix, compiler settings F1: Move HDFS-dependent files exclusively to scala-hdfs directory, add scala-hdfs source paths to all Spark 3.x leaf module POMs to prevent duplicate class compilation errors. F2: Add azure-cosmos-spark_4 and azure-cosmos-spark_4-1_2-13 to sdk/cosmos/pom.xml modules list. F3: Add azure-cosmos-spark_4 to Spark 4.0 ProjectListOverride in PR CI matrix. F4: Add Spark 4.1 entry to PR CI matrix. F5: Move maven.compiler.source/target=17 to shared azure-cosmos-spark_4 parent POM so both 4.0 and 4.1 children inherit it consistently. F6: Remove incomplete PR link from CHANGELOG (PR number unknown at commit time). F7: Change cosmos-spark-version from '4.0' to '4' in shared base POM. F8: Add explanatory comment for dead build-helper profile in base POM. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…LOG, README, remove redundant deps - Backport scalastyle annotations to spark_3 scala-hdfs originals so copies differ only by the HDFSMetadataLog import line (F1) - Add shared bug fix entries to spark_4-1 CHANGELOG from spark_4-0 (F2) - Match spark_4-0 README version table column structure with TBD values (F3) - Remove redundant jackson-databind/jackson-module-scala from spark_4-1 POM, inherited from parent spark_4 (F4) - Add XML comments documenting intentional scala-hdfs omission in spark_4-1 build-helper config (F5) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…urce refs, add sync comments - F1: Remove redundant Jackson dependencies from spark_4-0_2-13 (inherited from parent spark_4) - F2: Remove non-existent azure-cosmos-spark_4/src/main/resources references from all 3 POMs - F3: Add sync-reminder comments to the 3 override files in spark_4-1_2-13 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
@sdkReviewAgent |
Member
Author
|
/azp run java - cosmos - spark |
|
Azure Pipelines successfully started running 1 pipeline(s). |
xinlian12
commented
Apr 21, 2026
xinlian12
commented
Apr 21, 2026
xinlian12
commented
Apr 21, 2026
Member
Author
|
✅ Review complete (35:59) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
…ecode - spark_4 (parent): removed scala-maven-plugin override and scala.compiler.release — each child decides its own target - spark_4-0: inherits source/target=1.8 from spark_3 (Spark 4.0 doesn't reference java.lang.Record). JDK<17 skips tests only. - spark_4-1: sets maven.compiler.source/target=17 (Spark 4.1 APIs reference java.lang.Record). JDK<17 skips entire build via cosmos.spark.skip=true. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2fb8a56 to
d0c3e1d
Compare
Member
Author
|
/azp run java - cosmos - spark |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - spark |
|
Azure Pipelines successfully started running 1 pipeline(s). |
… properties
Changed spark_3's scala-maven-plugin from hardcoded <source>1.8</source>
to ${maven.compiler.source}. All 3.x and 4.0 modules inherit '8' from
spark_3's properties (no behavior change). spark_4-1 overrides to '17'
via maven.compiler.source/target=17 in its own properties, which flows
through to the Scala compiler — resolving java.lang.Record (JDK 16+).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rk 4.x modules The Spark 4.1 CI job (Spark41Scala213IntegrationTests) fails with: ItemsTable.scala:41: Class java.lang.Record not found Root cause: The scala-maven-plugin inherited target=1.8 from the spark_3 parent, but Spark 4.1 APIs reference java.lang.Record (JDK 14+). Fix: Override scala-maven-plugin source/target to 17 in the spark_4 parent's build-scala profile. Both Spark 4.0 and 4.1 require JDK 17+. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
No longer needed since spark_3 now uses ${maven.compiler.source} instead
of hardcoded 1.8. Each child module's property flows through naturally.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - spark |
|
Azure Pipelines successfully started running 1 pipeline(s). |
tvaron3
reviewed
Apr 21, 2026
Spark 4.x requires Java 17+. Skip the entire build (not just tests) on older JDKs via cosmos.spark.skip=true, consistent with spark_4-1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - spark |
|
Azure Pipelines successfully started running 1 pipeline(s). |
tvaron3
approved these changes
Apr 22, 2026
alzimmermsft
approved these changes
Apr 23, 2026
weshaggard
approved these changes
Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #48849
Summary
Add Spark 4.1 support to the Azure Cosmos DB Spark connector.
Addresses SPARK-52787 package reorganization where
HDFSMetadataLogmoved fromo.a.s.sql.execution.streamingtoo.a.s.sql.execution.streaming.checkpointing. Also addsstoreOffsets()toItemsBatchWriterfor Databricks Runtime 18.1+ compatibility.Architecture
New shared base module:
azure-cosmos-spark_4azure-cosmos-spark_3for 3.x)New leaf module:
azure-cosmos-spark_4-1_2-13azure-cosmos-spark_4HDFSMetadataLogimport for Spark 4.1ItemsBatchWriter) withstoreOffsets()for DBR 18.1+maven-resources-pluginto copy shared sources fromspark_3excluding the 4 overridden filesmaven.compiler.source/target=17(Spark 4.1 APIs referencejava.lang.Record)Refactored:
azure-cosmos-spark_4-0_2-13azure-cosmos-spark_3→azure-cosmos-spark_4Scala compiler parameterization
Changed
spark_3scala-maven-plugin from hardcoded<source>1.8</source>to${maven.compiler.source}so each module controls its own bytecode target via Maven properties.JDK compatibility matrix
Both Spark 4.x modules skip entirely on JDK <17 via
cosmos.spark.skip=true.CI/Build Updates
eng/versioning/version_client.txt— added version entrieseng/versioning/external_dependencies.txt— added Spark 4.1.0 dependencysdk/cosmos/ci.yml— added trigger paths, release params, artifact entriessdk/cosmos/spark.yml— added Databricks live test stage for Spark 4.1eng/pipelines/templates/stages/cosmos-emulator-matrix.json— added Spark 4.1 emulator test entrieseng/pipelines/aggregate-reports.yml— added exclusion/inclusion entrieseng/.docsettings.yml— added README exclusioneng/scripts/generate_from_source_pom.py— addedazure-cosmos-spark_4as valid parentGenerated by coding-agent-harness