Skip to content

Conversation

@SajidAlamQB
Copy link
Contributor

@SajidAlamQB SajidAlamQB commented Sep 11, 2025

Description

This PR introduces SparkDatasetV2, a cleaner alternative to SparkDataset that addresses long-standing issues outlined in #135.

Problems with Current SparkDataset

  • Custom filepath parsing (split_filepath vs. get_protocol_and_path) causes inconsistencies
  • Multiple code pathways (fsspec for metadata, Spark for I/O) work for S3/DBFS but break for others
  • Forces installation of ~300MB dependencies even on Databricks, causing conflicts with databricks-connect
  • Circular dependencies between Databricks datasets and SparkDataset utilities
  • Incomplete test coverage and filepath handling bugs

Development notes

Dependency Improvements:

  • spark-core with zero dependencies - no forced PySpark on Databricks!
  • Environment-specific installations: spark-local, spark-databricks, spark-emr
  • Optional filesystem support: spark-s3, spark-gcs, spark-azure
  • HDFS via PyArrow instead of deprecated hdfs package

Code Improvements:

  • Lazy imports via TYPE_CHECKING
  • Consistent fsspec usage (keept the dbutils optimisation for DBFS performance, I can't test since databricks free doesn't allow use of DBFS)
  • Proper protocol translation (s3:// → s3a://)
  • Unity Catalog volumes support (/Volumes/...)
  • Spark Connect compatibility
  • Spark Connect and Databricks Connect compatibility
  • Automatic Pandas → Spark DataFrame conversion (like SnowparkTableDataset)
  • Session creation inside dataset no SparkHooks required
  • Spark Session Handling:
    get_spark_with_remote_support() automatically detects environment:
  • Databricks Connect (when DATABRICKS_HOST and DATABRICKS_TOKEN set and databricks-connect installed)
  • Spark Connect (when SPARK_REMOTE set)
  • Classic local SparkSession (fallback)

Current Status

  • Local development with spark-local
  • Unity Catalog volumes on Databricks
  • S3/GCS/Azure storage via optional dependencies
  • Spark Connect (Spark 4.0+)

Known Limitations:

  • DBFS public root (/FileStore) is deprecated by Databricks not very easy to test if legacy works.

Breaking Changes

  • Users of kedro-datasets[spark] must choose specific bundles
  • HDFS users need to use spark-hdfs with PyArrow

Now:

  • Databricks users: No more PySpark conflicts with databricks-connect
  • Reduced installation size: ~310MB → 0MB for cloud environments
  • Clearer installation paths based on environment
  • Users relying on kedro-datasets[spark] will need to choose specific bundles
  • HDFS support is deprecated (still available via spark-hdfs)

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@SajidAlamQB SajidAlamQB changed the title SparkDataset Rewrite chore(datasets): SparkDataset Rewrite Sep 22, 2025
@SajidAlamQB SajidAlamQB changed the title chore(datasets): SparkDataset Rewrite feat(datasets): SparkDataset Rewrite Sep 22, 2025
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging in a bit, this feels less like a rewrite and more like a refactoring. Here are my initial thoughts:

  • I've added a comment re my concern removing the _dbfs_glob logic. This needs to be validated carefully (perhaps Databricks improved performance of regular glob?) so we don't reintroduce a performance issue. I remember debugging this on a client project, because IIRC (it's been years) performance degrades to the point of unusability with a large number of versions.
  • Will this provide the best experience with spark-connect and databricks-connect? (FWIW databricks-connect is a bit annoying to look into since it's not open source.) Spark 3.4 introduced Spark Connect, and Spark 4 includes major refactors to really make it part of the core (e.g. pyspark.sql.classic is moved to the same level as pyspark.sql.connect, and they inherit from the same base DataFrame and all—wasn't the case before). IMO Spark Connect looks like the future of Spark, and a SparkDataset refresh should work seamlessly with it. Spark Connect (and Databricks Connect) are also potentially great for users who struggle with the deployment experience (e.g. need to get code onto Databricks from local). That said, the classic experience is still likely a very common way for teams who are working more from within Databricks to operate.
  • I like the fact that HDFS is supported through PyArrow now. If there's still concern that people may need the old, separate HDFS client (not sure there is? hdfs hasn't had a release in two years and doesn't support Python 3.13 for example), maybe that could be handled through some sort of fallback logic?

@SajidAlamQB
Copy link
Contributor Author

Thanks @deepyaman you're right about the DBFS glob issue that is a good catch we'll add that back in. Regarding refactor vs rewrite, we chose V2 for safety, but I'm open to discussing whether we should refactor original instead if you think that's better.

@deepyaman
Copy link
Member

Regarding refactor vs rewrite, we chose V2 for safety, but I'm open to discussing whether we should refactor original instead if you think that's better.

Yeah, if course. I think can get the V2 "ready", and then see if it's sufficiently different that it needs to be breaking/a separate dataset.

Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
@SajidAlamQB
Copy link
Contributor Author

@noklam would also appreciate your thoughts on this.

@noklam noklam self-requested a review September 30, 2025 15:34
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't have time to review this in details but I don't want to block this. A quick question top of my head:

  • When should I use databricks specific dataset and spark.SparkDataset on Databricks? I recalled there are already something that is only possible with the databricks one. If we are re-writing this I think we should have a look at this.
  • dbfs is a bit annoying - Databricks already deprecated it, new cluster are default to UC's volume but still a lot of people are using dbfs in older cluster.
  • Is there a goal/additional things that this rewrite improve? Or is it more like refactoring?

@SajidAlamQB
Copy link
Contributor Author

  • When should I use databricks specific dataset and spark.SparkDataset on Databricks? I recalled there are already something that is only possible with the databricks one. If we are re-writing this I think we should have a look at this.
  • dbfs is a bit annoying - Databricks already deprecated it, new cluster are default to UC's volume but still a lot of people are using dbfs in older cluster.
  • Is there a goal/additional things that this rewrite improve? Or is it more like refactoring?

Hey @noklam thanks,

I think the Databricks datasets are more for TABLE operations while the SparkDataset is for FILE operations.

The new V2 handles both DBFS and UC Volumes properly, they still supports /dbfs/, dbfs:/, and /Volumes/ paths and we only do the DBFS optimisations only when needed.

This goes a bit beyond a refactor I think, we solving some long standing issues such as the Databricks users can now actually use it, we add Spark Connect for Spark 4.0 and now the users can choose their deps instead of installing everything via pyproject.toml changes. It makes the dataset more usable.

Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
@SajidAlamQB
Copy link
Contributor Author

The spark tests are not running on Windows because of the issues with the old implementation. Is it possible to run this new one on Windows?

SparkDatasetV2 should work on Windows I will enable it and test it out.

I spent a good amount of time trying to get SparkDatasetV2 tests running on Windows CI. Unfortunately, PySpark + Windows + GitHub Actions runners have fundamental incompatibilities so I will go back to skipping the spark tests on windows.

@DimedS
Copy link
Member

DimedS commented Dec 1, 2025

Remote connections to Databricks should be established using Databricks Connect.
I’ve updated the implementation here: f11f1c1, and it now works correctly in remote mode.

Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

Don't forget to add it to the release notes too and a small note on when to use this vs. the legacy SparkDataset.

Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @SajidAlamQB !

@SajidAlamQB SajidAlamQB merged commit 0fa5f1c into main Dec 2, 2025
18 checks passed
@SajidAlamQB SajidAlamQB deleted the dev/sparkdataset-rewrite branch December 2, 2025 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewrite SparkDataset

7 participants