feat(datasets): SparkDataset Rewrite #1185

SajidAlamQB · 2025-09-11T15:05:02Z

Description

This PR introduces SparkDatasetV2, a cleaner alternative to SparkDataset that addresses long-standing issues outlined in #135.

Problems with Current SparkDataset

Custom filepath parsing (split_filepath vs. get_protocol_and_path) causes inconsistencies
Multiple code pathways (fsspec for metadata, Spark for I/O) work for S3/DBFS but break for others
Forces installation of ~300MB dependencies even on Databricks, causing conflicts with databricks-connect
Circular dependencies between Databricks datasets and SparkDataset utilities
Incomplete test coverage and filepath handling bugs

Development notes

Dependency Improvements:

spark-core with zero dependencies - no forced PySpark on Databricks!
Environment-specific installations: spark-local, spark-databricks, spark-emr
Optional filesystem support: spark-s3, spark-gcs, spark-azure
HDFS via PyArrow instead of deprecated hdfs package

Code Improvements:

Lazy imports via TYPE_CHECKING
Consistent fsspec usage (keept the dbutils optimisation for DBFS performance, I can't test since databricks free doesn't allow use of DBFS)
Proper protocol translation (s3:// → s3a://)
Unity Catalog volumes support (/Volumes/...)
Spark Connect compatibility
Spark Connect and Databricks Connect compatibility
Automatic Pandas → Spark DataFrame conversion (like SnowparkTableDataset)
Session creation inside dataset no SparkHooks required
Spark Session Handling:
get_spark_with_remote_support() automatically detects environment:

Databricks Connect (when DATABRICKS_HOST and DATABRICKS_TOKEN set and databricks-connect installed)
Spark Connect (when SPARK_REMOTE set)
Classic local SparkSession (fallback)

Current Status

Local development with spark-local
Unity Catalog volumes on Databricks
S3/GCS/Azure storage via optional dependencies
Spark Connect (Spark 4.0+)

Known Limitations:

DBFS public root (/FileStore) is deprecated by Databricks not very easy to test if legacy works.

Breaking Changes

Users of kedro-datasets[spark] must choose specific bundles
HDFS users need to use spark-hdfs with PyArrow

Now:

Databricks users: No more PySpark conflicts with databricks-connect
Reduced installation size: ~310MB → 0MB for cloud environments
Clearer installation paths based on environment
Users relying on kedro-datasets[spark] will need to choose specific bundles
HDFS support is deprecated (still available via spark-hdfs)

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Updated jsonschema/kedro-catalog-X.XX.json if necessary
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Sajid Alam <[email protected]>

deepyaman

Digging in a bit, this feels less like a rewrite and more like a refactoring. Here are my initial thoughts:

I've added a comment re my concern removing the _dbfs_glob logic. This needs to be validated carefully (perhaps Databricks improved performance of regular glob?) so we don't reintroduce a performance issue. I remember debugging this on a client project, because IIRC (it's been years) performance degrades to the point of unusability with a large number of versions.
Will this provide the best experience with spark-connect and databricks-connect? (FWIW databricks-connect is a bit annoying to look into since it's not open source.) Spark 3.4 introduced Spark Connect, and Spark 4 includes major refactors to really make it part of the core (e.g. pyspark.sql.classic is moved to the same level as pyspark.sql.connect, and they inherit from the same base DataFrame and all—wasn't the case before). IMO Spark Connect looks like the future of Spark, and a SparkDataset refresh should work seamlessly with it. Spark Connect (and Databricks Connect) are also potentially great for users who struggle with the deployment experience (e.g. need to get code onto Databricks from local). That said, the classic experience is still likely a very common way for teams who are working more from within Databricks to operate.
I like the fact that HDFS is supported through PyArrow now. If there's still concern that people may need the old, separate HDFS client (not sure there is? hdfs hasn't had a release in two years and doesn't support Python 3.13 for example), maybe that could be handled through some sort of fallback logic?

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py

kedro-datasets/pyproject.toml

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py

SajidAlamQB · 2025-09-26T09:08:02Z

Thanks @deepyaman you're right about the DBFS glob issue that is a good catch we'll add that back in. Regarding refactor vs rewrite, we chose V2 for safety, but I'm open to discussing whether we should refactor original instead if you think that's better.

deepyaman · 2025-09-26T15:01:21Z

Regarding refactor vs rewrite, we chose V2 for safety, but I'm open to discussing whether we should refactor original instead if you think that's better.

Yeah, if course. I think can get the V2 "ready", and then see if it's sufficiently different that it needs to be breaking/a separate dataset.

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB · 2025-09-29T13:33:18Z

@noklam would also appreciate your thoughts on this.

noklam

Sorry I don't have time to review this in details but I don't want to block this. A quick question top of my head:

When should I use databricks specific dataset and spark.SparkDataset on Databricks? I recalled there are already something that is only possible with the databricks one. If we are re-writing this I think we should have a look at this.
dbfs is a bit annoying - Databricks already deprecated it, new cluster are default to UC's volume but still a lot of people are using dbfs in older cluster.
Is there a goal/additional things that this rewrite improve? Or is it more like refactoring?

SajidAlamQB · 2025-10-01T13:12:03Z

When should I use databricks specific dataset and spark.SparkDataset on Databricks? I recalled there are already something that is only possible with the databricks one. If we are re-writing this I think we should have a look at this.

dbfs is a bit annoying - Databricks already deprecated it, new cluster are default to UC's volume but still a lot of people are using dbfs in older cluster.

Is there a goal/additional things that this rewrite improve? Or is it more like refactoring?

Hey @noklam thanks,

I think the Databricks datasets are more for TABLE operations while the SparkDataset is for FILE operations.

The new V2 handles both DBFS and UC Volumes properly, they still supports /dbfs/, dbfs:/, and /Volumes/ paths and we only do the DBFS optimisations only when needed.

This goes a bit beyond a refactor I think, we solving some long standing issues such as the Databricks users can now actually use it, we add Spark Connect for Spark 4.0 and now the users can choose their deps instead of installing everything via pyproject.toml changes. It makes the dataset more usable.

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB · 2025-11-27T15:40:54Z

The spark tests are not running on Windows because of the issues with the old implementation. Is it possible to run this new one on Windows?

SparkDatasetV2 should work on Windows I will enable it and test it out.

I spent a good amount of time trying to get SparkDatasetV2 tests running on Windows CI. Unfortunately, PySpark + Windows + GitHub Actions runners have fundamental incompatibilities so I will go back to skipping the spark tests on windows.

Signed-off-by: Dmitry Sorokin <[email protected]>

DimedS · 2025-12-01T09:44:32Z

Remote connections to Databricks should be established using Databricks Connect.
I’ve updated the implementation here: f11f1c1, and it now works correctly in remote mode.

Signed-off-by: Sajid Alam <[email protected]>

merelcht

LGTM! 👍

Don't forget to add it to the release notes too and a small note on when to use this vs. the legacy SparkDataset.

DimedS

Great work, @SajidAlamQB !

Signed-off-by: Sajid Alam <[email protected]>

rework spark in pyproject.toml

6623c73

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB changed the title ~~SparkDataset Rewrite~~ chore(datasets): SparkDataset Rewrite Sep 22, 2025

SajidAlamQB and others added 2 commits September 22, 2025 11:21

Merge branch 'main' into dev/sparkdataset-rewrite

a424044

Update pyproject.toml

e28b980

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB changed the title ~~chore(datasets): SparkDataset Rewrite~~ feat(datasets): SparkDataset Rewrite Sep 22, 2025

SajidAlamQB added 8 commits September 22, 2025 12:55

Update spark_dataset.py

27f9bd3

Signed-off-by: Sajid Alam <[email protected]>

Update spark_dataset.py

c82f440

Signed-off-by: Sajid Alam <[email protected]>

lint

460cf31

Signed-off-by: Sajid Alam <[email protected]>

Update spark_dataset.py

7514ccb

Signed-off-by: Sajid Alam <[email protected]>

Update test_spark_dataset.py

e98ce2a

Signed-off-by: Sajid Alam <[email protected]>

lint

f800f33

Signed-off-by: Sajid Alam <[email protected]>

revert and split sparkdataset rewrite into v2

c74ff12

Signed-off-by: Sajid Alam <[email protected]>

Update test_spark_dataset_v2.py

5ebd4f1

Signed-off-by: Sajid Alam <[email protected]>

deepyaman reviewed Sep 24, 2025

View reviewed changes

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py Outdated Show resolved Hide resolved

kedro-datasets/pyproject.toml Show resolved Hide resolved

ravi-kumar-pilla reviewed Sep 25, 2025

View reviewed changes

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py Show resolved Hide resolved

Merge branch 'main' into dev/sparkdataset-rewrite

85958f1

SajidAlamQB added 2 commits September 29, 2025 14:13

changes based on feedback

d069256

Signed-off-by: Sajid Alam <[email protected]>

lint

fa13847

Signed-off-by: Sajid Alam <[email protected]>

Merge branch 'main' into dev/sparkdataset-rewrite

e536bba

noklam self-requested a review September 30, 2025 15:34

noklam reviewed Sep 30, 2025

View reviewed changes

ravi-kumar-pilla added 2 commits October 13, 2025 00:55

Merge branch 'main' into dev/sparkdataset-rewrite

d6b56c5

Merge branch 'main' into dev/sparkdataset-rewrite

c05b8be

ravi-kumar-pilla reviewed Oct 14, 2025

View reviewed changes

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py Show resolved Hide resolved

SajidAlamQB added 2 commits October 15, 2025 09:11

Update __init__.py

fe52b19

Signed-off-by: Sajid Alam <[email protected]>

fix tests

cc9a747

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB added 12 commits November 27, 2025 12:14

Update unit-tests.yml

7cecbf9

Signed-off-by: Sajid Alam <[email protected]>

please spark tests work!!

c709ecc

Signed-off-by: Sajid Alam <[email protected]>

try fix spark test 2

69f3d5c

Signed-off-by: Sajid Alam <[email protected]>

attempt 3

21d984e

Signed-off-by: Sajid Alam <[email protected]>

please work spark windows

8dc0d75

Signed-off-by: Sajid Alam <[email protected]>

Update unit-tests.yml

37e8f47

Signed-off-by: Sajid Alam <[email protected]>

Update unit-tests.yml

5403ec3

Signed-off-by: Sajid Alam <[email protected]>

spark potential fix

639578c

Signed-off-by: Sajid Alam <[email protected]>

update makefile

1aa7ee2

Signed-off-by: Sajid Alam <[email protected]>

Update conftest.py

d1edb53

Signed-off-by: Sajid Alam <[email protected]>

Update conftest.py

94b295a

Signed-off-by: Sajid Alam <[email protected]>

revert spark windows tests

31b82b6

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB and others added 2 commits November 27, 2025 15:41

Merge branch 'main' into dev/sparkdataset-rewrite

c270119

Modify Databrics connect

f11f1c1

Signed-off-by: Dmitry Sorokin <[email protected]>

SajidAlamQB added 6 commits December 1, 2025 11:13

check if DatabricksSession is None

f619efc

Signed-off-by: Sajid Alam <[email protected]>

coverage

fc267d1

Signed-off-by: Sajid Alam <[email protected]>

Update spark_dataset_v2.py

4a03e4a

Signed-off-by: Sajid Alam <[email protected]>

auto convert pandas to spark dataframe

0dc71a1

Signed-off-by: Sajid Alam <[email protected]>

revert conftest

a84e4d5

Signed-off-by: Sajid Alam <[email protected]>

lint

3af62cc

Signed-off-by: Sajid Alam <[email protected]>

merelcht approved these changes Dec 1, 2025

View reviewed changes

DimedS approved these changes Dec 2, 2025

View reviewed changes

SajidAlamQB added 2 commits December 2, 2025 10:24

add docs and release notes

eec8593

Signed-off-by: Sajid Alam <[email protected]>

pin mkdocsstrings 2.0 has breaking changes

4cdbc80

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB mentioned this pull request Dec 2, 2025

mkdocstrings-python 2.0.0 breaks docs build #1261

Closed

SajidAlamQB merged commit 0fa5f1c into main Dec 2, 2025
18 checks passed

SajidAlamQB deleted the dev/sparkdataset-rewrite branch December 2, 2025 10:47

DimedS mentioned this pull request Dec 2, 2025

Modify spaceflights-pyspark starter kedro-org/kedro-starters#300

Open

feat(datasets): SparkDataset Rewrite #1185

feat(datasets): SparkDataset Rewrite #1185

Uh oh!

Conversation

SajidAlamQB commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problems with Current SparkDataset

Development notes

Current Status

Breaking Changes

Checklist

Uh oh!

deepyaman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SajidAlamQB commented Sep 26, 2025

Uh oh!

deepyaman commented Sep 26, 2025

Uh oh!

SajidAlamQB commented Sep 29, 2025

Uh oh!

noklam left a comment

Choose a reason for hiding this comment

Uh oh!

SajidAlamQB commented Oct 1, 2025

Uh oh!

Uh oh!

SajidAlamQB commented Nov 27, 2025

Uh oh!

DimedS commented Dec 1, 2025

Uh oh!

merelcht left a comment

Choose a reason for hiding this comment

Uh oh!

DimedS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

SajidAlamQB commented Sep 11, 2025 •

edited

Loading