Skip to content

Conversation

@DimedS
Copy link
Member

@DimedS DimedS commented Dec 2, 2025

Summary

This PR addresses kedro-org/kedro#5210 and updates the spaceflights-pyspark starter to use the new SparkDatasetV2 introduced in kedro-org/kedro-plugins#1185.

The updated starter will become the default for all PySpark and Databricks examples.
As a result, the databricks-iris starter will be archived.


Key Changes

1. Migration to SparkDatasetV2

  • Replaces the old SparkDataset with the new SparkDatasetV2.
  • Enables:
    • Remote Spark execution (Databricks Connect / Spark Connect)
    • Automatic Pandas → Spark conversion on save
    • Cleaner session management (no hooks required)

2. Removal of Spark Hook

  • Spark session is now created directly inside SparkDatasetV2.
  • The previous hook implementation is no longer needed.

3. Removal of Reporting Pipeline

  • Removed to keep the starter focused and beginner-friendly.
  • The previous reporting pipeline used SQL inside a node, which is not considered good Kedro practice.

4. Improved data_processing Pipeline

  • Updated to support local development with a remote Spark backend (e.g., Databricks Serverless via Databricks Connect).
  • When Spark executes remotely but input files are stored locally:
    • Data is loaded with Pandas datasets.
    • Data is saved using SparkDatasetV2, which automatically converts Pandas → Spark.
  • This pattern enables users to develop on their laptop while persisting and transforming data on a remote Spark cluster.
  • To enable this workflow, the preprocessing step was migrated back to Pandas, matching the approach in the original spaceflights-pandas starter.
  • Subsequent transformations are performed in Spark, demonstrating:
    • how to combine Pandas and Spark in a single pipeline,
    • how the new Dataset V2 layer handles auto-conversion,
    • and how this makes local execution with local files with a remote Spark backend possible.

How to test it:

Start in a fresh venv. You can create a new Kedro starter using:

kedro new \
  --starter git+https://github.com/kedro-org/kedro-starters.git \
  --directory spaceflights-pyspark \
  --checkout 5210-modernise-spark-starter-to-support-3-execution-modes

install last version of kedro-datasets from main

pip install git+https://github.com/kedro-org/kedro-plugins.git@main#subdirectory=kedro-datasets

  1. Run locally with Spark. The project should work out of the box with a local Spark installation—no changes required.

  2. Run on Databricks (free or paid): Update all dataset paths in the catalog.yml to use Volumes, e.g.: /Volumes/<catalog>/<schema>/<path>. Move your project into a Git repository. Import the repository into Databricks Repos. Open a Databricks Notebook inside the repo and execute the Kedro project from there.

  3. Run locally with remote Spark (paid Databricks only): You can also run the project locally while using a remote Databricks Spark cluster for all Spark transformations. Install databricks-connect. Export the required environment variables: export DATABRICKS_HOST=..., export DATABRICKS_TOKEN=... Update all Spark dataset paths in the catalog to use Volumes (same as above).

Documentation Updates (upcoming)

This PR will be followed by updates to the Spark and Databricks documentation.
The new starter will be explained in three usage scenarios:

  1. Local PySpark execution
  2. Databricks-native execution
  3. Local run with a remote Databricks Spark session
    (Databricks Connect / Spark Connect)

@DimedS DimedS linked an issue Dec 2, 2025 that may be closed by this pull request
@DimedS DimedS requested a review from SajidAlamQB December 2, 2025 11:02
@DimedS DimedS marked this pull request as ready for review December 4, 2025 10:17
Signed-off-by: Dmitry Sorokin <[email protected]>
@DimedS DimedS requested a review from SajidAlamQB December 4, 2025 10:33

# Hooks are executed in a Last-In-First-Out (LIFO) order.
HOOKS = (SparkHooks(),)
HOOKS = ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the hooks.py file now right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also spark.yml is unnecessary for SparkDatasetV2 that file is only useful if using SparkHooks` maybe we should also remove it or add some comment that it's optional/for advanced configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fully agree, I thought I removed them, fixed it now.

@DimedS DimedS requested a review from SajidAlamQB December 4, 2025 14:16
@merelcht
Copy link
Member

merelcht commented Dec 4, 2025

I wouldn't remove the Reporting pipeline. This pipeline is meant to demonstrate Kedro Viz features and also the starters are all supposed to be the same, apart from bigger differences like using spark or not.


companies:
type: pandas.CSVDataset
filepath: data/01_raw/companies.csv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when the input data isn't coming from a local place? I'd imagine that if you have massive data you want all of it to be handled by Spark and not just the intermediate results.

@DimedS
Copy link
Member Author

DimedS commented Dec 4, 2025

I wouldn't remove the Reporting pipeline. This pipeline is meant to demonstrate Kedro Viz features and also the starters are all supposed to be the same, apart from bigger differences like using spark or not.

ok, makes sense, will revise it then instead of removing

@ravi-kumar-pilla
Copy link
Contributor

Hi @DimedS ,

The code looks fine to me. Awesome work !!

I tested the below scenarios and it works as expected:

  1. Local PySpark execution
  2. Databricks-native execution

However I need to still try databricks connect which I will do tomorrow. But I do not think it will be a blocker considering the local and native executions work fine.

I do have a [Nit] question - when doing a session.run in native databricks env, with path to unity catalog volumes that are not present, new paths (i.e., intermediate, primary etc) are not created. Whereas in local the folder they are automatically created. Is this to avoid permission issues creating Volume paths on databricks which are not available or is there a reason we do not create volumes that are not present.

Thank you

@merelcht
Copy link
Member

merelcht commented Dec 5, 2025

Could we also address #265 by any chance?

@DimedS
Copy link
Member Author

DimedS commented Dec 5, 2025

Could we also address #265 by any chance?

I actually think we should archive the kedro-spark-viz starter. At the moment, there’s no Viz option in the kedro new command, and I’m not sure this starter still serves a purpose.

Checked, it was already archieved.

Looks like that we already removed pyspark pin in the current starter, and setuptools in combination with pyspark 4.* is not likely needed, I removed it and have a look to tests.

@DimedS
Copy link
Member Author

DimedS commented Dec 5, 2025

I do have a [Nit] question - when doing a session.run in native databricks env, with path to unity catalog volumes that are not present, new paths (i.e., intermediate, primary etc) are not created. Whereas in local the folder they are automatically created. Is this to avoid permission issues creating Volume paths on databricks which are not available or is there a reason we do not create volumes that are not present.

thanks for the review, @ravi-kumar-pilla . Makes sense, will do a spike on that, likely we should address that on the dataset level.

Signed-off-by: Dmitry Sorokin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Modernise Spark starter to support 3 execution modes Remove setuptools from spaceflights-pyspark-viz starter on pyspark release

5 participants