Modify spaceflights-pyspark starter #300

DimedS · 2025-12-02T11:02:45Z

Summary

This PR addresses kedro-org/kedro#5210 and updates the spaceflights-pyspark starter to use the new SparkDatasetV2 introduced in kedro-org/kedro-plugins#1185.

The updated starter will become the default for all PySpark and Databricks examples.
As a result, the databricks-iris starter will be archived.

Key Changes

1. Migration to `SparkDatasetV2`

Replaces the old SparkDataset with the new SparkDatasetV2.
Enables:
- Remote Spark execution (Databricks Connect / Spark Connect)
- Automatic Pandas → Spark conversion on save
- Cleaner session management (no hooks required)

2. Removal of Spark Hook

Spark session is now created directly inside SparkDatasetV2.
The previous hook implementation is no longer needed.

3. Removal of Reporting Pipeline

Removed to keep the starter focused and beginner-friendly.
The previous reporting pipeline used SQL inside a node, which is not considered good Kedro practice.

4. Improved `data_processing` Pipeline

Updated to support local development with a remote Spark backend (e.g., Databricks Serverless via Databricks Connect).
When Spark executes remotely but input files are stored locally:
- Data is loaded with Pandas datasets.
- Data is saved using SparkDatasetV2, which automatically converts Pandas → Spark.
This pattern enables users to develop on their laptop while persisting and transforming data on a remote Spark cluster.
To enable this workflow, the preprocessing step was migrated back to Pandas, matching the approach in the original spaceflights-pandas starter.
Subsequent transformations are performed in Spark, demonstrating:
- how to combine Pandas and Spark in a single pipeline,
- how the new Dataset V2 layer handles auto-conversion,
- and how this makes local execution with local files with a remote Spark backend possible.

How to test it:

Start in a fresh venv. You can create a new Kedro starter using:

kedro new \
  --starter git+https://github.com/kedro-org/kedro-starters.git \
  --directory spaceflights-pyspark \
  --checkout 5210-modernise-spark-starter-to-support-3-execution-modes

install last version of kedro-datasets from main

pip install git+https://github.com/kedro-org/kedro-plugins.git@main#subdirectory=kedro-datasets

Run locally with Spark. The project should work out of the box with a local Spark installation—no changes required.
Run on Databricks (free or paid): Update all dataset paths in the catalog.yml to use Volumes, e.g.: /Volumes/<catalog>/<schema>/<path>. Move your project into a Git repository. Import the repository into Databricks Repos. Open a Databricks Notebook inside the repo and execute the Kedro project from there.
Run locally with remote Spark (paid Databricks only): You can also run the project locally while using a remote Databricks Spark cluster for all Spark transformations. Install databricks-connect. Export the required environment variables: export DATABRICKS_HOST=..., export DATABRICKS_TOKEN=... Update all Spark dataset paths in the catalog to use Volumes (same as above).

Documentation Updates (upcoming)

This PR will be followed by updates to the Spark and Databricks documentation.
The new starter will be explained in three usage scenarios:

Local PySpark execution
Databricks-native execution
Local run with a remote Databricks Spark session
(Databricks Connect / Spark Connect)

Signed-off-by: Dmitry Sorokin <[email protected]>

...cutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_processing/nodes.py

Signed-off-by: Dmitry Sorokin <[email protected]>

SajidAlamQB · 2025-12-04T13:17:54Z

...ights-pyspark/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/settings.py


 # Hooks are executed in a Last-In-First-Out (LIFO) order.
-HOOKS = (SparkHooks(),)
+HOOKS = ()


We can remove the hooks.py file now right?

Also spark.yml is unnecessary for SparkDatasetV2 that file is only useful if using SparkHooks` maybe we should also remove it or add some comment that it's optional/for advanced configuration.

fully agree, I thought I removed them, fixed it now.

Signed-off-by: Dmitry Sorokin <[email protected]>

merelcht · 2025-12-04T14:28:53Z

I wouldn't remove the Reporting pipeline. This pipeline is meant to demonstrate Kedro Viz features and also the starters are all supposed to be the same, apart from bigger differences like using spark or not.

merelcht · 2025-12-04T14:30:54Z

spaceflights-pyspark/{{ cookiecutter.repo_name }}/conf/base/catalog.yml


 companies:
+  type: pandas.CSVDataset
  filepath: data/01_raw/companies.csv


What about when the input data isn't coming from a local place? I'd imagine that if you have massive data you want all of it to be handled by Spark and not just the intermediate results.

DimedS · 2025-12-04T14:32:29Z

I wouldn't remove the Reporting pipeline. This pipeline is meant to demonstrate Kedro Viz features and also the starters are all supposed to be the same, apart from bigger differences like using spark or not.

ok, makes sense, will revise it then instead of removing

ravi-kumar-pilla · 2025-12-05T03:05:46Z

Hi @DimedS ,

The code looks fine to me. Awesome work !!

I tested the below scenarios and it works as expected:

Local PySpark execution
Databricks-native execution

However I need to still try databricks connect which I will do tomorrow. But I do not think it will be a blocker considering the local and native executions work fine.

I do have a [Nit] question - when doing a session.run in native databricks env, with path to unity catalog volumes that are not present, new paths (i.e., intermediate, primary etc) are not created. Whereas in local the folder they are automatically created. Is this to avoid permission issues creating Volume paths on databricks which are not available or is there a reason we do not create volumes that are not present.

Thank you

merelcht · 2025-12-05T10:59:31Z

Could we also address #265 by any chance?

DimedS · 2025-12-05T11:06:02Z

Could we also address #265 by any chance?

I actually think we should archive the kedro-spark-viz starter. At the moment, there’s no Viz option in the kedro new command, and I’m not sure this starter still serves a purpose.

Checked, it was already archieved.

Looks like that we already removed pyspark pin in the current starter, and setuptools in combination with pyspark 4.* is not likely needed, I removed it and have a look to tests.

DimedS · 2025-12-05T11:07:16Z

I do have a [Nit] question - when doing a session.run in native databricks env, with path to unity catalog volumes that are not present, new paths (i.e., intermediate, primary etc) are not created. Whereas in local the folder they are automatically created. Is this to avoid permission issues creating Volume paths on databricks which are not available or is there a reason we do not create volumes that are not present.

thanks for the review, @ravi-kumar-pilla . Makes sense, will do a spike on that, likely we should address that on the dataset level.

Signed-off-by: Dmitry Sorokin <[email protected]>

Modify spaceflights-pyspark starter

a51cd2d

Signed-off-by: Dmitry Sorokin <[email protected]>

DimedS linked an issue Dec 2, 2025 that may be closed by this pull request

Modernise Spark starter to support 3 execution modes kedro-org/kedro#5210

Open

DimedS requested a review from SajidAlamQB December 2, 2025 11:02

SajidAlamQB reviewed Dec 2, 2025

View reviewed changes

...cutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_processing/nodes.py Outdated Show resolved Hide resolved

...cutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_processing/nodes.py Outdated Show resolved Hide resolved

Remove convertion nodes, modify initial data transformations to pandas

4831624

Signed-off-by: Dmitry Sorokin <[email protected]>

DimedS marked this pull request as ready for review December 4, 2025 10:17

Fix linting

4752078

Signed-off-by: Dmitry Sorokin <[email protected]>

DimedS requested a review from SajidAlamQB December 4, 2025 10:33

Temporary add datasets from main to requirements

283514a

Signed-off-by: Dmitry Sorokin <[email protected]>

DimedS requested review from merelcht and ravi-kumar-pilla December 4, 2025 10:57

Temporary add pyspark to requirements

9d608d2

Signed-off-by: Dmitry Sorokin <[email protected]>

SajidAlamQB reviewed Dec 4, 2025

View reviewed changes

Temporary modify pyproject and remove hook and spark conf

02f1cdb

Signed-off-by: Dmitry Sorokin <[email protected]>

DimedS requested a review from SajidAlamQB December 4, 2025 14:16

merelcht reviewed Dec 4, 2025

View reviewed changes

DimedS added 2 commits December 5, 2025 11:25

Remove setuptools from requirements

61ebaff

Signed-off-by: Dmitry Sorokin <[email protected]>

Revert pyproject changes

006ad71

Signed-off-by: Dmitry Sorokin <[email protected]>

merelcht linked an issue Dec 10, 2025 that may be closed by this pull request

Remove setuptools from spaceflights-pyspark-viz starter on pyspark release #265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modify spaceflights-pyspark starter #300

Modify spaceflights-pyspark starter #300

Uh oh!

DimedS commented Dec 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

SajidAlamQB Dec 4, 2025

Uh oh!

SajidAlamQB Dec 4, 2025

Uh oh!

DimedS Dec 4, 2025

Uh oh!

merelcht commented Dec 4, 2025

Uh oh!

merelcht Dec 4, 2025

Uh oh!

DimedS commented Dec 4, 2025

Uh oh!

ravi-kumar-pilla commented Dec 5, 2025

Uh oh!

merelcht commented Dec 5, 2025

Uh oh!

DimedS commented Dec 5, 2025 •

edited

Loading

Uh oh!

DimedS commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Modify spaceflights-pyspark starter #300

Are you sure you want to change the base?

Modify spaceflights-pyspark starter #300

Uh oh!

Conversation

DimedS commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. Migration to SparkDatasetV2

2. Removal of Spark Hook

3. Removal of Reporting Pipeline

4. Improved data_processing Pipeline

How to test it:

Documentation Updates (upcoming)

Uh oh!

Uh oh!

Uh oh!

SajidAlamQB Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

SajidAlamQB Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

DimedS Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

merelcht commented Dec 4, 2025

Uh oh!

merelcht Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

DimedS commented Dec 4, 2025

Uh oh!

ravi-kumar-pilla commented Dec 5, 2025

Uh oh!

merelcht commented Dec 5, 2025

Uh oh!

DimedS commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DimedS commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DimedS commented Dec 2, 2025 •

edited

Loading

1. Migration to `SparkDatasetV2`

4. Improved `data_processing` Pipeline

DimedS commented Dec 5, 2025 •

edited

Loading