-
Notifications
You must be signed in to change notification settings - Fork 64
Modify spaceflights-pyspark starter #300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Modify spaceflights-pyspark starter #300
Conversation
Signed-off-by: Dmitry Sorokin <[email protected]>
...cutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_processing/nodes.py
Outdated
Show resolved
Hide resolved
...cutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_processing/nodes.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
|
|
||
| # Hooks are executed in a Last-In-First-Out (LIFO) order. | ||
| HOOKS = (SparkHooks(),) | ||
| HOOKS = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the hooks.py file now right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also spark.yml is unnecessary for SparkDatasetV2 that file is only useful if using SparkHooks` maybe we should also remove it or add some comment that it's optional/for advanced configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fully agree, I thought I removed them, fixed it now.
Signed-off-by: Dmitry Sorokin <[email protected]>
|
I wouldn't remove the Reporting pipeline. This pipeline is meant to demonstrate Kedro Viz features and also the starters are all supposed to be the same, apart from bigger differences like using spark or not. |
|
|
||
| companies: | ||
| type: pandas.CSVDataset | ||
| filepath: data/01_raw/companies.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about when the input data isn't coming from a local place? I'd imagine that if you have massive data you want all of it to be handled by Spark and not just the intermediate results.
ok, makes sense, will revise it then instead of removing |
|
Hi @DimedS , The code looks fine to me. Awesome work !! I tested the below scenarios and it works as expected:
However I need to still try databricks connect which I will do tomorrow. But I do not think it will be a blocker considering the local and native executions work fine. I do have a [Nit] question - when doing a Thank you |
|
Could we also address #265 by any chance? |
I actually think we should archive the Checked, it was already archieved. Looks like that we already removed pyspark pin in the current starter, and setuptools in combination with pyspark 4.* is not likely needed, I removed it and have a look to tests. |
thanks for the review, @ravi-kumar-pilla . Makes sense, will do a spike on that, likely we should address that on the dataset level. |
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Summary
This PR addresses kedro-org/kedro#5210 and updates the
spaceflights-pysparkstarter to use the newSparkDatasetV2introduced in kedro-org/kedro-plugins#1185.The updated starter will become the default for all PySpark and Databricks examples.
As a result, the
databricks-irisstarter will be archived.Key Changes
1. Migration to
SparkDatasetV2SparkDatasetwith the newSparkDatasetV2.2. Removal of Spark Hook
SparkDatasetV2.3. Removal of Reporting Pipeline
4. Improved
data_processingPipelineSparkDatasetV2, which automatically converts Pandas → Spark.spaceflights-pandasstarter.How to test it:
Start in a fresh venv. You can create a new Kedro starter using:
install last version of kedro-datasets from main
pip install git+https://github.com/kedro-org/kedro-plugins.git@main#subdirectory=kedro-datasetsRun locally with Spark. The project should work out of the box with a local Spark installation—no changes required.
Run on Databricks (free or paid): Update all dataset paths in the catalog.yml to use Volumes, e.g.:
/Volumes/<catalog>/<schema>/<path>. Move your project into a Git repository. Import the repository into Databricks Repos. Open a Databricks Notebook inside the repo and execute the Kedro project from there.Run locally with remote Spark (paid Databricks only): You can also run the project locally while using a remote Databricks Spark cluster for all Spark transformations. Install databricks-connect. Export the required environment variables:
export DATABRICKS_HOST=..., export DATABRICKS_TOKEN=...Update all Spark dataset paths in the catalog to use Volumes (same as above).Documentation Updates (upcoming)
This PR will be followed by updates to the Spark and Databricks documentation.
The new starter will be explained in three usage scenarios:
(Databricks Connect / Spark Connect)