[SPARK-52505][K8S] Allow to create executor kubernetes service#51203
Open
EnricoMi wants to merge 4 commits intoapache:masterfrom
Open
[SPARK-52505][K8S] Allow to create executor kubernetes service#51203EnricoMi wants to merge 4 commits intoapache:masterfrom
EnricoMi wants to merge 4 commits intoapache:masterfrom
Conversation
c4818e3 to
6bfacf0
Compare
Contributor
Author
|
@dongjoon-hyun @cloud-fan this avoids 2 minutes timeouts (default) when executors connect to recently decommissioned or lost executor pods. Since connections are established in a single thread, the timeout blocks connecting to other executors that are still alive. |
6bfacf0 to
52b69d8
Compare
JIRA Issue Information=== Sub-task SPARK-52505 === This comment was automatically generated by GitHub Actions |
977d448 to
ed2d724
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This allows executors to register its block manager with the driver via a Kubernetes service name rather than the pod IP. The driver and executors then connect to the executor block managers via the service.
Why are the changes needed?
In Kubernetes, connecting to an evicted (decommissioned) executor times out after 2 minutes (default). Executors connect to other executors synchronously (one at a time), so this time out accumulates for each executor peer. An executor that reads from many decommissioned executors blocks for a multiple of the timeout until it fails with a fetch failure.
This can be fixed by binding the block manager to a fixed port, defining a Kubernetes service for that block manager port and have the executor register that K8S service port with the driver. The driver and other executors then connect to the service name and instantly fail with a
connection refusedif the executor got decommissioned and the service removed.Setting
spark.kubernetes.executor.enableService=trueand definingspark.blockManager.portwill perform this setup for each executor.The Kubernetes service is owned by the driver, meaning it will survive the removal of the executor.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.