Skip to content

[SPARK-52505][K8S] Allow to create executor kubernetes service#51203

Open
EnricoMi wants to merge 4 commits intoapache:masterfrom
G-Research:k8s-executor-service
Open

[SPARK-52505][K8S] Allow to create executor kubernetes service#51203
EnricoMi wants to merge 4 commits intoapache:masterfrom
G-Research:k8s-executor-service

Conversation

@EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Jun 17, 2025

What changes were proposed in this pull request?

This allows executors to register its block manager with the driver via a Kubernetes service name rather than the pod IP. The driver and executors then connect to the executor block managers via the service.

Why are the changes needed?

In Kubernetes, connecting to an evicted (decommissioned) executor times out after 2 minutes (default). Executors connect to other executors synchronously (one at a time), so this time out accumulates for each executor peer. An executor that reads from many decommissioned executors blocks for a multiple of the timeout until it fails with a fetch failure.

This can be fixed by binding the block manager to a fixed port, defining a Kubernetes service for that block manager port and have the executor register that K8S service port with the driver. The driver and other executors then connect to the service name and instantly fail with a connection refused if the executor got decommissioned and the service removed.

Setting spark.kubernetes.executor.enableService=true and defining spark.blockManager.port will perform this setup for each executor.

The Kubernetes service is owned by the driver, meaning it will survive the removal of the executor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Oct 7, 2025

@dongjoon-hyun @cloud-fan this avoids 2 minutes timeouts (default) when executors connect to recently decommissioned or lost executor pods. Since connections are established in a single thread, the timeout blocks connecting to other executors that are still alive.

@EnricoMi EnricoMi force-pushed the k8s-executor-service branch from 6bfacf0 to 52b69d8 Compare November 24, 2025 08:24
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

JIRA Issue Information

=== Sub-task SPARK-52505 ===
Summary: Connection timeout on Kubernetes to decommissioned executor
Assignee: None
Status: Open
Affected: ["4.1.0"]


This comment was automatically generated by GitHub Actions

@EnricoMi EnricoMi force-pushed the k8s-executor-service branch from 977d448 to ed2d724 Compare February 17, 2026 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant