-
Notifications
You must be signed in to change notification settings - Fork 0
fix: stop using remove node label #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rework chore: rework cluster-autoscaler Scaleway cloudprovider integration
…r-api-changes Adjust OWNERS so that only API changes need api review
* fix: ensure vpa_recommender_vpa_objects_count UpdateModeInPlaceOrRecreate is reset * move GetUpdateModes() to helpers.go * update copyright Co-authored-by: Adrian Moisey <[email protected]> --------- Co-authored-by: Adrian Moisey <[email protected]>
…t-in-pod-auto-scaler fix: deprecated import of cacheddiscovery in vertical-pod-autoscaler
…deStartupTime Make maxNodeStartupTime configurable
fix: deprecated import of cacheddiscovery in balancer
Add support for Intel Habana Gaudi GPUs in the cluster autoscaler by:
- Define ResourceIntelGPU resource name (habana.ai/gaudi)
- Add Intel GPU to GPUVendorResourceNames list
- Refactor GPU detection logic to iterate through all GPU vendor resource names
instead of checking vendors individually
This enables the autoscaler to properly detect and handle Intel GPU nodes
alongside existing NVIDIA, AMD, and DirectX GPU support.
Extract the GPU allocatable detection loop into a new NodeHasGpuAllocatable helper function in utils/gpu/gpu.go. This eliminates code duplication across gpu_processor.go and makes the logic more maintainable. The new function returns both the GPU allocatable value and whether it exists, allowing callers to get both pieces of information in a single call. Changes: - Add NodeHasGpuAllocatable() helper in utils/gpu/gpu.go - Update NodeHasGpu() to use the new helper - Simplify FilterOutNodesWithUnreadyResources() in gpu_processor.go - Simplify GetNodeGpuTarget() in gpu_processor.go
…support Add Intel GPU (Habana Gaudi) autoscaler support
…encytracker Node removal latency metrics added
f152187 to
8ce8145
Compare
RixhersAjazi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a quick question for my understanding. Would suggest waiting on others to review as well to ensure bigger picture context isn't missing.
cluster-autoscaler/cloudprovider/coreweave/coreweave_nodegroup.go
Outdated
Show resolved
Hide resolved
LanceEa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of open questions but the gist looks right.
cluster-autoscaler/cloudprovider/coreweave/coreweave_nodegroup.go
Outdated
Show resolved
Hide resolved
|
Okay cool, I'm not going to merge this in as it will pollute our fork's commit history. Instead I am going to get this rebased on top of the upstream's master branch and open this PR in the upstream repo. I've already cut an image from this branch's latest commit so it is good to rollout now This will make this PR a little messy on our end but it will look the same in the upstream PR. Its better than having to ask CBS to manually hard reset our fork. Will eventually close this PR once the Upstream has been merged in |
e761968 to
106ffbf
Compare
106ffbf to
6d42554
Compare
|
Here is the upstream PR |
RixhersAjazi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving commit 6d42554
What type of PR is this?
What this PR does / why we need it:
Two changes present in this PR
Dropping logic that adds the
remove-nodelabelremove-nodelabel is now returning to its original state of being controlled by a single internal controllerremove-nodelabel was causing conflicts with the internal controller responsible for reconciling node groups.Taking the absolute value of
deltain theDecreaseTargetSizeDecreaseTargetSizeis called delta is passed as a negative value. This causes the autoscaler to increase a nodePool's target size because the method subtracts delta.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: