Skip to content

Conversation

@nickstern2002
Copy link
Member

@nickstern2002 nickstern2002 commented Nov 25, 2025

What type of PR is this?

What this PR does / why we need it:

Two changes present in this PR

Dropping logic that adds the remove-node label

  • The ToBeDeletedByClusterAutoscaler taint will be used as an indicator that a node should be removed from the node group. The remove-node label is now returning to its original state of being controlled by a single internal controller
  • The Autoscaler's usage of the remove-node label was causing conflicts with the internal controller responsible for reconciling node groups.

Taking the absolute value of delta in the DecreaseTargetSize

  • This is fixing a bug in the current CoreWeave Autoscaler implementation. Currently, when DecreaseTargetSize is called delta is passed as a negative value. This causes the autoscaler to increase a nodePool's target size because the method subtracts delta.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

This improves the user experience when using the CoreWeave implementation of the Cluster Autoscaler. Specifically prevents scenarios such as too many nodes getting removed and node groups being mistakenly scaled up. 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


lxuan94-pp and others added 15 commits November 3, 2025 11:01
…rework

chore: rework cluster-autoscaler Scaleway cloudprovider integration
…r-api-changes

Adjust OWNERS so that only API changes need api review
* fix: ensure vpa_recommender_vpa_objects_count UpdateModeInPlaceOrRecreate is reset

* move GetUpdateModes() to helpers.go

* update copyright

Co-authored-by: Adrian Moisey <[email protected]>

---------

Co-authored-by: Adrian Moisey <[email protected]>
…t-in-pod-auto-scaler

fix: deprecated import of cacheddiscovery in vertical-pod-autoscaler
…deStartupTime

Make maxNodeStartupTime configurable
fix: deprecated import of cacheddiscovery in balancer
Add support for Intel Habana Gaudi GPUs in the cluster autoscaler by:
- Define ResourceIntelGPU resource name (habana.ai/gaudi)
- Add Intel GPU to GPUVendorResourceNames list
- Refactor GPU detection logic to iterate through all GPU vendor resource names
    instead of checking vendors individually

This enables the autoscaler to properly detect and handle Intel GPU nodes
alongside existing NVIDIA, AMD, and DirectX GPU support.
Extract the GPU allocatable detection loop into a new NodeHasGpuAllocatable
helper function in utils/gpu/gpu.go. This eliminates code duplication across
gpu_processor.go and makes the logic more maintainable.

The new function returns both the GPU allocatable value and whether it exists,
allowing callers to get both pieces of information in a single call.

Changes:
- Add NodeHasGpuAllocatable() helper in utils/gpu/gpu.go
- Update NodeHasGpu() to use the new helper
- Simplify FilterOutNodesWithUnreadyResources() in gpu_processor.go
- Simplify GetNodeGpuTarget() in gpu_processor.go
@nickstern2002 nickstern2002 self-assigned this Nov 26, 2025
@nickstern2002 nickstern2002 force-pushed the ns/drop-remove-node-label branch from f152187 to 8ce8145 Compare December 2, 2025 21:35
@nickstern2002 nickstern2002 requested a review from a team December 2, 2025 21:58
RixhersAjazi
RixhersAjazi previously approved these changes Dec 2, 2025
Copy link

@RixhersAjazi RixhersAjazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a quick question for my understanding. Would suggest waiting on others to review as well to ensure bigger picture context isn't missing.

Copy link

@LanceEa LanceEa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of open questions but the gist looks right.

LanceEa
LanceEa previously approved these changes Dec 3, 2025
@LanceEa LanceEa requested a review from RixhersAjazi December 3, 2025 14:34
@nickstern2002
Copy link
Member Author

nickstern2002 commented Dec 3, 2025

Okay cool, I'm not going to merge this in as it will pollute our fork's commit history. Instead I am going to get this rebased on top of the upstream's master branch and open this PR in the upstream repo. I've already cut an image from this branch's latest commit so it is good to rollout now

This will make this PR a little messy on our end but it will look the same in the upstream PR. Its better than having to ask CBS to manually hard reset our fork.

Will eventually close this PR once the Upstream has been merged in

@nickstern2002 nickstern2002 force-pushed the ns/drop-remove-node-label branch from 106ffbf to 6d42554 Compare December 3, 2025 17:03
@nickstern2002
Copy link
Member Author

Here is the upstream PR
kubernetes#8880

Copy link

@RixhersAjazi RixhersAjazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving commit 6d42554

@nickstern2002 nickstern2002 merged commit ac281ca into master Dec 5, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.