Skip to content

Conversation

@comtalyst
Copy link
Collaborator

@comtalyst comtalyst commented Oct 1, 2025

Fixes #

Description

New node provisioning architecture. More details TODO.

This provision mode is considered in-preview at least until AKS machine API is generally available.
Currently, user needs to:

  • Register the preview for AKS machine API
  • Opt-in to this provision mode by changing PROVISION_MODE when deploying Karpenter

Meanwhile:

  • E2E tests are not being ran automatically for this provision mode, and continue to run with aksscriptless provision mode. This may change separately of this PR

Tests to follow, not necessarily for PR

  • DriftAction handling E2Es
  • NAP provision mode migration test
  • NAP perf/scale test on large clusters
  • FIPS/SIG test on utilization suite + update to cover self-hosted
  • byok suite

How was this change tested?

  • New acceptance tests added in all dimensions, in parity with VM instances
  • Current E2Es except:
    • Awaiting API-side fix: spot taints inconsistency
      • 4/12 tests in consolidation
      • 3/21 tests in drift
      • 2/17 tests in scheduling
    • Awaiting API-side fix: in-place update capability
      • 3/3 tests in inplaceupdate
    • Awaiting E2E changes from BREAKING: you cannot create subnets inside of the managed vnet  #1221
      • 1/22 tests in integration
      • 1/1 tests in subnet
    • Need testing on NAP
      • 3/8 in utilization
    • Highly flaky, deprioritized
      • 1/1 in byok
  • Integration test for NAP
  • Manual inspection on provisioned K8s nodes: correct labels
  • Long nodepool names/starting with number

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

TODO

@comtalyst comtalyst changed the title feat: further parity with AKS from AKS machine API integration/provision mode feat: further parity with AKS from AKS machine API provision mode Oct 1, 2025
@comtalyst comtalyst changed the title feat: further parity with AKS from AKS machine API provision mode feat: further AKS parity with AKS machine API provision mode Oct 1, 2025
@comtalyst comtalyst changed the title feat: further AKS parity with AKS machine API provision mode feat: further AKS feature/quality/perf parity with AKS machine API provision mode Oct 1, 2025
@comtalyst comtalyst changed the title feat: further AKS feature/quality/perf parity with AKS machine API provision mode feat: further AKS feature/quality/perf parity, through machine API integration Oct 1, 2025
@comtalyst comtalyst force-pushed the comtalyst/aks-machine-api-integration-core branch from 99c437a to ab84951 Compare October 1, 2025 03:12
@@ -0,0 +1,279 @@
/*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to re-check how out-of-pattern the naming of these files is.
The intention is to split the files while letting the suite prefix "group" the files together in IDE file explorer/list.

import (
"context"
"fmt"
"strings"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charliedmcb do you mind reviewing this area?

"sigs.k8s.io/controller-runtime/pkg/reconcile"
karpv1 "sigs.k8s.io/karpenter/pkg/apis/v1"
"sigs.k8s.io/karpenter/pkg/operator/injection"
corenodeclaimutils "sigs.k8s.io/karpenter/pkg/utils/nodeclaim"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthchr do you mind reviewing this package?

}

return lo.ToPtr(ossku), lo.ToPtr(enabledArtifactStreaming), enableFIPs, nil
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charliedmcb do you mind checking this area?

Tags: tags,
},
}, nil
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charliedmcb @Bryce-Soghigian do you mind double-check if the your familiar features are wired correctly? I can ensure that existing E2Es pass though.

}
if o.ManageExistingAKSMachines {
if o.ProvisionMode == consts.ProvisionModeAKSMachineAPI || o.ManageExistingAKSMachines {
aksMachinesClient, err = armcontainerservice.NewMachinesClient(cfg.SubscriptionID, cred, opts)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakechill do you mind verify if new options logic (relevant to SIG) makes sense?

@comtalyst comtalyst force-pushed the comtalyst/aks-machine-api-integration-core branch 3 times, most recently from 575faa3 to ec2a6aa Compare October 7, 2025 22:00
Copy link
Contributor

@charliedmcb charliedmcb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed drift.go

Assuming the supporting functions are implemented correctly, lgtm

for _, availableImage := range nodeImages {
// Note: not supporting drift across galleries yet, as AKS machine does not hold gallery info, as of now.
// Alternatively, could call GET VM, if not propose API changes.
availableImageVersion, err := utils.GetAKSMachineNodeImageVersionFromImageID(availableImage.ID) // WARNING: verify whether this function support the desired gallery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually seeing this utils func existing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in a different PR: #1180

Copy link
Contributor

@charliedmcb charliedmcb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did only a quick pass over aksmachineinstancehelpers.go

Need fixes. FIPs at a minimum isn't handled correctly.

Didn't take a close look at anything I wasn't super familiar with.

if lo.FromPtr(nodeClass.Spec.FIPSMode) == v1beta1.FIPSModeFIPS {
ossku = armcontainerservice.OSSKUUbuntu
enabledArtifactStreaming = false
enableFIPs = lo.ToPtr(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enableFIPs can be true for any image family. I don't think this handling makes any sense. It should be pulling enableFIPS from the nodeclass

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure..? Could you verify if the logic in provisionclientbootstrap.go is correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can definitely be true for AzureLinuxImageFamily

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it is an aksnodeclass level feature, so I don't think configuring enableFIPs based on extrapolating what images it works for vs not makes sense, as support for 2204, and 2404 will come, meaning that imo should get this from the nodeclass, and check if the mode is FIPS for enabled, otherwise not enable it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it checks what the field is set to no?

enableFIPS := lo.FromPtr(p.FIPSMode) == v1beta1.FIPSModeFIPS

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, so the part that my change is in the wrong is enableFIPS is set only on UbuntuImageFamily, while it could just use the one in node class.
On the ossku selection based on nodeClass.Spec.FIPSMode, does that looks correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Fixed)

// Note: as of the time of writing, AKS machine API does not support tags on NICs. This could be fixed server-side.
tags := ConfigureAKSMachineTags(options.FromContext(ctx), nodeClass, nodeClaim, creationTimestamp)

return &armcontainerservice.Machine{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will also have to add the image family, once we don't need to use the headers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if that plan is being implemented?
AFAIK the new way to pass in image is being worked on from the API side, but not sure if it will be in a form of new field (which will require another SDK version bump).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed with the API team that this is upcoming.

Copy link
Member

@matthchr matthchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done, leaving my comments. Have a little bit to go still.


AnnotationAKSNodeClassHash = apis.Group + "/aksnodeclass-hash"
AnnotationAKSNodeClassHashVersion = apis.Group + "/aksnodeclass-hash-version"
AnnotationAKSMachineResourceID = apis.Group + "/aks-machine-resource-id" // resource ID of the associated AKS machine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least at some times in the past we've duplicated some of this between v1alpha1 and v1beta1. Not sure if we should keep doing it but may be worth doing at least for now til we drop v1alpha1 entirely?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see v1alpha1 to add anymore. I think we dropped it?
v1alpha2 is still there, but there is no labels duplication.
Btw, what is the expected impact from this?

@tallaxes, thoughts?

// An "instance" is a remote object, created by the API based on the template.
// A "template" is a local struct, populated from Karpenter-provided parameters with the logic further below.
// A "template" shares the struct with an "instance" representation. But read-only fields may not be populated. Ideally, the types should have been separated to avoid making cross-module assumption of the existence of certain fields.
type AKSMachinePromise struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take all of the fields on this object that aren't the promise func + provider ref and put them onto their own type? Something like:

type AKSMachineDetails struct

or something?

Then in cloudprovider controller instead of having to pass all the bits around you can just pass the whole struct, so this:

	newNodeClaim, err := instance.BuildNodeClaimFromAKSMachineTemplate(
		ctx, aksMachinePromise.AKSMachineTemplate,
		aksMachinePromise.InstanceType,
		aksMachinePromise.CapacityType,
		lo.ToPtr(aksMachinePromise.Zone),
		aksMachinePromise.AKSMachineID,
		aksMachinePromise.VMResourceID,
		false,
		aksMachinePromise.AKSMachineNodeImageVersion)

becomes

	newNodeClaim, err := instance.BuildNodeClaimFromAKSMachineTemplate(
		ctx, 
		aksMachinePromise.Machine)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For BuildNodeClaimFromAKSMachineTemplate(), there is another variant/layer located right below it: BuildNodeClaimFromAKSMachine(). That one is more similar to your proposal of passing in just the Machine object.

The reason is that these functions can be called with both (A) real Machine object retrieved from the API, and (B) Karpenter-populated struct just after creation, which may have some fields unpopulated. Using the same BuildNodeClaimFromAKSMachine() for both obscures the fact use case of (B) may have different assumptions from (A).

Other parameters in BuildNodeClaimFromAKSMachineTemplate are either not naturally a part of Karpenter-populated struct (e.g., resource IDs) or is in a different format but easily retrieved by (B)'s codepath without having to do the conversion (e.g., InstanceType was given then converted to be used in Machine API, so here is just taking that pre-conversion value, in contrast to converting it back if we take it from machine template struct).

The trade-off is that it looks less minimal as we can see, though I don't think it is as significance as the above. Thoughts?

type AKSMachineDetails struct

Another variant from the philosophy above is to have a dedicated struct for "template" and server-side object. But don't think that is worth adding another structure though. Thoughts?

}

// Convert the AKS machine to a NodeClaim
newNodeClaim, err := instance.BuildNodeClaimFromAKSMachineTemplate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, corecloudprovider.NewNodeClaimNotFoundError(fmt.Errorf("failed to get AKS machine, AKS machines pool name is empty"))
}

// ASSUMPTION: AKS machines API accepts only AKS machine name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this assumption in a few places but not sure I follow what it means... it accepts the machine pool name and the machine, but is it really an assumption (does it deserve a comment?)

Feels like it's just a fact of the API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you recall, in one of the previous designs, we almost make GET Machine callable with VM name (aks-aksmanagedap-default-etc...) as well. This comment is to note that that idea is not the case anymore.
Although, given less likelihood of going back and its less usual pattern, I will just remove these comments to prevent confusion.

I've seen this assumption in a few places

I see only one?

func (p *DefaultAKSMachineProvider) List(ctx context.Context) ([]*armcontainerservice.Machine, error) {
if p.aksMachinesPoolName == "" {
// Possible when this option field is not populated, which is not required when PROVISION_MODE is not aksmachineapi.
// So, we respond similarly to if AKS machines pool is not found.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List if machines pool is not found at least at the API level should be an error? Why not an error here, any reason?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see you generally treat machine pool not found as a success with no machines, so you're being consistent. Maybe I'll understand once I finish reading more...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The philosophy is we want to be least bothered by the existence of machines pool. If machines pool cannot be found, we say machines cannot be found, and would rather declare the latter.

}

// Determine creation timestamp for Karpenter's perspective
creationTimestamp := NewAKSMachineTimestamp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this versus just using creationTimestamp from the machine API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creationTimestamp := NewAKSMachineTimestamp()

// Build the AKS machine template
aksMachineTemplate, err := p.buildAKSMachineTemplate(ctx, instanceType, capacityType, zone, nodeClass, nodeClaim, creationTimestamp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note to self) stopped here for now

@comtalyst comtalyst force-pushed the comtalyst/aks-machine-api-integration-core branch from e2e58e3 to 5ebafe9 Compare December 5, 2025 03:51
@comtalyst comtalyst changed the base branch from comtalyst/xpmt-machine-api-all-refactors-combined to comtalyst/xpmt-machine-api-all-refactors-combined-v2 December 5, 2025 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants