-
Notifications
You must be signed in to change notification settings - Fork 100
feat: further AKS feature/quality/perf parity, through machine API integration #1197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: comtalyst/xpmt-machine-api-all-refactors-combined-v2
Are you sure you want to change the base?
Conversation
99c437a to
ab84951
Compare
| @@ -0,0 +1,279 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to re-check how out-of-pattern the naming of these files is.
The intention is to split the files while letting the suite prefix "group" the files together in IDE file explorer/list.
| import ( | ||
| "context" | ||
| "fmt" | ||
| "strings" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charliedmcb do you mind reviewing this area?
| "sigs.k8s.io/controller-runtime/pkg/reconcile" | ||
| karpv1 "sigs.k8s.io/karpenter/pkg/apis/v1" | ||
| "sigs.k8s.io/karpenter/pkg/operator/injection" | ||
| corenodeclaimutils "sigs.k8s.io/karpenter/pkg/utils/nodeclaim" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthchr do you mind reviewing this package?
| } | ||
|
|
||
| return lo.ToPtr(ossku), lo.ToPtr(enabledArtifactStreaming), enableFIPs, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charliedmcb do you mind checking this area?
| Tags: tags, | ||
| }, | ||
| }, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charliedmcb @Bryce-Soghigian do you mind double-check if the your familiar features are wired correctly? I can ensure that existing E2Es pass though.
| } | ||
| if o.ManageExistingAKSMachines { | ||
| if o.ProvisionMode == consts.ProvisionModeAKSMachineAPI || o.ManageExistingAKSMachines { | ||
| aksMachinesClient, err = armcontainerservice.NewMachinesClient(cfg.SubscriptionID, cred, opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rakechill do you mind verify if new options logic (relevant to SIG) makes sense?
575faa3 to
ec2a6aa
Compare
charliedmcb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reviewed drift.go
Assuming the supporting functions are implemented correctly, lgtm
| for _, availableImage := range nodeImages { | ||
| // Note: not supporting drift across galleries yet, as AKS machine does not hold gallery info, as of now. | ||
| // Alternatively, could call GET VM, if not propose API changes. | ||
| availableImageVersion, err := utils.GetAKSMachineNodeImageVersionFromImageID(availableImage.ID) // WARNING: verify whether this function support the desired gallery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually seeing this utils func existing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in a different PR: #1180
charliedmcb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did only a quick pass over aksmachineinstancehelpers.go
Need fixes. FIPs at a minimum isn't handled correctly.
Didn't take a close look at anything I wasn't super familiar with.
| if lo.FromPtr(nodeClass.Spec.FIPSMode) == v1beta1.FIPSModeFIPS { | ||
| ossku = armcontainerservice.OSSKUUbuntu | ||
| enabledArtifactStreaming = false | ||
| enableFIPs = lo.ToPtr(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enableFIPs can be true for any image family. I don't think this handling makes any sense. It should be pulling enableFIPS from the nodeclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure..? Could you verify if the logic in provisionclientbootstrap.go is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can definitely be true for AzureLinuxImageFamily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it is an aksnodeclass level feature, so I don't think configuring enableFIPs based on extrapolating what images it works for vs not makes sense, as support for 2204, and 2404 will come, meaning that imo should get this from the nodeclass, and check if the mode is FIPS for enabled, otherwise not enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it checks what the field is set to no?
Line 114 in e4199a8
| enableFIPS := lo.FromPtr(p.FIPSMode) == v1beta1.FIPSModeFIPS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, so the part that my change is in the wrong is enableFIPS is set only on UbuntuImageFamily, while it could just use the one in node class.
On the ossku selection based on nodeClass.Spec.FIPSMode, does that looks correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Fixed)
| // Note: as of the time of writing, AKS machine API does not support tags on NICs. This could be fixed server-side. | ||
| tags := ConfigureAKSMachineTags(options.FromContext(ctx), nodeClass, nodeClaim, creationTimestamp) | ||
|
|
||
| return &armcontainerservice.Machine{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will also have to add the image family, once we don't need to use the headers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if that plan is being implemented?
AFAIK the new way to pass in image is being worked on from the API side, but not sure if it will be in a form of new field (which will require another SDK version bump).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed with the API team that this is upcoming.
b3ac8c7 to
5ec8cf8
Compare
3555e5e to
fb00f47
Compare
matthchr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done, leaving my comments. Have a little bit to go still.
|
|
||
| AnnotationAKSNodeClassHash = apis.Group + "/aksnodeclass-hash" | ||
| AnnotationAKSNodeClassHashVersion = apis.Group + "/aksnodeclass-hash-version" | ||
| AnnotationAKSMachineResourceID = apis.Group + "/aks-machine-resource-id" // resource ID of the associated AKS machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least at some times in the past we've duplicated some of this between v1alpha1 and v1beta1. Not sure if we should keep doing it but may be worth doing at least for now til we drop v1alpha1 entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see v1alpha1 to add anymore. I think we dropped it?
v1alpha2 is still there, but there is no labels duplication.
Btw, what is the expected impact from this?
@tallaxes, thoughts?
| // An "instance" is a remote object, created by the API based on the template. | ||
| // A "template" is a local struct, populated from Karpenter-provided parameters with the logic further below. | ||
| // A "template" shares the struct with an "instance" representation. But read-only fields may not be populated. Ideally, the types should have been separated to avoid making cross-module assumption of the existence of certain fields. | ||
| type AKSMachinePromise struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we take all of the fields on this object that aren't the promise func + provider ref and put them onto their own type? Something like:
type AKSMachineDetails struct
or something?
Then in cloudprovider controller instead of having to pass all the bits around you can just pass the whole struct, so this:
newNodeClaim, err := instance.BuildNodeClaimFromAKSMachineTemplate(
ctx, aksMachinePromise.AKSMachineTemplate,
aksMachinePromise.InstanceType,
aksMachinePromise.CapacityType,
lo.ToPtr(aksMachinePromise.Zone),
aksMachinePromise.AKSMachineID,
aksMachinePromise.VMResourceID,
false,
aksMachinePromise.AKSMachineNodeImageVersion)
becomes
newNodeClaim, err := instance.BuildNodeClaimFromAKSMachineTemplate(
ctx,
aksMachinePromise.Machine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For BuildNodeClaimFromAKSMachineTemplate(), there is another variant/layer located right below it: BuildNodeClaimFromAKSMachine(). That one is more similar to your proposal of passing in just the Machine object.
The reason is that these functions can be called with both (A) real Machine object retrieved from the API, and (B) Karpenter-populated struct just after creation, which may have some fields unpopulated. Using the same BuildNodeClaimFromAKSMachine() for both obscures the fact use case of (B) may have different assumptions from (A).
Other parameters in BuildNodeClaimFromAKSMachineTemplate are either not naturally a part of Karpenter-populated struct (e.g., resource IDs) or is in a different format but easily retrieved by (B)'s codepath without having to do the conversion (e.g., InstanceType was given then converted to be used in Machine API, so here is just taking that pre-conversion value, in contrast to converting it back if we take it from machine template struct).
The trade-off is that it looks less minimal as we can see, though I don't think it is as significance as the above. Thoughts?
type AKSMachineDetails struct
Another variant from the philosophy above is to have a dedicated struct for "template" and server-side object. But don't think that is worth adding another structure though. Thoughts?
| } | ||
|
|
||
| // Convert the AKS machine to a NodeClaim | ||
| newNodeClaim, err := instance.BuildNodeClaimFromAKSMachineTemplate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return nil, corecloudprovider.NewNodeClaimNotFoundError(fmt.Errorf("failed to get AKS machine, AKS machines pool name is empty")) | ||
| } | ||
|
|
||
| // ASSUMPTION: AKS machines API accepts only AKS machine name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen this assumption in a few places but not sure I follow what it means... it accepts the machine pool name and the machine, but is it really an assumption (does it deserve a comment?)
Feels like it's just a fact of the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you recall, in one of the previous designs, we almost make GET Machine callable with VM name (aks-aksmanagedap-default-etc...) as well. This comment is to note that that idea is not the case anymore.
Although, given less likelihood of going back and its less usual pattern, I will just remove these comments to prevent confusion.
I've seen this assumption in a few places
I see only one?
| func (p *DefaultAKSMachineProvider) List(ctx context.Context) ([]*armcontainerservice.Machine, error) { | ||
| if p.aksMachinesPoolName == "" { | ||
| // Possible when this option field is not populated, which is not required when PROVISION_MODE is not aksmachineapi. | ||
| // So, we respond similarly to if AKS machines pool is not found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List if machines pool is not found at least at the API level should be an error? Why not an error here, any reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see you generally treat machine pool not found as a success with no machines, so you're being consistent. Maybe I'll understand once I finish reading more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The philosophy is we want to be least bothered by the existence of machines pool. If machines pool cannot be found, we say machines cannot be found, and would rather declare the latter.
| } | ||
|
|
||
| // Determine creation timestamp for Karpenter's perspective | ||
| creationTimestamp := NewAKSMachineTimestamp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this versus just using creationTimestamp from the machine API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1197 (comment)
| creationTimestamp := NewAKSMachineTimestamp() | ||
|
|
||
| // Build the AKS machine template | ||
| aksMachineTemplate, err := p.buildAKSMachineTemplate(ctx, instanceType, capacityType, zone, nodeClass, nodeClaim, creationTimestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note to self) stopped here for now
e2e58e3 to
5ebafe9
Compare
Fixes #
Description
New node provisioning architecture. More details TODO.
This provision mode is considered in-preview at least until AKS machine API is generally available.
Currently, user needs to:
PROVISION_MODEwhen deploying KarpenterMeanwhile:
aksscriptlessprovision mode. This may change separately of this PRTests to follow, not necessarily for PR
How was this change tested?
Does this change impact docs?
Release Note