Skip to content

Nodeup intermittently failing to connect to kops-controller when using Cilium CNI in IPAM mode #17790

@recollir

Description

@recollir

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.29.X and 1.32.X (independent of patch version)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.29.Y

3. What cloud provider are you using?

aws

4. What commands did you run? What is the simplest way to reproduce this issue?

Scaling up the cluster by adding a node (starting up an EC2 instance) through any method (manually by increasing the desired instance number in the ASG, automatically through the cluster autoscaler or by having kops roll the cluster).

5. What happened after the commands executed?

In the nodeup logs from the kops-configuration service we see msgs like

Starting kops-configuration.service - Run kOps bootstrap (nodeup)...
nodeup version 1.32.2 (git-v1.32.2)
W1203 16:06:48.236770    2493 main.go:133] got error running nodeup (will retry in 30s): failed to get node config from server: Post "https://kops-controller.internal.ernie.example.com:3988/bootstrap": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
W1203 16:07:33.274695    2493 main.go:133] got error running nodeup (will retry in 30s): failed to get node config from server: Post "https://kops-controller.internal.ernie.example.com:3988/bootstrap": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Sometimes more, sometimes less. Depending on the DNS resolution of the kops-controller FQDN.

The kops-controller FQDN resolves to all IPs of the control plane nodes (primary IP as well as the secondary IPs of the ENIs on the node). Example:

❯ dig kops-controller.internal.ernie.example.com

; <<>> DiG 9.10.6 <<>> kops-controller.internal.ernie.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11808
;; flags: qr rd ra; QUERY: 1, ANSWER: 39, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;kops-controller.internal.ernie.example.com. IN A

;; ANSWER SECTION:
kops-controller.internal.ernie.example.com. 60 IN A 10.101.37.202
kops-controller.internal.ernie.example.com. 60 IN A 10.101.66.63
kops-controller.internal.ernie.example.com. 60 IN A 10.101.55.241
kops-controller.internal.ernie.example.com. 60 IN A 10.101.5.139
kops-controller.internal.ernie.example.com. 60 IN A 10.101.34.10
kops-controller.internal.ernie.example.com. 60 IN A 10.101.4.113
kops-controller.internal.ernie.example.com. 60 IN A 10.101.4.70
kops-controller.internal.ernie.example.com. 60 IN A 10.101.64.234
kops-controller.internal.ernie.example.com. 60 IN A 10.101.93.20
kops-controller.internal.ernie.example.com. 60 IN A 10.101.41.180
kops-controller.internal.ernie.example.com. 60 IN A 10.101.88.5
kops-controller.internal.ernie.example.com. 60 IN A 10.101.57.246
kops-controller.internal.ernie.example.com. 60 IN A 10.101.10.210
kops-controller.internal.ernie.example.com. 60 IN A 10.101.66.236
kops-controller.internal.ernie.example.com. 60 IN A 10.101.38.161
kops-controller.internal.ernie.example.com. 60 IN A 10.101.8.134
kops-controller.internal.ernie.example.com. 60 IN A 10.101.6.219
kops-controller.internal.ernie.example.com. 60 IN A 10.101.18.248
kops-controller.internal.ernie.example.com. 60 IN A 10.101.36.254
kops-controller.internal.ernie.example.com. 60 IN A 10.101.67.202
kops-controller.internal.ernie.example.com. 60 IN A 10.101.62.78
kops-controller.internal.ernie.example.com. 60 IN A 10.101.93.21
kops-controller.internal.ernie.example.com. 60 IN A 10.101.36.223
kops-controller.internal.ernie.example.com. 60 IN A 10.101.95.1
kops-controller.internal.ernie.example.com. 60 IN A 10.101.73.195
kops-controller.internal.ernie.example.com. 60 IN A 10.101.94.158
kops-controller.internal.ernie.example.com. 60 IN A 10.101.45.83
kops-controller.internal.ernie.example.com. 60 IN A 10.101.9.199
kops-controller.internal.ernie.example.com. 60 IN A 10.101.53.90
kops-controller.internal.ernie.example.com. 60 IN A 10.101.48.6
kops-controller.internal.ernie.example.com. 60 IN A 10.101.22.1
kops-controller.internal.ernie.example.com. 60 IN A 10.101.11.25
kops-controller.internal.ernie.example.com. 60 IN A 10.101.9.172
kops-controller.internal.ernie.example.com. 60 IN A 10.101.89.184
kops-controller.internal.ernie.example.com. 60 IN A 10.101.64.176
kops-controller.internal.ernie.example.com. 60 IN A 10.101.38.6
kops-controller.internal.ernie.example.com. 60 IN A 10.101.13.231
kops-controller.internal.ernie.example.com. 60 IN A 10.101.7.101
kops-controller.internal.ernie.example.com. 60 IN A 10.101.70.102

;; Query time: 89 msec
;; SERVER: 100.64.0.1#53(100.64.0.1)
;; WHEN: Thu Dec 04 08:46:28 CET 2025
;; MSG SIZE  rcvd: 709

This corresponds to the output of kubectl describe node <control plane node> and the addresses section:

Addresses:
  InternalIP:   10.101.53.90
  InternalIP:   10.101.37.202
  InternalIP:   10.101.34.10
  InternalIP:   10.101.62.78
  InternalIP:   10.101.38.161
  InternalIP:   10.101.38.6
  InternalIP:   10.101.48.6
  InternalIP:   10.101.36.254
  InternalIP:   10.101.36.223
  InternalIP:   10.101.45.83
  InternalIP:   10.101.55.241
  InternalIP:   10.101.57.246
  InternalIP:   10.101.41.180
  InternalDNS:  i-0084cbb5018a4567.ec2.internal
  Hostname:     i-0084cbb5018a4567.ec2.internal

or the kubectl get node <control plane node> -o yaml

status:
  addresses:
  - address: 10.101.53.90
    type: InternalIP
  - address: 10.101.37.202
    type: InternalIP
  - address: 10.101.34.10
    type: InternalIP
  - address: 10.101.62.78
    type: InternalIP
  - address: 10.101.38.161
    type: InternalIP
  - address: 10.101.38.6
    type: InternalIP
  - address: 10.101.48.6
    type: InternalIP
  - address: 10.101.36.254
    type: InternalIP
  - address: 10.101.36.223
    type: InternalIP
  - address: 10.101.45.83
    type: InternalIP
  - address: 10.101.55.241
    type: InternalIP
  - address: 10.101.57.246
    type: InternalIP
  - address: 10.101.41.180
    type: InternalIP
  - address: i-0084cbb5018a4567.ec2.internal
    type: InternalDNS
  - address: i-0084cbb5018a4567.ec2.internal
    type: Hostname

6. What did you expect to happen?

Nodeup being able to connect to kops-controller directly without timeouts by using the correct IP from the FQDN resolution. So the DNS records being created for the kops-controller should only have the primary IP of the instance (see below for the configuration of our CNI).

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

networking:
    cni: {}

We run with Cilium as the CNI and manage it ourself. Cilium is configured in IPAM mode to allocate VPC native IP addresses on the pods. Hence instances getting multiple ENIs and IPs associated with them (see above for example of a control plane node). Relevant part of the cilium configuration:

cilium:
  eni:
    enabled: true
    awsReleaseExcessIPs: true 
  ipam:
    mode: eni

8. Anything else do we need to know?

In

// updateNodeRecords will apply the records for the specified node. It returns the key that was set.
func (c *NodeController) updateNodeRecords(node *v1.Node) string {
var records []dns.Record
// Alias targets
// node/<name>/internal -> InternalIP
for _, a := range node.Status.Addresses {
if a.Type != v1.NodeInternalIP {
continue
}
var recordType dns.RecordType = dns.RecordTypeA
if utils.IsIPv6IP(a.Address) {
recordType = dns.RecordTypeAAAA
}
if !c.haveType[recordType] {
continue
}
records = append(records, dns.Record{
RecordType: recordType,
FQDN: "node/" + node.Name + "/internal",
Value: a.Address,
AliasTarget: true,
})
}
// node/<name>/external -> ExternalIP
for _, a := range node.Status.Addresses {
if a.Type != v1.NodeExternalIP && (a.Type != v1.NodeInternalIP || !utils.IsIPv6IP(a.Address)) {
continue
}
var recordType dns.RecordType = dns.RecordTypeA
if utils.IsIPv6IP(a.Address) {
recordType = dns.RecordTypeAAAA
}
records = append(records, dns.Record{
RecordType: recordType,
FQDN: "node/" + node.Name + "/external",
Value: a.Address,
AliasTarget: true,
})
}
// node/role=<role>/external -> ExternalIP
// node/role=<role>/internal -> InternalIP
{
role := kopsutil.GetNodeRole(node)
// Default to node
if role == "" {
role = "node"
}
for _, a := range node.Status.Addresses {
var roleType string
switch a.Type {
case v1.NodeInternalIP:
roleType = dns.RoleTypeInternal
case v1.NodeExternalIP:
roleType = dns.RoleTypeExternal
}
var recordType dns.RecordType = dns.RecordTypeA
if utils.IsIPv6IP(a.Address) {
recordType = dns.RecordTypeAAAA
}
records = append(records, dns.Record{
RecordType: recordType,
FQDN: dns.AliasForNodesInRole(role, roleType),
Value: a.Address,
AliasTarget: true,
})
}
}
we see that in the loop over all internal IP addresses there is no distinction happening on the type of the internal IP address and think that his contributes to the issue we are seeing with the additional DNS records being created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions