Skip to content

sdk: update following Go version bump#896

Open
vigh-m wants to merge 5 commits intobottlerocket-os:developfrom
vigh-m:sdk-update
Open

sdk: update following Go version bump#896
vigh-m wants to merge 5 commits intobottlerocket-os:developfrom
vigh-m:sdk-update

Conversation

@vigh-m
Copy link
Copy Markdown
Contributor

@vigh-m vigh-m commented Apr 8, 2026

Description of changes:

  • Depends on go: add support for go 1.26, drop 1.24 bottlerocket-sdk#345
  • Updates host-ctr to use Go v1.26
  • Update nvidia-container-toolkit to v1.19.0. This change brings in a new hook DisableDeviceNodeModificationHook when using CDI mode.
    From the PR in nvidia-container-toolkit:
    This change adds a hook to disable device node creation for FULL GPUs (i.e. non-MIG devices) or 
    modification in a container by updating the ModifyDeviceFiles driver parameter.
    
    (This does not include nvidia-caps devices that are required by MIG devices).
    
    The presence of "extra" device nodes in a container are largely cosmetic since the container should 
    not have the required cgroup access for the additional devices. This does not affect the device 
    nodes on the host.
    

This new hook looks safe to update based on the above and testing done.

We're defaulting to false since it's expected that the AWS EFA device plugin will handle mounting the uverbs devices.

  • Add the rc build of kubernetes-1.36

Testing done:

  • Tested building both arches and launching AMIs
  • Conformance testing of AMIs
  • Tested Nvidia workloads in cdi-cri mode and volume-mounts mode, with and without MIG enabled.
  • Validated testing the new mofedEnabled: false by launching pods on EFA enabled instances
Details

For the pod that didn't request any EFA device (just a GPU):

❯ k logs gpu-mofed-test
--- /dev/infiniband ---
NONE
--- /dev/nvidia* ---
crw-rw-rw-. 1 root root 195, 254 Apr 17 18:05 /dev/nvidia-modeset
crw-rw-rw-. 1 root root 238,   0 Apr 17 18:05 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 238,   1 Apr 17 18:05 /dev/nvidia-uvm-tools
crw-rw-rw-. 1 root root 195,   3 Apr 17 18:05 /dev/nvidia3
crw-rw-rw-. 1 root root 195, 255 Apr 17 18:05 /dev/nvidiactl
--- env ---
HOSTNAME=gpu-mofed-test
NVIDIA_CTK_LIBCUDA_DIR=/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/nvidia/tesla
NVIDIA_GDRCOPY=enabled
NVIDIA_GDS=enabled
NVIDIA_VISIBLE_DEVICES=void

For the pod that requesting one EFA device and one GPU:

❯ k logs gpu-mofed-test
--- /dev/infiniband ---
total 0
drwxr-xr-x. 2 root root       60 Apr 17 18:28 .
drwxr-xr-x. 6 root root      480 Apr 17 18:28 ..
crw-rw-rw-. 1 root root 231, 192 Apr 17 18:28 uverbs0
--- /dev/nvidia* ---
crw-rw-rw-. 1 root root 195, 254 Apr 17 18:28 /dev/nvidia-modeset
crw-rw-rw-. 1 root root 238,   0 Apr 17 18:28 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 238,   1 Apr 17 18:28 /dev/nvidia-uvm-tools
crw-rw-rw-. 1 root root 195,   0 Apr 17 18:28 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 255 Apr 17 18:28 /dev/nvidiactl
--- env ---
HOSTNAME=gpu-mofed-test
NVIDIA_CTK_LIBCUDA_DIR=/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/nvidia/tesla
NVIDIA_GDRCOPY=enabled
NVIDIA_GDS=enabled
NVIDIA_VISIBLE_DEVICES=void
bash-5.2# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  mofedEnabled: false <--- Rendered here
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: cdi-cri
    deviceIDStrategy: index
    containerDriverRoot: "/"

/usr/bin/nvidia-device-plugin --version gave me an error

bash-5.2# /usr/bin/nvidia-device-plugin --version
NVIDIA Device Plugin version unknown

But /usr/bin/nvidia-device-plugin --help shows:

...
COMMANDS:
...
   --mofed-enabled ensure that containers that request NVIDIA GPU resources are started with MOFED support (default: true) [$MOFED_ENABLED]
...

Which has default: true as expected of the new version of the device-plugin

  • Tested the k8s-1.36 rc by building an AMI and verifying that it launches

ToDo:

  • Bump the SDK version after release

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@vigh-m vigh-m changed the title Sdk update sdk: Update following Go version bump Apr 8, 2026
@vigh-m vigh-m changed the title sdk: Update following Go version bump sdk: update following Go version bump Apr 8, 2026
vigh-m added 5 commits April 17, 2026 13:31
Signed-off-by: Vighnesh Maheshwari <vighmah@amazon.com>
Signed-off-by: Vighnesh Maheshwari <vighmah@amazon.com>
Signed-off-by: Vighnesh Maheshwari <vighmah@amazon.com>
Build the ecr-credential-provider-1.36 using the v1.35 sources
since upstream has not released rc sources yet. Setting this up to use
1.36 sources when they are available

Signed-off-by: Vighnesh Maheshwari <vighmah@amazon.com>
Signed-off-by: Vighnesh Maheshwari <vighmah@amazon.com>
@vigh-m
Copy link
Copy Markdown
Contributor Author

vigh-m commented Apr 17, 2026

From reading the Go 1.26 release notes, I think the below changes are worth noting:


New garbage collector

The Green Tea garbage collector, previously available as an experiment in Go 1.25, is now enabled by default after incorporating feedback.

This garbage collector’s design improves the performance of marking and scanning small objects through better locality and CPU scalability. Benchmark results vary, but we expect somewhere between a 10–40% reduction in garbage collection overhead in real-world programs that heavily use the garbage collector. Further improvements, on the order of 10% in garbage collection overhead, are expected when running on newer amd64-based CPU platforms (Intel Ice Lake or AMD Zen 4 and newer), as the garbage collector now leverages vector instructions for scanning small objects when possible.

The new garbage collector may be disabled by setting GOEXPERIMENT=nogreenteagc at build time. This opt-out setting is expected to be removed in Go 1.27. If you disable the new garbage collector for any reason related to its performance or behavior, please file an issue.

Compiler

The compiler can now allocate the backing store for slices on the stack in more situations, which improves performance. If this change is causing trouble, the bisect tool can be used to find the allocation causing trouble using the -compile=variablemake flag. All such new stack allocations can also be turned off using -gcflags=all=-d=variablemakehash=n. If you encounter issues with this optimization, please file an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant