Skip to content

drop support for compute capability <= 7.0 for newer cuDNN versions#170

Merged
casparvl merged 3 commits intoEESSI:mainfrom
bedroge:cudnn915_cc70
Mar 10, 2026
Merged

drop support for compute capability <= 7.0 for newer cuDNN versions#170
casparvl merged 3 commits intoEESSI:mainfrom
bedroge:cudnn915_cc70

Conversation

@bedroge
Copy link
Contributor

@bedroge bedroge commented Feb 27, 2026

This one is a little bit more tricky as CUDA itself, as the list of supported compute capabilities in the docs (https://docs.nvidia.com/deeplearning/cudnn/backend/v9.19.0/reference/support-matrix.html) don't really match what running cuobjdump on the binaries shows. Also, there seem to be some gaps in the matrix, and I wonder if that's really correct.

So for now I've chosen an easier approach by just checking if we're building with a newer cuDNN and compute capability <= 7.0, and in that case I do the same thing as what @casparvl implemented for CUDA. In order to check if cuDNN is used as dependency, I've generalized Caspar's get_cuda_version into a get_dependency_software_version function.

Tested this locally with EESSI-extend and the cuDNN from EESSI/software-layer#1410 on a V100 (CC 7.0) and RTX PRO 6000 (CC 12.0f), and got the expected result: on the RTX PRO 6000 I get a full cuDNN installation, while for the V100 I get the following output during the build:

WARNING: Requested a CUDA Compute Capability (['7.0']) that is not supported by the cuDNN version (9.15.0.57) used by this software. Switching to 
'--module-only --force' and injectiong an LmodError into the modulefile. You can override this behaviour by setting the 
EESSI_OVERRIDE_CUDA_CC_CUDNN_CHECK environment variable.

and a module file that has:

if (not os.getenv("EESSI_IGNORE_CUDNN_9_15_0_57_CC_7_0")) then LmodError("EasyConfigs using cuDNN 9.15.0.57 or older are not supported for (all) requested Compute Capabilities: ['7.0'].\n") end

@bedroge
Copy link
Contributor Author

bedroge commented Feb 27, 2026

Ultimately we could make the same kind of lookup table as for CUDA. Initially I started working on it:

# The documentation at e.g. https://docs.nvidia.com/deeplearning/cudnn/backend/v9.19.0/reference/support-matrix.html and
# what cuobjdump showns on cuDNN libraries does not fully match. The support matrix below may be too inclusive,
# so if you find that a specific combination is not supported in practice, please remove it from the matrix.
CUDNN_SUPPORTED_CCS = {
    '8.8.0': [],
    '9.15.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.15.1': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.16.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.17.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.17.1': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.18.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.18.1': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.19.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
}

but it's a lot of work, and as mentioned, it's not really clear what is supported and what is not. We could also consider an more simple lookup table with just the min+max supported CCs per X.YZ version? But then again, https://docs.nvidia.com/deeplearning/cudnn/backend/v9.19.0/reference/support-matrix.html says that 12.1 is not supported, the binaries do seem to indicate that it's supported, so it's very confusing and unclear...

@casparvl
Copy link
Contributor

My 2 cents:

  1. Go for a lookup table. If you only specify a min and max version, the implicit assumption is that all intermediate versions are supported - which does not seem to be the case (i.e. 11.X almost certainly isn't, since that's not supported in CUDA 12 - see the CUDA lookup table)
  2. If you create a lookup table, and if the docs contradict what the binaries show, assume the binaries to be correct. If the binaries say there is no X.Y support, there is no X.Y code in the binary - so there can't be support. If the binary says there is X.Y code in the binary, that might not be a hard guarantee that the full cuDNN API is supported for that architecture - but the only way to find out is to assume the support is there, install it, and see how this works in practice. If we skip installations for targets that do turn out to be supported, we'd never find out otherwise.

@bedroge
Copy link
Contributor Author

bedroge commented Feb 27, 2026

I just feel like a lookup table is a lot of work to set up and to maintain, while (according to the docs) the supported CCs don't change that often. Also, wouldn't the sanity check still catch unsupported CCs, as it did for CC 7.0 in EESSI/software-layer#1410? So whenever we run into this, we can mark those as unsupported in the hooks (and if necessary, change the if statement to something else if there are going to be too many combinations)?

@casparvl
Copy link
Contributor

casparvl commented Mar 9, 2026

Hm, I don't think it's too bad to maintain - but admittedly it may be easier for CUDA than for cuDNN since we can just query the list from nvcc. Looking at your PR again, it should correctly generate fake modules for cuDNN's that are too new to support CC 7.0.

The fact that it doesn't do so for CC 11.0 may be a minor detail, since the CUDA sanity check will then indeed report that this is also invalid. The only downside of not including that case (and maybe also an upper limit) right away is that when sites install this with EESSI-extend and have 11.0 configured as their CC, they'll hit the CUDA sanity check - and may not fully understand why it fails (while the error message printed by the module is much more informative, as it is more specific).

Anyway, I'm also ok in leaving that out for now. If you can have a look at my (minor) review comment, I'll see if I can test the PR locally - and merge it if it works as expected.

Copy link
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing baesd on this feature branch:

[casparl@tcn471 software-layer-scripts]$ eb --hooks eb_hooks.py cuDNN-9.15.0.57-CUDA-12.9.1.eb --accept-eula-for=cuDNN
...
== Running pre-fetch hook...

WARNING: Requested a CUDA Compute Capability (['7.0']) that is not supported by the cuDNN version (9.15.0.57) used by this software. Switching to '--module-only --force' and injectiong an LmodError into
the modulefile. You can override this behaviour by setting the EESSI_OVERRIDE_CUDA_CC_CUDNN_CHECK environment variable.

== Updated build option 'module-only' to 'True'
== Updated build option 'force' to 'True'
...
== Setting EESSI_IGNORE_CUDNN_9_15_0_57_CC_7_0 in initial environment
  >> generating module file @ /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/modules/all/cuDNN/9.15.0.57-CUDA-12.9.1.lua
== Running post-module hook...
== Restored original build option 'module_only' to False
== Restored original build option 'force' to False
== Removing EESSI_IGNORE_CUDNN_9_15_0_57_CC_7_0 in initial environment
...
== Summary:
   * [SUCCESS] cuDNN/9.15.0.57-CUDA-12.9.1

That looks good.

So does this:

$ module load cuDNN/9.15.0.57-CUDA-12.9.1
Lmod has detected the following error:  EasyConfigs using cuDNN 9.15.0.57 or newer are not supported for (all) requested Compute Capabilities: ['7.0'].

While processing the following module(s):
    Module fullname              Module Filename
    ---------------              ---------------
    cuDNN/9.15.0.57-CUDA-12.9.1  /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/modules/all/cuDNN/9.15.0.57-CUDA-12.9.1.lua

It also allows suppressing the module error, as intended:

$ EESSI_IGNORE_CUDNN_9_15_0_57_CC_7_0=1 module load cuDNN/9.15.0.57-CUDA-12.9.1
[casparl@tcn471 software-layer-scripts]$

And finally, running with

$  EESSI_OVERRIDE_CUDA_CC_CUDNN_CHECK=1 eb --hooks eb_hooks.py cuDNN-9.15.0.57-CUDA-12.9.1.eb --accept-eula-for=cuDNN --rebuild

We can indeed surpress the check and do a full install (I won't paste output here, we all know what a succesfull EB installation looks like).

LGTM!

@casparvl
Copy link
Contributor

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen2

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Mar 10, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen2
Building for: x86_64/amd/zen2
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_170/138001

date job status comment
Mar 10 09:47:10 UTC 2026 submitted job id 138001 awaits release by job manager
Mar 10 09:47:33 UTC 2026 released job awaits launch by Slurm scheduler
Mar 10 11:26:04 UTC 2026 running job 138001 is running
Mar 10 11:27:23 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-138001.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen2-17731419420.tar.zstsize: 0 MiB (26445 bytes)
entries: 1
modules under 2025.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen2/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen2
2025.06/init/easybuild/eb_hooks.py
Mar 10 11:27:23 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen2+default
P: latency: 1.31 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen2+default
P: latency: 2.04 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen2+default
P: latency: 0.17 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen2+default
P: bandwidth: 8003.59 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138001.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Mar 10 12:55:12 UTC 2026 uploaded transfer of eessi-2025.06-software-linux-x86_64-amd-zen2-17731419420.tar.zst to S3 bucket succeeded

@casparvl
Copy link
Contributor

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen2

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Mar 10, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2023.06-software
Building on: amd-zen2
Building for: x86_64/amd/zen2
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_170/138002

date job status comment
Mar 10 09:47:18 UTC 2026 submitted job id 138002 awaits release by job manager
Mar 10 09:47:31 UTC 2026 released job awaits launch by Slurm scheduler
Mar 10 11:26:02 UTC 2026 running job 138002 is running
Mar 10 11:29:49 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-138002.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-17731419390.tar.zstsize: 0 MiB (26440 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen2/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
Mar 10 11:29:49 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] ( 1/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:x86-64-zen2+default
P: perf: 265.926 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 2/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:x86-64-zen2+default
P: perf: 450.596 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 3/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /775175bf @BotBuildTests:x86-64-zen2+default
P: latency: 2.81 us (r:0, l:None, u:None)
[ OK ] ( 4/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /52707c40 @BotBuildTests:x86-64-zen2+default
P: latency: 2.94 us (r:0, l:None, u:None)
[ OK ] ( 5/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /b1aacda9 @BotBuildTests:x86-64-zen2+default
P: latency: 6.03 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /c6bad193 @BotBuildTests:x86-64-zen2+default
P: latency: 5.74 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:x86-64-zen2+default
P: latency: 0.77 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:x86-64-zen2+default
P: latency: 0.73 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:x86-64-zen2+default
P: bandwidth: 6473.16 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:x86-64-zen2+default
P: bandwidth: 6463.41 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138002.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Mar 10 12:55:21 UTC 2026 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen2-17731419390.tar.zst to S3 bucket succeeded

@casparvl casparvl added bot:deploy 2025.06-software.eessi.io 2025.06 version of software.eessi.io labels Mar 10, 2026
@casparvl casparvl merged commit 2eb7da3 into EESSI:main Mar 10, 2026
78 of 83 checks passed
@bedroge bedroge deleted the cudnn915_cc70 branch March 10, 2026 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025.06-software.eessi.io 2025.06 version of software.eessi.io bot:deploy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants