Add optional CUDA graph resource inspection for AI predecoder by wsttiger · Pull Request #505 · NVIDIA/cudaqx

wsttiger · 2026-04-16T20:58:27Z

Add ai_predecoder_service::print_graph_resources() that walks the captured cuGraph and reports per-kernel grid/block dims, register usage, shared memory, and launch totals, plus a node-type summary.

Collection is opt-in via a new collect_resources parameter on capture_graph() because it uses the CUDA driver API to introspect TRT kernels, which perturbs primary-context state and breaks DOCA-based GPU-RoCE on the FPGA bridge. Only the software benchmark exposes a --print-graph-resources flag; the FPGA bridge ignores it and prints a warning.

Add ai_predecoder_service::print_graph_resources() that walks the captured cuGraph and reports per-kernel grid/block dims, register usage, shared memory, and launch totals, plus a node-type summary. Collection is opt-in via a new collect_resources parameter on capture_graph() because it uses the CUDA driver API to introspect TRT kernels, which perturbs primary-context state and breaks DOCA-based GPU-RoCE on the FPGA bridge. Only the software benchmark exposes a --print-graph-resources flag; the FPGA bridge ignores it and prints a warning. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

The CUDA graph resource inspection added in this PR introduced CUDA driver API calls (cuGraphKernelNodeGetParams_v2, cuFuncGetName, cuFuncGetAttribute) inside ai_predecoder_service.cu. The new unittests/realtime/ targets were updated to link CUDA::cuda_driver, but the older test_realtime_pipeline target in libs/qec/unittests/CMakeLists.txt (which also compiles ai_predecoder_service.cu directly) was missed, causing undefined reference errors in the standalone QEC CI builds (amd64 12.6, amd64 13.0, arm64 13.0). Add CUDA::cuda_driver to test_realtime_pipeline's link libraries. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Decouple CUDA graph resource introspection from ai_predecoder_service and expose it as free functions in a new graph_resources translation unit. Targets that want per-kernel grid/block/register/shared-memory reporting now: 1. call capture_graph(stream, device_launch, save_graph=true), which retains a cudaGraphClone of the captured template, and 2. pass the cudaGraph_t returned by get_captured_graph() to the free functions collect_graph_resources() / print_graph_resources() in cudaq/qec/realtime/graph_resources.h. Motivation: the driver-API calls (cuFuncGetAttribute, cuFuncGetName, cuGraphKernelNodeGetParams_v2) required to introspect TRT-internal kernels pulled libcuda.so.1 into every target that merely compiled ai_predecoder_service.cu. That broke QEC standalone CI builds whose containers do not ship a GPU driver: test_realtime_pipeline's gtest_discover_tests invocation failed at build time on "libcuda.so.1: cannot open shared object file". After this change the driver API is confined to graph_resources.cu, which is only compiled into the benchmark target (test_realtime_predecoder_w_pymatching). test_realtime_pipeline and hololink_predecoder_bridge no longer reference any cu*-prefixed symbol and therefore no longer require CUDA::cuda_driver on their link lines. Verified with ldd: libcuda.so.1 is absent from test_realtime_pipeline. Additional cleanup: - ai_predecoder_service no longer owns graph_resource_info and has no <iosfwd>/<string>/<vector> includes it does not use. - The FPGA bridge still warns when --print-graph-resources is passed since driver-API introspection would perturb the CUDA context used by DOCA/Hololink GPU-RoCE. - Reverts the earlier CUDA::cuda_driver link and DISCOVERY_MODE PRE_TEST workarounds on test_realtime_pipeline. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

bmhowe23 · 2026-04-20T21:37:03Z

Can this be a .cpp file since there isn't any real CUDA code in here?

wsttiger requested review from bmhowe23, cketcham2333 and kvmto April 16, 2026 20:58

wsttiger added 2 commits April 17, 2026 03:56

bmhowe23 reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional CUDA graph resource inspection for AI predecoder#505

Add optional CUDA graph resource inspection for AI predecoder#505
wsttiger wants to merge 3 commits intoNVIDIA:mainfrom
wsttiger:add_ai_predecoder_cuda_graph_resources_output

wsttiger commented Apr 16, 2026

Uh oh!

bmhowe23 Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wsttiger commented Apr 16, 2026

Uh oh!

bmhowe23 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants