Add optional CUDA graph resource inspection for AI predecoder#505
Open
wsttiger wants to merge 3 commits intoNVIDIA:mainfrom
Open
Add optional CUDA graph resource inspection for AI predecoder#505wsttiger wants to merge 3 commits intoNVIDIA:mainfrom
wsttiger wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Add ai_predecoder_service::print_graph_resources() that walks the captured cuGraph and reports per-kernel grid/block dims, register usage, shared memory, and launch totals, plus a node-type summary. Collection is opt-in via a new collect_resources parameter on capture_graph() because it uses the CUDA driver API to introspect TRT kernels, which perturbs primary-context state and breaks DOCA-based GPU-RoCE on the FPGA bridge. Only the software benchmark exposes a --print-graph-resources flag; the FPGA bridge ignores it and prints a warning. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
The CUDA graph resource inspection added in this PR introduced CUDA driver API calls (cuGraphKernelNodeGetParams_v2, cuFuncGetName, cuFuncGetAttribute) inside ai_predecoder_service.cu. The new unittests/realtime/ targets were updated to link CUDA::cuda_driver, but the older test_realtime_pipeline target in libs/qec/unittests/CMakeLists.txt (which also compiles ai_predecoder_service.cu directly) was missed, causing undefined reference errors in the standalone QEC CI builds (amd64 12.6, amd64 13.0, arm64 13.0). Add CUDA::cuda_driver to test_realtime_pipeline's link libraries. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Decouple CUDA graph resource introspection from ai_predecoder_service
and expose it as free functions in a new graph_resources translation
unit. Targets that want per-kernel grid/block/register/shared-memory
reporting now:
1. call capture_graph(stream, device_launch, save_graph=true), which
retains a cudaGraphClone of the captured template, and
2. pass the cudaGraph_t returned by get_captured_graph() to the free
functions collect_graph_resources() / print_graph_resources() in
cudaq/qec/realtime/graph_resources.h.
Motivation: the driver-API calls (cuFuncGetAttribute, cuFuncGetName,
cuGraphKernelNodeGetParams_v2) required to introspect TRT-internal
kernels pulled libcuda.so.1 into every target that merely compiled
ai_predecoder_service.cu. That broke QEC standalone CI builds whose
containers do not ship a GPU driver: test_realtime_pipeline's
gtest_discover_tests invocation failed at build time on
"libcuda.so.1: cannot open shared object file".
After this change the driver API is confined to graph_resources.cu,
which is only compiled into the benchmark target
(test_realtime_predecoder_w_pymatching). test_realtime_pipeline and
hololink_predecoder_bridge no longer reference any cu*-prefixed
symbol and therefore no longer require CUDA::cuda_driver on their
link lines. Verified with ldd: libcuda.so.1 is absent from
test_realtime_pipeline.
Additional cleanup:
- ai_predecoder_service no longer owns graph_resource_info and has
no <iosfwd>/<string>/<vector> includes it does not use.
- The FPGA bridge still warns when --print-graph-resources is passed
since driver-API introspection would perturb the CUDA context used
by DOCA/Hololink GPU-RoCE.
- Reverts the earlier CUDA::cuda_driver link and DISCOVERY_MODE
PRE_TEST workarounds on test_realtime_pipeline.
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
bmhowe23
reviewed
Apr 20, 2026
Collaborator
There was a problem hiding this comment.
Can this be a .cpp file since there isn't any real CUDA code in here?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add ai_predecoder_service::print_graph_resources() that walks the captured cuGraph and reports per-kernel grid/block dims, register usage, shared memory, and launch totals, plus a node-type summary.
Collection is opt-in via a new collect_resources parameter on capture_graph() because it uses the CUDA driver API to introspect TRT kernels, which perturbs primary-context state and breaks DOCA-based GPU-RoCE on the FPGA bridge. Only the software benchmark exposes a --print-graph-resources flag; the FPGA bridge ignores it and prints a warning.