Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python. It c
* [cuda.core](https://nvidia.github.io/cuda-python/cuda-core/latest): Pythonic access to CUDA Runtime and other core functionality
* [cuda.bindings](https://nvidia.github.io/cuda-python/cuda-bindings/latest): Low-level Python bindings to CUDA C APIs
* [cuda.pathfinder](https://nvidia.github.io/cuda-python/cuda-pathfinder/latest): Utilities for locating CUDA components installed in the user's Python environment
* [cuda.coop](https://nvidia.github.io/cccl/python/coop): A Python module providing CCCL's reusable block-wide and warp-wide *device* primitives for use within Numba CUDA kernels
* [cuda.compute](https://nvidia.github.io/cccl/python/compute): A Python module for easy access to CCCL's highly efficient and customizable parallel algorithms, like `sort`, `scan`, `reduce`, `transform`, etc. that are callable on the *host*
* [cuda.coop](https://nvidia.github.io/cccl/unstable/python/coop.html): A Python module providing CCCL's reusable block-wide and warp-wide *device* primitives for use within Numba CUDA kernels
* [cuda.compute](https://nvidia.github.io/cccl/unstable/python/compute/index.html): A Python module for easy access to CCCL's highly efficient and customizable parallel algorithms, like `sort`, `scan`, `reduce`, `transform`, etc. that are callable on the *host*
* [numba.cuda](https://nvidia.github.io/numba-cuda/): A Python DSL that exposes CUDA **SIMT** programming model and compiles a restricted subset of Python code into CUDA kernels and device functions
* [cuda.tile](https://docs.nvidia.com/cuda/cutile-python/): A new Python DSL that exposes CUDA **Tile** programming model and allows users to write NumPy-like code in CUDA kernels
* [nvmath-python](https://docs.nvidia.com/cuda/nvmath-python/latest): Pythonic access to NVIDIA CPU & GPU Math Libraries, with [*host*](https://docs.nvidia.com/cuda/nvmath-python/latest/overview.html#host-apis), [*device*](https://docs.nvidia.com/cuda/nvmath-python/latest/overview.html#device-apis), and [*distributed*](https://docs.nvidia.com/cuda/nvmath-python/latest/distributed-apis/index.html) APIs. It also provides low-level Python bindings to host C APIs ([nvmath.bindings](https://docs.nvidia.com/cuda/nvmath-python/latest/bindings/index.html)).
Expand Down Expand Up @@ -44,4 +44,6 @@ The list of available interfaces is:
* NVRTC
* nvJitLink
* NVVM
* nvFatbin
* cuFile
* NVML
4 changes: 4 additions & 0 deletions cuda_core/docs/nv-versions.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
"version": "latest",
"url": "https://nvidia.github.io/cuda-python/cuda-core/latest/"
},
{
"version": "1.0.0",
"url": "https://nvidia.github.io/cuda-python/cuda-core/1.0.0/"
},
{
"version": "0.7.0",
"url": "https://nvidia.github.io/cuda-python/cuda-core/0.7.0/"
Expand Down
49 changes: 4 additions & 45 deletions cuda_core/docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,10 @@
``cuda.core`` API Reference
===========================

This is the main API reference for ``cuda.core``. The package has not yet
reached version 1.0.0, and APIs may change between minor versions, possibly
without deprecation warnings. Once version 1.0.0 is released, APIs will
be considered stable and will follow semantic versioning with appropriate
deprecation periods for breaking changes.
This is the main API reference for ``cuda.core``. As of version 1.0.0, all
APIs are considered stable and follow `Semantic Versioning <https://semver.org/>`_
with appropriate deprecation periods for breaking changes. See the
:doc:`support policy <support>` for details.


Devices and execution
Expand Down Expand Up @@ -242,46 +241,6 @@ execution.
checkpoint.Process


CUDA system information and NVIDIA Management Library (NVML)
------------------------------------------------------------

.. note::
``cuda.core.system`` support requires ``cuda_bindings`` 12.9.6 or later, or 13.2.0 or later.

Basic functions
```````````````

.. autosummary::
:toctree: generated/

system.get_driver_version
system.get_driver_version_full
system.get_driver_branch
system.get_num_devices
system.get_nvml_version
system.get_process_name
system.get_topology_common_ancestor
system.get_p2p_status

Events
``````

.. autosummary::
:toctree: generated/

system.register_events

Types
`````

.. autosummary::
:toctree: generated/

:template: autosummary/cyclass.rst

system.Device
system.NvlinkInfo

Utility functions
-----------------

Expand Down
44 changes: 44 additions & 0 deletions cuda_core/docs/source/api_nvml.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
.. SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
.. SPDX-License-Identifier: Apache-2.0

.. module:: cuda.core.system

CUDA system information and NVIDIA Management Library (NVML)
============================================================

.. note::
``cuda.core.system`` support requires ``cuda_bindings`` 12.9.6 or later, or 13.2.0 or later.

Basic functions
---------------

.. autosummary::
:toctree: generated/

get_driver_version
get_driver_version_full
get_driver_branch
get_num_devices
get_nvml_version
get_process_name
get_topology_common_ancestor
get_p2p_status

Events
------

.. autosummary::
:toctree: generated/

register_events

Types
-----

.. autosummary::
:toctree: generated/

:template: autosummary/cyclass.rst

Device
NvlinkInfo
2 changes: 2 additions & 0 deletions cuda_core/docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,14 @@ Welcome to the documentation for ``cuda.core``.
install
interoperability
api
api_nvml
environment_variables
contribute

.. toctree::
:maxdepth: 1

support
conduct
license

Expand Down
2 changes: 1 addition & 1 deletion cuda_core/docs/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ dependencies are as follows:
Free-threading Build Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As of cuda-core 0.4.0, **experimental** packages for the `free-threaded interpreter`_ are shipped.
As of cuda-core 1.0.0, **experimental** packages for the `free-threaded interpreter`_ are shipped.

1. Support for these builds is best effort, due to heavy use of `built-in
modules that are known to be thread-unsafe`_, such as ``ctypes``.
Expand Down
176 changes: 146 additions & 30 deletions cuda_core/docs/source/release/1.0.0-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,74 @@ New features
including string process state queries, lock/checkpoint/restore/unlock
operations, and GPU UUID remapping support for restore.
(`#1343 <https://github.com/NVIDIA/cuda-python/issues/1343>`__)
- Added green context support (CUDA 12.4+). New types :class:`Context`,
:class:`ContextOptions`, :class:`SMResource`, :class:`SMResourceOptions`,
:class:`WorkqueueResource`, and :class:`WorkqueueResourceOptions` enable GPU
SM and workqueue resource partitioning. Create green contexts via
:meth:`Device.create_context`, then use :meth:`Context.create_stream` and
:attr:`Context.resources` to work within the partitioned resources.
(`#1976 <https://github.com/NVIDIA/cuda-python/pull/1976>`__)
- Changes to the :mod:`cuda.core.system` module for NVIDIA Management Library (NVML)
access:

- :attr:`system.Device.mig` for querying and setting MIG mode, enumerating
MIG device instances, and navigating parent/child relationships.
(`#1916 <https://github.com/NVIDIA/cuda-python/pull/1916>`__)
- :attr:`system.Device.compute_running_processes` for querying running compute
processes on a device, returning :class:`~system.ProcessInfo` objects with
PID, GPU memory usage, and MIG instance IDs.
(`#1917 <https://github.com/NVIDIA/cuda-python/pull/1917>`__)
- :meth:`system.Device.get_nvlink` for querying NVLink version and state per
link, and :attr:`system.Device.utilization` returning current GPU and memory
utilization rates.
(`#1918 <https://github.com/NVIDIA/cuda-python/pull/1918>`__)

- Re-wrapped NVML enums as human-readable ``StrEnum`` subclasses instead of raw
integer re-exports from ``cuda.bindings.nvml``. These are available in
``cuda.core.system.typing``.
(`#2014 <https://github.com/NVIDIA/cuda-python/pull/2014>`__)
Comment thread
leofang marked this conversation as resolved.
- Enums are now available in places where a small number of string values are
accepted or returned. You may continue to use the string values, or use
enumerations for better linting and type-checking.
(`#2016 <https://github.com/NVIDIA/cuda-python/issues/2016>`__)
The new enums are:

- :class:`cuda.core.typing.CompilerBackendType`
- :class:`cuda.core.typing.GraphConditionalType`
- :class:`cuda.core.typing.GraphMemoryType`
- :class:`cuda.core.typing.ManagedMemoryLocationType`
- :class:`cuda.core.typing.ObjectCodeFormatType`
- :class:`cuda.core.typing.PCHStatusType`
- :class:`cuda.core.typing.SourceCodeType`
- :class:`cuda.core.typing.VirtualMemoryAccessType`
- :class:`cuda.core.typing.VirtualMemoryAllocationType`
- :class:`cuda.core.typing.VirtualMemoryGranularityType`
- :class:`cuda.core.typing.VirtualMemoryHandleType`
- :class:`cuda.core.typing.VirtualMemoryLocationType`


Breaking changes
----------------

- :class:`~utils.StridedMemoryView` now provides a fast path for ``torch.Tensor``
objects via PyTorch's AOT Inductor (AOTI) stable C ABI. When a ``torch.Tensor``
is passed to any ``from_*`` classmethod (``from_dlpack``,
``from_cuda_array_interface``, ``from_array_interface``, or
``from_any_interface``), tensor metadata is read directly from the underlying
C struct, bypassing the DLPack and CUDA Array Interface protocol overhead.
This yields ~7–20x faster ``StridedMemoryView`` construction for PyTorch
tensors (depending on whether stream ordering is required). Proper CUDA stream
ordering is established between PyTorch's current stream and the consumer
stream, matching the DLPack synchronization contract.
Requires PyTorch >= 2.3.

This is a *behavioral* breaking change: because the AOTI tensor bridge reads
raw metadata without re-enacting PyTorch's export guardrails, tensors that
PyTorch would reject at the DLPack boundary (notably ``requires_grad``,
conjugated, non-strided/sparse, and wrong-current-device CUDA tensors) are
now accepted. This is intentional — ``StridedMemoryView`` is designed for
low-level interop where those checks are not needed.
(`#749 <https://github.com/NVIDIA/cuda-python/issues/749>`__)
- Renamed :class:`~graph.GraphDef` to :class:`~graph.GraphDefinition` for
consistency with the rest of the API, which spells words out (e.g.
``TensorMapDescriptor``, not ``TensorMapDesc``).
Expand Down Expand Up @@ -125,6 +188,63 @@ Breaking changes
- :obj:`cuda.core.typing.DevicePointerT` -> :obj:`cuda.core.typing.DevicePointerType`
- :obj:`cuda.core.typing.IsStreamT` -> :obj:`cuda.core.typing.IsStreamType`

- Renamed and converted multiple :class:`~system.Device` properties and methods
for naming consistency
(`#1946 <https://github.com/NVIDIA/cuda-python/pull/1946>`__):

On :class:`~system.Device`:

- ``is_c2c_mode_enabled`` -> ``is_c2c_enabled``
- ``persistence_mode_enabled`` -> ``is_persistence_mode_enabled``
- ``clock(clock_type)`` -> ``get_clock(clock_type)``
- ``get_auto_boosted_clocks_enabled()`` -> ``is_auto_boosted_clocks_enabled``
(method -> property)
- ``get_current_clock_event_reasons()`` -> ``current_clock_event_reasons``
(method -> property)
- ``get_supported_clock_event_reasons()`` -> ``supported_clock_event_reasons``
(method -> property)
- ``display_mode`` -> ``is_display_connected``
- ``display_active`` -> ``is_display_active``
- ``fan(fan=0)`` -> ``get_fan(fan=0)``
- ``get_supported_pstates()`` -> ``supported_pstates``
(method -> property)

On ``PciInfo``:

- ``get_max_pcie_link_generation()`` -> ``link_generation`` (method -> property)
- ``get_gpu_max_pcie_link_generation()`` -> ``max_link_generation``
(method -> property)
- ``get_max_pcie_link_width()`` -> ``max_link_width`` (method -> property)
- ``get_current_pcie_link_generation()`` -> ``current_link_generation``
(method -> property)
- ``get_current_pcie_link_width()`` -> ``current_link_width``
(method -> property)
- ``get_pcie_throughput(counter)`` -> ``get_throughput(counter)``
- ``get_pcie_replay_counter()`` -> ``replay_counter`` (method -> property)

On ``Temperature``:

- ``sensor(sensor=...)`` -> ``get_sensor(sensor=...)``
- ``threshold(threshold_type)`` -> ``get_threshold(threshold_type)``
- ``thermal_settings(sensor_index)`` -> ``get_thermal_settings(sensor_index)``

On ``FanInfo``:

- ``set_default_fan_speed()`` -> ``set_default_speed()``

- Removed 18 helper/data-container classes from ``cuda.core.system.__all__``:
``BAR1MemoryInfo``, ``ClockInfo``, ``ClockOffsets``, ``CoolerInfo``,
``DeviceAttributes``, ``DeviceEvents``, ``EventData``, ``FanInfo``,
``FieldValue``, ``FieldValues``, ``GpuDynamicPstatesInfo``,
``GpuDynamicPstatesUtilization``, ``InforomInfo``, ``PciInfo``,
``RepairStatus``, ``Temperature``, ``ThermalSensor``, ``ThermalSettings``.
These classes are still returned by :class:`~system.Device` properties and
methods but should not be directly instantiated by users.
(`#1942 <https://github.com/NVIDIA/cuda-python/pull/1942>`__)
- :attr:`system.Device.uuid` now returns the full NVML UUID with prefix
(e.g. ``GPU-...``). Use :attr:`system.Device.uuid_without_prefix` for
the previous behavior.
(`#1916 <https://github.com/NVIDIA/cuda-python/pull/1916>`__)
- :func:`args_viewable_as_strided_memory` and :class:`StridedMemoryView` are now
longer at the top-level in :mod:`cuda.core`. They are available publicly from the
:mod:`cuda.core.utils` module.
Expand All @@ -133,33 +253,29 @@ Breaking changes
Fixes and enhancements
-----------------------

- :class:`~utils.StridedMemoryView` now provides a fast path for ``torch.Tensor``
objects via PyTorch's AOT Inductor (AOTI) stable C ABI. When a ``torch.Tensor``
is passed to any ``from_*`` classmethod (``from_dlpack``,
``from_cuda_array_interface``, ``from_array_interface``, or
``from_any_interface``), tensor metadata is read directly from the underlying
C struct, bypassing the DLPack and CUDA Array Interface protocol overhead.
This yields ~7-20x faster ``StridedMemoryView`` construction for PyTorch
tensors (depending on whether stream ordering is required). Proper CUDA stream ordering is established between PyTorch's current
stream and the consumer stream, matching the DLPack synchronization contract.
Requires PyTorch >= 2.3.
(`#749 <https://github.com/NVIDIA/cuda-python/issues/749>`__)

- Enums are not available in places where a small number of string values are
accepted or returned. You may continue to use the string values, or use
enumerations for better linting and type-checking.
(`#2016 <https://github.com/NVIDIA/cuda-python/issues/2016>`__)
The new enums are:

- :class:`cuda.core.typing.CompilerBackendType`
- :class:`cuda.core.typing.GraphConditionalType`
- :class:`cuda.core.typing.GraphMemoryType`
- :class:`cuda.core.typing.ManagedMemoryLocationType`
- :class:`cuda.core.typing.ObjectCodeFormatType`
- :class:`cuda.core.typing.PCHStatusType`
- :class:`cuda.core.typing.SourceCodeType`
- :class:`cuda.core.typing.VirtualMemoryAccessType`
- :class:`cuda.core.typing.VirtualMemoryAllocationType`
- :class:`cuda.core.typing.VirtualMemoryGranularityType`
- :class:`cuda.core.typing.VirtualMemoryHandleType`
- :class:`cuda.core.typing.VirtualMemoryLocationType`
- Fixed :attr:`Buffer.is_managed` returning ``False`` for pool-allocated managed
memory (:class:`ManagedMemoryResource`), which caused DLPack interop to
misclassify managed buffers as ``kDLCUDAHost``. The fix queries both the
driver pointer attribute and the memory resource.
(`#1924 <https://github.com/NVIDIA/cuda-python/pull/1924>`__)
- :attr:`system.Device.arch` now returns ``UNKNOWN`` instead of raising
``ValueError`` when NVML reports an architecture not yet in the enum.
(`#1937 <https://github.com/NVIDIA/cuda-python/pull/1937>`__)
- :meth:`system.Device.get_field_values` and
:meth:`system.Device.clear_field_values` with an empty list no longer raise
``InvalidArgumentError``.
(`#1982 <https://github.com/NVIDIA/cuda-python/pull/1982>`__)
- :class:`Linker` error and info log retrieval now properly checks return codes
from nvJitLink, raising exceptions on failure instead of silently ignoring
errors.
(`#1993 <https://github.com/NVIDIA/cuda-python/pull/1993>`__)
- Fixed a potential crash when NVML event set creation failed on Windows, due to
``__dealloc__`` freeing an uninitialized handle.
(`#1992 <https://github.com/NVIDIA/cuda-python/pull/1992>`__)
- CUDA Runtime error messages are now more reliable, especially on Windows
where the runtime DLL name table could disagree with the installed bindings.
(`#2003 <https://github.com/NVIDIA/cuda-python/pull/2003>`__)
- Linux release wheels are now stripped of debug symbols, significantly reducing
package size. Debug builds are now supported via
``--config-settings=debug=true``.
(`#1890 <https://github.com/NVIDIA/cuda-python/pull/1890>`__)
Loading