NVIDIA · leofang · May 6, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/README.md b/README.md
@@ -5,8 +5,8 @@ CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python. It c
 * [cuda.core](https://nvidia.github.io/cuda-python/cuda-core/latest): Pythonic access to CUDA Runtime and other core functionality
 * [cuda.bindings](https://nvidia.github.io/cuda-python/cuda-bindings/latest): Low-level Python bindings to CUDA C APIs
 * [cuda.pathfinder](https://nvidia.github.io/cuda-python/cuda-pathfinder/latest): Utilities for locating CUDA components installed in the user's Python environment
-* [cuda.coop](https://nvidia.github.io/cccl/python/coop): A Python module providing CCCL's reusable block-wide and warp-wide *device* primitives for use within Numba CUDA kernels
-* [cuda.compute](https://nvidia.github.io/cccl/python/compute): A Python module for easy access to CCCL's highly efficient and customizable parallel algorithms, like `sort`, `scan`, `reduce`, `transform`, etc. that are callable on the *host*
+* [cuda.coop](https://nvidia.github.io/cccl/unstable/python/coop.html): A Python module providing CCCL's reusable block-wide and warp-wide *device* primitives for use within Numba CUDA kernels
+* [cuda.compute](https://nvidia.github.io/cccl/unstable/python/compute/index.html): A Python module for easy access to CCCL's highly efficient and customizable parallel algorithms, like `sort`, `scan`, `reduce`, `transform`, etc. that are callable on the *host*
 * [numba.cuda](https://nvidia.github.io/numba-cuda/): A Python DSL that exposes CUDA **SIMT** programming model and compiles a restricted subset of Python code into CUDA kernels and device functions
 * [cuda.tile](https://docs.nvidia.com/cuda/cutile-python/): A new Python DSL that exposes CUDA **Tile** programming model and allows users to write NumPy-like code in CUDA kernels
 * [nvmath-python](https://docs.nvidia.com/cuda/nvmath-python/latest): Pythonic access to NVIDIA CPU & GPU Math Libraries, with [*host*](https://docs.nvidia.com/cuda/nvmath-python/latest/overview.html#host-apis), [*device*](https://docs.nvidia.com/cuda/nvmath-python/latest/overview.html#device-apis), and [*distributed*](https://docs.nvidia.com/cuda/nvmath-python/latest/distributed-apis/index.html) APIs. It also provides low-level Python bindings to host C APIs ([nvmath.bindings](https://docs.nvidia.com/cuda/nvmath-python/latest/bindings/index.html)).
@@ -44,4 +44,6 @@ The list of available interfaces is:
 * NVRTC
 * nvJitLink
 * NVVM
+* nvFatbin
 * cuFile
+* NVML
diff --git a/cuda_core/docs/nv-versions.json b/cuda_core/docs/nv-versions.json
@@ -3,6 +3,10 @@
         "version": "latest",
         "url": "https://nvidia.github.io/cuda-python/cuda-core/latest/"
     },
+    {
+        "version": "1.0.0",
+        "url": "https://nvidia.github.io/cuda-python/cuda-core/1.0.0/"
+    },
     {
         "version": "0.7.0",
         "url": "https://nvidia.github.io/cuda-python/cuda-core/0.7.0/"

diff --git a/cuda_core/docs/source/api.rst b/cuda_core/docs/source/api.rst
@@ -6,11 +6,10 @@
 ``cuda.core`` API Reference
 ===========================
 
-This is the main API reference for ``cuda.core``. The package has not yet
-reached version 1.0.0, and APIs may change between minor versions, possibly
-without deprecation warnings. Once version 1.0.0 is released, APIs will
-be considered stable and will follow semantic versioning with appropriate
-deprecation periods for breaking changes.
+This is the main API reference for ``cuda.core``. As of version 1.0.0, all
+APIs are considered stable and follow `Semantic Versioning <https://semver.org/>`_
+with appropriate deprecation periods for breaking changes. See the
+:doc:`support policy <support>` for details.
 
 
 Devices and execution
@@ -242,46 +241,6 @@ execution.
    checkpoint.Process
 
 
-CUDA system information and NVIDIA Management Library (NVML)
-------------------------------------------------------------
-
-.. note::
-   ``cuda.core.system`` support requires ``cuda_bindings`` 12.9.6 or later, or 13.2.0 or later.
-
-Basic functions
-```````````````
-
-.. autosummary::
-   :toctree: generated/
-
-   system.get_driver_version
-   system.get_driver_version_full
-   system.get_driver_branch
-   system.get_num_devices
-   system.get_nvml_version
-   system.get_process_name
-   system.get_topology_common_ancestor
-   system.get_p2p_status
-
-Events
-``````
-
-.. autosummary::
-   :toctree: generated/
-
-   system.register_events
-
-Types
-`````
-
-.. autosummary::
-   :toctree: generated/
-
-   :template: autosummary/cyclass.rst
-
-   system.Device
-   system.NvlinkInfo
-
 Utility functions
 -----------------
 

diff --git a/cuda_core/docs/source/api_nvml.rst b/cuda_core/docs/source/api_nvml.rst
@@ -0,0 +1,44 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: Apache-2.0
+
+.. module:: cuda.core.system
+
+CUDA system information and NVIDIA Management Library (NVML)
+============================================================
+
+.. note::
+   ``cuda.core.system`` support requires ``cuda_bindings`` 12.9.6 or later, or 13.2.0 or later.
+
+Basic functions
+---------------
+
+.. autosummary::
+   :toctree: generated/
+
+   get_driver_version
+   get_driver_version_full
+   get_driver_branch
+   get_num_devices
+   get_nvml_version
+   get_process_name
+   get_topology_common_ancestor
+   get_p2p_status
+
+Events
+------
+
+.. autosummary::
+   :toctree: generated/
+
+   register_events
+
+Types
+-----
+
+.. autosummary::
+   :toctree: generated/
+
+   :template: autosummary/cyclass.rst
+
+   Device
+   NvlinkInfo
diff --git a/cuda_core/docs/source/index.rst b/cuda_core/docs/source/index.rst
@@ -15,12 +15,14 @@ Welcome to the documentation for ``cuda.core``.
    install
    interoperability
    api
+   api_nvml
    environment_variables
    contribute
 
 .. toctree::
    :maxdepth: 1
 
+   support
    conduct
    license
 

diff --git a/cuda_core/docs/source/install.rst b/cuda_core/docs/source/install.rst
@@ -32,7 +32,7 @@ dependencies are as follows:
 Free-threading Build Support
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-As of cuda-core 0.4.0, **experimental** packages for the `free-threaded interpreter`_ are shipped.
+As of cuda-core 1.0.0, **experimental** packages for the `free-threaded interpreter`_ are shipped.
 
 1. Support for these builds is best effort, due to heavy use of `built-in
    modules that are known to be thread-unsafe`_, such as ``ctypes``.

diff --git a/cuda_core/docs/source/release/1.0.0-notes.rst b/cuda_core/docs/source/release/1.0.0-notes.rst
@@ -20,11 +20,74 @@ New features
   including string process state queries, lock/checkpoint/restore/unlock
   operations, and GPU UUID remapping support for restore.
   (`#1343 <https://github.com/NVIDIA/cuda-python/issues/1343>`__)
+- Added green context support (CUDA 12.4+). New types :class:`Context`,
+  :class:`ContextOptions`, :class:`SMResource`, :class:`SMResourceOptions`,
+  :class:`WorkqueueResource`, and :class:`WorkqueueResourceOptions` enable GPU
+  SM and workqueue resource partitioning. Create green contexts via
+  :meth:`Device.create_context`, then use :meth:`Context.create_stream` and
+  :attr:`Context.resources` to work within the partitioned resources.
+  (`#1976 <https://github.com/NVIDIA/cuda-python/pull/1976>`__)
+- Changes to the :mod:`cuda.core.system` module for NVIDIA Management Library (NVML)
+  access:
+
+  - :attr:`system.Device.mig` for querying and setting MIG mode, enumerating
+    MIG device instances, and navigating parent/child relationships.
+    (`#1916 <https://github.com/NVIDIA/cuda-python/pull/1916>`__)
+  - :attr:`system.Device.compute_running_processes` for querying running compute
+    processes on a device, returning :class:`~system.ProcessInfo` objects with
+    PID, GPU memory usage, and MIG instance IDs.
+    (`#1917 <https://github.com/NVIDIA/cuda-python/pull/1917>`__)
+  - :meth:`system.Device.get_nvlink` for querying NVLink version and state per
+    link, and :attr:`system.Device.utilization` returning current GPU and memory
+    utilization rates.
+    (`#1918 <https://github.com/NVIDIA/cuda-python/pull/1918>`__)
+
+- Re-wrapped NVML enums as human-readable ``StrEnum`` subclasses instead of raw
+  integer re-exports from ``cuda.bindings.nvml``. These are available in
+  ``cuda.core.system.typing``.
+  (`#2014 <https://github.com/NVIDIA/cuda-python/pull/2014>`__)
+- Enums are now available in places where a small number of string values are
+  accepted or returned.  You may continue to use the string values, or use
+  enumerations for better linting and type-checking.
+  (`#2016 <https://github.com/NVIDIA/cuda-python/issues/2016>`__)
+  The new enums are:
+
+  - :class:`cuda.core.typing.CompilerBackendType`
+  - :class:`cuda.core.typing.GraphConditionalType`
+  - :class:`cuda.core.typing.GraphMemoryType`
+  - :class:`cuda.core.typing.ManagedMemoryLocationType`
+  - :class:`cuda.core.typing.ObjectCodeFormatType`
+  - :class:`cuda.core.typing.PCHStatusType`
+  - :class:`cuda.core.typing.SourceCodeType`
+  - :class:`cuda.core.typing.VirtualMemoryAccessType`
+  - :class:`cuda.core.typing.VirtualMemoryAllocationType`
+  - :class:`cuda.core.typing.VirtualMemoryGranularityType`
+  - :class:`cuda.core.typing.VirtualMemoryHandleType`
+  - :class:`cuda.core.typing.VirtualMemoryLocationType`
 
 
 Breaking changes
 ----------------
 
+- :class:`~utils.StridedMemoryView` now provides a fast path for ``torch.Tensor``
+  objects via PyTorch's AOT Inductor (AOTI) stable C ABI. When a ``torch.Tensor``
+  is passed to any ``from_*`` classmethod (``from_dlpack``,
+  ``from_cuda_array_interface``, ``from_array_interface``, or
+  ``from_any_interface``), tensor metadata is read directly from the underlying
+  C struct, bypassing the DLPack and CUDA Array Interface protocol overhead.
+  This yields ~7–20x faster ``StridedMemoryView`` construction for PyTorch
+  tensors (depending on whether stream ordering is required). Proper CUDA stream
+  ordering is established between PyTorch's current stream and the consumer
+  stream, matching the DLPack synchronization contract.
+  Requires PyTorch >= 2.3.
+
+  This is a *behavioral* breaking change: because the AOTI tensor bridge reads
+  raw metadata without re-enacting PyTorch's export guardrails, tensors that
+  PyTorch would reject at the DLPack boundary (notably ``requires_grad``,
+  conjugated, non-strided/sparse, and wrong-current-device CUDA tensors) are
+  now accepted. This is intentional — ``StridedMemoryView`` is designed for
+  low-level interop where those checks are not needed.
+  (`#749 <https://github.com/NVIDIA/cuda-python/issues/749>`__)
 - Renamed :class:`~graph.GraphDef` to :class:`~graph.GraphDefinition` for
   consistency with the rest of the API, which spells words out (e.g.
   ``TensorMapDescriptor``, not ``TensorMapDesc``).
@@ -125,6 +188,63 @@ Breaking changes
   - :obj:`cuda.core.typing.DevicePointerT` -> :obj:`cuda.core.typing.DevicePointerType`
   - :obj:`cuda.core.typing.IsStreamT` -> :obj:`cuda.core.typing.IsStreamType`
 
+- Renamed and converted multiple :class:`~system.Device` properties and methods
+  for naming consistency
+  (`#1946 <https://github.com/NVIDIA/cuda-python/pull/1946>`__):
+
+  On :class:`~system.Device`:
+
+  - ``is_c2c_mode_enabled`` -> ``is_c2c_enabled``
+  - ``persistence_mode_enabled`` -> ``is_persistence_mode_enabled``
+  - ``clock(clock_type)`` -> ``get_clock(clock_type)``
+  - ``get_auto_boosted_clocks_enabled()`` -> ``is_auto_boosted_clocks_enabled``
+    (method -> property)
+  - ``get_current_clock_event_reasons()`` -> ``current_clock_event_reasons``
+    (method -> property)
+  - ``get_supported_clock_event_reasons()`` -> ``supported_clock_event_reasons``
+    (method -> property)
+  - ``display_mode`` -> ``is_display_connected``
+  - ``display_active`` -> ``is_display_active``
+  - ``fan(fan=0)`` -> ``get_fan(fan=0)``
+  - ``get_supported_pstates()`` -> ``supported_pstates``
+    (method -> property)
+
+  On ``PciInfo``:
+
+  - ``get_max_pcie_link_generation()`` -> ``link_generation`` (method -> property)
+  - ``get_gpu_max_pcie_link_generation()`` -> ``max_link_generation``
+    (method -> property)
+  - ``get_max_pcie_link_width()`` -> ``max_link_width`` (method -> property)
+  - ``get_current_pcie_link_generation()`` -> ``current_link_generation``
+    (method -> property)
+  - ``get_current_pcie_link_width()`` -> ``current_link_width``
+    (method -> property)
+  - ``get_pcie_throughput(counter)`` -> ``get_throughput(counter)``
+  - ``get_pcie_replay_counter()`` -> ``replay_counter`` (method -> property)
+
+  On ``Temperature``:
+
+  - ``sensor(sensor=...)`` -> ``get_sensor(sensor=...)``
+  - ``threshold(threshold_type)`` -> ``get_threshold(threshold_type)``
+  - ``thermal_settings(sensor_index)`` -> ``get_thermal_settings(sensor_index)``
+
+  On ``FanInfo``:
+
+  - ``set_default_fan_speed()`` -> ``set_default_speed()``
+
+- Removed 18 helper/data-container classes from ``cuda.core.system.__all__``:
+  ``BAR1MemoryInfo``, ``ClockInfo``, ``ClockOffsets``, ``CoolerInfo``,
+  ``DeviceAttributes``, ``DeviceEvents``, ``EventData``, ``FanInfo``,
+  ``FieldValue``, ``FieldValues``, ``GpuDynamicPstatesInfo``,
+  ``GpuDynamicPstatesUtilization``, ``InforomInfo``, ``PciInfo``,
+  ``RepairStatus``, ``Temperature``, ``ThermalSensor``, ``ThermalSettings``.
+  These classes are still returned by :class:`~system.Device` properties and
+  methods but should not be directly instantiated by users.
+  (`#1942 <https://github.com/NVIDIA/cuda-python/pull/1942>`__)
+- :attr:`system.Device.uuid` now returns the full NVML UUID with prefix
+  (e.g. ``GPU-...``). Use :attr:`system.Device.uuid_without_prefix` for
+  the previous behavior.
+  (`#1916 <https://github.com/NVIDIA/cuda-python/pull/1916>`__)
 - :func:`args_viewable_as_strided_memory` and :class:`StridedMemoryView` are now
   longer at the top-level in :mod:`cuda.core`.  They are available publicly from the
   :mod:`cuda.core.utils` module.
@@ -133,33 +253,29 @@ Breaking changes
 Fixes and enhancements
 -----------------------
 
-- :class:`~utils.StridedMemoryView` now provides a fast path for ``torch.Tensor``
-  objects via PyTorch's AOT Inductor (AOTI) stable C ABI. When a ``torch.Tensor``
-  is passed to any ``from_*`` classmethod (``from_dlpack``,
-  ``from_cuda_array_interface``, ``from_array_interface``, or
-  ``from_any_interface``), tensor metadata is read directly from the underlying
-  C struct, bypassing the DLPack and CUDA Array Interface protocol overhead.
-  This yields ~7-20x faster ``StridedMemoryView`` construction for PyTorch
-  tensors (depending on whether stream ordering is required).  Proper CUDA stream ordering is established between PyTorch's current
-  stream and the consumer stream, matching the DLPack synchronization contract.
-  Requires PyTorch >= 2.3.
-  (`#749 <https://github.com/NVIDIA/cuda-python/issues/749>`__)
-
-- Enums are not available in places where a small number of string values are
-  accepted or returned.  You may continue to use the string values, or use
-  enumerations for better linting and type-checking.
-  (`#2016 <https://github.com/NVIDIA/cuda-python/issues/2016>`__)
-  The new enums are:
-
-  - :class:`cuda.core.typing.CompilerBackendType`
-  - :class:`cuda.core.typing.GraphConditionalType`
-  - :class:`cuda.core.typing.GraphMemoryType`
-  - :class:`cuda.core.typing.ManagedMemoryLocationType`
-  - :class:`cuda.core.typing.ObjectCodeFormatType`
-  - :class:`cuda.core.typing.PCHStatusType`
-  - :class:`cuda.core.typing.SourceCodeType`
-  - :class:`cuda.core.typing.VirtualMemoryAccessType`
-  - :class:`cuda.core.typing.VirtualMemoryAllocationType`
-  - :class:`cuda.core.typing.VirtualMemoryGranularityType`
-  - :class:`cuda.core.typing.VirtualMemoryHandleType`
-  - :class:`cuda.core.typing.VirtualMemoryLocationType`
+- Fixed :attr:`Buffer.is_managed` returning ``False`` for pool-allocated managed
+  memory (:class:`ManagedMemoryResource`), which caused DLPack interop to
+  misclassify managed buffers as ``kDLCUDAHost``. The fix queries both the
+  driver pointer attribute and the memory resource.
+  (`#1924 <https://github.com/NVIDIA/cuda-python/pull/1924>`__)
+- :attr:`system.Device.arch` now returns ``UNKNOWN`` instead of raising
+  ``ValueError`` when NVML reports an architecture not yet in the enum.
+  (`#1937 <https://github.com/NVIDIA/cuda-python/pull/1937>`__)
+- :meth:`system.Device.get_field_values` and
+  :meth:`system.Device.clear_field_values` with an empty list no longer raise
+  ``InvalidArgumentError``.
+  (`#1982 <https://github.com/NVIDIA/cuda-python/pull/1982>`__)
+- :class:`Linker` error and info log retrieval now properly checks return codes
+  from nvJitLink, raising exceptions on failure instead of silently ignoring
+  errors.
+  (`#1993 <https://github.com/NVIDIA/cuda-python/pull/1993>`__)
+- Fixed a potential crash when NVML event set creation failed on Windows, due to
+  ``__dealloc__`` freeing an uninitialized handle.
+  (`#1992 <https://github.com/NVIDIA/cuda-python/pull/1992>`__)
+- CUDA Runtime error messages are now more reliable, especially on Windows
+  where the runtime DLL name table could disagree with the installed bindings.
+  (`#2003 <https://github.com/NVIDIA/cuda-python/pull/2003>`__)
+- Linux release wheels are now stripped of debug symbols, significantly reducing
+  package size. Debug builds are now supported via
+  ``--config-settings=debug=true``.
+  (`#1890 <https://github.com/NVIDIA/cuda-python/pull/1890>`__)