Skip to content

Handle NVIDIA UMA memory fallback#466

Merged
Syllo merged 1 commit intoSyllo:masterfrom
redyuan43:fix-dgx-spark-uma-memory
Apr 29, 2026
Merged

Handle NVIDIA UMA memory fallback#466
Syllo merged 1 commit intoSyllo:masterfrom
redyuan43:fix-dgx-spark-uma-memory

Conversation

@redyuan43
Copy link
Copy Markdown
Contributor

Summary

This PR improves NVIDIA UMA/iGPU memory reporting for platforms where NVML does not expose dedicated framebuffer memory. On these systems NVML can return NVML_ERROR_NOT_SUPPORTED for nvmlDeviceGetMemoryInfo*() because the GPU shares system memory instead of having a dedicated framebuffer.

The concrete hardware used to reproduce and validate this change is:

  • NVIDIA DGX Spark
  • GPU: NVIDIA GB10
  • Driver: 580.142
  • CUDA: 13.0
  • Architecture: unified memory / iGPU-style platform

Problem

On DGX Spark / GB10, nvidia-smi reports framebuffer memory as unsupported:

FB Memory Usage
    Total : N/A
    Used  : N/A
    Free  : N/A

Before this change, nvtop detected the unified-memory case but only populated total_memory from /proc/meminfo. As a result, the UI still had incomplete memory reporting and the overall memory meter could not show the available/used UMA memory state.

Changes

  • Add a small NVIDIA backend fallback that reads /proc/meminfo when NVML reports unified memory / unsupported framebuffer memory.
  • Populate:
    • total_memory from MemTotal
    • free_memory from MemAvailable
    • used_memory as MemTotal - MemAvailable
    • mem_util_rate from the same values
  • Fix the generic process-memory fallback accumulation bug where used_memory was added to itself before adding each process usage.

This keeps the change scoped to the existing UMA path and does not alter normal discrete NVIDIA GPU memory reporting.

Validation

Built and tested on the DGX Spark / GB10 machine:

build/src/nvtop: ELF 64-bit LSB pie executable, ARM aarch64
nvtop version 3.3.2

nvtop --snapshot after the change:

"device_name": "NVIDIA GB10"
"mem_util": "10%"
"mem_total": "130595311616"
"mem_used": "13716254720"
"mem_free": "116879056896"

Interactive UI was also tested on the DGX Spark desktop session. The memory meter changed from N/A to a UMA system memory value, for example:

13.045Gi/121.626Gi

Local build validation was also run on an x86_64 NVIDIA system with discrete RTX 3060 GPUs to ensure the normal NVML framebuffer path still builds and reports discrete GPU memory through NVML.

Not addressed / expected remaining N/A fields

This PR intentionally does not attempt to synthesize metrics that NVML and nvidia-smi still do not expose on DGX Spark / GB10. On the tested machine these remain N/A in nvidia-smi as well:

Fan Speed              : N/A
Tx Throughput          : N/A
Rx Throughput          : N/A
Memory Clock           : N/A
Power Limit            : N/A

Those are left unchanged because they require driver/platform telemetry support or another documented data source. This PR only addresses the UMA memory reporting case where NVIDIA documentation recommends estimating memory resources from Linux system memory counters rather than relying on framebuffer memory.

@Syllo Syllo merged commit 4c56f89 into Syllo:master Apr 29, 2026
3 checks passed
@Syllo
Copy link
Copy Markdown
Owner

Syllo commented Apr 29, 2026

Thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants