Skip to content

Weird semantic of hipMenset on Trento/7A53 #3869

@etiennemlb

Description

@etiennemlb

On Trento/7A53 + MI250X nodes (aka Frontier/Lumi etc. nodes), we can read GPU memory from the host.

The hipMemset semantic seems to ignore that the host can observe the GPU memory, leading to race conditions.

In CUDA, you can't really observe GPU memory from the host by directly reading/writing to it. You have to go through, say, cudaMemcpy or a kernel which will be enqueued onto a stream, ensuring that the next operation in a stream can observe the previous operation in the same stream.
In CUDA, cudaMemset is always async unless the destination ptr is pageable memory. In hip, I do not think the behavior is well documented, but I would assume the same as tools like hipify seems to naively translate to it from cuda. https://rocm.docs.amd.com/projects/HIP/en/docs-develop/reference/hip_runtime_api/modules/memory_management.html#_CPPv49hipMemsetPvi6size_t

You would have to synchronize with the stream/device if you want other stream/device to observe the changes.
But on Trento/7A53 + MI250X, you can observe GPU memory from the host without synchronizing with a stream or a device.

This leads to the following issue:

  • allocate a device buffer;
  • modify a device buffer using a function that should synchronize host and device if side device effects can be host observable;
  • launch host operation reading device memory;
  • observe a race condition.

The 3rd step could very well be an MPI operation as given in this AMD provided (!) example: https://github.com/amd/HPCTrainingExamples/blob/main/MPI-examples/collective.cpp
and note that using Cray MPICH which does make use of the Trento/7A53 + MI250X node specificity for small, latency sensitive transfers, will fail in the AMD example given above.

The code below reproduce the issue which start to appear around rocm/5.5.1 up to now (I haven't tested 7.0.0). This makes me think that HIP is bugged under this node configuration, it should block instead of completing asynchronously.

#include <hip/hip_runtime.h>

#include <chrono>
#include <cstdlib>
#include <iostream>
#include <vector>

#define HIP_CHECK(cmd)                                               \
    do {                                                             \
        hipError_t error = cmd;                                      \
        if(error != hipSuccess) {                                    \
            std::fprintf(stderr, "HIP/CUDA error %d at %s:%d: %s\n", \
                         error, __FILE__, __LINE__,                  \
                         hipGetErrorString(error));                  \
            std::exit(EXIT_FAILURE);                                 \
        }                                                            \
    } while(false)

template <int kSize,
          typename T>
__attribute__((noinline)) void
Repro(T* d_elements,
      T* elements_after_memset,
      T* elements_after_sleep,
      T  set_to) {
    static constexpr int kBytes = kSize * sizeof(T);
    HIP_CHECK(hipMemset(d_elements, set_to, kBytes));

    // // NOTE: needed to ensure visiblity on MI250X + AMD EPYC Trento/7A53
    // HIP_CHECK(hipDeviceSynchronize());
    // HIP_CHECK(hipStreamSynchronize(nullptr));

    for(int i = 0; i < kSize; i++) {
        // Read from host, authorized on mi250x + trento.
        elements_after_memset[i] = d_elements[i];
    }

    HIP_CHECK(hipStreamSynchronize(nullptr));

    for(int i = 0; i < kSize; i++) {
        elements_after_sleep[i] = d_elements[i];
    }
}

int main() {
    using T                     = int;
    static constexpr int kSize  = 16;
    static constexpr int kBytes = kSize * sizeof(int);

    static constexpr T set_to = -1;

    std::vector<T> elements_after_memset;
    std::vector<T> elements_after_sleep;

    elements_after_memset.resize(kSize);
    elements_after_sleep.resize(kSize);

    int* d_elements;
    HIP_CHECK(hipMalloc(&d_elements, kBytes));
    HIP_CHECK(hipDeviceSynchronize());

    Repro<kSize>(d_elements,
                 elements_after_memset.data(),
                 elements_after_sleep.data(),
                 set_to);

    std::cout << "Bad values after hipMemset:" << std::endl;
    for(int i = 0; i < kSize; i++) {
        if(elements_after_memset[i] != set_to) {
            std::printf("d_data[%d] = %d\n", i, elements_after_memset[i]);
        }
    }

    std::cout << "Bad values after second hipMemset:" << std::endl;
    for(int i = 0; i < kSize; i++) {
        if(elements_after_memset[i] != set_to) {
            std::printf("d_data[%d] = %d\n", i, elements_after_sleep[i]);
        }
    }

    HIP_CHECK(hipFree(d_elements));

    return 0;
}

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions