Skip to content

Conversation

@pvelesko
Copy link
Collaborator

@pvelesko pvelesko commented Oct 14, 2025

This PR refactors the queue synchronization system to replace the previous "last event tracking" approach with a more robust event dependency system. The changes eliminate race conditions and simplify the synchronization logic.

Key Changes

1. Queue Synchronization Refactor

Removed Components:

  • LastEvent_ member variable from Queue class
  • LastEventMtx mutex for protecting last event access
  • updateLastEvent() method for updating last event references
  • getLastEvent() and getLastEventNoLock() methods
  • getSyncQueuesLastEvents() method for collecting synchronization events

Added Components:

  • addDependenciesQueueSyncImpl() template method for creating event dependencies
  • Template-based approach allows different backends to implement their own event handle types
  • More flexible dependency management system

2. Device Destruction Ordering Fix

Before:

chipstar::Device::~Device() {
  LOCK(DeviceVarMtx);      // chipstar::Device::HostPtrToCompiledMod_
  LOCK(QueueAddRemoveMtx); // chipstar::Device::ChipQueues_
  // ... cleanup code ...
}

After:

chipstar::Device::~Device() {
  // First handle queues with QueueAddRemoveMtx
  {
    LOCK(QueueAddRemoveMtx); // chipstar::Device::ChipQueues_
    // ... queue cleanup ...
  }

  // Then handle device variables with DeviceVarMtx
  {
    LOCK(DeviceVarMtx); // chipstar::Device::HostPtrToCompiledMod_
    // ... device variable cleanup ...
  }
}

3. Event Management Improvements

Added Methods:

  • checkEvents() virtual method to base Context class (default no-op implementation)
  • checkEventsForAllContexts() method to Backend class for processing events across all contexts

Removed Calls:

  • updateLastEvent(nullptr) calls from queue removal paths
  • updateLastEvent(nullptr) calls from cleanup routines

4. Queue Query Interface Changes

Before:

bool query() {
  if (!LastEvent_)
    return true;

  if (LastEvent_->updateFinishStatus(false))
    if (LastEvent_->isFinished())
      return true;

  return false;
};

After:

virtual bool query() = 0;  // Pure virtual method

Technical Details

Event Dependency System

The new addDependenciesQueueSyncImpl() template method:

template<typename EventHandle>
std::pair<std::vector<EventHandle>, LockGuardVector>
addDependenciesQueueSyncImpl(
    chipstar::Backend* BackendPtr,
    std::shared_ptr<chipstar::Event> TargetEvent,
    std::function<EventHandle(chipstar::Queue*, std::shared_ptr<chipstar::Event>&)> CreateMarkerInQueue)

Key Features:

  • Creates marker events in other queues for synchronization
  • Adds dependencies to target events
  • Tracks events through backend event management
  • Handles both default stream and blocking queue synchronization

Synchronization Logic

Default Streams (Legacy/Per-Thread):

  • Create markers in all blocking queues
  • Add dependencies to ensure proper ordering

Blocking Queues:

  • Create markers in default streams (legacy and per-thread)
  • Synchronize with default stream execution

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvelesko pvelesko marked this pull request as ready for review October 20, 2025 09:41
@colleeneb
Copy link
Contributor

Thanks! gamess worked fine with this PR.

The test Unit_hipStreamPerThread_DeviceReset_1 was hanging for me on aurora (with this branch and LLVM-19). I'm investigating, will see if it's reproducible and let you know.

@pvelesko pvelesko added in-progress Work still in progress and removed in-progress Work still in progress labels Oct 31, 2025
@colleeneb
Copy link
Contributor

colleeneb commented Nov 13, 2025

On a fresh re-pull and run, there are no hangs. I do see some fails that aren't present in a pull from develop:

99% tests passed, 4 tests failed out of 1074

Label Time Summary:
SAMPLE      =  22.88 sec*proc (46 tests)
internal    =  22.88 sec*proc (46 tests)

Total Test time (real) = 350.61 sec

The following tests did not run:
        565 - Unit_hipMallocManaged_HostDeviceConcurrent (Skipped)
        566 - Unit_hipMallocManaged_MultiChunkSingleDevice (Skipped)
        567 - Unit_hipMallocManaged_MultiChunkMultiDevice (Skipped)
        568 - Unit_hipMallocManaged_OverSubscription (Skipped)
        570 - Unit_hipMallocManaged_TwoPointers - int (Skipped)
        571 - Unit_hipMallocManaged_TwoPointers - float (Skipped)
        572 - Unit_hipMallocManaged_TwoPointers - double (Skipped)
        599 - Unit_hipMallocManaged_FlgParam (Skipped)
        600 - Unit_hipMallocManaged_AccessMultiStream (Skipped)
        637 - Unit_hipMallocManaged_Advanced (Skipped)
        732 - Unit_hipMemsetSync (Skipped)
        733 - Unit_hipMemsetDSync - int8_t (Skipped)
        734 - Unit_hipMemsetDSync - int16_t (Skipped)
        735 - Unit_hipMemsetDSync - uint32_t (Skipped)
        736 - Unit_hipMemset2DSync (Skipped)
        737 - Unit_hipMemset3DSync (Skipped)
        815 - Unit_hipDeviceTotalMem_NonSelectedDevice (Skipped)
        817 - Unit_hipGetDeviceCount_HideDevices (Skipped)
        826 - Unit_hipSetGetDevice_Positive_Threaded_Basic (Skipped)
        830 - Unit_hipDeviceGetP2PAttribute_Basic (Skipped)
        831 - Unit_hipDeviceGetP2PAttribute_Negative (Skipped)
        832 - Unit_hipDeviceGetDefaultMemPool_Positive_Basic (Skipped)
        834 - Unit_hipDeviceCanAccessPeer_positive (Skipped)
        835 - Unit_hipDeviceCanAccessPeer_negative (Skipped)
        836 - Unit_hipDeviceEnableDisablePeerAccess_positive (Skipped)
        837 - Unit_hipDeviceEnablePeerAccess_negative (Skipped)
        838 - Unit_hipDeviceDisablePeerAccess_negative (Skipped)
        842 - Unit_hipDeviceSetMemPool_Positive_Basic (Skipped)
        843 - Unit_hipDeviceGetMemPool_Positive_Default (Skipped)
        844 - Unit_hipDeviceGetMemPool_Positive_Basic (Skipped)
        845 - Unit_hipDeviceGetMemPool_Positive_Threaded (Skipped)
        989 - TestBufferDevAddr (Skipped)

The following tests FAILED:
        765 - Unit_hipStreamAddCallback_WithDefaultStream (Subprocess aborted)
        766 - Unit_hipStreamAddCallback_WithCreatedStream (Subprocess aborted)
        1016 - stream (Subprocess aborted)                       SAMPLE internal
        1020 - hipAddCallback (Subprocess aborted)               SAMPLE internal

All fail with:

1354: Error: Unmapped API or API Error Code encountered at /lus/flare/projects/Aurora_deployment/bertoni/chip-spv_source-20251112-19-release/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:615
1354: API call: zeCommandListAppendBarrier
1354: Error code: ZE_RESULT_ERROR_INVALID_ARGUMENT
6/6 Test #1354: stream .................................................Subprocess aborted***Exception:   0.08 sec

Did you see these at all?

@pvelesko pvelesko force-pushed the syncQueues branch 3 times, most recently from 8a942a2 to 6dad2aa Compare December 9, 2025 20:37
…er events

Replace LastEvent tracking with marker-based queue synchronization:
- Remove LastEvent_ member and updateLastEvent() from Queue class
- Add addDependenciesQueueSync() template that creates marker events in
  other queues for implicit synchronization between blocking/default streams
- Make query() pure virtual, backends implement using native sync APIs

Level Zero changes:
- Use ZE_COMMAND_QUEUE_FLAG_IN_ORDER for command queues
- Implement query() with zeCommandListHostSynchronize zero-timeout check
- Move event maintenance from EventMonitor to on-demand in getEventFromPool()
- Add createEventDedicated() for callback events (workaround for PVC driver
  issue with event reuse between immediate and regular command lists)
- Get dependencies before locking CommandListMtx to avoid deadlock
- Reduce event pool size from 1000 to 10

OpenCL changes:
- Track IsEmptyQueue_ to handle query() for never-used queues
- Implement addDependenciesQueueSync() with cl_event markers
- Rework callback mechanism to use marker enqueued from callback thread
  instead of user events (workaround for Intel driver bug)

Other:
- Add checkEvents() virtual method to Context for event processing
- Fix Device destructor to avoid holding multiple locks simultaneously
- Remove L0CollectEventsTimeout env var (no longer needed)
- TestBufferDevAddr now conditional on OpenCL backend
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants