[BUG]: Investigate the mixed-join performance regression caused by register spilling

### Is this a duplicate?

- [x] I confirmed there appear to be no duplicate issues for this bug (https://github.com/NVIDIA/cuCollections/issues)

### Type of Bug

Performance

### Describe the bug

cudf is a primary user of cuco data structures. Over the past three years, we have been developing new implementations of cuco hash tables https://github.com/NVIDIA/cuCollections/issues/110 to unify all variants under a single open-addressing–based design. In parallel, we have been migrating all cudf use cases from the legacy cuco hash tables to the new ones https://github.com/rapidsai/cudf/issues/12261.

In most cases, the migration has shown clear advantages: the new implementations typically consume fewer registers than the legacy versions and deliver superior runtime performance.

However, this is not always the case. In some scenarios where the new code performs worse, the regression can be traced to additional logic in the new design. For example:

* The new implementation supports `erase` for all hash tables. While this has negligible performance impact for sets and maps, it significantly slows down CG-based insertions for multisets and multimaps https://github.com/NVIDIA/cuCollections/pull/681.
* Additional checks ensure correct functionality even at 100% occupancy https://github.com/NVIDIA/cuCollections/pull/647.

Individually these additions may seem minor, but combined they can make the new implementation noticeably slower than the legacy code.

In the case of the mixed join migration https://github.com/rapidsai/cudf/pull/19660, our first attempt by simply replacing the old device-view usage with the new multiset ref caused up to a 20x slowdown. Profiling shows that this is due to register spilling. While the new code nominally uses fewer registers, the issue appears to be longer live ranges: the new design forces temporary data to remain live longer than in the legacy version, which in turn causes spilling.

From what I have tested so far, no single design choice seems solely responsible for this. Several components may contribute:

* **Device `rebind_*` logic**: each call rebinds a device ref with new hash functions, key equality, or hash-table operations (insert, count, retrieve, etc.). These functions return a device ref by copy; returning by `&&` might reduce live ranges and get rid of the issue.
* **Equal wrapper**: combines sentinel checks and key-equality checks in one class, using a strong type to represent 3-way comparison results, which consumes more memory than a simple `bool`.
* **Probing iterator**: wraps hash functions and operations to compute the initial slot and step size for probing.
* **Mixin logic**: uses `reinterpret_cast` at runtime to cast refs to the base OA class.
* **Use of thrust algos**: adding two elements via `thrust::reduce`.

I don’t believe any single factor is the sole root cause; however, collectively, they seem to extend register live ranges. To match the legacy performance, I had to remove all of the above abstractions and revert to naive free functions (non-OO design). This suggests either (a) our object-oriented design is causing data to remain live longer than necessary, or (b) some aspect of cuco internals is confusing the compiler, preventing it from analyzing live ranges effectively. Interestingly, in previous work, we replaced the OO finalizer with free functions https://github.com/NVIDIA/cuCollections/pull/701, which significantly improved performance. At the time, we didn’t consider register live ranges, but it could be related.

### How to Reproduce

So far, I haven’t found a way to reproduce this performance regression within the cuco benchmarks, mainly because the mixed join use case is fairly complex. At the moment, the only way to reproduce the issue is by comparing the new cuco implementation with the legacy one in cudf using the mixed join benchmarks.

### Expected behavior

[mixed-join-ref.zip](https://github.com/user-attachments/files/22499689/mixed-join-ref.zip)

[mixed-join-new.zip](https://github.com/user-attachments/files/22499682/mixed-join-new.zip)

NCU profiling between the old and the new mixed join for references


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: Investigate the mixed-join performance regression caused by register spilling #761

Is this a duplicate?

Type of Bug

Describe the bug

How to Reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Investigate the mixed-join performance regression caused by register spilling #761

Description

Is this a duplicate?

Type of Bug

Describe the bug

How to Reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions