Skip to content

[BUG]: Investigate the mixed-join performance regression caused by register spilling #761

@PointKernel

Description

@PointKernel

Is this a duplicate?

Type of Bug

Performance

Describe the bug

cudf is a primary user of cuco data structures. Over the past three years, we have been developing new implementations of cuco hash tables #110 to unify all variants under a single open-addressing–based design. In parallel, we have been migrating all cudf use cases from the legacy cuco hash tables to the new ones rapidsai/cudf#12261.

In most cases, the migration has shown clear advantages: the new implementations typically consume fewer registers than the legacy versions and deliver superior runtime performance.

However, this is not always the case. In some scenarios where the new code performs worse, the regression can be traced to additional logic in the new design. For example:

Individually these additions may seem minor, but combined they can make the new implementation noticeably slower than the legacy code.

In the case of the mixed join migration rapidsai/cudf#19660, our first attempt by simply replacing the old device-view usage with the new multiset ref caused up to a 20x slowdown. Profiling shows that this is due to register spilling. While the new code nominally uses fewer registers, the issue appears to be longer live ranges: the new design forces temporary data to remain live longer than in the legacy version, which in turn causes spilling.

From what I have tested so far, no single design choice seems solely responsible for this. Several components may contribute:

  • Device rebind_* logic: each call rebinds a device ref with new hash functions, key equality, or hash-table operations (insert, count, retrieve, etc.). These functions return a device ref by copy; returning by && might reduce live ranges and get rid of the issue.
  • Equal wrapper: combines sentinel checks and key-equality checks in one class, using a strong type to represent 3-way comparison results, which consumes more memory than a simple bool.
  • Probing iterator: wraps hash functions and operations to compute the initial slot and step size for probing.
  • Mixin logic: uses reinterpret_cast at runtime to cast refs to the base OA class.
  • Use of thrust algos: adding two elements via thrust::reduce.

I don’t believe any single factor is the sole root cause; however, collectively, they seem to extend register live ranges. To match the legacy performance, I had to remove all of the above abstractions and revert to naive free functions (non-OO design). This suggests either (a) our object-oriented design is causing data to remain live longer than necessary, or (b) some aspect of cuco internals is confusing the compiler, preventing it from analyzing live ranges effectively. Interestingly, in previous work, we replaced the OO finalizer with free functions #701, which significantly improved performance. At the time, we didn’t consider register live ranges, but it could be related.

How to Reproduce

So far, I haven’t found a way to reproduce this performance regression within the cuco benchmarks, mainly because the mixed join use case is fairly complex. At the moment, the only way to reproduce the issue is by comparing the new cuco implementation with the legacy one in cudf using the mixed join benchmarks.

Expected behavior

mixed-join-ref.zip

mixed-join-new.zip

NCU profiling between the old and the new mixed join for references

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions