Skip to content

Conversation

@SherrinZhou
Copy link

@SherrinZhou SherrinZhou commented Dec 9, 2025

When running perftest with RDMA CM enabled under an environment where the server is under high pressure and likely to reject a CM connection issued by the client, if the connection request is rejected (RDMA_CM_EVENT_REJECTED), the client enters a retry loop in rdma_cm_client_connection.

However, the previous retry logic contained multiple flaws causing segmentation faults, double frees, and heap corruption.

The error print looked like this:
RDMA CM event error:
Event: RDMA_CM_EVENT_REJECTED; error: 8.
ERRNO:Operation not supported.
Failed to handle RDMA CM event.
ERRNO: Operation not supported.
Failed to connect RDMA CM events.
ERRNO:Operation not supported.
Failed to resolve RDMA CM address.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM ID number 0.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM contexts.
ERRNO: Bad file descriptor.
free(): double free detected in tcache 2

The backtrace of the segfault triggered core dump looked like this:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fcb90783db5 in __GI_abort () at abort.c:79
#2 0x00007fcb907dc4e7 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fcb908ebaae "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007fcb907e35ec in malloc_printerr (str=str@entry=0x7fcb908ed6d8 "free(): double free detected in tcache 2") at malloc.c:5374
#4 0x00007fcb907e535d in _int_free (av=0x7fcb90b21bc0 <main_arena>, p=0xf459c0, have_lock=) at malloc.c:4213
#5 0x000000000040dd86 in create_rdma_cm_connection (ctx=0x7ffcf6f18410, user_param=0x7ffcf6f17fe0, comm=0x7ffcf6f17fc0, my_dest=0xf41590, rem_dest=0xf41a00) at src/perftest_communication.c:2949
#6 0x0000000000404b83 in main (argc=26, argv=0x7ffcf6f186f8) at src/send_bw.c:273

The following issues were identified and fixed:

  1. Double Free and Heap Corruption:
    The cleanup logic inside the retry loop destroyed resources (Event Channel,
    IDs) without clearing their pointers. Subsequent error handling paths tried
    to free them again, triggering "double free" or "Bad file descriptor".
    Additionally, connection_index was incremented unconditionally on every
    attempt, eventually overflowing the nodes array and corrupting heap metadata.

  2. Invalid Argument / Context Mismatch:
    The previous logic destroyed the Protection Domain (PD) and Event Channel
    on every retry but failed to properly re-initialize them or update the
    Context pointer. This caused ibv_create_qp to fail with ENOENT (No such
    file or directory) because it attempted to use a stale PD with a new Context.

  3. Client/Server State Desynchronization:
    Resetting the connection flow completely on the client side caused state
    desynchronization with the server (which tracks connection indices linearly),
    leading to further rejections.

This patch implements a robust "incremental retry" strategy:

  • Only failed QP/CM nodes are cleaned up and retried; established connections
    are preserved.
  • Global resources (PD, Event Channel, Context) are preserved across retries
    to ensure resource validity.
  • ctx_init is guarded to run only when the PD is not yet initialized.
  • Pointers are explicitly set to NULL after destruction to prevent double frees.
  • Memory leaks in hints->ai_src_addr allocation are fixed.

@SherrinZhou SherrinZhou force-pushed the fix/cm_retry_resource_leak branch 2 times, most recently from 4b90310 to 3f0bf12 Compare December 12, 2025 06:37
When running perftest with RDMA CM enabled (-R), if the server rejects a
connection request (RDMA_CM_EVENT_REJECTED), the client enters a retry
loop. The previous retry logic was flawed, leading to segmentation
faults, double frees, and heap corruption due to improper resource
management.

The following issues were identified and fixed:

1. Double Free and Heap Corruption:
   The cleanup logic inside the retry loop destroyed resources (Event
Channel, IDs) without clearing their pointers. Subsequent error handling
paths tried to free them again, triggering "double free" or "Bad file
descriptor". Additionally, `connection_index` was incremented
unconditionally on every attempt, eventually overflowing the `nodes`
array and corrupting heap metadata.

2. Invalid Argument / Context Mismatch:
   The previous logic destroyed the Protection Domain (PD) and Event
Channel on every retry but failed to properly re-initialize them or
update the Context pointer. This caused `ibv_create_qp` to fail with
ENOENT because it attempted to use a stale PD with a new
Context.

3. Client/Server State Desynchronization:
   Resetting the connection flow completely on the client side caused
state desynchronization with the server (which tracks connection indices
linearly), leading to further rejections. Implement a robust
"incremental retry" strategy:
- Only failed QP/CM nodes are cleaned up and retried; established
connections are preserved.
- Global resources (PD, Event Channel, Context) are preserved across
retries to ensure resource validity.
- `ctx_init` is guarded to run only when the PD is not yet initialized.
- Pointers are explicitly set to NULL after destruction to prevent
double frees.
- Memory leaks in `hints->ai_src_addr` allocation are fixed.

Signed-off-by: Ruizhe Zhou <[email protected]>
@SherrinZhou SherrinZhou force-pushed the fix/cm_retry_resource_leak branch from 3f0bf12 to 3cf9835 Compare December 12, 2025 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant