perftest: Fix double frees during RDMA CM connection retry #368

SherrinZhou · 2025-12-09T03:31:06Z

When running perftest with RDMA CM enabled under an environment where the server is under high pressure and likely to reject a CM connection issued by the client, if the connection request is rejected (RDMA_CM_EVENT_REJECTED), the client enters a retry loop in rdma_cm_client_connection.

However, the previous retry logic contained multiple flaws causing segmentation faults, double frees, and heap corruption.

The error print looked like this:
RDMA CM event error:
Event: RDMA_CM_EVENT_REJECTED; error: 8.
ERRNO:Operation not supported.
Failed to handle RDMA CM event.
ERRNO: Operation not supported.
Failed to connect RDMA CM events.
ERRNO:Operation not supported.
Failed to resolve RDMA CM address.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM ID number 0.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM contexts.
ERRNO: Bad file descriptor.
free(): double free detected in tcache 2

The backtrace of the segfault triggered core dump looked like this:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fcb90783db5 in __GI_abort () at abort.c:79
#2 0x00007fcb907dc4e7 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fcb908ebaae "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007fcb907e35ec in malloc_printerr (str=str@entry=0x7fcb908ed6d8 "free(): double free detected in tcache 2") at malloc.c:5374
#4 0x00007fcb907e535d in _int_free (av=0x7fcb90b21bc0 <main_arena>, p=0xf459c0, have_lock=) at malloc.c:4213
#5 0x000000000040dd86 in create_rdma_cm_connection (ctx=0x7ffcf6f18410, user_param=0x7ffcf6f17fe0, comm=0x7ffcf6f17fc0, my_dest=0xf41590, rem_dest=0xf41a00) at src/perftest_communication.c:2949
#6 0x0000000000404b83 in main (argc=26, argv=0x7ffcf6f186f8) at src/send_bw.c:273

The following issues were identified and fixed:

Double Free and Heap Corruption:
The cleanup logic inside the retry loop destroyed resources (Event Channel,
IDs) without clearing their pointers. Subsequent error handling paths tried
to free them again, triggering "double free" or "Bad file descriptor".
Additionally, connection_index was incremented unconditionally on every
attempt, eventually overflowing the nodes array and corrupting heap metadata.
Invalid Argument / Context Mismatch:
The previous logic destroyed the Protection Domain (PD) and Event Channel
on every retry but failed to properly re-initialize them or update the
Context pointer. This caused ibv_create_qp to fail with ENOENT (No such
file or directory) because it attempted to use a stale PD with a new Context.
Client/Server State Desynchronization:
Resetting the connection flow completely on the client side caused state
desynchronization with the server (which tracks connection indices linearly),
leading to further rejections.

This patch implements a robust "incremental retry" strategy:

Only failed QP/CM nodes are cleaned up and retried; established connections
are preserved.
Global resources (PD, Event Channel, Context) are preserved across retries
to ensure resource validity.
ctx_init is guarded to run only when the PD is not yet initialized.
Pointers are explicitly set to NULL after destruction to prevent double frees.
Memory leaks in hints->ai_src_addr allocation are fixed.

When running perftest with RDMA CM enabled (-R), if the server rejects a connection request (RDMA_CM_EVENT_REJECTED), the client enters a retry loop. The previous retry logic was flawed, leading to segmentation faults, double frees, and heap corruption due to improper resource management. The following issues were identified and fixed: 1. Double Free and Heap Corruption: The cleanup logic inside the retry loop destroyed resources (Event Channel, IDs) without clearing their pointers. Subsequent error handling paths tried to free them again, triggering "double free" or "Bad file descriptor". Additionally, `connection_index` was incremented unconditionally on every attempt, eventually overflowing the `nodes` array and corrupting heap metadata. 2. Invalid Argument / Context Mismatch: The previous logic destroyed the Protection Domain (PD) and Event Channel on every retry but failed to properly re-initialize them or update the Context pointer. This caused `ibv_create_qp` to fail with ENOENT because it attempted to use a stale PD with a new Context. 3. Client/Server State Desynchronization: Resetting the connection flow completely on the client side caused state desynchronization with the server (which tracks connection indices linearly), leading to further rejections. Implement a robust "incremental retry" strategy: - Only failed QP/CM nodes are cleaned up and retried; established connections are preserved. - Global resources (PD, Event Channel, Context) are preserved across retries to ensure resource validity. - `ctx_init` is guarded to run only when the PD is not yet initialized. - Pointers are explicitly set to NULL after destruction to prevent double frees. - Memory leaks in `hints->ai_src_addr` allocation are fixed. Signed-off-by: Ruizhe Zhou <[email protected]>

SherrinZhou force-pushed the fix/cm_retry_resource_leak branch 2 times, most recently from 4b90310 to 3f0bf12 Compare December 12, 2025 06:37

SherrinZhou force-pushed the fix/cm_retry_resource_leak branch from 3f0bf12 to 3cf9835 Compare December 12, 2025 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perftest: Fix double frees during RDMA CM connection retry #368

perftest: Fix double frees during RDMA CM connection retry #368

Uh oh!

SherrinZhou commented Dec 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

perftest: Fix double frees during RDMA CM connection retry #368

Are you sure you want to change the base?

perftest: Fix double frees during RDMA CM connection retry #368

Uh oh!

Conversation

SherrinZhou commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SherrinZhou commented Dec 9, 2025 •

edited

Loading