Fix m144 teardown crashes#967
Conversation
|
|
|
@dfed are you able to validate this hypothesis? Looks like |
|
Trying to isolate the changes between m137 vs m144: Crash 1 (ICE transport use-after-free): // m137 — allocator stays alive
network_thread()->BlockingCall([this] {
TeardownDataChannelTransport_n({});
transport_controller_.reset();
port_allocator_->DiscardCandidatePool(); // pool discarded, allocator lives
if (network_thread_safety_) {
network_thread_safety_->SetNotAlive();
}
});
// m144 — allocator destroyed
void PeerConnection::CloseOnNetworkThread() {
network_thread()->BlockingCall([this] {
TeardownDataChannelTransport_n(RTCError::OK());
port_allocator_->DiscardCandidatePool();
transport_controller_.reset();
port_allocator_.reset(); // NEW: allocator destroyed
network_thread_safety_->SetNotAlive();
});
}Crash 2 (AVAudioEngine deallocation assertion): // m137 DeleteChannel — no media_engine_ref_
context()->worker_thread()->BlockingCall([&]() {
auto channel_to_delete = std::move(channel_);
for (const auto& sender : senders_)
sender->internal()->SetMediaChannel(nullptr);
for (const auto& receiver : receivers_)
receiver->internal()->SetMediaChannel(nullptr);
channel_to_delete.reset();
});
// m144 DeleteChannel — releases media engine reference
context()->worker_thread()->BlockingCall([&]() {
RTC_DCHECK_RUN_ON(context()->worker_thread());
auto channel_to_delete = std::move(channel_);
for (const auto& sender : senders_)
sender->internal()->SetMediaChannel(nullptr);
for (const auto& receiver : receivers_)
receiver->internal()->SetMediaChannel(nullptr);
channel_to_delete.reset();
media_engine_ref_.reset(); // NEW: can trigger audio engine teardown
});Both are fixed in m146 by commit |
|
@pblazej not sure how I would validate this. Seems plausible to me. But I'm not manually calling any of these APIs. What specifically can I help validate? |
|
@dfed I mean - are you able to test/deploy the fix in production? |
|
got it @pblazej. the crash isn't frequent enough for me to reproduce locally, and shipping a release build from a PR branch isn't something we usually do. |
|
@dfed what's the repro rate from your metrics? Is it in the order of 0.1%-0.001%? |
|
@hiroshihorie I tried to re-create the decision chain behind removing tracks, I think that's just an unfortunate side effect of WebRTC internal changes (similarly to the previous little regressions). Wdyt about removing that tentatively? |
These two new issues combined affect 0.015% of sessions. They are our second and fourth highest crasher. |
removeTrack nulls sender tracks and changes transceiver directions, causing PeerConnection::Close() to skip ClearSend/DetachTrack in its StopTransceiverProcedure and hit edge cases in the worker-thread teardown (ICE use-after-free, AVAudioEngine deallocation assertion). Close() handles full cleanup on its own. The loop was originally commented out as "not required?" and was accidentally uncommented during a threading refactor (101e09d). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
82a786c to
2e0b7f4
Compare
| // Stop listening to delegate | ||
| _pc.delegate = nil | ||
| // Remove all senders (if any) | ||
| for sender in _pc.senders { |
There was a problem hiding this comment.
do we have tests that protects us from regressions ?
[SymbolicatedCrashes.zip](https://github.com/user-attachments/files/26703300/SymbolicatedCrashes.zip) Calling `removeTrack` for all senders before `close()` puts WebRTC into a partially-cleaned state that causes `Close()` to follow a different teardown path, crashing on the worker thread. `RemoveTrackOrError` (`pc/peer_connection.cc:975`) nulls the sender track and changes transceiver direction: ```cpp sender->SetTrack(nullptr); // nulls track transceiver->internal()->set_direction(...); // kSendRecv → kRecvOnly sdp_handler_->UpdateNegotiationNeeded(); ``` `SetTrack(nullptr)` triggers `ClearSend()` (`pc/rtp_sender.cc:494`) because `prev_can_send_track` was true: ```cpp track_ = nullptr; if (can_send_track()) { // false — track_ is null SetSend(); } else if (prev_can_send_track) { // true ClearSend(); // disables audio sending on worker thread } ``` ```cpp void AudioRtpSender::ClearSend() { bool success = worker_thread_->BlockingCall([&] { return voice_media_channel()->SetAudioSend(ssrc_, false, &options, nullptr); }); } ``` Later, `Close()` (`pc/peer_connection.cc:1876`) runs `StopTransceiverProcedure` → `sender->Stop()` (`pc/rtp_sender.cc:562`): ```cpp void RtpSenderBase::Stop() { if (stopped_) return; if (track_) { // NULL — removeTrack already set it DetachTrack(); // SKIPPED track_->UnregisterObserver(this); // SKIPPED } if (can_send_track()) { // FALSE — track_ is null ClearSend(); // SKIPPED RemoveTrackFromStats(); // SKIPPED } media_channel_ = nullptr; stopped_ = true; } ``` Without `removeTrack`, `Stop()` runs the full path: `DetachTrack` → `ClearSend` → `RemoveTrackFromStats`. With it, all three are skipped. The sender enters `DeleteChannel` in a partially-cleaned state. `DeleteChannel` (`pc/rtp_transceiver.cc:385`) then destroys the channel on the worker thread: ```cpp void RtpTransceiver::DeleteChannel() { context()->worker_thread()->BlockingCall([&]() { auto channel_to_delete = std::move(channel_); for (auto& sender : senders_) sender->internal()->SetMediaChannel(nullptr); for (auto& receiver : receivers_) receiver->internal()->SetMediaChannel(nullptr); channel_to_delete.reset(); // channel destroyed → [AVAudioEngine stop] media_engine_ref_.reset(); }); } ``` The channel destructor encountering the modified state leads to crash 1 (freed Port/socket resources accessed during ICE transport teardown) and crash 2 (inconsistent audio engine state → assertion in buffer deallocation during `[AVAudioEngine stop]`). Both `removeTrack` and `close()` go through `PeerConnectionProxy` (`pc/proxy.h`), which marshals them synchronously to the signaling thread — no threading race between them. The issue is the modified internal state, not concurrency. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> (cherry picked from commit 00d3ca9)
SymbolicatedCrashes.zip
Calling
removeTrackfor all senders beforeclose()puts WebRTC into apartially-cleaned state that causes
Close()to follow a different teardownpath, crashing on the worker thread.
RemoveTrackOrError(pc/peer_connection.cc:975) nulls the sender track andchanges transceiver direction:
SetTrack(nullptr)triggersClearSend()(pc/rtp_sender.cc:494) becauseprev_can_send_trackwas true:Later,
Close()(pc/peer_connection.cc:1876) runsStopTransceiverProcedure→
sender->Stop()(pc/rtp_sender.cc:562):Without
removeTrack,Stop()runs the full path:DetachTrack→ClearSend→RemoveTrackFromStats. With it, all three are skipped. Thesender enters
DeleteChannelin a partially-cleaned state.DeleteChannel(pc/rtp_transceiver.cc:385) then destroys the channel on theworker thread:
The channel destructor encountering the modified state leads to crash 1 (freed
Port/socket resources accessed during ICE transport teardown) and crash 2
(inconsistent audio engine state → assertion in buffer deallocation during
[AVAudioEngine stop]).Both
removeTrackandclose()go throughPeerConnectionProxy(
pc/proxy.h), which marshals them synchronously to the signaling thread — nothreading race between them. The issue is the modified internal state, not
concurrency.