Conversation
Signed-off-by: Ronak Jain <ronakjainc@gmail.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Ronak Jain <ronakjainc@gmail.com>
|
/retest |
| HostVectorSharedPtr all_hosts(new HostVector(first_host_set.hosts())); | ||
| all_hosts->emplace_back(host); | ||
| all_hosts->insert(all_hosts->end(), hosts.begin(), hosts.end()); | ||
| priority_set_.updateHosts(0, |
There was a problem hiding this comment.
Do you happen to have flame graph for the CPU on the main thread? I suspect this block is where the wasted cycles are, which batching would surely help.
| cleanup_timer_(dispatcher_.createTimer([this]() -> void { cleanup(); })), | ||
| host_map_(std::make_shared<HostMultiMap>()) { | ||
| host_map_(std::make_shared<HostMultiMap>()), | ||
| batch_update_timer_(dispatcher_.createTimer([this]() -> void { batchUpdate(); })) { |
There was a problem hiding this comment.
should the timer as well be initialized only if lb_config.has_batch_update_interval() is enabled?
| // If batch update is enabled, add the host to the pending list | ||
| // and enable the timer if it is not already enabled. | ||
| pending_add_hosts_list_.push_back(host); | ||
| if (!batch_update_timer_->enabled()) { |
There was a problem hiding this comment.
instead of doing this check here every time host is added, you could move the batch_update_timer_->enableTimer(batch_update_interval_) to the end of OriginalDstCluster::batchUpdate method. That would change the semantics slightly of how batch_update_interval works.
| ~OriginalDstCluster() override { | ||
| ASSERT_IS_MAIN_OR_TEST_THREAD(); | ||
| cleanup_timer_->disableTimer(); | ||
| batch_update_timer_->disableTimer(); |
There was a problem hiding this comment.
for proper timer cleanup you would also need to reset the timer pointer so its callback gets deallocated with the timer object
|
/wait |
Commit Message: Add batched host updates in original_dst cluster
Additional Description: Adds batching support for host updates in original_dst clusters to address performance bottlenecks on the main thread. With high fan-out clients with large numbers of active connections (e.g. 60k) and high unique host-port connection rates (e.g. 3k/s), the current per-connection addHost callbacks create significant pressure on the main thread. Each callback triggers a host map copy and priority set update, which is CPU heavy and impacts critical operations like metrics collection + health checks on the main thread.
The new batch_update_interval configuration option allows consolidating these updates into periodic batches, significantly reducing main thread overhead. This has been production-tested for 6+ months with positive results.
Reopens #38769
Risk Level: Low
Testing: Unit tests
Docs Changes: proto docs
Release Notes: None
Platform Specific Features: None