Skip to content

original_dst: batch host updates#43782

Open
jronak wants to merge 2 commits intoenvoyproxy:mainfrom
jronak:jronak/original-dst-batch
Open

original_dst: batch host updates#43782
jronak wants to merge 2 commits intoenvoyproxy:mainfrom
jronak:jronak/original-dst-batch

Conversation

@jronak
Copy link
Contributor

@jronak jronak commented Mar 5, 2026

Commit Message: Add batched host updates in original_dst cluster

Additional Description: Adds batching support for host updates in original_dst clusters to address performance bottlenecks on the main thread. With high fan-out clients with large numbers of active connections (e.g. 60k) and high unique host-port connection rates (e.g. 3k/s), the current per-connection addHost callbacks create significant pressure on the main thread. Each callback triggers a host map copy and priority set update, which is CPU heavy and impacts critical operations like metrics collection + health checks on the main thread.

The new batch_update_interval configuration option allows consolidating these updates into periodic batches, significantly reducing main thread overhead. This has been production-tested for 6+ months with positive results.

Reopens #38769

Risk Level: Low
Testing: Unit tests
Docs Changes: proto docs
Release Notes: None
Platform Specific Features: None

Signed-off-by: Ronak Jain <ronakjainc@gmail.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #43782 was opened by jronak.

see: more, trace.

Signed-off-by: Ronak Jain <ronakjainc@gmail.com>
@jronak
Copy link
Contributor Author

jronak commented Mar 6, 2026

/retest

HostVectorSharedPtr all_hosts(new HostVector(first_host_set.hosts()));
all_hosts->emplace_back(host);
all_hosts->insert(all_hosts->end(), hosts.begin(), hosts.end());
priority_set_.updateHosts(0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to have flame graph for the CPU on the main thread? I suspect this block is where the wasted cycles are, which batching would surely help.

cleanup_timer_(dispatcher_.createTimer([this]() -> void { cleanup(); })),
host_map_(std::make_shared<HostMultiMap>()) {
host_map_(std::make_shared<HostMultiMap>()),
batch_update_timer_(dispatcher_.createTimer([this]() -> void { batchUpdate(); })) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the timer as well be initialized only if lb_config.has_batch_update_interval() is enabled?

// If batch update is enabled, add the host to the pending list
// and enable the timer if it is not already enabled.
pending_add_hosts_list_.push_back(host);
if (!batch_update_timer_->enabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this check here every time host is added, you could move the batch_update_timer_->enableTimer(batch_update_interval_) to the end of OriginalDstCluster::batchUpdate method. That would change the semantics slightly of how batch_update_interval works.

~OriginalDstCluster() override {
ASSERT_IS_MAIN_OR_TEST_THREAD();
cleanup_timer_->disableTimer();
batch_update_timer_->disableTimer();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for proper timer cleanup you would also need to reset the timer pointer so its callback gets deallocated with the timer object

@agrawroh
Copy link
Member

/wait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants