How are you running Apache Doris stably in production under multi-replica write load? #64743

ojalberts-itc · 2026-06-23T10:23:38Z

ojalberts-itc
Jun 23, 2026

Hi all — I'm after operational guidance from people running Doris in production, in the context of
a stability problem we've hit and filed as
apache/doris#64708. I'm not asking the
community to debug the bug here (the issue has the full evidence) — I want to learn how you operate
Doris so this class of problem doesn't bite you.

Our setup. A lakehouse: we read Apache Iceberg through a Glue/S3 external catalog, aggregate,
and write the result into native Doris UNIQUE-KEY merge-on-write tables. We run coupled
(storage-compute together) mode — FE + BE only, no Meta Service / FoundationDB / Recycler /
S3 storage vault; native tablet data lives on BE-local EBS. Current cluster on AWS:

Mode: coupled (storage-compute together).
Topology: 3 FE (HA followers) + 4 BE.
Instances: AWS r8i.2xlarge (8 vCPU / 64 GiB RAM) for every node, region af-south-1.
BE storage: one dedicated 500 GB gp3 EBS volume per BE (3000 IOPS / 125 MiB/s baseline),
XFS (noatime,nodiratime), mounted at the Doris storage_root_path.
Replication: Doris default 3, across the 4 BEs.
OS / JDK: Amazon Linux 2023 + Amazon Corretto 17 (also reproduced on Ubuntu 22.04 LTS).
Doris versions tested: 4.0.6 GA and 4.1.2.
Config: effectively stock be.conf/fe.conf — the only non-default overrides are
mem_limit = 80%, storage_root_path, and priority_networks.

What we're hitting. Under sustained multi-replica (repl=3) write load, the BE write path
wedges: a BE-to-BE brpc load-stream socket on :8060 goes "Broken" and is never revived; writes
hang then fail failed to write enough replicas N/3 ... connection errors; and only a
simultaneous full-BE-fleet restart recovers — while every node still reports Alive=true
(the heartbeat runs on a separate threadpool from the write path). We've reproduced it on 4.0.6
and 4.1.2, on Amazon Linux 2023 and Ubuntu 22.04, with the environment-layer causes
(network / firewall / SELinux-AppArmor / conntrack / Nitro-ENA) excluded by direct test. Details,
in-process captures, and the exact signature are in the issue:
apache/doris#64708.

What we already found searching here. #52461
notes that in storage-compute decoupled mode you don't set a replica number at all — high
availability comes from S3/HDFS rather than Doris multi-copy — which would mean decoupled removes
the BE-to-BE multi-replica load-stream path that wedges for us. And the broken-:8060-socket
behaviour isn't new: #15007
(failed to send brpc batch, error=Host is down, sockets stuck "Not connected" on :8060) and
#43008 (brpc :8060
Resource temporarily unavailable, reported to persist across a version upgrade) hit the same
socket layer on the query/exchange path. So our questions are really: how do you run coupled
mode safely, or is decoupled the answer?

Decoupled mode as the fix? Given Replication Factor in Compute Storage Decoupled Mode #52461 (replica count is moot in decoupled — HA from
S3/HDFS), is storage-compute decoupled the recommended mode for write-heavy workloads, and
does it actually avoid this multi-replica load-stream wedge in practice? What's the real
operational cost of the extra moving parts (Meta Service + FoundationDB + Recycler) — is it
production-ready, and worth it to escape this?
Ingestion method — do you avoid large concurrent INSERT ... SELECT / CREATE TABLE AS SELECT at repl=3 and prefer Stream Load / Routine Load / Broker Load / S3 load instead? Did
the ingestion method change your stability?
Replication (coupled mode) — do you run repl=3 in production, or repl=1 + storage
durability? Anyone running experimental_enable_single_replica_insert=true long-term — does it
hold up? (In decoupled this is moot per Replication Factor in Compute Storage Decoupled Mode #52461; this is for coupled-mode operators.)
Version — which line do you consider production-stable today (2.1 LTS, 3.x, 4.0.x,
4.1.x)? Did moving versions resolve load-stream / brpc stability for you?
Operations — do your BEs stay up for months, or do you restart on a schedule? And how do
you monitor write-path health, given SHOW BACKENDS ... Alive=true stays green even when the
write path is dead?
Tuning — any brpc / load-stream / timeout settings (e.g. enable_brpc_connection_check,
tablet_writer_open_rpc_timeout_sec, brpc thread/socket knobs) that materially improved
stability?

Happy to share our full Terraform topology, config, and findings. Thanks in advance — we're trying
to land on a stable operating model for Doris and would value hearing how others actually run it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are you running Apache Doris stably in production under multi-replica write load? #64743

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How are you running Apache Doris stably in production under multi-replica write load? #64743

Uh oh!

ojalberts-itc Jun 23, 2026

Replies: 0 comments

ojalberts-itc
Jun 23, 2026