How are you running Apache Doris stably in production under multi-replica write load? #64743
Unanswered
ojalberts-itc
asked this question in
A - General / Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all — I'm after operational guidance from people running Doris in production, in the context of
a stability problem we've hit and filed as
apache/doris#64708. I'm not asking the
community to debug the bug here (the issue has the full evidence) — I want to learn how you operate
Doris so this class of problem doesn't bite you.
Our setup. A lakehouse: we read Apache Iceberg through a Glue/S3 external catalog, aggregate,
and write the result into native Doris UNIQUE-KEY merge-on-write tables. We run coupled
(storage-compute together) mode — FE + BE only, no Meta Service / FoundationDB / Recycler /
S3 storage vault; native tablet data lives on BE-local EBS. Current cluster on AWS:
r8i.2xlarge(8 vCPU / 64 GiB RAM) for every node, regionaf-south-1.gp3EBS volume per BE (3000 IOPS / 125 MiB/s baseline),XFS (
noatime,nodiratime), mounted at the Dorisstorage_root_path.be.conf/fe.conf— the only non-default overrides aremem_limit = 80%,storage_root_path, andpriority_networks.What we're hitting. Under sustained multi-replica (repl=3) write load, the BE write path
wedges: a BE-to-BE brpc load-stream socket on
:8060goes "Broken" and is never revived; writeshang then fail
failed to write enough replicas N/3 ... connection errors; and only asimultaneous full-BE-fleet restart recovers — while every node still reports
Alive=true(the heartbeat runs on a separate threadpool from the write path). We've reproduced it on 4.0.6
and 4.1.2, on Amazon Linux 2023 and Ubuntu 22.04, with the environment-layer causes
(network / firewall / SELinux-AppArmor / conntrack / Nitro-ENA) excluded by direct test. Details,
in-process captures, and the exact signature are in the issue:
apache/doris#64708.
What we already found searching here. #52461
notes that in storage-compute decoupled mode you don't set a replica number at all — high
availability comes from S3/HDFS rather than Doris multi-copy — which would mean decoupled removes
the BE-to-BE multi-replica load-stream path that wedges for us. And the broken-
:8060-socketbehaviour isn't new: #15007
(
failed to send brpc batch, error=Host is down, sockets stuck "Not connected" on:8060) and#43008 (brpc
:8060Resource temporarily unavailable, reported to persist across a version upgrade) hit the samesocket layer on the query/exchange path. So our questions are really: how do you run coupled
mode safely, or is decoupled the answer?
S3/HDFS), is storage-compute decoupled the recommended mode for write-heavy workloads, and
does it actually avoid this multi-replica load-stream wedge in practice? What's the real
operational cost of the extra moving parts (Meta Service + FoundationDB + Recycler) — is it
production-ready, and worth it to escape this?
INSERT ... SELECT/CREATE TABLE AS SELECTat repl=3 and prefer Stream Load / Routine Load / Broker Load / S3 load instead? Didthe ingestion method change your stability?
durability? Anyone running
experimental_enable_single_replica_insert=truelong-term — does ithold up? (In decoupled this is moot per Replication Factor in Compute Storage Decoupled Mode #52461; this is for coupled-mode operators.)
4.1.x)? Did moving versions resolve load-stream / brpc stability for you?
you monitor write-path health, given
SHOW BACKENDS ... Alive=truestays green even when thewrite path is dead?
enable_brpc_connection_check,tablet_writer_open_rpc_timeout_sec, brpc thread/socket knobs) that materially improvedstability?
Happy to share our full Terraform topology, config, and findings. Thanks in advance — we're trying
to land on a stable operating model for Doris and would value hearing how others actually run it.
Beta Was this translation helpful? Give feedback.
All reactions