Skip to content

Segment Load/Drop Race Causes Silent Partial Query Results #18738

@jtuglu1

Description

@jtuglu1

Affected Version

All recent Druid versions.

Description

Using HTTP-based load/drop segment communication between data nodes and brokers:
T0: Coordinator moves segment S from historical A to historical B.
T1: Historical B loads segment S
T2: Coordinator gets loadSegment callback from B
T3: Coordinator tells Historical A to drop segment S
T4: Broker gets drop callback from Historical A, removes the server from the timeline (now 0 nodes serving)
T5: Query sees partial results/no servers for that segment
T6: Broker gets load callback from Historical B

This out-of-order delivery causes a momentary lapse in the Broker's timeline, yielding partial query results to the user without them knowing. In large deployments, this occurs quite frequently – nearly every hour.

Proposal: Brokers should rely on Coordinators for timeline add/drop callbacks, instead of talking to data nodes directly. While placing more responsibility on the coordinator, this ensures happens-before ordering is respected and there's a single, linearizable chain of events seen by each broker.

#18716 will fail these queries instead of letting them succeed, but the queries will still unnecessarily fail which is not ideal.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions