Skip to content

fix: preserve push notification event order per task (closes #775)#776

Open
perhaps468 wants to merge 1 commit intoa2aproject:mainfrom
perhaps468:fix/push-notification-order-preserving-775
Open

fix: preserve push notification event order per task (closes #775)#776
perhaps468 wants to merge 1 commit intoa2aproject:mainfrom
perhaps468:fix/push-notification-order-preserving-775

Conversation

@perhaps468
Copy link
Copy Markdown

Summary

Push notifications for streaming events (Task, TaskStatusUpdateEvent, TaskArtifactUpdateEvent) could arrive out of order on the client side because the previous implementation used CompletableFuture.runAsync() — which submits to the shared ForkJoinPool.common() pool, where events from concurrent tasks compete for the same thread pool and can be interleaved unpredictably.

This fix replaces the shared-pool approach with per-task bounded queues, each with a dedicated consumer thread, guaranteeing FIFO ordering per task and complete isolation between tasks.


Problem

In the original implementation (MainEventBusProcessor.sendPushNotification()):

CompletableFuture.runAsync(pushTask); // ForkJoinPool.common() — shared, unordered

ForkJoinPool.common() has a fixed parallelism of Runtime.getRuntime().availableProcessors() (typically 8–16). When multiple concurrent tasks each produce multiple events, those events race for thread pool slots. The consumer can receive task A's Event 2 before Event 1, purely as a scheduling artifact.


Options Considered

Analysis of the Original Approach

The original implementation using CompletableFuture.runAsync() against ForkJoinPool.common() had clear advantages worth acknowledging:

Original pros:

  • Zero memory overhead — no per-task queues or threads
  • Zero thread management overhead — fully managed by the JVM
  • Trivial to implement — one line of code
  • Excellent throughput under low concurrency — no context-switching cost

Original cons:

  • Ordering not guaranteed across concurrent tasks — events from different tasks can interleave in unpredictable ways due to thread scheduling
  • Non-deterministic delivery — the same code path can produce different event sequences under load
  • No backpressure — events pile up in the shared pool's queue with no per-task isolation

Is this optimization necessary? For a protocol like A2A where clients consume streams of events (text chunks arriving in order, artifacts building up incrementally), out-of-order delivery is a correctness bug, not just a performance issue. A client expecting working → partial-completed → completed could see partial-completed → working → completed, breaking expectations of any state-machine logic in the client. This is particularly critical for:

  • Streaming responses (text parts must arrive in order for meaningful output)
  • Multi-artifact tasks (each artifact appends sequentially)
  • Client-side state machines that transition based on event type

Given the correctness implications, the per-task isolation is worth the added complexity.

Option: Single-threaded executor per task (chosen)

Each task gets its own LinkedBlockingQueue + dedicated daemon thread.

Pros:

  • Strict FIFO per task — no interleaving possible
  • Complete isolation between tasks — one slow task never blocks another
  • Simple, bounded memory (max 50 events per task, then oldest dropped)
  • Lazy cleanup — threads expire after 60 seconds of inactivity

Cons:

  • One daemon thread per active task (acceptable: daemon threads don't block JVM exit; 60s expiry limits growth)
  • Thread-per-task model is not ideal for extreme scale (thousands of concurrent tasks), but realistic workloads rarely reach that

Solution

Each task gets a TaskPushNotifier (inner class of MainEventBusProcessor) containing:

  • LinkedBlockingQueue<StreamingEventKind> with capacity 50
  • Dedicated daemon consumer thread running a consumeLoop()
  • ConcurrentHashMap<String, TaskPushNotifier> notifiers for per-task lookups
  • 60-second inactivity expiry: expired notifiers are lazily replaced on next access
producer (MainEventBusProcessor thread)
  │
  ▼ offer()
LinkedBlockingQueue(50) ──► consumeLoop()
  │ (poll(10ms))              │
  │                       pushSender.sendNotification()
  ▼
(drop oldest if full)

Overflow behavior: If the queue is full when a new event arrives, the oldest pending event is silently dropped to make room. This is a deliberate design choice — a queue that stays full forever would deadlock the producer.


Test Results

New tests (MainEventBusProcessorPushNotificationOrderTest)

Test What it verifies Result
testSameTaskEventsOrdered Events for the same task arrive in strict FIFO order
testDifferentTasksUnaffected Events for different tasks are isolated and don't block each other
testQueueOverflowDropsOldest Queue holds at most 50 events; oldest are dropped on overflow

Regression tests (MainEventBusProcessorExceptionTest)

All 3 existing exception handling tests continue to pass.


Files Changed

  • server-common/src/main/java/io/a2a/server/events/MainEventBusProcessor.java

    • Added ConcurrentHashMap<String, TaskPushNotifier> notifiers
    • Added TaskPushNotifier inner class (bounded queue + consumer thread + expiry)
    • Modified sendPushNotification() to route via per-task notifier
    • Modified @PreDestroy stop() to clean up all notifier threads
  • server-common/src/test/java/io/a2a/server/events/MainEventBusProcessorPushNotificationOrderTest.java (new)

    • 3 tests covering FIFO ordering, task isolation, and overflow behavior

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the global asynchronous push notification system with a per-task bounded queue and dedicated consumer threads to ensure FIFO ordering and isolation. The review feedback highlights a potential blocking issue in the main event loop during notifier cleanup, a race condition in the queue overflow logic when accessed by multiple threads, and the need for a more robust test case to properly verify the "drop oldest" behavior.

Comment on lines +372 to +379
notifiers.compute(taskId, (id, existing) -> {
if (existing != null && !existing.isExpired()) {
existing.touch();
return existing;
}
if (existing != null) existing.shutdown();
return new TaskPushNotifier(taskId);
}).offer(event);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The shutdown() method on an expired TaskPushNotifier is called synchronously within the notifiers.compute() block. This block is executed by the single MainEventBusProcessor thread, which is a critical component for processing all events. The shutdown() method contains a consumerThread.join(2000), which can block for up to 2 seconds if the consumer thread is stuck (e.g., in a non-interruptible network call within pushSender). This could severely degrade or stall the main event processing loop.

Consider making the shutdown of expired notifiers asynchronous to avoid blocking the MainEventBusProcessor thread. You could use a dedicated executor or simply a new thread for this cleanup task.

        notifiers.compute(taskId, (id, existing) -> {
            if (existing != null && !existing.isExpired()) {
                existing.touch();
                return existing;
            }
            if (existing != null) {
                new Thread(existing::shutdown, "Notifier-Shutdown-" + existing.taskId).start();
            }
            return new TaskPushNotifier(taskId);
        }).offer(event);

Comment on lines +406 to +413
void offer(StreamingEventKind event) {
if (shutdown) return;
touch();
while (!queue.offer(event)) {
// Queue is full - drop oldest to make room for newest
queue.poll();
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The offer method can be called by multiple producer threads concurrently for the same task. The current implementation of the while loop to handle a full queue is not atomic. This can lead to a race condition where more than one "oldest" event is dropped to make room for new events if multiple threads attempt to offer to a full queue at the same time. To ensure that only one event is dropped per new event added, you should synchronize this method.

Suggested change
void offer(StreamingEventKind event) {
if (shutdown) return;
touch();
while (!queue.offer(event)) {
// Queue is full - drop oldest to make room for newest
queue.poll();
}
}
synchronized void offer(StreamingEventKind event) {
if (shutdown) return;
touch();
while (!queue.offer(event)) {
// Queue is full - drop oldest to make room for newest
queue.poll();
}
}
References
  1. Using AtomicBoolean or synchronization is a general recommendation for managing shared mutable state across multiple threads to ensure atomicity.

Comment on lines +142 to +172
@Test
public void testQueueOverflowDropsOldest() throws Exception {
// Test that the per-task queue has a bounded capacity.
String taskId = "overflow-task";
int capacity = 50;

AtomicInteger sentCount = new AtomicInteger(0);

PushNotificationSender countingSender = event -> {
sentCount.incrementAndGet();
pushSender.sendNotification(event);
};

MainEventBusProcessor processor = new MainEventBusProcessor(
mainEventBus, mock(TaskStore.class), countingSender, queueManager);

// Submit exactly the queue capacity worth of events
for (int i = 0; i < capacity; i++) {
Task event = createTask(taskId, i);
processor.sendPushNotification(taskId, event);
}

// Wait for all to be consumed
Thread.sleep(1000);

// All capacity events should be sent (no overflow)
assertEquals(capacity, sentCount.get(),
"All " + capacity + " events should be sent when queue is exactly full");

processor.stop();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test testQueueOverflowDropsOldest doesn't seem to test the overflow condition as its name implies. It submits exactly capacity items and asserts that all are processed. This confirms the queue can hold capacity items, but it doesn't verify the "drop oldest" behavior on overflow.

A more robust test would:

  1. Submit capacity + N events.
  2. Ensure the consumer has a chance to run but doesn't empty the queue completely (or is blocked).
  3. Verify that only the latest capacity events are in the queue (or were processed), and that the first N events were dropped.

For example, you could submit capacity + 1 events and check that the first event was dropped and the last one was processed.

References
  1. While stricter coding practices can be relaxed for test code, the test logic must still accurately verify the intended behavior and edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant