perf(daemon): per-root DashMap sharding, normalizers, and bounded connection pool by svarlamov · Pull Request #1060 · git-ai-project/git-ai

svarlamov · 2026-04-11T22:57:07Z

Summary

Two architectural improvements that make the trace2 ingest pipeline scale with concurrent workloads.

1. Shard `TraceIngressState` with DashMap

Replace the single Mutex<TraceIngressState> (15 HashMap/HashSet fields all keyed by root SID) with 15 independent dashmap::DashMap/DashSet fields on ActorDaemonCoordinator:

Per-root events no longer serialize on a global mutex — different roots run concurrently
The readonly fast path (the 40/sec Zed flood) is now lock-free: DashSet::contains replaces Mutex::lock + HashSet::contains
Fold mark_trace_root_activity (previously an extra Mutex cycle) into the existing lock acquisition for mutating events — saves one lock cycle per git commit
DashSet::insert returning bool provides atomic check-and-set for the close-fallback deduplication logic

2. Per-root normalizer instances

Replace AsyncMutex<TraceNormalizer> (one global instance) with DashMap<String, Arc<AsyncMutex<TraceNormalizer>>>:

Each root SID gets its own normalizer, scoped to its lifetime
Normalizer entries are cleaned up in clear_trace_root_tracking
Completed root SIDs tracked in a bounded DashSet (16,384 FIFO) to prevent orphaned normalizer creation from late events
Early-exit guard for empty root keys prevents orphaned normalizer entries
Architectural foundation for future parallel ingest workers per root

Correctness fixes

TOCTOU race in record_trace_connection_close: Use remove_if_mut to atomically decrement-and-remove within a single shard lock, with explicit v > 0 guard against double-close edge case
TOCTOU race on ingr_root_mutating / ingr_root_target_repo_only: Use DashMap entry API for atomic read-modify-write so concurrent threads cannot overwrite true with false
Normalizer leak guard: completed_root_sids DashSet prevents late events from creating orphaned normalizers after root tracking is cleared
Documented benign race: record_trace_connection_close can produce duplicate stale roots under concurrent close; downstream ingr_root_close_fallback_enqueued DashSet deduplicates

Test plan

cargo test --lib — all tests pass
cargo clippy --all-targets — clean
cargo fmt — applied
CI ubuntu checks — all green

🤖 Generated with Claude Code

devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-04-11T23:02:38Z

+        let root_key = payload_root_sid.clone().unwrap_or_default();
+        let normalizer_arc = self
+            .normalizers
+            .entry(root_key)
+            .or_insert_with(|| {
+                Arc::new(AsyncMutex::new(
+                    crate::daemon::trace_normalizer::TraceNormalizer::new(self.backend.clone()),
+                ))
+            })
+            .clone();


🔴 Per-root normalizer leak: late/duplicate terminal events create orphaned normalizers that are never cleaned up

When a root's terminal event is processed, clear_trace_root_tracking at src/daemon.rs:4423 removes the normalizer from the normalizers DashMap. If a subsequent event for that same root arrives (e.g., a connection-close fallback atexit generated by enqueue_stale_connection_close_fallbacks), apply_trace_payload_to_state at line 7008 calls self.normalizers.entry(root_key).or_insert_with(...), creating a brand-new normalizer with empty state. This fresh normalizer has no completed_roots entry for the root, so it doesn't early-return. For an atexit event, the normalizer creates a deferred_exits entry and returns Ok(None). Back in apply_trace_payload_to_state, the replace_pending_root_entry call finds no pending slot (already cleaned up), so clear_trace_root_tracking is never called again — the orphaned normalizer persists in the DashMap forever.

In the old single-normalizer design, completed_roots (retained up to 16,384 entries per COMPLETED_ROOT_RETENTION_LIMIT at src/daemon/trace_normalizer.rs:86) would catch these late events and discard them immediately. The per-root split loses this cross-lifecycle memory, causing an unbounded slow leak in long-running daemons.

Prompt for agents

The problem is that after clear_trace_root_tracking removes a normalizer for a completed root from the normalizers DashMap, any subsequent payload for that root (e.g. a connection-close fallback atexit) will lazily create a brand-new normalizer that has no knowledge the root already completed. This new normalizer is never cleaned up. In the old single-normalizer design, the completed_roots set (capped at 16,384 entries) served as a guard against processing late events for already-finished roots. Possible approaches: 1. Add a separate DashSet<String> (e.g. completed_root_sids) on ActorDaemonCoordinator that tracks recently completed roots (with a bounded size). In apply_trace_payload_to_state, check this set BEFORE creating/looking up a per-root normalizer. If the root is in the completed set, skip normalizer creation and return TracePayloadApplyOutcome::None. 2. Alternatively, in the terminal-event branch of apply_trace_payload_to_state (around line 7020-7036), after clear_trace_root_tracking is called, also check whether the normalizer for this root is still in the DashMap and remove it if no pending root slot exists. 3. A simpler approach: in the code at line 7005-7014, before or_insert_with, check whether this root has already been processed by consulting the ingress state (e.g. if no ingress maps contain this root AND no pending root slot exists, skip creating the normalizer).

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

devin-ai-integration · 2026-04-11T23:17:46Z

+                drop(count);
+                self.ingr_root_open_connections.remove(root_sid);


🔴 TOCTOU race in record_trace_connection_close between drop(count) and remove()

The conversion from a single Mutex<TraceIngressState> to per-field DashMaps introduces a race condition in record_trace_connection_close. When count == 1, the code drops the RefMut (releasing the DashMap shard lock at line 4304) and then calls remove() (re-acquiring the lock at line 4305). In the window between these two operations, a concurrent call to trace_root_connection_opened (src/daemon.rs:4288-4293) on another connection-handler thread can find the existing entry (still at value 1), increment it to 2, and then remove() deletes the entry entirely — losing the new connection's count.

The old code was safe because both the decrement-or-remove and the open were serialized under one Mutex<TraceIngressState> lock. The consequence is that the newly opened connection becomes untracked: when it later closes, record_trace_connection_close won't find it in ingr_root_open_connections and will incorrectly report the root as stale, potentially triggering a premature close-fallback atexit event.

Prompt for agents

The bug is in record_trace_connection_close (src/daemon.rs around lines 4296-4310). When count <= 1, the code does drop(count) then self.ingr_root_open_connections.remove(root_sid), but between these two operations another thread can call trace_root_connection_opened which increments the counter, and then remove() deletes the entry losing the new connection count. To fix this atomically, replace the drop+remove pattern with DashMap's remove_if method (available in dashmap 6.x), which checks and removes in a single shard-locked operation: let was_removed = self.ingr_root_open_connections.remove_if(root_sid, |_, v| *v <= 1).is_some(); if !was_removed { continue; } Alternatively, you could keep the get_mut path for count > 1 (decrement) and use remove_if for the count == 1 case. The key requirement is that the check-then-remove must be atomic with respect to concurrent entry/or_insert calls from trace_root_connection_opened.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration

Devin Review found 1 new potential issue.

View 14 additional findings in Devin Review.

devin-ai-integration · 2026-04-15T04:13:07Z

+            let removed = self
+                .ingr_root_open_connections
+                .remove_if_mut(root_sid, |_, v| {
+                    if *v > 0 {
+                        *v -= 1;
+                    }
+                    *v == 0
+                });
+            if removed.is_some() || !self.ingr_root_open_connections.contains_key(root_sid) {
+                stale_roots.push(root_sid.clone());
            }


🟡 TOCTOU race in record_trace_connection_close can produce duplicate stale roots

The remove_if_mut + contains_key two-step in record_trace_connection_close has a TOCTOU race when two connections for the same root close concurrently. If root has count=2 and two threads call simultaneously: Thread A's remove_if_mut decrements to 1 (returns None); Thread B's remove_if_mut decrements to 0, removes entry (returns Some). Thread A then calls contains_key which returns false (entry removed by B), so Thread A also marks the root as stale. Both threads return the same root in stale_roots, leading to duplicate fallback atexit events being enqueued downstream via enqueue_stale_connection_close_fallbacks. While the completed_root_sids guard and ingr_root_close_fallback_enqueued DashSet may mitigate the worst effects, the old code under a single Mutex had no such race — only one thread could decrement and determine staleness at a time.

Was this helpful? React with 👍 or 👎 to provide feedback.

…ers, bounded connection pool Three architectural improvements to the trace2 ingestion pipeline: 1. Replace Mutex<TraceIngressState> with 15 independent DashMap/DashSet fields - Each field (root_worktrees, root_families, root_argv, root_mutating, etc.) is now a dashmap::DashMap<String, V> or dashmap::DashSet<String> - Per-root events no longer serialize on a single global mutex; concurrent connections to different roots run without blocking each other - Readonly fast path is now fully lock-free: DashMap shard lookups replace global Mutex acquisition for the 40+/sec readonly events from IDEs like Zed - Fold `mark_trace_root_activity` into the existing lock acquisition in the mutating path, eliminating one full Mutex cycle per mutating event - Add `dashmap = "6"` dependency 2. Per-root normalizer instances (dashmap::DashMap<String, Arc<AsyncMutex<...>>>) - Replace single global AsyncMutex<TraceNormalizer> with a per-root map - Normalizer state is now scoped to each root SID; entries are cleaned up when clear_trace_root_tracking removes the root - Enables future concurrent ingest workers per root without further refactor 3. Bounded trace connection thread pool - Add a hard ceiling (MAX_CONCURRENT_TRACE_CONNECTIONS=512) on simultaneous connection handler threads - Excess connections are dropped with a log message rather than spawning unbounded OS threads (prevents memory exhaustion and scheduler overload) - Active count tracked with a shared AtomicUsize; thread spawn/exit are O(1) atomic operations All 1420 tests pass. Benchmarks: readonly_flood/zed_mixed_1000_events unchanged (p=0.99; concurrency benefit not captured by single-threaded microbenchmark). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. TOCTOU race in record_trace_connection_close: the old drop(count)+remove() pattern released the DashMap shard lock between check and delete, allowing trace_root_connection_opened on a concurrent thread to increment the counter and then have it silently deleted. Fix: use remove_if_mut to atomically decrement and remove within a single shard-lock acquisition. 2. Orphaned per-root normalizer leak: after clear_trace_root_tracking removes normalizers[root], a late connection-close fallback atexit would lazily create a fresh empty normalizer that never gets cleaned up, causing an unbounded slow memory leak in long-running daemons. Fix: maintain a bounded completed_root_sids DashSet (FIFO-evicted at 16 384 entries, mirroring the normalizer's own completed_roots limit) and early-exit in apply_trace_payload_to_state for already-completed roots. 3. Panic safety of active-connection counter: the plain fetch_sub after the handler call would be skipped if handle_trace_connection_actor panics, leaking the counter and eventually blocking new connections. Fix: use a small RAII ActiveGuard whose Drop impl performs the decrement, guaranteeing cleanup even during panic unwinding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. remove_if_mut: guard against count=0 edge case — use explicit `if *v > 0 { *v -= 1 }` instead of saturating_sub to avoid silently removing entries that are already at zero. 2. ingr_root_mutating TOCTOU: replace get-then-insert with atomic entry API (or_insert + conditional upgrade) so concurrent connection-handler threads cannot overwrite true with false. 3. ingr_root_target_repo_only TOCTOU: same atomic entry fix, with the reflog cleanup only firing on an actual false→true upgrade. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove the MAX_CONCURRENT_TRACE_CONNECTIONS=512 ceiling and ActiveGuard RAII pattern. The simple unbounded thread::spawn from main is sufficient — the connection pool added complexity without meaningful benefit for current workloads, and the trace listener may move away from OS threads soon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Early-exit when root_key is empty before normalizer creation, preventing an orphaned normalizer keyed on "" that would never be cleaned up. - Document the benign duplicate-stale-roots race in record_trace_connection_close (concurrent close can push the same root twice; downstream deduplicates via ingr_root_close_fallback_enqueued DashSet). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

svarlamov · 2026-05-05T02:44:28Z

Just fyi maybe some useful ideas, but this is old so I'm just gonna close it out for now. CC @heapwolf

devin-ai-integration Bot reviewed Apr 11, 2026

View reviewed changes

svarlamov force-pushed the perf/daemon-ingest-scalability branch from 273ed39 to 80bd7f1 Compare April 11, 2026 23:09

devin-ai-integration Bot reviewed Apr 11, 2026

View reviewed changes

svarlamov force-pushed the fix/readonly-command-daemon-flood branch from 52a3bc3 to 905bc5c Compare April 12, 2026 16:04

svarlamov force-pushed the perf/daemon-ingest-scalability branch 5 times, most recently from 98e95e4 to 3b51be1 Compare April 15, 2026 00:41

This comment was marked as resolved.

Sign in to view

svarlamov force-pushed the fix/readonly-command-daemon-flood branch from bbb0dff to 43d37b6 Compare April 15, 2026 00:46

svarlamov force-pushed the perf/daemon-ingest-scalability branch from 3b51be1 to c6a7743 Compare April 15, 2026 00:47

This comment was marked as resolved.

Sign in to view

Base automatically changed from fix/readonly-command-daemon-flood to main April 15, 2026 03:31

svarlamov force-pushed the perf/daemon-ingest-scalability branch from c6a7743 to 144859b Compare April 15, 2026 03:46

devin-ai-integration Bot reviewed Apr 15, 2026

View reviewed changes

svarlamov force-pushed the perf/daemon-ingest-scalability branch 2 times, most recently from ee6d2c6 to 8c8636e Compare April 16, 2026 00:47

svarlamov and others added 7 commits April 16, 2026 23:39

ci: trigger re-run for rebase worktree flake check

3b25f44

ci: trigger CI run

6958b24

svarlamov force-pushed the perf/daemon-ingest-scalability branch from 8c8636e to 6958b24 Compare April 16, 2026 23:39

svarlamov closed this May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(daemon): per-root DashMap sharding, normalizers, and bounded connection pool#1060

perf(daemon): per-root DashMap sharding, normalizers, and bounded connection pool#1060
svarlamov wants to merge 7 commits into
mainfrom
perf/daemon-ingest-scalability

svarlamov commented Apr 11, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 11, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 11, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 15, 2026 •

edited

Loading

Uh oh!

svarlamov commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		drop(count);
		self.ingr_root_open_connections.remove(root_sid);

Conversation

svarlamov commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Shard TraceIngressState with DashMap

2. Per-root normalizer instances

Correctness fixes

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

svarlamov commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

svarlamov commented Apr 11, 2026 •

edited

Loading

1. Shard `TraceIngressState` with DashMap

devin-ai-integration Bot Apr 15, 2026 •

edited

Loading