Skip to content

Detect deleted socket files and shut down zombie daemon#1133

Merged
svarlamov merged 8 commits intomainfrom
investigate/wrapper-timeout-1079
Apr 17, 2026
Merged

Detect deleted socket files and shut down zombie daemon#1133
svarlamov merged 8 commits intomainfrom
investigate/wrapper-timeout-1079

Conversation

@svarlamov
Copy link
Copy Markdown
Member

@svarlamov svarlamov commented Apr 16, 2026

Summary

  • Adds a periodic socket health check that connects to both sockets (not just stat) to verify the daemon is reachable
  • When either socket probe fails, spawns a detached git-ai bg restart --hard process to self-heal — the zombie daemon is reaped and a fresh one starts automatically
  • Replaces blocking thread::join() calls with a timed join approach (2s deadline) so the daemon process can exit even when listener threads are stuck in accept() on deleted sockets

Problem

When daemon socket files are removed from the filesystem while the daemon process is still running, the daemon becomes a zombie: alive but unreachable. The kernel-level socket file descriptors remain valid for existing connections, but new connect() calls fail with ENOENT. This causes wrapper invocations to fail init_daemon_telemetry_handle(), but still pass invocation_id to git — leading the daemon to enter the 750ms wrapper-state timeout path for every command.

This was observed in #1079 where JeanFred saw persistent wrapper state timeout messages with pre=false, post=false, indicating the wrapper could never connect to the daemon.

Root cause scenarios

  1. Daemon restart during self-update: old daemon shuts down, socket files are cleaned up, new daemon starts but wrapper invocations during the gap pass invocation_id without a connection
  2. External socket deletion: cleanup scripts, manual rm, or filesystem events remove socket files while daemon is running
  3. Racing daemon instances: multiple daemon starts can interfere with each other's socket files

Implementation

  • daemon_socket_health_check_loop: Uses condvar.wait_timeout() with a configurable interval (default 30s, overridable via GIT_AI_DAEMON_SOCKET_HEALTH_CHECK_SECS for testing). Each cycle does a real connect() to both sockets via local_socket_connects_with_timeout (100ms timeout). This catches more failure modes than a simple stat(): deleted files, stale sockets from a previous instance, and hung listener threads.
  • spawn_self_restart: On health check failure, spawns a detached git-ai bg restart --hard child process that inherits the daemon env vars. The child does SIGKILL + wait for reap, then ensure_daemon_running to start a fresh instance. The current daemon then proceeds with graceful shutdown.
  • Timed thread joins: After shutdown, polls each daemon thread with a 2-second total deadline. If listener threads are stuck in accept() because socket files were deleted (so the wake-up connection fails), the daemon process still exits cleanly.

Test plan

  • daemon_shuts_down_when_socket_files_are_deleted — spawns an isolated daemon, verifies it's alive and responsive, deletes both socket files, asserts the daemon exits within 10 seconds
  • daemon_self_heals_after_socket_deletion — same setup, but then waits for a new daemon to come up on the same socket paths and verifies it responds to control requests with a different PID
  • Both tests use 1-second health check interval, gated with #[cfg(unix)]
  • Full compilation check passes
  • CI checks pass

Fixes #1079

🤖 Generated with Claude Code

When daemon socket files are removed from the filesystem (by cleanup
scripts, racing daemon instances, or manual deletion), the daemon process
becomes unreachable: alive but unable to accept new connections. New
wrapper invocations fail to connect and fall into the 750ms wrapper-state
timeout path.

Add a periodic socket health check that verifies both socket files still
exist on disk. When either is missing, the daemon initiates graceful
shutdown so the next wrapper invocation can spawn a fresh instance.

Also replace blocking thread::join() calls with a timed join approach
(2s deadline) so the daemon process can exit even when listener threads
are stuck in accept() on deleted sockets.

Fixes #1079

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

svarlamov and others added 7 commits April 17, 2026 00:01
Windows named pipes cannot be deleted with fs::remove_file (OS error 87:
"The parameter is incorrect"). The socket-file-deleted scenario is
Unix-specific since Unix domain sockets use filesystem entries that can
be unlinked independently of the open file descriptors.

The daemon-side health check (path.exists()) still runs on all platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fixed 500ms sleep with 10ms polling loop so fast-joining threads
(health, update) don't waste time. This makes the shutdown sequence
complete faster when only listener threads are stuck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of just checking that socket files exist on the filesystem,
actually connect to both sockets. This catches strictly more failure
modes: deleted files, stale sockets from a previous instance, and hung
listener threads that are no longer calling accept().

The probe uses the existing local_socket_connects_with_timeout with the
100ms timeout already used elsewhere. The listener handles probe
connections gracefully (immediate EOF -> handler returns Ok).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the health check detects unreachable sockets, it now spawns a
detached `git-ai bg restart --hard` process before shutting down. The
restart process reaps the zombie daemon (SIGKILL if needed), waits for
the lock to release, and starts a fresh daemon instance.

This means socket deletion is no longer a permanent failure — the daemon
automatically recovers without waiting for the next wrapper invocation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The PID file path varies by platform (macOS uses /tmp when paths exceed
100 chars for Unix socket limits). The test already verifies the full
behavioral contract: original daemon exits, new daemon starts and
responds to control requests on the same socket paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If the daemon's sockets fail the health check within the first 60
seconds of uptime, shut down without spawning a restart. This prevents
infinite restart loops when the underlying issue is systemic (e.g.
filesystem permissions, broken paths, another process continuously
deleting sockets faster than the daemon can start).

If sockets fail after 60+ seconds of healthy operation, the failure is
likely transient (manual cleanup, racing daemon) and self-restart is
appropriate.

The minimum uptime is overridable via GIT_AI_DAEMON_MIN_UPTIME_FOR_RESTART_SECS
for testing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@svarlamov svarlamov merged commit 8ae8bba into main Apr 17, 2026
23 of 26 checks passed
@svarlamov svarlamov deleted the investigate/wrapper-timeout-1079 branch April 17, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: authorship is sometimes lost through rebases

1 participant