Detect deleted socket files and shut down zombie daemon#1133
Merged
Conversation
When daemon socket files are removed from the filesystem (by cleanup scripts, racing daemon instances, or manual deletion), the daemon process becomes unreachable: alive but unable to accept new connections. New wrapper invocations fail to connect and fall into the 750ms wrapper-state timeout path. Add a periodic socket health check that verifies both socket files still exist on disk. When either is missing, the daemon initiates graceful shutdown so the next wrapper invocation can spawn a fresh instance. Also replace blocking thread::join() calls with a timed join approach (2s deadline) so the daemon process can exit even when listener threads are stuck in accept() on deleted sockets. Fixes #1079 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Windows named pipes cannot be deleted with fs::remove_file (OS error 87: "The parameter is incorrect"). The socket-file-deleted scenario is Unix-specific since Unix domain sockets use filesystem entries that can be unlinked independently of the open file descriptors. The daemon-side health check (path.exists()) still runs on all platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fixed 500ms sleep with 10ms polling loop so fast-joining threads (health, update) don't waste time. This makes the shutdown sequence complete faster when only listener threads are stuck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of just checking that socket files exist on the filesystem, actually connect to both sockets. This catches strictly more failure modes: deleted files, stale sockets from a previous instance, and hung listener threads that are no longer calling accept(). The probe uses the existing local_socket_connects_with_timeout with the 100ms timeout already used elsewhere. The listener handles probe connections gracefully (immediate EOF -> handler returns Ok). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the health check detects unreachable sockets, it now spawns a detached `git-ai bg restart --hard` process before shutting down. The restart process reaps the zombie daemon (SIGKILL if needed), waits for the lock to release, and starts a fresh daemon instance. This means socket deletion is no longer a permanent failure — the daemon automatically recovers without waiting for the next wrapper invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The PID file path varies by platform (macOS uses /tmp when paths exceed 100 chars for Unix socket limits). The test already verifies the full behavioral contract: original daemon exits, new daemon starts and responds to control requests on the same socket paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If the daemon's sockets fail the health check within the first 60 seconds of uptime, shut down without spawning a restart. This prevents infinite restart loops when the underlying issue is systemic (e.g. filesystem permissions, broken paths, another process continuously deleting sockets faster than the daemon can start). If sockets fail after 60+ seconds of healthy operation, the failure is likely transient (manual cleanup, racing daemon) and self-restart is appropriate. The minimum uptime is overridable via GIT_AI_DAEMON_MIN_UPTIME_FOR_RESTART_SECS for testing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
git-ai bg restart --hardprocess to self-heal — the zombie daemon is reaped and a fresh one starts automaticallythread::join()calls with a timed join approach (2s deadline) so the daemon process can exit even when listener threads are stuck inaccept()on deleted socketsProblem
When daemon socket files are removed from the filesystem while the daemon process is still running, the daemon becomes a zombie: alive but unreachable. The kernel-level socket file descriptors remain valid for existing connections, but new
connect()calls fail withENOENT. This causes wrapper invocations to failinit_daemon_telemetry_handle(), but still passinvocation_idto git — leading the daemon to enter the 750ms wrapper-state timeout path for every command.This was observed in #1079 where JeanFred saw persistent
wrapper state timeoutmessages withpre=false, post=false, indicating the wrapper could never connect to the daemon.Root cause scenarios
rm, or filesystem events remove socket files while daemon is runningImplementation
daemon_socket_health_check_loop: Usescondvar.wait_timeout()with a configurable interval (default 30s, overridable viaGIT_AI_DAEMON_SOCKET_HEALTH_CHECK_SECSfor testing). Each cycle does a realconnect()to both sockets vialocal_socket_connects_with_timeout(100ms timeout). This catches more failure modes than a simplestat(): deleted files, stale sockets from a previous instance, and hung listener threads.spawn_self_restart: On health check failure, spawns a detachedgit-ai bg restart --hardchild process that inherits the daemon env vars. The child does SIGKILL + wait for reap, thenensure_daemon_runningto start a fresh instance. The current daemon then proceeds with graceful shutdown.accept()because socket files were deleted (so the wake-up connection fails), the daemon process still exits cleanly.Test plan
daemon_shuts_down_when_socket_files_are_deleted— spawns an isolated daemon, verifies it's alive and responsive, deletes both socket files, asserts the daemon exits within 10 secondsdaemon_self_heals_after_socket_deletion— same setup, but then waits for a new daemon to come up on the same socket paths and verifies it responds to control requests with a different PID#[cfg(unix)]Fixes #1079
🤖 Generated with Claude Code