Detect deleted socket files and shut down zombie daemon by svarlamov · Pull Request #1133 · git-ai-project/git-ai

svarlamov · 2026-04-16T23:14:19Z

Summary

Adds a periodic socket health check that connects to both sockets (not just stat) to verify the daemon is reachable
When either socket probe fails, spawns a detached git-ai bg restart --hard process to self-heal — the zombie daemon is reaped and a fresh one starts automatically
Replaces blocking thread::join() calls with a timed join approach (2s deadline) so the daemon process can exit even when listener threads are stuck in accept() on deleted sockets

Problem

When daemon socket files are removed from the filesystem while the daemon process is still running, the daemon becomes a zombie: alive but unreachable. The kernel-level socket file descriptors remain valid for existing connections, but new connect() calls fail with ENOENT. This causes wrapper invocations to fail init_daemon_telemetry_handle(), but still pass invocation_id to git — leading the daemon to enter the 750ms wrapper-state timeout path for every command.

This was observed in #1079 where JeanFred saw persistent wrapper state timeout messages with pre=false, post=false, indicating the wrapper could never connect to the daemon.

Root cause scenarios

Daemon restart during self-update: old daemon shuts down, socket files are cleaned up, new daemon starts but wrapper invocations during the gap pass invocation_id without a connection
External socket deletion: cleanup scripts, manual rm, or filesystem events remove socket files while daemon is running
Racing daemon instances: multiple daemon starts can interfere with each other's socket files

Implementation

daemon_socket_health_check_loop: Uses condvar.wait_timeout() with a configurable interval (default 30s, overridable via GIT_AI_DAEMON_SOCKET_HEALTH_CHECK_SECS for testing). Each cycle does a real connect() to both sockets via local_socket_connects_with_timeout (100ms timeout). This catches more failure modes than a simple stat(): deleted files, stale sockets from a previous instance, and hung listener threads.
spawn_self_restart: On health check failure, spawns a detached git-ai bg restart --hard child process that inherits the daemon env vars. The child does SIGKILL + wait for reap, then ensure_daemon_running to start a fresh instance. The current daemon then proceeds with graceful shutdown.
Timed thread joins: After shutdown, polls each daemon thread with a 2-second total deadline. If listener threads are stuck in accept() because socket files were deleted (so the wake-up connection fails), the daemon process still exits cleanly.

Test plan

daemon_shuts_down_when_socket_files_are_deleted — spawns an isolated daemon, verifies it's alive and responsive, deletes both socket files, asserts the daemon exits within 10 seconds
daemon_self_heals_after_socket_deletion — same setup, but then waits for a new daemon to come up on the same socket paths and verifies it responds to control requests with a different PID
Both tests use 1-second health check interval, gated with #[cfg(unix)]
Full compilation check passes
CI checks pass

Fixes #1079

🤖 Generated with Claude Code

When daemon socket files are removed from the filesystem (by cleanup scripts, racing daemon instances, or manual deletion), the daemon process becomes unreachable: alive but unable to accept new connections. New wrapper invocations fail to connect and fall into the 750ms wrapper-state timeout path. Add a periodic socket health check that verifies both socket files still exist on disk. When either is missing, the daemon initiates graceful shutdown so the next wrapper invocation can spawn a fresh instance. Also replace blocking thread::join() calls with a timed join approach (2s deadline) so the daemon process can exit even when listener threads are stuck in accept() on deleted sockets. Fixes #1079 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Windows named pipes cannot be deleted with fs::remove_file (OS error 87: "The parameter is incorrect"). The socket-file-deleted scenario is Unix-specific since Unix domain sockets use filesystem entries that can be unlinked independently of the open file descriptors. The daemon-side health check (path.exists()) still runs on all platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace fixed 500ms sleep with 10ms polling loop so fast-joining threads (health, update) don't waste time. This makes the shutdown sequence complete faster when only listener threads are stuck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of just checking that socket files exist on the filesystem, actually connect to both sockets. This catches strictly more failure modes: deleted files, stale sockets from a previous instance, and hung listener threads that are no longer calling accept(). The probe uses the existing local_socket_connects_with_timeout with the 100ms timeout already used elsewhere. The listener handles probe connections gracefully (immediate EOF -> handler returns Ok). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the health check detects unreachable sockets, it now spawns a detached `git-ai bg restart --hard` process before shutting down. The restart process reaps the zombie daemon (SIGKILL if needed), waits for the lock to release, and starts a fresh daemon instance. This means socket deletion is no longer a permanent failure — the daemon automatically recovers without waiting for the next wrapper invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The PID file path varies by platform (macOS uses /tmp when paths exceed 100 chars for Unix socket limits). The test already verifies the full behavioral contract: original daemon exits, new daemon starts and responds to control requests on the same socket paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

If the daemon's sockets fail the health check within the first 60 seconds of uptime, shut down without spawning a restart. This prevents infinite restart loops when the underlying issue is systemic (e.g. filesystem permissions, broken paths, another process continuously deleting sockets faster than the daemon can start). If sockets fail after 60+ seconds of healthy operation, the failure is likely transient (manual cleanup, racing daemon) and self-restart is appropriate. The minimum uptime is overridable via GIT_AI_DAEMON_MIN_UPTIME_FOR_RESTART_SECS for testing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

devin-ai-integration Bot reviewed Apr 16, 2026

View reviewed changes

svarlamov and others added 7 commits April 17, 2026 00:01

Format test code

48354c1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

svarlamov merged commit 8ae8bba into main Apr 17, 2026
23 of 26 checks passed

svarlamov deleted the investigate/wrapper-timeout-1079 branch April 17, 2026 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect deleted socket files and shut down zombie daemon#1133

Detect deleted socket files and shut down zombie daemon#1133
svarlamov merged 8 commits intomainfrom
investigate/wrapper-timeout-1079

svarlamov commented Apr 16, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

svarlamov commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root cause scenarios

Implementation

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

svarlamov commented Apr 16, 2026 •

edited

Loading