Watchdog Runbook
Opt-in watchdog onboarding, detached rollout, rollback, and failure recovery
Watchdog supervision is explicit opt-in. It is never enabled by default.
This page is the final, advanced stage of the tx docs journey. Use it after you are already comfortable with Primitives and the Headful Experience.
Use watchdog when you need detached, long-running RALPH loops with automatic health checks and stale-run recovery.
When to Enable Watchdog
Enable watchdog only when all of the following are true:
- Your normal loop (
tx ready,tx done,tx block) is stable. - You need detached execution that survives terminal closes/logout.
- You need automatic stalled-run and orphan-task recovery.
- You are comfortable rolling back to manual operation if needed.
Onboarding and Rollout
1. Scaffold watchdog assets
tx init --watchdog --watchdog-runtime autoGenerated files:
scripts/ralph-watchdog.sh
scripts/ralph-hourly-supervisor.sh
scripts/watchdog-launcher.sh
.tx/watchdog.env
ops/watchdog/com.tx.ralph-watchdog.plist
ops/watchdog/tx-ralph-watchdog.serviceRuntime modes:
| Mode | Behavior |
|---|---|
auto | Enables only installed runtimes; may scaffold with WATCHDOG_ENABLED=0 if no runtime is found |
codex | Fails unless codex is available in PATH |
claude | Fails unless claude is available in PATH |
both | Fails unless both codex and claude are available |
2. Validate and smoke-test
command -v codex || true
command -v claude || true
cat .tx/watchdog.env
./scripts/watchdog-launcher.sh start
./scripts/watchdog-launcher.sh status
tail -n 50 .tx/ralph-watchdog.logDetached Supervision
macOS launchd (user agent)
The scaffolded plist contains __PROJECT_DIR__ placeholders. Render it before loading:
PROJECT_DIR="$(pwd)"
PLIST_SRC="ops/watchdog/com.tx.ralph-watchdog.plist"
PLIST_DEST="$HOME/Library/LaunchAgents/com.tx.ralph-watchdog.plist"
mkdir -p "$HOME/Library/LaunchAgents"
sed "s|__PROJECT_DIR__|$PROJECT_DIR|g" "$PLIST_SRC" > "$PLIST_DEST"
launchctl bootout "gui/$(id -u)" "$PLIST_DEST" 2>/dev/null || true
launchctl bootstrap "gui/$(id -u)" "$PLIST_DEST"
launchctl kickstart -k "gui/$(id -u)/com.tx.ralph-watchdog"
launchctl print "gui/$(id -u)/com.tx.ralph-watchdog" | head -n 20Linux systemd (user unit)
The scaffolded unit also contains __PROJECT_DIR__ placeholders:
PROJECT_DIR="$(pwd)"
UNIT_SRC="ops/watchdog/tx-ralph-watchdog.service"
UNIT_DEST="$HOME/.config/systemd/user/tx-ralph-watchdog.service"
mkdir -p "$HOME/.config/systemd/user"
sed "s|__PROJECT_DIR__|$PROJECT_DIR|g" "$UNIT_SRC" > "$UNIT_DEST"
systemctl --user daemon-reload
systemctl --user enable --now tx-ralph-watchdog.service
systemctl --user status tx-ralph-watchdog.service --no-pager
journalctl --user -u tx-ralph-watchdog.service -n 100 --no-pagerIf the service must keep running after logout, enable lingering once:
loginctl enable-linger "$USER"Failure Recovery
Use this when runs/tasks look stuck after runtime crashes, machine sleep, or abrupt terminal shutdown.
1. Inspect current state
tx trace list --hours 6
tx trace errors --hours 6
tx list --status active2. Reap stalled runs
tx trace stalled --reap --transcript-idle-seconds 300 --heartbeat-lag-seconds 1803. Reconcile orphaned runs/tasks without starting new loops
/bin/bash ./scripts/ralph-watchdog.sh --once --no-start4. Clear stale PID locks if status still reports stale files
rm -f .tx/ralph-watchdog.pid \
.tx/ralph-hourly-supervisor.pid \
.tx/ralph-codex-live.pid \
.tx/ralph-claude-live.pid5. Reset any remaining stuck active tasks
tx list --status active --json | jq -r '.[].id' | while IFS= read -r task_id; do
[ -n "$task_id" ] && tx reset "$task_id"
done6. Verify recovery
tx ready --limit 10
tx trace errors --hours 1 --limit 20Rollback
Use rollback when you need to disable watchdog supervision and return to manual loop operation.
1. Stop detached service manager
macOS:
launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/com.tx.ralph-watchdog.plist" 2>/dev/null || trueLinux:
systemctl --user disable --now tx-ralph-watchdog.service || true
systemctl --user daemon-reload2. Stop launcher-managed watchdog process
./scripts/watchdog-launcher.sh stop || true3. Disable watchdog in config
Set WATCHDOG_ENABLED=0 in .tx/watchdog.env.
4. Clean stale PID files
rm -f .tx/ralph-watchdog.pid \
.tx/ralph-hourly-supervisor.pid \
.tx/ralph-codex-live.pid \
.tx/ralph-claude-live.pid5. Continue with manual loop
/bin/bash ./scripts/ralph.shTroubleshooting
Missing runtime CLI (codex or claude)
Symptoms:
tx init --watchdog --watchdog-runtime codex|claude|bothfails./scripts/watchdog-launcher.sh startreports unavailable runtime(s)
Actions:
- Verify runtime binaries:
command -v codexandcommand -v claude - Install missing runtime CLIs or disable them in
.tx/watchdog.env:WATCHDOG_CODEX_ENABLED=0orWATCHDOG_CLAUDE_ENABLED=0 - Restart watchdog:
./scripts/watchdog-launcher.sh restart
Noisy restart loops
Symptoms:
- Frequent watchdog restarts
- Repeated error-burst messages in
.tx/ralph-watchdog.log
Actions:
- Inspect logs:
.tx/ralph-watchdog.log.tx/ralph-watchdog.daemon.outjournalctl --user -u tx-ralph-watchdog.service(systemd)
- Increase stabilization thresholds in
.tx/watchdog.env:WATCHDOG_ERROR_BURST_THRESHOLDWATCHDOG_RESTART_COOLDOWN_SECONDSWATCHDOG_TRANSCRIPT_IDLE_SECONDSWATCHDOG_RUN_STALE_SECONDS
- Temporarily pause supervision:
- set
WATCHDOG_ENABLED=0 - run
./scripts/watchdog-launcher.sh stop
- set
Stale PID warnings
Run:
./scripts/watchdog-launcher.sh statusIf status shows stale PID, remove PID files (see rollback/recovery sections) and start again:
./scripts/watchdog-launcher.sh startExit Criteria
If detached supervision introduces noise, disable it and return to manual loops. Watchdog is optional infrastructure, not a required mode.
Rollback path:
- Follow the Rollback section.
- Set
WATCHDOG_ENABLED=0. - Run your normal manual loop (
./scripts/ralph.sh) until you are ready to re-enable watchdog.