# ha-diag-agent Deployment Guide ## Section 1: Prerequisites ### HA long-lived access token The agent authenticates to Home Assistant with a long-lived token issued by a dedicated service account. Do not use a personal admin token. 1. In HA: **Settings → People → Add Person** - Name: `diag_agent` - Do **not** add to any group (no admin rights needed) 2. Log in to HA as `diag_agent` 3. Go to **Profile → Long-Lived Access Tokens → Create token** - Name: `ha-diag-agent` - Copy the token — it is shown only once 4. Store the token in the node's `.env` file (see Section 2) ### Tailnet reachability check (chelsty-infra only) `chelsty-infra` reaches Home Assistant on `chelsty-ha` over Tailscale. Verify before deploying: ```bash curl -sf http://100.70.180.90:8123/api/ \ -H "Authorization: Bearer " | python3 -m json.tool # Expect: {"message": "API running."} ``` If the request times out, check that both nodes are on the Tailscale mesh (`tailscale status`) and that `chelsty-ha` is powered on. --- ## Section 2: Per-host config Create `/opt/homelab/config/ha-diag-agent/.env` on **each target node**: ### piha ```bash mkdir -p /opt/homelab/config/ha-diag-agent cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF' HA_URL=http://localhost:8123 HA_TOKEN= NODE_NAME=piha LOCATION_TAG=ken CHECK_INTERVAL=60 CHECK_INTERVAL_UNAVAILABLE=3600 UNAVAILABLE_THRESHOLD_HOURS=24 ALERT_COOLDOWN_HOURS=6 LOG_LEVEL=info EOF chmod 600 /opt/homelab/config/ha-diag-agent/.env ``` ### chelsty-infra ```bash mkdir -p /opt/homelab/config/ha-diag-agent cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF' HA_URL=http://100.70.180.90:8123 HA_TOKEN= NODE_NAME=chelsty-infra LOCATION_TAG=chelsty CHECK_INTERVAL=60 CHECK_INTERVAL_UNAVAILABLE=3600 UNAVAILABLE_THRESHOLD_HOURS=24 ALERT_COOLDOWN_HOURS=6 LOG_LEVEL=info EOF chmod 600 /opt/homelab/config/ha-diag-agent/.env ``` > If `chelsty-ha` gets a new Tailscale IP, update `HA_URL` in this file and > restart the container. --- ## Section 3: Deploy procedure ### From SATURN (standard flow) ```bash # 1. Commit and push changes from SATURN git push # 2. SSH to target node ssh oskar@piha # or chelsty-infra # 3. Pull latest and deploy cd ~/homelab-codex-ws git pull scripts/deploy/deploy.sh --service ha-diag-agent ``` ### chelsty-infra (docker-compose v1) `chelsty-infra` runs docker-compose v1 (1.29.2). The deploy script calls `docker-compose` (hyphenated), which is correct. If you need to run manually: ```bash cd ~/homelab-codex-ws/services/ha-diag-agent docker-compose up -d --build ``` --- ## Section 4: Verification ```bash # Container is up docker ps | grep ha-diag-agent # Last 50 log lines docker logs ha-diag-agent --tail 50 # FastAPI health endpoint curl http://localhost:8087/health # Expect: {"status": "ok", "ws_connected": true, ...} # Events are being written ls /opt/homelab/events// # Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds # Supervisor is picking up events (check on VPS / control-plane) tail -f /opt/homelab/logs/supervisor.log | grep ha_ ``` --- ## Section 5: First-48h observation (shadow mode) The supervisor starts with `HA_DIAG_SHADOW_MODE=true` (default). During this window, `ha_websocket_dead` events are downgraded to `alert_only` actions tagged `[SHADOW MODE]` rather than triggering an automatic restart. Watch for these signals in Telegram: - `[SHADOW MODE] would have triggered container_restart for homeassistant` — confirms the detection path works end-to-end - `ha_entity_unavailable_long` / `ha_integration_failed` / etc. — these are always `alert_only` regardless of shadow mode; verify descriptions look accurate and thresholds are reasonable Things to evaluate: | Question | Good sign | |----------|-----------| | Are shadow alerts firing at reasonable frequency? | ≤ 1 per 30 min per node | | Are there false positives? | No alerts during known-good uptime | | Are entity-unavailable alerts describing real entities? | Yes, names match HA UI | | Are integration-failed alerts genuine? | Yes, not noise from startup | Note any false positives or noisy thresholds before enabling production mode. --- ## Section 6: Enabling production mode `HA_DIAG_SHADOW_MODE` is an environment variable read by the supervisor container. The VPS supervisor env vars live in the version-controlled override file at `hosts/vps/runtime/control-plane/docker-compose.override.yml` (not in a runtime `.env` file — the supervisor has no `env_file:` directive). When the 48h observation period looks clean: **1. Edit the override file on SATURN:** ```yaml # hosts/vps/runtime/control-plane/docker-compose.override.yml services: supervisor: environment: - NODE_ALIAS_MAP={"node-2":"chelsty"} - HA_DIAG_SHADOW_MODE=false # add this line ``` **2. Commit and push from SATURN:** ```bash git add hosts/vps/runtime/control-plane/docker-compose.override.yml git commit -m "feat(control-plane): disable HA shadow mode — production ready" git push ``` **3. Apply on VPS:** ```bash ssh oskar@100.95.58.48 cd ~/homelab-codex-ws && git pull docker compose \ -f services/control-plane/docker-compose.yml \ -f hosts/vps/runtime/control-plane/docker-compose.override.yml \ up -d supervisor ``` **4. Confirm:** ```bash docker logs control-plane-supervisor --tail 5 # Expect: shadow_mode=False — HA container_restart actions enabled ``` From this point, the next `ha_websocket_dead` event will generate a `container_restart` action in the approval queue. The 30-minute cooldown still applies after each restart. --- ## Section 7: Rollback If production mode causes unexpected behaviour: ```bash # Option A — re-enable shadow mode # On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml # Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true) # Commit, push, then on VPS: ssh oskar@100.95.58.48 cd ~/homelab-codex-ws && git pull docker compose \ -f services/control-plane/docker-compose.yml \ -f hosts/vps/runtime/control-plane/docker-compose.override.yml \ up -d supervisor # Option B — stop ha-diag-agent entirely on affected nodes ssh oskar@ docker stop ha-diag-agent # Events written before rollback remain in /opt/homelab/events// # and are historical only — no automated action will be taken on them # unless the supervisor re-processes them, which it won't (already in # _ha_processed_event_ids). ``` Any `container_restart` actions still in `pending/` after rollback can be manually rejected via the Telegram bot or by deleting the action files from `/opt/homelab/actions/pending/` on the VPS.