# ha-diag-agent Per-host Home Assistant diagnostic agent. Polls HA REST API on a schedule, emits structured events to `/opt/homelab/events//`, and exposes an HTTP API for health checks and manual check triggers. Follows the same event-pipeline pattern as `node-agent`: filesystem-first, no direct supervisor integration, events processed by the VPS observer. ## Architecture ``` APScheduler (interval-based REST checks) ├─ HeartbeatCheck → pings /api/, emits ha_websocket_dead on failure ├─ UnavailableEntitiesCheck → entity unavailable > threshold ├─ SystemHealthCheck → /api/system_health per-integration status ├─ AutomationFailuresCheck → automation last-run error traces └─ UpdatesAvailableCheck → pending HA/integration updates WebSocketMonitor (persistent, long-running — Phase 4b) └─ Maintains a live WS subscription to state_changed events Any traffic = HA is alive. Watchdog fires ha_websocket_dead on silence > 5min or on disconnect. Emits ha_websocket_recovered when the connection is restored after a dead alert. FastAPI (port 8087) GET /health → liveness probe (includes ws_connected field) POST /trigger/ → run a named check on demand SQLite (/data/ha_diag.db) entity_baseline → last-known entity states check_history → per-check run log alerts_sent → dedup gate for alert events ``` The WebSocketMonitor is the only persistent-connection component; all other checks are APScheduler intervals (stateless REST polls). ## Event Types | Type | Severity | Trigger | |------|----------|---------| | `ha_websocket_dead` | error | WS disconnect, silence > 5min, or /api/ unreachable | | `ha_websocket_recovered` | info | WS reconnected after a dead alert (clears incident) | | `ha_integration_failed` | error | Integration in error state | | `ha_entity_unavailable_long` | warning | Entity unavailable > threshold | | `ha_automation_failing` | warning | Automation last run errored | | `ha_update_available` | info | HA or integration update pending | | `ha_recorder_lag` | warning | Recorder write lag > threshold | | `ha_system_health_degraded` | warning | System health check failed | Event routing in supervisor (Phase 5) maps these to `notify` actions. `ha_websocket_recovered` should be routed to clear any active `ha_websocket_dead` incident. ## First-time deployment See **[DEPLOY.md](DEPLOY.md)** for the full procedure: HA token creation, per-host `.env` config, deploy commands, verification steps, 48h shadow-mode observation, and rollback. **Shadow mode** (`HA_DIAG_SHADOW_MODE`, default `true` on the control-plane): `ha_websocket_dead` events are downgraded to `alert_only` with a `[SHADOW MODE]` note instead of queuing an automatic `container_restart`. Set to `false` in `/opt/homelab/config/control-plane/.env` on the VPS when ready for live actions. ## Deployment model The agent is deployed **per-host** but targets a potentially remote HA instance: | Node | Agent runs on | HA lives on | HA URL | |------|--------------|-------------|--------| | piha | piha | piha (localhost) | `http://localhost:8123` | | chelsty-infra | chelsty-infra | chelsty-ha (HAOS VM, separate machine) | `http://100.70.180.90:8123` | **chelsty-infra note:** Home Assistant runs on `chelsty-ha`, a dedicated Home Assistant OS VM. `chelsty-infra` is the hypervisor but does not run HA itself. The agent on `chelsty-infra` reaches HA over the Tailscale network (`100.70.180.90:8123`). If `chelsty-ha` gets a new Tailscale IP, update `HA_URL` in `/opt/homelab/config/ha-diag-agent/.env` on `chelsty-infra`. ## Deployment ```bash # 1. Create config on target node ssh oskar@ mkdir -p /opt/homelab/config/ha-diag-agent /var/lib/ha-diag-agent cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF' HA_URL=http://homeassistant.local:8123 # or http://100.70.180.90:8123 for chelsty-infra HA_TOKEN= NODE_NAME=piha # or chelsty-infra LOCATION_TAG=ken # or chelsty CHECK_INTERVAL=60 EOF # 2. Deploy scripts/deploy/deploy.sh --service ha-diag-agent # 3. Verify docker ps --filter name=ha-diag-agent curl http://localhost:8087/health ``` ### chelsty-infra note `chelsty-infra` runs docker-compose v1 (1.29.2). Use `docker-compose` (hyphenated): ```bash docker-compose -f docker-compose.yml up -d --build ``` ### HA long-lived token In HA UI: Profile → Long-Lived Access Tokens → Create token. ## Running Tests ```bash cd services/ha-diag-agent pip install -e ".[dev]" pytest tests/ -v ``` ## Optional YAML config Place `/opt/homelab/config/ha-diag-agent/ha-diag-agent.yaml` on the node. Values there are defaults; env vars take priority. ```yaml ha_url: http://homeassistant.local:8123 location_tag: ken check_interval: 60 ```