2026-05-29 12:26:34 +02:00
|
|
|
# ha-diag-agent
|
|
|
|
|
|
|
|
|
|
Per-host Home Assistant diagnostic agent. Polls HA REST API on a schedule,
|
|
|
|
|
emits structured events to `/opt/homelab/events/<node>/`, and exposes an
|
|
|
|
|
HTTP API for health checks and manual check triggers.
|
|
|
|
|
|
|
|
|
|
Follows the same event-pipeline pattern as `node-agent`: filesystem-first,
|
|
|
|
|
no direct supervisor integration, events processed by the VPS observer.
|
|
|
|
|
|
|
|
|
|
## Architecture
|
|
|
|
|
|
|
|
|
|
```
|
2026-05-29 15:00:18 +02:00
|
|
|
APScheduler (interval-based REST checks)
|
|
|
|
|
├─ HeartbeatCheck → pings /api/, emits ha_websocket_dead on failure
|
|
|
|
|
├─ UnavailableEntitiesCheck → entity unavailable > threshold
|
|
|
|
|
├─ SystemHealthCheck → /api/system_health per-integration status
|
|
|
|
|
├─ AutomationFailuresCheck → automation last-run error traces
|
|
|
|
|
└─ UpdatesAvailableCheck → pending HA/integration updates
|
|
|
|
|
|
|
|
|
|
WebSocketMonitor (persistent, long-running — Phase 4b)
|
|
|
|
|
└─ Maintains a live WS subscription to state_changed events
|
|
|
|
|
Any traffic = HA is alive. Watchdog fires ha_websocket_dead on
|
|
|
|
|
silence > 5min or on disconnect. Emits ha_websocket_recovered
|
|
|
|
|
when the connection is restored after a dead alert.
|
2026-05-29 12:26:34 +02:00
|
|
|
|
|
|
|
|
FastAPI (port 8087)
|
2026-05-29 15:00:18 +02:00
|
|
|
GET /health → liveness probe (includes ws_connected field)
|
2026-05-29 12:26:34 +02:00
|
|
|
POST /trigger/<check> → run a named check on demand
|
|
|
|
|
|
|
|
|
|
SQLite (/data/ha_diag.db)
|
|
|
|
|
entity_baseline → last-known entity states
|
|
|
|
|
check_history → per-check run log
|
|
|
|
|
alerts_sent → dedup gate for alert events
|
|
|
|
|
```
|
|
|
|
|
|
2026-05-29 15:00:18 +02:00
|
|
|
The WebSocketMonitor is the only persistent-connection component; all other
|
|
|
|
|
checks are APScheduler intervals (stateless REST polls).
|
|
|
|
|
|
2026-05-29 12:26:34 +02:00
|
|
|
## Event Types
|
|
|
|
|
|
|
|
|
|
| Type | Severity | Trigger |
|
|
|
|
|
|------|----------|---------|
|
2026-05-29 15:00:18 +02:00
|
|
|
| `ha_websocket_dead` | error | WS disconnect, silence > 5min, or /api/ unreachable |
|
|
|
|
|
| `ha_websocket_recovered` | info | WS reconnected after a dead alert (clears incident) |
|
2026-05-29 12:26:34 +02:00
|
|
|
| `ha_integration_failed` | error | Integration in error state |
|
|
|
|
|
| `ha_entity_unavailable_long` | warning | Entity unavailable > threshold |
|
|
|
|
|
| `ha_automation_failing` | warning | Automation last run errored |
|
|
|
|
|
| `ha_update_available` | info | HA or integration update pending |
|
|
|
|
|
| `ha_recorder_lag` | warning | Recorder write lag > threshold |
|
|
|
|
|
| `ha_system_health_degraded` | warning | System health check failed |
|
|
|
|
|
|
|
|
|
|
Event routing in supervisor (Phase 5) maps these to `notify` actions.
|
2026-05-29 15:00:18 +02:00
|
|
|
`ha_websocket_recovered` should be routed to clear any active `ha_websocket_dead` incident.
|
2026-05-29 12:26:34 +02:00
|
|
|
|
2026-05-29 17:04:39 +02:00
|
|
|
## First-time deployment
|
|
|
|
|
|
|
|
|
|
See **[DEPLOY.md](DEPLOY.md)** for the full procedure: HA token creation,
|
|
|
|
|
per-host `.env` config, deploy commands, verification steps, 48h shadow-mode
|
|
|
|
|
observation, and rollback.
|
|
|
|
|
|
|
|
|
|
**Shadow mode** (`HA_DIAG_SHADOW_MODE`, default `true` on the control-plane):
|
|
|
|
|
`ha_websocket_dead` events are downgraded to `alert_only` with a `[SHADOW MODE]`
|
|
|
|
|
note instead of queuing an automatic `container_restart`. Set to `false` in
|
|
|
|
|
`/opt/homelab/config/control-plane/.env` on the VPS when ready for live actions.
|
|
|
|
|
|
2026-05-29 12:56:13 +02:00
|
|
|
## Deployment model
|
|
|
|
|
|
|
|
|
|
The agent is deployed **per-host** but targets a potentially remote HA instance:
|
|
|
|
|
|
|
|
|
|
| Node | Agent runs on | HA lives on | HA URL |
|
|
|
|
|
|------|--------------|-------------|--------|
|
|
|
|
|
| piha | piha | piha (localhost) | `http://localhost:8123` |
|
|
|
|
|
| chelsty-infra | chelsty-infra | chelsty-ha (HAOS VM, separate machine) | `http://100.70.180.90:8123` |
|
|
|
|
|
|
|
|
|
|
**chelsty-infra note:** Home Assistant runs on `chelsty-ha`, a dedicated Home Assistant
|
|
|
|
|
OS VM. `chelsty-infra` is the hypervisor but does not run HA itself. The agent on
|
|
|
|
|
`chelsty-infra` reaches HA over the Tailscale network (`100.70.180.90:8123`). If `chelsty-ha`
|
|
|
|
|
gets a new Tailscale IP, update `HA_URL` in `/opt/homelab/config/ha-diag-agent/.env` on
|
|
|
|
|
`chelsty-infra`.
|
|
|
|
|
|
2026-05-29 12:26:34 +02:00
|
|
|
## Deployment
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# 1. Create config on target node
|
|
|
|
|
ssh oskar@<node-ip>
|
|
|
|
|
mkdir -p /opt/homelab/config/ha-diag-agent /var/lib/ha-diag-agent
|
|
|
|
|
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
|
2026-05-29 12:56:13 +02:00
|
|
|
HA_URL=http://homeassistant.local:8123 # or http://100.70.180.90:8123 for chelsty-infra
|
2026-05-29 12:26:34 +02:00
|
|
|
HA_TOKEN=<long-lived-token>
|
2026-05-29 12:56:13 +02:00
|
|
|
NODE_NAME=piha # or chelsty-infra
|
|
|
|
|
LOCATION_TAG=ken # or chelsty
|
2026-05-29 12:26:34 +02:00
|
|
|
CHECK_INTERVAL=60
|
|
|
|
|
EOF
|
|
|
|
|
|
|
|
|
|
# 2. Deploy
|
|
|
|
|
scripts/deploy/deploy.sh --service ha-diag-agent
|
|
|
|
|
|
|
|
|
|
# 3. Verify
|
|
|
|
|
docker ps --filter name=ha-diag-agent
|
|
|
|
|
curl http://localhost:8087/health
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### chelsty-infra note
|
|
|
|
|
|
|
|
|
|
`chelsty-infra` runs docker-compose v1 (1.29.2). Use `docker-compose` (hyphenated):
|
|
|
|
|
```bash
|
|
|
|
|
docker-compose -f docker-compose.yml up -d --build
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### HA long-lived token
|
|
|
|
|
|
|
|
|
|
In HA UI: Profile → Long-Lived Access Tokens → Create token.
|
|
|
|
|
|
|
|
|
|
## Running Tests
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
cd services/ha-diag-agent
|
|
|
|
|
pip install -e ".[dev]"
|
|
|
|
|
pytest tests/ -v
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Optional YAML config
|
|
|
|
|
|
|
|
|
|
Place `/opt/homelab/config/ha-diag-agent/ha-diag-agent.yaml` on the node.
|
|
|
|
|
Values there are defaults; env vars take priority.
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
ha_url: http://homeassistant.local:8123
|
|
|
|
|
location_tag: ken
|
|
|
|
|
check_interval: 60
|
|
|
|
|
```
|