- hosts/piha/runtime/ha-diag-agent/docker-compose.override.yml: mem_limit
128m, hardcoded events volume (/opt/homelab/events/piha:/events) to avoid
${NODE_NAME} shell-expansion issue in deploy-node.sh
- services/ha-diag-agent/env.example: per-host HA_URL comments (piha vs
chelsty-infra tailscale), HA_TOKEN source note
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| src/ha_diag | ||
| tests | ||
| DEPLOY.md | ||
| docker-compose.yml | ||
| Dockerfile | ||
| env.example | ||
| healthcheck.sh | ||
| pyproject.toml | ||
| README.md | ||
| service.yaml | ||
ha-diag-agent
Per-host Home Assistant diagnostic agent. Polls HA REST API on a schedule,
emits structured events to /opt/homelab/events/<node>/, and exposes an
HTTP API for health checks and manual check triggers.
Follows the same event-pipeline pattern as node-agent: filesystem-first,
no direct supervisor integration, events processed by the VPS observer.
Architecture
APScheduler (interval-based REST checks)
├─ HeartbeatCheck → pings /api/, emits ha_websocket_dead on failure
├─ UnavailableEntitiesCheck → entity unavailable > threshold
├─ SystemHealthCheck → /api/system_health per-integration status
├─ AutomationFailuresCheck → automation last-run error traces
└─ UpdatesAvailableCheck → pending HA/integration updates
WebSocketMonitor (persistent, long-running — Phase 4b)
└─ Maintains a live WS subscription to state_changed events
Any traffic = HA is alive. Watchdog fires ha_websocket_dead on
silence > 5min or on disconnect. Emits ha_websocket_recovered
when the connection is restored after a dead alert.
FastAPI (port 8087)
GET /health → liveness probe (includes ws_connected field)
POST /trigger/<check> → run a named check on demand
SQLite (/data/ha_diag.db)
entity_baseline → last-known entity states
check_history → per-check run log
alerts_sent → dedup gate for alert events
The WebSocketMonitor is the only persistent-connection component; all other checks are APScheduler intervals (stateless REST polls).
Event Types
| Type | Severity | Trigger |
|---|---|---|
ha_websocket_dead |
error | WS disconnect, silence > 5min, or /api/ unreachable |
ha_websocket_recovered |
info | WS reconnected after a dead alert (clears incident) |
ha_integration_failed |
error | Integration in error state |
ha_entity_unavailable_long |
warning | Entity unavailable > threshold |
ha_automation_failing |
warning | Automation last run errored |
ha_update_available |
info | HA or integration update pending |
ha_recorder_lag |
warning | Recorder write lag > threshold |
ha_system_health_degraded |
warning | System health check failed |
Event routing in supervisor (Phase 5) maps these to notify actions.
ha_websocket_recovered should be routed to clear any active ha_websocket_dead incident.
First-time deployment
See DEPLOY.md for the full procedure: HA token creation,
per-host .env config, deploy commands, verification steps, 48h shadow-mode
observation, and rollback.
Shadow mode (HA_DIAG_SHADOW_MODE, default true on the control-plane):
ha_websocket_dead events are downgraded to alert_only with a [SHADOW MODE]
note instead of queuing an automatic container_restart. Set to false in
/opt/homelab/config/control-plane/.env on the VPS when ready for live actions.
Deployment model
The agent is deployed per-host but targets a potentially remote HA instance:
| Node | Agent runs on | HA lives on | HA URL |
|---|---|---|---|
| piha | piha | piha (localhost) | http://localhost:8123 |
| chelsty-infra | chelsty-infra | chelsty-ha (HAOS VM, separate machine) | http://100.70.180.90:8123 |
chelsty-infra note: Home Assistant runs on chelsty-ha, a dedicated Home Assistant
OS VM. chelsty-infra is the hypervisor but does not run HA itself. The agent on
chelsty-infra reaches HA over the Tailscale network (100.70.180.90:8123). If chelsty-ha
gets a new Tailscale IP, update HA_URL in /opt/homelab/config/ha-diag-agent/.env on
chelsty-infra.
Deployment
# 1. Create config on target node
ssh oskar@<node-ip>
mkdir -p /opt/homelab/config/ha-diag-agent /var/lib/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://homeassistant.local:8123 # or http://100.70.180.90:8123 for chelsty-infra
HA_TOKEN=<long-lived-token>
NODE_NAME=piha # or chelsty-infra
LOCATION_TAG=ken # or chelsty
CHECK_INTERVAL=60
EOF
# 2. Deploy
scripts/deploy/deploy.sh --service ha-diag-agent
# 3. Verify
docker ps --filter name=ha-diag-agent
curl http://localhost:8087/health
chelsty-infra note
chelsty-infra runs docker-compose v1 (1.29.2). Use docker-compose (hyphenated):
docker-compose -f docker-compose.yml up -d --build
HA long-lived token
In HA UI: Profile → Long-Lived Access Tokens → Create token.
Running Tests
cd services/ha-diag-agent
pip install -e ".[dev]"
pytest tests/ -v
Optional YAML config
Place /opt/homelab/config/ha-diag-agent/ha-diag-agent.yaml on the node.
Values there are defaults; env vars take priority.
ha_url: http://homeassistant.local:8123
location_tag: ken
check_interval: 60