homelab-codex-ws/services/ha-diag-agent
Oskar Kapala 9ec43b6829 fix(ha-diag-agent): structlog event kwarg collision + replace aioresponses
- main.py: rename event= to ha_event= in _log.warning() — structlog treats
  'event' as a reserved positional arg; the old name caused TypeError when
  any check returned unhealthy results (events were still emitted, but the
  check was logged as check_error instead of check_unhealthy)
- tests/test_ha_client.py: replace aioresponses with unittest.mock — aioresponses
  0.7.8 is incompatible with aiohttp >=3.12 (missing stream_writer kwarg)
- pyproject.toml: remove aioresponses from dev dependencies

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 19:19:34 +02:00
..
src/ha_diag fix(ha-diag-agent): structlog event kwarg collision + replace aioresponses 2026-06-03 19:19:34 +02:00
tests fix(ha-diag-agent): structlog event kwarg collision + replace aioresponses 2026-06-03 19:19:34 +02:00
DEPLOY.md feat(control-plane): shadow_mode for HA event auto-actions + deploy docs 2026-05-29 17:12:33 +02:00
docker-compose.yml feat(ha-diag-agent): scaffold service with HA REST client and event emitter 2026-05-29 12:26:34 +02:00
Dockerfile feat(ha-diag-agent): scaffold service with HA REST client and event emitter 2026-05-29 12:26:34 +02:00
env.example feat(ha-diag-agent): add piha deploy config 2026-06-03 19:19:27 +02:00
healthcheck.sh feat(ha-diag-agent): scaffold service with HA REST client and event emitter 2026-05-29 12:26:34 +02:00
pyproject.toml fix(ha-diag-agent): structlog event kwarg collision + replace aioresponses 2026-06-03 19:19:34 +02:00
README.md feat(control-plane): shadow_mode for HA event auto-actions + deploy docs 2026-05-29 17:12:33 +02:00
service.yaml feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup 2026-05-29 13:41:55 +02:00

ha-diag-agent

Per-host Home Assistant diagnostic agent. Polls HA REST API on a schedule, emits structured events to /opt/homelab/events/<node>/, and exposes an HTTP API for health checks and manual check triggers.

Follows the same event-pipeline pattern as node-agent: filesystem-first, no direct supervisor integration, events processed by the VPS observer.

Architecture

APScheduler (interval-based REST checks)
  ├─ HeartbeatCheck            → pings /api/, emits ha_websocket_dead on failure
  ├─ UnavailableEntitiesCheck  → entity unavailable > threshold
  ├─ SystemHealthCheck         → /api/system_health per-integration status
  ├─ AutomationFailuresCheck   → automation last-run error traces
  └─ UpdatesAvailableCheck     → pending HA/integration updates

WebSocketMonitor (persistent, long-running — Phase 4b)
  └─ Maintains a live WS subscription to state_changed events
     Any traffic = HA is alive. Watchdog fires ha_websocket_dead on
     silence > 5min or on disconnect. Emits ha_websocket_recovered
     when the connection is restored after a dead alert.

FastAPI (port 8087)
  GET  /health             → liveness probe (includes ws_connected field)
  POST /trigger/<check>    → run a named check on demand

SQLite (/data/ha_diag.db)
  entity_baseline          → last-known entity states
  check_history            → per-check run log
  alerts_sent              → dedup gate for alert events

The WebSocketMonitor is the only persistent-connection component; all other checks are APScheduler intervals (stateless REST polls).

Event Types

Type Severity Trigger
ha_websocket_dead error WS disconnect, silence > 5min, or /api/ unreachable
ha_websocket_recovered info WS reconnected after a dead alert (clears incident)
ha_integration_failed error Integration in error state
ha_entity_unavailable_long warning Entity unavailable > threshold
ha_automation_failing warning Automation last run errored
ha_update_available info HA or integration update pending
ha_recorder_lag warning Recorder write lag > threshold
ha_system_health_degraded warning System health check failed

Event routing in supervisor (Phase 5) maps these to notify actions. ha_websocket_recovered should be routed to clear any active ha_websocket_dead incident.

First-time deployment

See DEPLOY.md for the full procedure: HA token creation, per-host .env config, deploy commands, verification steps, 48h shadow-mode observation, and rollback.

Shadow mode (HA_DIAG_SHADOW_MODE, default true on the control-plane): ha_websocket_dead events are downgraded to alert_only with a [SHADOW MODE] note instead of queuing an automatic container_restart. Set to false in /opt/homelab/config/control-plane/.env on the VPS when ready for live actions.

Deployment model

The agent is deployed per-host but targets a potentially remote HA instance:

Node Agent runs on HA lives on HA URL
piha piha piha (localhost) http://localhost:8123
chelsty-infra chelsty-infra chelsty-ha (HAOS VM, separate machine) http://100.70.180.90:8123

chelsty-infra note: Home Assistant runs on chelsty-ha, a dedicated Home Assistant OS VM. chelsty-infra is the hypervisor but does not run HA itself. The agent on chelsty-infra reaches HA over the Tailscale network (100.70.180.90:8123). If chelsty-ha gets a new Tailscale IP, update HA_URL in /opt/homelab/config/ha-diag-agent/.env on chelsty-infra.

Deployment

# 1. Create config on target node
ssh oskar@<node-ip>
mkdir -p /opt/homelab/config/ha-diag-agent /var/lib/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://homeassistant.local:8123   # or http://100.70.180.90:8123 for chelsty-infra
HA_TOKEN=<long-lived-token>
NODE_NAME=piha                           # or chelsty-infra
LOCATION_TAG=ken                         # or chelsty
CHECK_INTERVAL=60
EOF

# 2. Deploy
scripts/deploy/deploy.sh --service ha-diag-agent

# 3. Verify
docker ps --filter name=ha-diag-agent
curl http://localhost:8087/health

chelsty-infra note

chelsty-infra runs docker-compose v1 (1.29.2). Use docker-compose (hyphenated):

docker-compose -f docker-compose.yml up -d --build

HA long-lived token

In HA UI: Profile → Long-Lived Access Tokens → Create token.

Running Tests

cd services/ha-diag-agent
pip install -e ".[dev]"
pytest tests/ -v

Optional YAML config

Place /opt/homelab/config/ha-diag-agent/ha-diag-agent.yaml on the node. Values there are defaults; env vars take priority.

ha_url: http://homeassistant.local:8123
location_tag: ken
check_interval: 60