History

Oskar Kapala 3499b2f280 feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes New checks: - SystemHealthCheck (15min interval): detects newly-failing HA integrations via /api/system_health snapshot diff; transition-based dedup (ok→error fires, sustained error silent, error→ok clears alert) - UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available events with 7-day dedup; release notes truncated at 2000 chars - UpdatesDigestCheck (Sunday cron 09:00): single digest event with all pending updates; weekly ISO-week dedup, independent of daily dedup key - AutomationFailuresCheck (30min interval): detects automations with N consecutive failures (default 3) via /api/trace/automation/<id>; 6h cooldown per automation Phase 3 flag fixes: - Flag #1 (since field): UnavailableEntitiesCheck now uses min(state.last_changed, baseline.first_seen) as effective "since", giving accurate duration when agent was offline at entity's first fail - Flag #3 (registry cache): HAClient.get_entity_registry() caches response in-process with configurable TTL (default 300s); avoids repeated API calls across concurrent check cycles; invalidate_registry_cache() for manual invalidation Storage: system_health_snapshot table (component, last_status, last_seen_at, payload) created automatically on next Storage.open() call Config additions (all with defaults): entity_registry_cache_ttl=300, system_health_check_interval=900, automation_check_interval=1800, automation_failure_threshold=3, updates_check_hour=9, updates_check_minute=0, updates_cooldown_days=7 Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new); 3 skipped (live-HA token not set in CI) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-05-29 14:43:10 +02:00
..
src/ha_diag	feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes	2026-05-29 14:43:10 +02:00
tests	feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes	2026-05-29 14:43:10 +02:00
docker-compose.yml	feat(ha-diag-agent): scaffold service with HA REST client and event emitter	2026-05-29 12:26:34 +02:00
Dockerfile	feat(ha-diag-agent): scaffold service with HA REST client and event emitter	2026-05-29 12:26:34 +02:00
env.example	feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup	2026-05-29 13:41:55 +02:00
healthcheck.sh	feat(ha-diag-agent): scaffold service with HA REST client and event emitter	2026-05-29 12:26:34 +02:00
pyproject.toml	feat(ha-diag-agent): test environment with dual HA Docker instances	2026-05-29 12:56:13 +02:00
README.md	feat(ha-diag-agent): test environment with dual HA Docker instances	2026-05-29 12:56:13 +02:00
service.yaml	feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup	2026-05-29 13:41:55 +02:00

README.md

ha-diag-agent

Per-host Home Assistant diagnostic agent. Polls HA REST API on a schedule, emits structured events to /opt/homelab/events/<node>/, and exposes an HTTP API for health checks and manual check triggers.

Follows the same event-pipeline pattern as node-agent: filesystem-first, no direct supervisor integration, events processed by the VPS observer.

Architecture

APScheduler (every CHECK_INTERVAL s)
  └─ HeartbeatCheck       → pings /api/, emits ha_websocket_dead on failure
     [Phase 3: EntityUnavailableCheck, SystemHealthCheck, UpdateCheck, ...]

FastAPI (port 8087)
  GET  /health             → liveness probe
  POST /trigger/<check>    → run a named check on demand

SQLite (/data/ha_diag.db)
  entity_baseline          → last-known entity states
  check_history            → per-check run log
  alerts_sent              → dedup gate for alert events

Event Types

Type	Severity	Trigger
`ha_websocket_dead`	error	HA /api/ unreachable
`ha_integration_failed`	error	Integration in error state
`ha_entity_unavailable_long`	warning	Entity unavailable > threshold
`ha_automation_failing`	warning	Automation last run errored
`ha_update_available`	info	HA or integration update pending
`ha_recorder_lag`	warning	Recorder write lag > threshold
`ha_system_health_degraded`	warning	System health check failed

Event routing in supervisor (Phase 5) maps these to notify actions.

Deployment model

The agent is deployed per-host but targets a potentially remote HA instance:

Node	Agent runs on	HA lives on	HA URL
piha	piha	piha (localhost)	`http://localhost:8123`
chelsty-infra	chelsty-infra	chelsty-ha (HAOS VM, separate machine)	`http://100.70.180.90:8123`

chelsty-infra note: Home Assistant runs on chelsty-ha, a dedicated Home Assistant OS VM. chelsty-infra is the hypervisor but does not run HA itself. The agent on chelsty-infra reaches HA over the Tailscale network (100.70.180.90:8123). If chelsty-ha gets a new Tailscale IP, update HA_URL in /opt/homelab/config/ha-diag-agent/.env on chelsty-infra.

Deployment

# 1. Create config on target node
ssh oskar@<node-ip>
mkdir -p /opt/homelab/config/ha-diag-agent /var/lib/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://homeassistant.local:8123   # or http://100.70.180.90:8123 for chelsty-infra
HA_TOKEN=<long-lived-token>
NODE_NAME=piha                           # or chelsty-infra
LOCATION_TAG=ken                         # or chelsty
CHECK_INTERVAL=60
EOF

# 2. Deploy
scripts/deploy/deploy.sh --service ha-diag-agent

# 3. Verify
docker ps --filter name=ha-diag-agent
curl http://localhost:8087/health

chelsty-infra note

chelsty-infra runs docker-compose v1 (1.29.2). Use docker-compose (hyphenated):

docker-compose -f docker-compose.yml up -d --build

HA long-lived token

In HA UI: Profile → Long-Lived Access Tokens → Create token.

Running Tests

cd services/ha-diag-agent
pip install -e ".[dev]"
pytest tests/ -v

Optional YAML config

Place /opt/homelab/config/ha-diag-agent/ha-diag-agent.yaml on the node. Values there are defaults; env vars take priority.

ha_url: http://homeassistant.local:8123
location_tag: ken
check_interval: 60