Commit graph

1 commit

Author SHA1 Message Date
Oskar Kapala 3499b2f280 feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes
New checks:
- SystemHealthCheck (15min interval): detects newly-failing HA
  integrations via /api/system_health snapshot diff; transition-based
  dedup (ok→error fires, sustained error silent, error→ok clears alert)
- UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available
  events with 7-day dedup; release notes truncated at 2000 chars
- UpdatesDigestCheck (Sunday cron 09:00): single digest event with all
  pending updates; weekly ISO-week dedup, independent of daily dedup key
- AutomationFailuresCheck (30min interval): detects automations with
  N consecutive failures (default 3) via /api/trace/automation/<id>;
  6h cooldown per automation

Phase 3 flag fixes:
- Flag #1 (since field): UnavailableEntitiesCheck now uses
  min(state.last_changed, baseline.first_seen) as effective "since",
  giving accurate duration when agent was offline at entity's first fail
- Flag #3 (registry cache): HAClient.get_entity_registry() caches
  response in-process with configurable TTL (default 300s); avoids
  repeated API calls across concurrent check cycles; invalidate_registry_cache()
  for manual invalidation

Storage: system_health_snapshot table (component, last_status, last_seen_at,
payload) created automatically on next Storage.open() call

Config additions (all with defaults): entity_registry_cache_ttl=300,
system_health_check_interval=900, automation_check_interval=1800,
automation_failure_threshold=3, updates_check_hour=9,
updates_check_minute=0, updates_cooldown_days=7

Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new);
3 skipped (live-HA token not set in CI)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 14:43:10 +02:00