Commit graph

3 commits

Author SHA1 Message Date
Oskar Kapala 3499b2f280 feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes
New checks:
- SystemHealthCheck (15min interval): detects newly-failing HA
  integrations via /api/system_health snapshot diff; transition-based
  dedup (ok→error fires, sustained error silent, error→ok clears alert)
- UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available
  events with 7-day dedup; release notes truncated at 2000 chars
- UpdatesDigestCheck (Sunday cron 09:00): single digest event with all
  pending updates; weekly ISO-week dedup, independent of daily dedup key
- AutomationFailuresCheck (30min interval): detects automations with
  N consecutive failures (default 3) via /api/trace/automation/<id>;
  6h cooldown per automation

Phase 3 flag fixes:
- Flag #1 (since field): UnavailableEntitiesCheck now uses
  min(state.last_changed, baseline.first_seen) as effective "since",
  giving accurate duration when agent was offline at entity's first fail
- Flag #3 (registry cache): HAClient.get_entity_registry() caches
  response in-process with configurable TTL (default 300s); avoids
  repeated API calls across concurrent check cycles; invalidate_registry_cache()
  for manual invalidation

Storage: system_health_snapshot table (component, last_status, last_seen_at,
payload) created automatically on next Storage.open() call

Config additions (all with defaults): entity_registry_cache_ttl=300,
system_health_check_interval=900, automation_check_interval=1800,
automation_failure_threshold=3, updates_check_hour=9,
updates_check_minute=0, updates_cooldown_days=7

Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new);
3 skipped (live-HA token not set in CI)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 14:43:10 +02:00
Oskar Kapala 20f6761a67 feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup
- shared aiohttp ClientSession in HAClient (Phase 1 Flag #2 fixed):
  make_session() factory, session injected at startup, closed on shutdown
- Check.run() → list[CheckResult]: clean multi-event interface
- first real diagnostic check: entity unavailable > 24h
  (INSERT OR IGNORE baseline preserves first-seen timestamp)
- root cause grouping: emit ha_integration_failed instead of N entity
  events when ≥50% of integration's entities are unavailable (≥3 min)
- alert deduplication via SQLite cooldown window (default 6h)
- recovery clears baseline + dedup for immediate re-alert
- configurable thresholds: duration, integration %, cooldown
- 38 unit tests + 7 integration tests (42 pass, 3 skip w/o live HA)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 13:41:55 +02:00
Oskar Kapala 07bd498fd6 feat(ha-diag-agent): test environment with dual HA Docker instances
- dockerized ken + chelsty HA test instances with template fixtures
- snapshot/reset/wait scripts for fixture management
- integration test infrastructure with separate marker
- location_tag promoted from metadata to event payload (Phase 1 flag #3)
- chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 12:56:13 +02:00