Oskar Kapala
|
3499b2f280
|
feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes
New checks:
- SystemHealthCheck (15min interval): detects newly-failing HA
integrations via /api/system_health snapshot diff; transition-based
dedup (ok→error fires, sustained error silent, error→ok clears alert)
- UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available
events with 7-day dedup; release notes truncated at 2000 chars
- UpdatesDigestCheck (Sunday cron 09:00): single digest event with all
pending updates; weekly ISO-week dedup, independent of daily dedup key
- AutomationFailuresCheck (30min interval): detects automations with
N consecutive failures (default 3) via /api/trace/automation/<id>;
6h cooldown per automation
Phase 3 flag fixes:
- Flag #1 (since field): UnavailableEntitiesCheck now uses
min(state.last_changed, baseline.first_seen) as effective "since",
giving accurate duration when agent was offline at entity's first fail
- Flag #3 (registry cache): HAClient.get_entity_registry() caches
response in-process with configurable TTL (default 300s); avoids
repeated API calls across concurrent check cycles; invalidate_registry_cache()
for manual invalidation
Storage: system_health_snapshot table (component, last_status, last_seen_at,
payload) created automatically on next Storage.open() call
Config additions (all with defaults): entity_registry_cache_ttl=300,
system_health_check_interval=900, automation_check_interval=1800,
automation_failure_threshold=3, updates_check_hour=9,
updates_check_minute=0, updates_cooldown_days=7
Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new);
3 skipped (live-HA token not set in CI)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-29 14:43:10 +02:00 |
|
Oskar Kapala
|
20f6761a67
|
feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup
- shared aiohttp ClientSession in HAClient (Phase 1 Flag #2 fixed):
make_session() factory, session injected at startup, closed on shutdown
- Check.run() → list[CheckResult]: clean multi-event interface
- first real diagnostic check: entity unavailable > 24h
(INSERT OR IGNORE baseline preserves first-seen timestamp)
- root cause grouping: emit ha_integration_failed instead of N entity
events when ≥50% of integration's entities are unavailable (≥3 min)
- alert deduplication via SQLite cooldown window (default 6h)
- recovery clears baseline + dedup for immediate re-alert
- configurable thresholds: duration, integration %, cooldown
- 38 unit tests + 7 integration tests (42 pass, 3 skip w/o live HA)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-29 13:41:55 +02:00 |
|