homelab-codex-ws/services
Oskar Kapala 3499b2f280 feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes
New checks:
- SystemHealthCheck (15min interval): detects newly-failing HA
  integrations via /api/system_health snapshot diff; transition-based
  dedup (ok→error fires, sustained error silent, error→ok clears alert)
- UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available
  events with 7-day dedup; release notes truncated at 2000 chars
- UpdatesDigestCheck (Sunday cron 09:00): single digest event with all
  pending updates; weekly ISO-week dedup, independent of daily dedup key
- AutomationFailuresCheck (30min interval): detects automations with
  N consecutive failures (default 3) via /api/trace/automation/<id>;
  6h cooldown per automation

Phase 3 flag fixes:
- Flag #1 (since field): UnavailableEntitiesCheck now uses
  min(state.last_changed, baseline.first_seen) as effective "since",
  giving accurate duration when agent was offline at entity's first fail
- Flag #3 (registry cache): HAClient.get_entity_registry() caches
  response in-process with configurable TTL (default 300s); avoids
  repeated API calls across concurrent check cycles; invalidate_registry_cache()
  for manual invalidation

Storage: system_health_snapshot table (component, last_status, last_seen_at,
payload) created automatically on next Storage.open() call

Config additions (all with defaults): entity_registry_cache_ttl=300,
system_health_check_interval=900, automation_check_interval=1800,
automation_failure_threshold=3, updates_check_hour=9,
updates_check_minute=0, updates_cooldown_days=7

Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new);
3 skipped (live-HA token not set in CI)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 14:43:10 +02:00
..
agent-system Fix Copy for AI: materializer fetches from control-plane API instead of Redis 2026-05-27 16:07:51 +02:00
control-plane supervisor: also cancel pending actions for services removed from desired state 2026-05-27 15:19:13 +02:00
forgejo Add node capability model 2026-05-11 20:46:50 +02:00
ha-diag-agent feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes 2026-05-29 14:43:10 +02:00
mosquitto Implement filesystem-first runtime event system 2026-05-12 13:38:25 +02:00
node-agent Fix ghost service keys from hash-prefixed Docker container names 2026-05-27 15:41:13 +02:00
node_exporter Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring 2026-05-27 15:10:48 +02:00
npm Add node capability model 2026-05-11 20:46:50 +02:00
ollama Add node capability model 2026-05-11 20:46:50 +02:00
planner-agent fix+debug(planner-agent): use base_url (not api_base) for litellm.acompletion, add print [TEMP] 2026-05-28 13:07:58 +02:00
stability-agent Fix stability agent fleet deploy scripts 2026-05-17 21:09:06 +02:00
zigbee2mqtt docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs 2026-05-29 14:17:23 +02:00
.gitkeep Add infrastructure standards and deployment conventions 2026-05-07 21:16:03 +02:00