homelab-codex-ws/services/ha-diag-agent/src/ha_diag/ha_client.py
Oskar Kapala 3499b2f280 feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes
New checks:
- SystemHealthCheck (15min interval): detects newly-failing HA
  integrations via /api/system_health snapshot diff; transition-based
  dedup (ok→error fires, sustained error silent, error→ok clears alert)
- UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available
  events with 7-day dedup; release notes truncated at 2000 chars
- UpdatesDigestCheck (Sunday cron 09:00): single digest event with all
  pending updates; weekly ISO-week dedup, independent of daily dedup key
- AutomationFailuresCheck (30min interval): detects automations with
  N consecutive failures (default 3) via /api/trace/automation/<id>;
  6h cooldown per automation

Phase 3 flag fixes:
- Flag #1 (since field): UnavailableEntitiesCheck now uses
  min(state.last_changed, baseline.first_seen) as effective "since",
  giving accurate duration when agent was offline at entity's first fail
- Flag #3 (registry cache): HAClient.get_entity_registry() caches
  response in-process with configurable TTL (default 300s); avoids
  repeated API calls across concurrent check cycles; invalidate_registry_cache()
  for manual invalidation

Storage: system_health_snapshot table (component, last_status, last_seen_at,
payload) created automatically on next Storage.open() call

Config additions (all with defaults): entity_registry_cache_ttl=300,
system_health_check_interval=900, automation_check_interval=1800,
automation_failure_threshold=3, updates_check_hour=9,
updates_check_minute=0, updates_cooldown_days=7

Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new);
3 skipped (live-HA token not set in CI)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 14:43:10 +02:00

105 lines
3.9 KiB
Python

from __future__ import annotations
import time
from typing import Any
import aiohttp
def make_session(token: str, timeout: float = 10.0) -> aiohttp.ClientSession:
"""Create a pre-configured ClientSession for use with HAClient."""
return aiohttp.ClientSession(
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
},
timeout=aiohttp.ClientTimeout(total=timeout),
)
class HAClient:
"""Async Home Assistant REST API client.
Session lifecycle is managed externally — the caller creates the session
via make_session() at startup and closes it on shutdown. HAClient is a
session-borrower: it never opens or closes the session it receives.
"""
def __init__(
self,
base_url: str,
session: aiohttp.ClientSession,
entity_registry_cache_ttl: float = 300.0,
) -> None:
self._base_url = base_url.rstrip("/")
self._session = session
self._registry_cache_ttl = entity_registry_cache_ttl
self._registry_cache: list[dict[str, Any]] | None = None
self._registry_fetched_at: float = 0.0
async def get_api_status(self) -> dict[str, Any]:
"""GET /api/ — returns {"message": "API running."} when HA is up."""
async with self._session.get(f"{self._base_url}/api/") as resp:
resp.raise_for_status()
return await resp.json()
async def get_states(self) -> list[dict[str, Any]]:
"""GET /api/states — full entity state list."""
async with self._session.get(f"{self._base_url}/api/states") as resp:
resp.raise_for_status()
return await resp.json()
async def get_system_health(self) -> dict[str, Any]:
"""GET /api/system_health — per-integration health summary."""
async with self._session.get(f"{self._base_url}/api/system_health") as resp:
resp.raise_for_status()
return await resp.json()
async def get_config(self) -> dict[str, Any]:
"""GET /api/config — HA configuration including version."""
async with self._session.get(f"{self._base_url}/api/config") as resp:
resp.raise_for_status()
return await resp.json()
async def get_entity_registry(self) -> list[dict[str, Any]]:
"""GET /api/config/entity_registry — entity registry entries.
Each entry includes entity_id, platform (integration name), area_id,
config_entry_id, and other metadata.
Result is cached in-process for entity_registry_cache_ttl seconds to
avoid hammering HA on every check cycle (Phase 3 Flag #3).
"""
now = time.monotonic()
if (
self._registry_cache is not None
and (now - self._registry_fetched_at) < self._registry_cache_ttl
):
return self._registry_cache
async with self._session.get(
f"{self._base_url}/api/config/entity_registry"
) as resp:
resp.raise_for_status()
result = await resp.json()
self._registry_cache = result
self._registry_fetched_at = now
return result
def invalidate_registry_cache(self) -> None:
"""Force the next get_entity_registry() call to fetch fresh data."""
self._registry_cache = None
self._registry_fetched_at = 0.0
async def get_automation_traces(self, automation_id: str) -> list[dict[str, Any]]:
"""GET /api/trace/automation/<id> — last run traces for an automation."""
url = f"{self._base_url}/api/trace/automation/{automation_id}"
async with self._session.get(url) as resp:
resp.raise_for_status()
return await resp.json()
async def get_error_log(self) -> str:
"""GET /api/error_log — plaintext error log."""
async with self._session.get(f"{self._base_url}/api/error_log") as resp:
resp.raise_for_status()
return await resp.text()