feat(control-plane): shadow_mode for HA event auto-actions + deploy docs

- HA_DIAG_SHADOW_MODE env flag in supervisor (default true) - shadow_mode downgrades container_restart actions to alert_only with [SHADOW MODE] note; same action_id and 30-min cooldown apply - alert_only events unaffected (always routed normally) - 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected - DEPLOY.md with token gen, per-host config, verification, 48h observation, production-mode enablement, rollback - README.md updated with shadow mode flag summary and DEPLOY.md link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(telegram-bot): correct risk_level field + show description in alerts
2026-05-29 17:12:33 +02:00 · 2026-05-29 16:26:49 +02:00 · 2026-05-29 15:59:23 +02:00 · 2026-05-29 15:00:18 +02:00 · 2026-05-29 14:43:10 +02:00 · 2026-05-29 14:17:23 +02:00
63 changed files with 6488 additions and 81 deletions
--- a/.gitignore
+++ b/.gitignore
@ -15,6 +15,7 @@ __pycache__/
 *$py.class
 venv/
 .venv/
+*.egg-info/

 # Tools
 .aider*
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -17,43 +17,22 @@ GitOps-lite orchestration for a distributed homelab. The repo is the source of t
 | **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
 | **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |

-All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services (`zigbee2mqtt`, `mosquitto`, `homeassistant`, `stability-agent`) must never depend on SATURN, VPS, or Forgejo at runtime.
+All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.

 ## Deployment

-### Run a fresh deployment on the current node
 ```bash
-scripts/deploy/deploy.sh
+scripts/deploy/deploy.sh                        # fresh deploy on current node
+scripts/deploy/deploy.sh --resume              # resume after interruption
+scripts/deploy/deploy.sh --stage verify        # specific stage only
+scripts/deploy/deploy.sh --service mosquitto   # specific service only
+./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
+./scripts/deploy/deploy-node.sh chelsty-infra  # CHELSTY nodes (individually)
+./scripts/bootstrap/prepare-node.sh            # general node bootstrap
+./scripts/bootstrap/chelsty-runtime.sh         # CHELSTY-specific bootstrap
 ```

-### Resume after interruption
-```bash
-scripts/deploy/deploy.sh --resume
-```
-
-### Run a specific stage only
-```bash
-scripts/deploy/deploy.sh --stage verify
-scripts/deploy/deploy.sh --stage diagnose
-```
-
-### Deploy a specific service
-```bash
-scripts/deploy/deploy.sh --service mosquitto
-```
-
-### Deploy from SATURN/SOLARIA to VPS (control plane)
-```bash
-./scripts/deploy/deploy-control-plane.sh --ssh
-```
-
-### Bootstrap a new node
-```bash
-./scripts/bootstrap/chelsty-runtime.sh  # CHELSTY-specific
-./scripts/bootstrap/prepare-node.sh     # General node prep
-```
-
-The staged deploy pipeline runs: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state is persisted in `/opt/homelab/state/deploy/` allowing safe resumption.
+Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.

 ## Service Structure

@ -94,25 +73,31 @@ Agents must never execute destructive actions (restarts, deploys, config changes

 ## Event System

-Events are append-only JSON lines at:
-```
-/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl
-```
+Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.

-Emit from shell:
-```bash
-source scripts/lib/events.sh
-emit_event "deployment_started" "info" "my-script.sh" "mosquitto" "cid-123" '{}'
-```
-
-Emit from Python:
-```python
-from scripts.lib.events import emit_event
-emit_event("service_unhealthy", "error", "monitor.py", "ollama", "cid-123", {"error": "OOM"})
-```
+Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).

 Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.

+### Supervisor event routing table
+
+| Event type | Source | Action generated | Cooldown |
+|---|---|---|---|
+| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
+| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
+| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
+| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
+| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
+| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
+| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
+
+HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
+
 ## Discovery Entry Points for Agents

 When exploring the system, use these files in order:
@ -124,29 +109,20 @@ When exploring the system, use these files in order:
 ## CHELSTY-Specific Rules

 - Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
- Deploy CHELSTY nodes individually:
-  ```bash
-  ./scripts/deploy/deploy-node.sh chelsty-infra
-  ./scripts/deploy/deploy-node.sh chelsty-ha
-  ```
- Bootstrap CHELSTY runtime:
-  ```bash
-  ./scripts/bootstrap/chelsty-runtime.sh
-  ```
+- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
 - Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.

 ## Runtime Path Conventions

-```
-/opt/homelab/
-├── data/<service>/     # Persistent volumes
-├── config/<service>/   # Secrets and host-local overrides (not in Git)
-├── logs/<service>/     # Service logs
-├── state/              # Deployment stage markers, agent heartbeats
-├── events/             # Append-only event store
-├── world/              # Observer output (synthesized state)
-└── actions/            # pending / approved / running / completed / failed
-```
+`/opt/homelab/` layout on each node:
+
+- `data/<service>/` — persistent volumes
+- `config/<service>/` — secrets and host-local overrides (not in Git)
+- `logs/<service>/` — service logs
+- `state/` — deployment stage markers, agent heartbeats
+- `events/` — append-only event store
+- `world/` — Observer output (synthesized state)
+- `actions/` — pending / approved / running / completed / failed

 ## Naming Conventions

--- a/hosts/chelsty-infra/services.yaml
+++ b/hosts/chelsty-infra/services.yaml
@ -2,6 +2,22 @@ host: chelsty-infra
 site: chelsty

 services:
+  ha-diag-agent:
+    role: ha-diagnostic-agent
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: false
+    depends_on:
+      local: []
+      external: [homeassistant]
+    config:
+      target_url: http://100.70.180.90:8123  # chelsty-ha via Tailscale (HAOS, separate VM)
+      location_tag: "chelsty"
+      events_dir: /opt/homelab/events/chelsty-infra
+    runtime:
+      config_path: /opt/homelab/config/ha-diag-agent
+      data_path: /var/lib/ha-diag-agent
+
  node-agent:
    role: node-stability-monitor
    # LTE node: node-agent monitors and emits events but does NO Docker cleanup.
--- a/hosts/piha/services.yaml
+++ b/hosts/piha/services.yaml
@ -1,6 +1,22 @@
 host: piha

 services:
+  ha-diag-agent:
+    role: ha-diagnostic-agent
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: false
+    depends_on:
+      local: []
+      external: [homeassistant]
+    config:
+      target_url: http://localhost:8123
+      location_tag: "ken"
+      events_dir: /opt/homelab/events/piha
+    runtime:
+      config_path: /opt/homelab/config/ha-diag-agent
+      data_path: /var/lib/ha-diag-agent
+
  node-agent:
    role: node-stability-monitor
    deployment_model: docker-compose
--- a/services/agent-system/telegram-bot/bot.py
+++ b/services/agent-system/telegram-bot/bot.py
@ -54,6 +54,36 @@ async def post_api(path, data):
        logger.error(f"Error posting to {url}: {e}")
        return False

+def _format_pending_action(action_id: str, data: dict) -> str:
+    """Build the Telegram Markdown message for a pending action notification.
+
+    Extracted so it can be unit-tested without a live Telegram connection.
+    """
+    # Supervisor writes risk_level; action-model.md legacy schema used risk.
+    risk = data.get("risk_level") or data.get("risk", "unknown")
+    message = (
+        f"⚠️ *Pending Action*\n"
+        f"ID: `{action_id}`\n"
+        f"Type: `{data.get('type', 'unknown')}`\n"
+        f"Service: `{data.get('service', 'unknown')}`\n"
+        f"Node: `{data.get('node', 'unknown')}`\n"
+        f"Risk: *{risk}*\n"
+    )
+    # description carries the human-readable substance of the action (required for
+    # alert_only actions where it is the entire operator-visible message).
+    description = data.get("description", "")
+    if description:
+        truncated = description[:300] + ("..." if len(description) > 300 else "")
+        message += f"Description: `{truncated}`\n"
+    # Legacy details block (old action-model.md schema) — kept for backwards compat.
+    if "details" in data:
+        details_str = json.dumps(data["details"], indent=2)
+        if len(details_str) > 1000:
+            details_str = details_str[:1000] + "..."
+        message += f"\nDetails:\n```json\n{details_str}\n```"
+    return message
+
+
 class ApprovalBot:
    def __init__(self):
        self.pending_dir = ACTIONS_ROOT / "pending"
@ -86,20 +116,7 @@ class ApprovalBot:

    async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
        """Sends an approval request message to all allowed users."""
-        message = (
-            f"⚠️ *Pending Action*\n"
-            f"ID: `{action_id}`\n"
-            f"Type: `{data.get('type', 'unknown')}`\n"
-            f"Service: `{data.get('service', 'unknown')}`\n"
-            f"Node: `{data.get('node', 'unknown')}`\n"
-            f"Risk: *{data.get('risk', 'unknown')}*\n"
-        )
-
-        if "details" in data:
-            details_str = json.dumps(data['details'], indent=2)
-            if len(details_str) > 1000:
-                details_str = details_str[:1000] + "..."
-            message += f"\nDetails:\n```json\n{details_str}\n```"
+        message = _format_pending_action(action_id, data)

        keyboard = [
            [
--- a/services/agent-system/telegram-bot/tests/init.py
+++ b/services/agent-system/telegram-bot/tests/init.py
--- a/services/agent-system/telegram-bot/tests/conftest.py
+++ b/services/agent-system/telegram-bot/tests/conftest.py
@ -0,0 +1,38 @@
+"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
+from __future__ import annotations
+
+import sys
+import types
+from unittest.mock import MagicMock
+
+
+def _make_telegram_stub() -> types.ModuleType:
+    mod = types.ModuleType("telegram")
+    mod.Update = MagicMock
+    mod.InlineKeyboardButton = MagicMock
+    mod.InlineKeyboardMarkup = MagicMock
+    return mod
+
+
+def _make_telegram_ext_stub() -> types.ModuleType:
+    mod = types.ModuleType("telegram.ext")
+    mod.ApplicationBuilder = MagicMock
+
+    # ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
+    # evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
+    ContextTypesMock = MagicMock()
+    ContextTypesMock.DEFAULT_TYPE = type(None)
+    mod.ContextTypes = ContextTypesMock
+
+    mod.CommandHandler = MagicMock
+    mod.CallbackQueryHandler = MagicMock
+    mod.MessageHandler = MagicMock
+    mod.filters = MagicMock()
+    return mod
+
+
+# Insert before any import of bot.py
+if "telegram" not in sys.modules:
+    sys.modules["telegram"] = _make_telegram_stub()
+if "telegram.ext" not in sys.modules:
+    sys.modules["telegram.ext"] = _make_telegram_ext_stub()
--- a/services/agent-system/telegram-bot/tests/test_format.py
+++ b/services/agent-system/telegram-bot/tests/test_format.py
@ -0,0 +1,116 @@
+"""Tests for _format_pending_action — no Telegram connection required.
+
+telegram stubs are set up in conftest.py before this module is imported.
+"""
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+import pytest
+
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from bot import _format_pending_action
+
+
+# ---------------------------------------------------------------------------
+# Bug 1 — risk_level field
+# ---------------------------------------------------------------------------
+
+def test_risk_level_shown_when_present():
+    data = {
+        "type": "container_restart", "service": "homeassistant",
+        "node": "chelsty-ha", "risk_level": "low",
+    }
+    msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
+    assert "Risk: *low*" in msg
+    assert "unknown" not in msg
+
+
+def test_risk_falls_back_to_legacy_risk_key():
+    data = {
+        "type": "redeploy", "service": "mosquitto",
+        "node": "chelsty-infra", "risk": "guarded",
+    }
+    msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
+    assert "Risk: *guarded*" in msg
+
+
+def test_risk_unknown_when_both_absent():
+    data = {"type": "redeploy", "service": "foo", "node": "bar"}
+    msg = _format_pending_action("redeploy-bar-foo", data)
+    assert "Risk: *unknown*" in msg
+
+
+# ---------------------------------------------------------------------------
+# Bug 2 — description field
+# ---------------------------------------------------------------------------
+
+def test_description_shown_for_alert_only():
+    data = {
+        "type": "alert_only", "service": "homeassistant",
+        "node": "chelsty-ha", "risk_level": "info",
+        "description": "3 entities unavailable for >1h",
+    }
+    msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
+    assert "3 entities unavailable for >1h" in msg
+    assert "Description:" in msg
+
+
+def test_description_shown_for_container_restart():
+    data = {
+        "type": "container_restart", "service": "homeassistant",
+        "node": "chelsty-ha", "risk_level": "low",
+        "description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
+    }
+    msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
+    assert "HA WebSocket unresponsive" in msg
+
+
+def test_description_absent_no_crash():
+    data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
+    msg = _format_pending_action("redeploy-bar-foo", data)
+    assert "Description:" not in msg
+    assert "Risk: *guarded*" in msg
+
+
+def test_description_truncated_at_300_chars():
+    long_desc = "x" * 400
+    data = {
+        "type": "alert_only", "service": "homeassistant",
+        "node": "chelsty-ha", "risk_level": "info",
+        "description": long_desc,
+    }
+    msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
+    assert "x" * 300 in msg
+    assert "..." in msg
+    assert "x" * 301 not in msg
+
+
+# ---------------------------------------------------------------------------
+# Combined — real HA alert_only action shape
+# ---------------------------------------------------------------------------
+
+def test_ha_alert_only_full_action():
+    """Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
+    data = {
+        "action_id": "alert-ha-entity-unavailable-chelsty-ha",
+        "type": "alert_only",
+        "node": "chelsty-ha",
+        "service": "homeassistant",
+        "risk_level": "info",
+        "confidence": 1.0,
+        "description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
+        "status": "pending",
+        "payload": {
+            "location_tag": "chelsty",
+            "reason": "ha_entity_unavailable_long",
+            "count": 3,
+        },
+    }
+    msg = _format_pending_action(data["action_id"], data)
+    assert "alert_only" in msg
+    assert "chelsty-ha" in msg
+    assert "Risk: *info*" in msg
+    assert "3 entities unavailable" in msg
+    assert "unknown" not in msg
--- a/services/control-plane/pyproject.toml
+++ b/services/control-plane/pyproject.toml
@ -0,0 +1,19 @@
+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "control-plane"
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = [
+    "pyyaml>=6.0",
+]
+
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.1",
+]
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
--- a/services/control-plane/src/executor.py
+++ b/services/control-plane/src/executor.py
@ -101,6 +101,10 @@ class Executor:
                payload = data.get("payload", {})
                success, error_msg = self._execute_disk_cleanup(node, payload)

+            elif action_type == "alert_only":
+                # Operator acknowledged the alert; no automated execution needed.
+                success = True
+
            else:
                success = False
                error_msg = f"Unknown action type: {action_type}"
--- a/services/control-plane/src/supervisor.py
+++ b/services/control-plane/src/supervisor.py
@ -9,6 +9,7 @@ from pathlib import Path
 RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
 WORLD_DIR = Path(RUNTIME_PATH) / "world"
 ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
+EVENTS_DIR = Path(RUNTIME_PATH) / "events"
 REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))

 # Node alias map: maps alternative node names (as they appear in events/world state)
@ -32,6 +33,53 @@ CONTAINER_RESTART_TRIGGERS = {"containers_not_running", "mqtt_unreachable"}
 # decide explicitly (e.g. adjust Frigate retain policy or purge HA recorder).
 NO_DISK_CLEANUP_NODES = {"chelsty-infra", "chelsty-ha"}

+# ---------------------------------------------------------------------------
+# HA diagnostic event routing (ha-diag-agent events)
+# ---------------------------------------------------------------------------
+
+# ha_websocket_dead: HA WebSocket unresponsive → restart the homeassistant container.
+# Separate from CONTAINER_RESTART_TRIGGERS because these events are routed directly
+# from the events dir (not via the world-state drift loop) to avoid conflicts with
+# the stability-agent's independent container health tracking on the same service key.
+HA_CONTAINER_RESTART_EVENTS = {"ha_websocket_dead"}
+
+# Alert-only events — operator notification, no automated action.
+HA_ALERT_ONLY_EVENTS = {
+    "ha_integration_failed",
+    "ha_entity_unavailable_long",
+    "ha_automation_failing",
+    "ha_update_available",
+    "ha_recorder_lag",
+    "ha_system_health_degraded",
+}
+
+# Stable action-ID suffix for each alert-only type
+_HA_ALERT_ID_SUFFIX = {
+    "ha_integration_failed":       "integration-failed",
+    "ha_entity_unavailable_long":  "entity-unavailable",
+    "ha_automation_failing":       "automation-failing",
+    "ha_update_available":         "update-available",
+    "ha_recorder_lag":             "recorder-lag",
+    "ha_system_health_degraded":   "system-health-degraded",
+}
+
+# 30-min cooldown after a container_restart completes; prevents restart loops
+# when HA repeatedly fails to connect (e.g. bad config, slow startup).
+HA_WEBSOCKET_RESTART_COOLDOWN = 1800
+
+# 1-hour cooldown for alert-only events; avoids repeated Telegram noise for
+# persistent conditions (e.g. an entity that stays unavailable for hours).
+HA_ALERT_COOLDOWN = 3600
+
+# Suppress ha_* events if homeassistant had a containers_not_running incident
+# within this window — HA is in a planned restart/update and alerts would be noise.
+HA_TRANSITION_WINDOW = 300  # 5 minutes
+
+# When True, events that would generate container_restart are downgraded to alert_only
+# with a "[SHADOW MODE]" note. Safe default for initial deployment; set
+# HA_DIAG_SHADOW_MODE=false on the control-plane node when ready for live actions.
+HA_DIAG_SHADOW_MODE = os.getenv("HA_DIAG_SHADOW_MODE", "true").lower() == "true"
+
 # Logging setup
 logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger("supervisor")
@ -41,7 +89,15 @@ class Supervisor:
    def __init__(self):
        self.desired_state = {"services": {}}
        self.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
+        # In-memory set of already-routed HA event IDs; prevents re-processing
+        # on each reconcile cycle. Grows to at most ~hundreds of entries/day.
+        self._ha_processed_event_ids: set = set()
        self._ensure_dirs()
+        logger.info(
+            "shadow_mode=%s — HA container_restart actions %s",
+            HA_DIAG_SHADOW_MODE,
+            "downgraded to alert_only" if HA_DIAG_SHADOW_MODE else "enabled",
+        )

    def _ensure_dirs(self):
        ACTIONS_DIR.mkdir(parents=True, exist_ok=True)
@ -242,6 +298,12 @@ class Supervisor:
        #    operator can see it was auto-resolved rather than silently dropped.
        self._cancel_resolved_pending_actions()

+        # 5. Route HA diagnostic events emitted by ha-diag-agent.
+        #    Processed directly from the events directory — not via the world-state
+        #    drift loop — to avoid conflicts with stability-agent's independent
+        #    container health tracking for the homeassistant service.
+        self._process_ha_events()
+
    # ------------------------------------------------------------------
    # Recommendation generation
    # ------------------------------------------------------------------
@ -442,6 +504,247 @@ class Supervisor:
                except Exception as e:
                    logger.error(f"Failed to cancel action {action_file.name}: {e}")

+    # ------------------------------------------------------------------
+    # HA diagnostic event routing
+    # ------------------------------------------------------------------
+
+    def _process_ha_events(self):
+        """Scan the events directory for unprocessed ha_* events and route them."""
+        if not EVENTS_DIR.exists():
+            return
+        for event_file in sorted(EVENTS_DIR.glob("**/*.json")):
+            event_id = event_file.stem
+            if event_id in self._ha_processed_event_ids:
+                continue
+            self._ha_processed_event_ids.add(event_id)
+            try:
+                with open(event_file) as f:
+                    event = json.load(f)
+            except Exception as e:
+                logger.debug(f"Could not read event {event_file}: {e}")
+                continue
+            if not event.get("type", "").startswith("ha_"):
+                continue
+            self._route_ha_event(event)
+
+    def _route_ha_event(self, event: dict):
+        event_type = event.get("type", "")
+        node = event.get("node", "")
+        if not node:
+            return
+
+        if event_type in HA_CONTAINER_RESTART_EVENTS:
+            if self._is_ha_in_transition(node):
+                logger.debug(
+                    f"Suppressing {event_type} on {node}: homeassistant in transition"
+                )
+                return
+            if HA_DIAG_SHADOW_MODE:
+                logger.info(
+                    "shadow_mode: suppressed container_restart for %s", event_type
+                )
+                self._generate_ha_shadow_alert(node, event)
+            else:
+                self._generate_ha_container_restart(node, event)
+
+        elif event_type == "ha_websocket_recovered":
+            self._cancel_ha_container_restart(node)
+
+        elif event_type in HA_ALERT_ONLY_EVENTS:
+            if self._is_ha_in_transition(node):
+                logger.debug(
+                    f"Suppressing {event_type} on {node}: homeassistant in transition"
+                )
+                return
+            self._generate_ha_alert_only(node, event)
+
+    def _is_ha_in_transition(self, node: str) -> bool:
+        """Return True if homeassistant container had a recent containers_not_running incident.
+
+        Suppresses ha_* alerts during planned HA restarts/updates to avoid
+        flooding the operator with secondary diagnostic alerts.
+        """
+        svc_key = f"{node}/homeassistant"
+        svc_info = self.actual_state["services"].get(svc_key, {})
+        incident_id = svc_info.get("incident_id")
+        if not incident_id:
+            return False
+        incident = self.actual_state["incidents"].get(incident_id, {})
+        return (
+            incident.get("status") == "active"
+            and incident.get("trigger_type") == "containers_not_running"
+            and time.time() - (incident.get("last_occurrence") or 0) < HA_TRANSITION_WINDOW
+        )
+
+    def _ha_action_recently_completed(self, action_id: str, cooldown: int) -> bool:
+        """Return True if action completed/rejected/cancelled within the cooldown window."""
+        for state in ("completed", "rejected", "cancelled"):
+            path = ACTIONS_DIR / state / f"{action_id}.json"
+            if path.exists():
+                try:
+                    with open(path) as f:
+                        data = json.load(f)
+                    finished = (
+                        data.get("finished_at")
+                        or data.get("cancelled_at")
+                        or data.get("updated_at")
+                        or 0
+                    )
+                    if time.time() - finished < cooldown:
+                        return True
+                except Exception:
+                    pass
+        return False
+
+    def _generate_ha_container_restart(self, node: str, event: dict):
+        service = "homeassistant"
+        action_id = f"container-restart-{node}-{service}"
+
+        for state in ("pending", "approved", "running"):
+            if (ACTIONS_DIR / state / f"{action_id}.json").exists():
+                logger.debug(f"Skipping {action_id}: already in state '{state}'")
+                return
+
+        if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
+            logger.debug(
+                f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
+            )
+            return
+
+        payload = dict(event.get("payload", {}))
+        payload["reason"] = "ha_websocket_dead"
+        payload["svc_key"] = f"{node}/{service}"
+
+        container_name = self._get_container_name(service)
+        action = {
+            "action_id": action_id,
+            "timestamp": time.time(),
+            "type": "container_restart",
+            "node": node,
+            "service": service,
+            "container_name": container_name,
+            "risk_level": "low",
+            "confidence": 0.9,
+            "description": (
+                f"Restart '{container_name}' on {node}: HA WebSocket unresponsive"
+            ),
+            "status": "pending",
+            "payload": payload,
+        }
+        self._write_pending_action(action)
+
+    def _generate_ha_shadow_alert(self, node: str, event: dict):
+        """Shadow-mode downgrade: emit alert_only instead of container_restart.
+
+        Uses the same action_id and cooldown as the real restart so that
+        cooldown semantics are identical regardless of shadow mode state.
+        """
+        service = "homeassistant"
+        action_id = f"container-restart-{node}-{service}"
+
+        for state in ("pending", "approved", "running"):
+            if (ACTIONS_DIR / state / f"{action_id}.json").exists():
+                logger.debug(f"Skipping {action_id}: already in state '{state}'")
+                return
+
+        if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
+            logger.debug(
+                f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
+            )
+            return
+
+        payload = dict(event.get("payload", {}))
+        payload["reason"] = "ha_websocket_dead"
+        payload["svc_key"] = f"{node}/{service}"
+        payload["shadow_mode"] = True
+
+        action = {
+            "action_id": action_id,
+            "timestamp": time.time(),
+            "type": "alert_only",
+            "node": node,
+            "service": service,
+            "risk_level": "info",
+            "confidence": 0.9,
+            "description": (
+                f"[SHADOW MODE] would have triggered container_restart "
+                f"for {service} on {node}: HA WebSocket unresponsive"
+            ),
+            "status": "pending",
+            "payload": payload,
+        }
+        self._write_pending_action(action)
+
+    def _generate_ha_alert_only(self, node: str, event: dict):
+        event_type = event.get("type", "")
+        suffix = _HA_ALERT_ID_SUFFIX.get(event_type, event_type.replace("_", "-"))
+        action_id = f"alert-ha-{suffix}-{node}"
+
+        for state in ("pending", "approved", "running"):
+            if (ACTIONS_DIR / state / f"{action_id}.json").exists():
+                logger.debug(f"Skipping {action_id}: already in state '{state}'")
+                return
+
+        if self._ha_action_recently_completed(action_id, HA_ALERT_COOLDOWN):
+            logger.debug(
+                f"Skipping {action_id}: within {HA_ALERT_COOLDOWN}s cooldown"
+            )
+            return
+
+        payload = dict(event.get("payload", {}))
+        payload["reason"] = event_type
+
+        action = {
+            "action_id": action_id,
+            "timestamp": time.time(),
+            "type": "alert_only",
+            "node": node,
+            "service": event.get("service", "homeassistant"),
+            "risk_level": "info",
+            "confidence": 1.0,
+            "description": event.get(
+                "message", f"HA diagnostic alert: {event_type} on {node}"
+            ),
+            "status": "pending",
+            "payload": payload,
+        }
+        self._write_pending_action(action)
+
+    def _cancel_ha_container_restart(self, node: str):
+        """Move a pending ha_websocket_dead container_restart to cancelled on recovery."""
+        action_id = f"container-restart-{node}-homeassistant"
+        pending_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
+        if not pending_path.exists():
+            return
+        cancelled_dir = ACTIONS_DIR / "cancelled"
+        cancelled_dir.mkdir(parents=True, exist_ok=True)
+        dest = cancelled_dir / f"{action_id}.json"
+        try:
+            with open(pending_path) as f:
+                action = json.load(f)
+            action["status"] = "cancelled"
+            action["cancelled_reason"] = "ha_websocket_recovered"
+            action["cancelled_at"] = time.time()
+            with open(dest, "w") as f:
+                json.dump(action, f, indent=2)
+            pending_path.unlink()
+            logger.info(f"Cancelled {action_id}: ha_websocket_recovered on {node}")
+        except Exception as e:
+            logger.error(f"Failed to cancel {action_id}: {e}")
+
+    def _write_pending_action(self, action: dict):
+        action_id = action["action_id"]
+        action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
+        try:
+            with open(action_path, "w") as f:
+                json.dump(action, f, indent=2)
+            logger.info(
+                f"Generated HA action: {action_id} "
+                f"(type={action['type']}, risk={action['risk_level']})"
+            )
+        except Exception as e:
+            logger.error(f"Failed to save action {action_id}: {e}")
+
    def loop(self, interval=30):
        logger.info("Starting supervisor loop")
        while True:
--- a/services/control-plane/tests/init.py
+++ b/services/control-plane/tests/init.py
--- a/services/control-plane/tests/test_supervisor_ha.py
+++ b/services/control-plane/tests/test_supervisor_ha.py
@ -0,0 +1,395 @@
+"""Tests for HA diagnostic event routing in the supervisor."""
+from __future__ import annotations
+
+import json
+import sys
+import time
+from pathlib import Path
+
+import pytest
+
+# Add src/ to path so we can import supervisor without installing
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+import supervisor as supervisor_module
+from supervisor import Supervisor
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+def _make_event(event_type: str, node: str = "chelsty-ha", service: str = "homeassistant",
+                payload: dict | None = None, message: str = "") -> dict:
+    return {
+        "id": f"evt-{node}-{int(time.time())}-{event_type}-{service}-1",
+        "type": event_type,
+        "node": node,
+        "service": service,
+        "severity": "warning",
+        "timestamp": int(time.time()),
+        "message": message or f"Test event: {event_type}",
+        "payload": payload or {"location_tag": "chelsty"},
+    }
+
+
+def _write_event(events_dir: Path, event: dict) -> Path:
+    path = events_dir / f"{event['id']}.json"
+    path.write_text(json.dumps(event))
+    return path
+
+
+def _setup_supervisor(tmp_path: Path, monkeypatch) -> Supervisor:
+    """Return a Supervisor instance with all paths redirected to tmp_path."""
+    actions = tmp_path / "actions"
+    events = tmp_path / "events"
+    world = tmp_path / "world"
+    repo = tmp_path / "repo"
+    state = tmp_path / "state"
+
+    for d in (actions, events, world, repo / "hosts", state):
+        d.mkdir(parents=True, exist_ok=True)
+
+    monkeypatch.setattr(supervisor_module, "ACTIONS_DIR", actions)
+    monkeypatch.setattr(supervisor_module, "EVENTS_DIR", events)
+    monkeypatch.setattr(supervisor_module, "WORLD_DIR", world)
+    monkeypatch.setattr(supervisor_module, "REPO_ROOT", repo)
+
+    sup = Supervisor()
+    # Empty desired/actual state so reconcile drift loop is a no-op
+    sup.desired_state = {"services": {}}
+    sup.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
+    return sup
+
+
+def _pending(tmp_path: Path, action_id: str) -> Path:
+    return tmp_path / "actions" / "pending" / f"{action_id}.json"
+
+
+def _read_action(tmp_path: Path, state: str, action_id: str) -> dict:
+    return json.loads((tmp_path / "actions" / state / f"{action_id}.json").read_text())
+
+
+# ---------------------------------------------------------------------------
+# 1. Each event type → correct action type
+# ---------------------------------------------------------------------------
+
+def test_ha_websocket_dead_generates_container_restart(tmp_path, monkeypatch):
+    monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", False)
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    events_dir = tmp_path / "events"
+    _write_event(events_dir, _make_event("ha_websocket_dead"))
+
+    sup._process_ha_events()
+
+    action_id = "container-restart-chelsty-ha-homeassistant"
+    assert _pending(tmp_path, action_id).exists()
+    action = _read_action(tmp_path, "pending", action_id)
+    assert action["type"] == "container_restart"
+    assert action["service"] == "homeassistant"
+    assert action["node"] == "chelsty-ha"
+
+
+@pytest.mark.parametrize("event_type,expected_suffix", [
+    ("ha_integration_failed",      "integration-failed"),
+    ("ha_entity_unavailable_long", "entity-unavailable"),
+    ("ha_automation_failing",      "automation-failing"),
+    ("ha_update_available",        "update-available"),
+    ("ha_recorder_lag",            "recorder-lag"),
+    ("ha_system_health_degraded",  "system-health-degraded"),
+])
+def test_alert_only_events_generate_alert_actions(
+    tmp_path, monkeypatch, event_type, expected_suffix
+):
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _write_event(tmp_path / "events", _make_event(event_type))
+
+    sup._process_ha_events()
+
+    action_id = f"alert-ha-{expected_suffix}-chelsty-ha"
+    assert _pending(tmp_path, action_id).exists(), f"No pending action for {event_type}"
+    action = _read_action(tmp_path, "pending", action_id)
+    assert action["type"] == "alert_only"
+    assert action["node"] == "chelsty-ha"
+
+
+# ---------------------------------------------------------------------------
+# 2. Transition suppression
+# ---------------------------------------------------------------------------
+
+def test_ha_websocket_dead_suppressed_during_transition(tmp_path, monkeypatch):
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+
+    # Set up world state: homeassistant has an active containers_not_running incident
+    inc_id = "inc-123-chelsty-ha-homeassistant"
+    sup.actual_state["services"]["chelsty-ha/homeassistant"] = {
+        "node": "chelsty-ha", "service": "homeassistant",
+        "status": "unhealthy", "incident_id": inc_id,
+    }
+    sup.actual_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "active",
+        "trigger_type": "containers_not_running",
+        "last_occurrence": time.time() - 60,  # 1 min ago — within 5-min window
+    }
+
+    _write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
+    sup._process_ha_events()
+
+    action_id = "container-restart-chelsty-ha-homeassistant"
+    assert not _pending(tmp_path, action_id).exists(), "Action should be suppressed during transition"
+
+
+def test_ha_alert_suppressed_during_transition(tmp_path, monkeypatch):
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+
+    inc_id = "inc-456-chelsty-ha-homeassistant"
+    sup.actual_state["services"]["chelsty-ha/homeassistant"] = {
+        "node": "chelsty-ha", "service": "homeassistant",
+        "status": "unhealthy", "incident_id": inc_id,
+    }
+    sup.actual_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "active",
+        "trigger_type": "containers_not_running",
+        "last_occurrence": time.time() - 30,
+    }
+
+    for event_type in supervisor_module.HA_ALERT_ONLY_EVENTS:
+        _write_event(tmp_path / "events", _make_event(event_type))
+
+    sup._process_ha_events()
+
+    for suffix in supervisor_module._HA_ALERT_ID_SUFFIX.values():
+        action_id = f"alert-ha-{suffix}-chelsty-ha"
+        assert not _pending(tmp_path, action_id).exists(), \
+            f"{action_id} should be suppressed"
+
+
+def test_transition_suppression_expires_after_window(tmp_path, monkeypatch):
+    """After 5 min, transition window expires and events are routed normally."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+
+    inc_id = "inc-789-chelsty-ha-homeassistant"
+    sup.actual_state["services"]["chelsty-ha/homeassistant"] = {
+        "node": "chelsty-ha", "service": "homeassistant",
+        "status": "unhealthy", "incident_id": inc_id,
+    }
+    sup.actual_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "active",
+        "trigger_type": "containers_not_running",
+        "last_occurrence": time.time() - 400,  # 6.7 min ago — outside window
+    }
+
+    _write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
+    sup._process_ha_events()
+
+    action_id = "container-restart-chelsty-ha-homeassistant"
+    assert _pending(tmp_path, action_id).exists(), "Should not be suppressed after window"
+
+
+# ---------------------------------------------------------------------------
+# 3. Recovery cancellation
+# ---------------------------------------------------------------------------
+
+def test_ha_websocket_recovered_cancels_pending_restart(tmp_path, monkeypatch):
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    events_dir = tmp_path / "events"
+    actions = tmp_path / "actions"
+    (actions / "cancelled").mkdir(parents=True, exist_ok=True)
+
+    # Pre-create a pending container_restart for homeassistant
+    action_id = "container-restart-chelsty-ha-homeassistant"
+    pending_action = {
+        "action_id": action_id, "type": "container_restart",
+        "node": "chelsty-ha", "service": "homeassistant",
+        "status": "pending", "timestamp": time.time(),
+    }
+    _pending(tmp_path, action_id).write_text(json.dumps(pending_action))
+
+    _write_event(events_dir, _make_event("ha_websocket_recovered"))
+    sup._process_ha_events()
+
+    assert not _pending(tmp_path, action_id).exists(), "Pending action should be cancelled"
+    cancelled = actions / "cancelled" / f"{action_id}.json"
+    assert cancelled.exists()
+    data = json.loads(cancelled.read_text())
+    assert data["cancelled_reason"] == "ha_websocket_recovered"
+
+
+def test_ha_websocket_recovered_no_pending_action_is_noop(tmp_path, monkeypatch):
+    """Recovery event when no pending restart exists must not raise."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _write_event(tmp_path / "events", _make_event("ha_websocket_recovered"))
+    sup._process_ha_events()  # should not raise
+
+
+# ---------------------------------------------------------------------------
+# 4. Cooldown
+# ---------------------------------------------------------------------------
+
+def test_ha_websocket_dead_cooldown_prevents_second_restart(tmp_path, monkeypatch):
+    """Two ha_websocket_dead events within 30 min → only one container_restart."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    events_dir = tmp_path / "events"
+    actions = tmp_path / "actions"
+    (actions / "completed").mkdir(parents=True, exist_ok=True)
+
+    # First event → action generated
+    _write_event(events_dir, _make_event("ha_websocket_dead", service="homeassistant"))
+    sup._process_ha_events()
+
+    action_id = "container-restart-chelsty-ha-homeassistant"
+    assert _pending(tmp_path, action_id).exists()
+
+    # Simulate: action completed recently (< 30 min ago)
+    action_data = json.loads(_pending(tmp_path, action_id).read_text())
+    action_data["status"] = "completed"
+    action_data["finished_at"] = time.time() - 60  # 1 min ago
+    (actions / "completed" / f"{action_id}.json").write_text(json.dumps(action_data))
+    _pending(tmp_path, action_id).unlink()
+
+    # Second event — should be suppressed by cooldown
+    event2 = _make_event("ha_websocket_dead", service="homeassistant")
+    event2["id"] = event2["id"] + "-2"  # different event ID
+    _write_event(events_dir, event2)
+    sup._process_ha_events()
+
+    assert not _pending(tmp_path, action_id).exists(), "Second restart within cooldown should be suppressed"
+
+
+def test_ha_websocket_dead_cooldown_expires(tmp_path, monkeypatch):
+    """After cooldown expires, a new ha_websocket_dead should generate an action."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    events_dir = tmp_path / "events"
+    actions = tmp_path / "actions"
+    (actions / "completed").mkdir(parents=True, exist_ok=True)
+
+    action_id = "container-restart-chelsty-ha-homeassistant"
+    # Pre-populate completed action with timestamp > 30 min ago
+    old_action = {
+        "action_id": action_id, "type": "container_restart",
+        "status": "completed", "finished_at": time.time() - 3700,  # > 30 min
+    }
+    (actions / "completed" / f"{action_id}.json").write_text(json.dumps(old_action))
+
+    _write_event(events_dir, _make_event("ha_websocket_dead"))
+    sup._process_ha_events()
+
+    assert _pending(tmp_path, action_id).exists(), "Should generate new restart after cooldown"
+
+
+# ---------------------------------------------------------------------------
+# 5. Location tag preserved
+# ---------------------------------------------------------------------------
+
+def test_location_tag_preserved_in_container_restart_payload(tmp_path, monkeypatch):
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _write_event(tmp_path / "events",
+                 _make_event("ha_websocket_dead", payload={"location_tag": "chelsty", "extra": "data"}))
+
+    sup._process_ha_events()
+
+    action = _read_action(tmp_path, "pending", "container-restart-chelsty-ha-homeassistant")
+    assert action["payload"]["location_tag"] == "chelsty"
+
+
+def test_location_tag_preserved_in_alert_only_payload(tmp_path, monkeypatch):
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _write_event(tmp_path / "events",
+                 _make_event("ha_entity_unavailable_long",
+                             payload={"location_tag": "ken", "count": 3}))
+
+    sup._process_ha_events()
+
+    action = _read_action(tmp_path, "pending", "alert-ha-entity-unavailable-chelsty-ha")
+    assert action["payload"]["location_tag"] == "ken"
+
+
+# ---------------------------------------------------------------------------
+# 6. Dedup — same alert type twice → only one pending action
+# ---------------------------------------------------------------------------
+
+def test_alert_only_dedup_second_event_skipped(tmp_path, monkeypatch):
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    events_dir = tmp_path / "events"
+
+    event1 = _make_event("ha_entity_unavailable_long")
+    event2 = _make_event("ha_entity_unavailable_long")
+    event2["id"] = event2["id"] + "-2"
+    _write_event(events_dir, event1)
+    _write_event(events_dir, event2)
+
+    sup._process_ha_events()
+
+    action_id = "alert-ha-entity-unavailable-chelsty-ha"
+    assert _pending(tmp_path, action_id).exists()
+    # Only one file — not duplicated
+    pending_files = list((tmp_path / "actions" / "pending").glob("alert-ha-entity-unavailable*.json"))
+    assert len(pending_files) == 1
+
+
+# ---------------------------------------------------------------------------
+# 7. Shadow mode
+# ---------------------------------------------------------------------------
+
+def test_shadow_mode_websocket_dead_generates_alert_not_restart(tmp_path, monkeypatch):
+    """shadow_mode=True: ha_websocket_dead → alert_only with [SHADOW MODE], not container_restart."""
+    monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", True)
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
+
+    sup._process_ha_events()
+
+    action_id = "container-restart-chelsty-ha-homeassistant"
+    assert _pending(tmp_path, action_id).exists(), "Shadow alert should be written"
+    action = _read_action(tmp_path, "pending", action_id)
+    assert action["type"] == "alert_only"
+    assert "[SHADOW MODE]" in action["description"]
+    assert action["payload"].get("shadow_mode") is True
+
+
+def test_no_shadow_mode_websocket_dead_generates_container_restart(tmp_path, monkeypatch):
+    """shadow_mode=False: ha_websocket_dead → container_restart (normal path)."""
+    monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", False)
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
+
+    sup._process_ha_events()
+
+    action_id = "container-restart-chelsty-ha-homeassistant"
+    assert _pending(tmp_path, action_id).exists()
+    action = _read_action(tmp_path, "pending", action_id)
+    assert action["type"] == "container_restart"
+    assert "[SHADOW MODE]" not in action["description"]
+
+
+def test_shadow_mode_alert_only_events_unaffected(tmp_path, monkeypatch):
+    """shadow_mode=True: alert-only events (ha_entity_unavailable_long) are still routed normally."""
+    monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", True)
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _write_event(tmp_path / "events", _make_event("ha_entity_unavailable_long"))
+
+    sup._process_ha_events()
+
+    action_id = "alert-ha-entity-unavailable-chelsty-ha"
+    assert _pending(tmp_path, action_id).exists()
+    action = _read_action(tmp_path, "pending", action_id)
+    assert action["type"] == "alert_only"
+    assert "[SHADOW MODE]" not in action["description"]
+
+
+# ---------------------------------------------------------------------------
+# 8. Non-HA events are ignored
+# ---------------------------------------------------------------------------
+
+def test_non_ha_events_not_routed(tmp_path, monkeypatch):
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    events_dir = tmp_path / "events"
+
+    for etype in ("service_unhealthy", "containers_not_running", "node_online", "deployment_failed"):
+        e = _make_event(etype, service="mosquitto")
+        e["type"] = etype
+        _write_event(events_dir, e)
+
+    sup._process_ha_events()
+
+    pending_files = list((tmp_path / "actions" / "pending").glob("*.json"))
+    assert pending_files == [], "Non-HA events should not generate actions via HA path"
--- a/services/ha-diag-agent/DEPLOY.md
+++ b/services/ha-diag-agent/DEPLOY.md
@ -0,0 +1,239 @@
+# ha-diag-agent Deployment Guide
+
+## Section 1: Prerequisites
+
+### HA long-lived access token
+
+The agent authenticates to Home Assistant with a long-lived token issued by a
+dedicated service account. Do not use a personal admin token.
+
+1. In HA: **Settings → People → Add Person**
+   - Name: `diag_agent`
+   - Do **not** add to any group (no admin rights needed)
+2. Log in to HA as `diag_agent`
+3. Go to **Profile → Long-Lived Access Tokens → Create token**
+   - Name: `ha-diag-agent`
+   - Copy the token — it is shown only once
+4. Store the token in the node's `.env` file (see Section 2)
+
+### Tailnet reachability check (chelsty-infra only)
+
+`chelsty-infra` reaches Home Assistant on `chelsty-ha` over Tailscale.
+Verify before deploying:
+
+```bash
+curl -sf http://100.70.180.90:8123/api/ \
+  -H "Authorization: Bearer <token>" | python3 -m json.tool
+# Expect: {"message": "API running."}
+```
+
+If the request times out, check that both nodes are on the Tailscale mesh
+(`tailscale status`) and that `chelsty-ha` is powered on.
+
+---
+
+## Section 2: Per-host config
+
+Create `/opt/homelab/config/ha-diag-agent/.env` on **each target node**:
+
+### piha
+
+```bash
+mkdir -p /opt/homelab/config/ha-diag-agent
+cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
+HA_URL=http://localhost:8123
+HA_TOKEN=<long-lived-token-for-piha>
+NODE_NAME=piha
+LOCATION_TAG=ken
+CHECK_INTERVAL=60
+CHECK_INTERVAL_UNAVAILABLE=3600
+UNAVAILABLE_THRESHOLD_HOURS=24
+ALERT_COOLDOWN_HOURS=6
+LOG_LEVEL=info
+EOF
+chmod 600 /opt/homelab/config/ha-diag-agent/.env
+```
+
+### chelsty-infra
+
+```bash
+mkdir -p /opt/homelab/config/ha-diag-agent
+cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
+HA_URL=http://100.70.180.90:8123
+HA_TOKEN=<long-lived-token-for-chelsty-ha>
+NODE_NAME=chelsty-infra
+LOCATION_TAG=chelsty
+CHECK_INTERVAL=60
+CHECK_INTERVAL_UNAVAILABLE=3600
+UNAVAILABLE_THRESHOLD_HOURS=24
+ALERT_COOLDOWN_HOURS=6
+LOG_LEVEL=info
+EOF
+chmod 600 /opt/homelab/config/ha-diag-agent/.env
+```
+
+> If `chelsty-ha` gets a new Tailscale IP, update `HA_URL` in this file and
+> restart the container.
+
+---
+
+## Section 3: Deploy procedure
+
+### From SATURN (standard flow)
+
+```bash
+# 1. Commit and push changes from SATURN
+git push
+
+# 2. SSH to target node
+ssh oskar@piha            # or chelsty-infra
+
+# 3. Pull latest and deploy
+cd ~/homelab-codex-ws
+git pull
+scripts/deploy/deploy.sh --service ha-diag-agent
+```
+
+### chelsty-infra (docker-compose v1)
+
+`chelsty-infra` runs docker-compose v1 (1.29.2). The deploy script calls
+`docker-compose` (hyphenated), which is correct. If you need to run manually:
+
+```bash
+cd ~/homelab-codex-ws/services/ha-diag-agent
+docker-compose up -d --build
+```
+
+---
+
+## Section 4: Verification
+
+```bash
+# Container is up
+docker ps | grep ha-diag-agent
+
+# Last 50 log lines
+docker logs ha-diag-agent --tail 50
+
+# FastAPI health endpoint
+curl http://localhost:8087/health
+# Expect: {"status": "ok", "ws_connected": true, ...}
+
+# Events are being written
+ls /opt/homelab/events/<node-name>/
+# Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds
+
+# Supervisor is picking up events (check on VPS / control-plane)
+tail -f /opt/homelab/logs/supervisor.log | grep ha_
+```
+
+---
+
+## Section 5: First-48h observation (shadow mode)
+
+The supervisor starts with `HA_DIAG_SHADOW_MODE=true` (default). During this
+window, `ha_websocket_dead` events are downgraded to `alert_only` actions
+tagged `[SHADOW MODE]` rather than triggering an automatic restart.
+
+Watch for these signals in Telegram:
+
+- `[SHADOW MODE] would have triggered container_restart for homeassistant` —
+  confirms the detection path works end-to-end
+- `ha_entity_unavailable_long` / `ha_integration_failed` / etc. — these are
+  always `alert_only` regardless of shadow mode; verify descriptions look
+  accurate and thresholds are reasonable
+
+Things to evaluate:
+
+| Question | Good sign |
+|----------|-----------|
+| Are shadow alerts firing at reasonable frequency? | ≤ 1 per 30 min per node |
+| Are there false positives? | No alerts during known-good uptime |
+| Are entity-unavailable alerts describing real entities? | Yes, names match HA UI |
+| Are integration-failed alerts genuine? | Yes, not noise from startup |
+
+Note any false positives or noisy thresholds before enabling production mode.
+
+---
+
+## Section 6: Enabling production mode
+
+`HA_DIAG_SHADOW_MODE` is an environment variable read by the supervisor
+container. The VPS supervisor env vars live in the version-controlled
+override file at `hosts/vps/runtime/control-plane/docker-compose.override.yml`
+(not in a runtime `.env` file — the supervisor has no `env_file:` directive).
+
+When the 48h observation period looks clean:
+
+**1. Edit the override file on SATURN:**
+
+```yaml
+# hosts/vps/runtime/control-plane/docker-compose.override.yml
+services:
+  supervisor:
+    environment:
+      - NODE_ALIAS_MAP={"node-2":"chelsty"}
+      - HA_DIAG_SHADOW_MODE=false      # add this line
+```
+
+**2. Commit and push from SATURN:**
+
+```bash
+git add hosts/vps/runtime/control-plane/docker-compose.override.yml
+git commit -m "feat(control-plane): disable HA shadow mode — production ready"
+git push
+```
+
+**3. Apply on VPS:**
+
+```bash
+ssh oskar@100.95.58.48
+cd ~/homelab-codex-ws && git pull
+docker compose \
+  -f services/control-plane/docker-compose.yml \
+  -f hosts/vps/runtime/control-plane/docker-compose.override.yml \
+  up -d supervisor
+```
+
+**4. Confirm:**
+
+```bash
+docker logs control-plane-supervisor --tail 5
+# Expect: shadow_mode=False — HA container_restart actions enabled
+```
+
+From this point, the next `ha_websocket_dead` event will generate a
+`container_restart` action in the approval queue. The 30-minute cooldown
+still applies after each restart.
+
+---
+
+## Section 7: Rollback
+
+If production mode causes unexpected behaviour:
+
+```bash
+# Option A — re-enable shadow mode
+# On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml
+# Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true)
+# Commit, push, then on VPS:
+ssh oskar@100.95.58.48
+cd ~/homelab-codex-ws && git pull
+docker compose \
+  -f services/control-plane/docker-compose.yml \
+  -f hosts/vps/runtime/control-plane/docker-compose.override.yml \
+  up -d supervisor
+
+# Option B — stop ha-diag-agent entirely on affected nodes
+ssh oskar@<node>
+docker stop ha-diag-agent
+
+# Events written before rollback remain in /opt/homelab/events/<node>/
+# and are historical only — no automated action will be taken on them
+# unless the supervisor re-processes them, which it won't (already in
+# _ha_processed_event_ids).
+```
+
+Any `container_restart` actions still in `pending/` after rollback can be
+manually rejected via the Telegram bot or by deleting the action files from
+`/opt/homelab/actions/pending/` on the VPS.
--- a/services/ha-diag-agent/Dockerfile
+++ b/services/ha-diag-agent/Dockerfile
@ -0,0 +1,13 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+COPY pyproject.toml .
+RUN mkdir -p src/ha_diag && touch src/ha_diag/__init__.py && \
+    pip install --no-cache-dir -e .
+
+COPY src/ src/
+
+ENV PYTHONUNBUFFERED=1
+
+CMD ["python", "-m", "ha_diag.main"]
--- a/services/ha-diag-agent/README.md
+++ b/services/ha-diag-agent/README.md
@ -0,0 +1,131 @@
+# ha-diag-agent
+
+Per-host Home Assistant diagnostic agent. Polls HA REST API on a schedule,
+emits structured events to `/opt/homelab/events/<node>/`, and exposes an
+HTTP API for health checks and manual check triggers.
+
+Follows the same event-pipeline pattern as `node-agent`: filesystem-first,
+no direct supervisor integration, events processed by the VPS observer.
+
+## Architecture
+
+```
+APScheduler (interval-based REST checks)
+  ├─ HeartbeatCheck            → pings /api/, emits ha_websocket_dead on failure
+  ├─ UnavailableEntitiesCheck  → entity unavailable > threshold
+  ├─ SystemHealthCheck         → /api/system_health per-integration status
+  ├─ AutomationFailuresCheck   → automation last-run error traces
+  └─ UpdatesAvailableCheck     → pending HA/integration updates
+
+WebSocketMonitor (persistent, long-running — Phase 4b)
+  └─ Maintains a live WS subscription to state_changed events
+     Any traffic = HA is alive. Watchdog fires ha_websocket_dead on
+     silence > 5min or on disconnect. Emits ha_websocket_recovered
+     when the connection is restored after a dead alert.
+
+FastAPI (port 8087)
+  GET  /health             → liveness probe (includes ws_connected field)
+  POST /trigger/<check>    → run a named check on demand
+
+SQLite (/data/ha_diag.db)
+  entity_baseline          → last-known entity states
+  check_history            → per-check run log
+  alerts_sent              → dedup gate for alert events
+```
+
+The WebSocketMonitor is the only persistent-connection component; all other
+checks are APScheduler intervals (stateless REST polls).
+
+## Event Types
+
+| Type | Severity | Trigger |
+|------|----------|---------|
+| `ha_websocket_dead` | error | WS disconnect, silence > 5min, or /api/ unreachable |
+| `ha_websocket_recovered` | info | WS reconnected after a dead alert (clears incident) |
+| `ha_integration_failed` | error | Integration in error state |
+| `ha_entity_unavailable_long` | warning | Entity unavailable > threshold |
+| `ha_automation_failing` | warning | Automation last run errored |
+| `ha_update_available` | info | HA or integration update pending |
+| `ha_recorder_lag` | warning | Recorder write lag > threshold |
+| `ha_system_health_degraded` | warning | System health check failed |
+
+Event routing in supervisor (Phase 5) maps these to `notify` actions.
+`ha_websocket_recovered` should be routed to clear any active `ha_websocket_dead` incident.
+
+## First-time deployment
+
+See **[DEPLOY.md](DEPLOY.md)** for the full procedure: HA token creation,
+per-host `.env` config, deploy commands, verification steps, 48h shadow-mode
+observation, and rollback.
+
+**Shadow mode** (`HA_DIAG_SHADOW_MODE`, default `true` on the control-plane):
+`ha_websocket_dead` events are downgraded to `alert_only` with a `[SHADOW MODE]`
+note instead of queuing an automatic `container_restart`. Set to `false` in
+`/opt/homelab/config/control-plane/.env` on the VPS when ready for live actions.
+
+## Deployment model
+
+The agent is deployed **per-host** but targets a potentially remote HA instance:
+
+| Node | Agent runs on | HA lives on | HA URL |
+|------|--------------|-------------|--------|
+| piha | piha | piha (localhost) | `http://localhost:8123` |
+| chelsty-infra | chelsty-infra | chelsty-ha (HAOS VM, separate machine) | `http://100.70.180.90:8123` |
+
+**chelsty-infra note:** Home Assistant runs on `chelsty-ha`, a dedicated Home Assistant
+OS VM. `chelsty-infra` is the hypervisor but does not run HA itself. The agent on
+`chelsty-infra` reaches HA over the Tailscale network (`100.70.180.90:8123`). If `chelsty-ha`
+gets a new Tailscale IP, update `HA_URL` in `/opt/homelab/config/ha-diag-agent/.env` on
+`chelsty-infra`.
+
+## Deployment
+
+```bash
+# 1. Create config on target node
+ssh oskar@<node-ip>
+mkdir -p /opt/homelab/config/ha-diag-agent /var/lib/ha-diag-agent
+cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
+HA_URL=http://homeassistant.local:8123   # or http://100.70.180.90:8123 for chelsty-infra
+HA_TOKEN=<long-lived-token>
+NODE_NAME=piha                           # or chelsty-infra
+LOCATION_TAG=ken                         # or chelsty
+CHECK_INTERVAL=60
+EOF
+
+# 2. Deploy
+scripts/deploy/deploy.sh --service ha-diag-agent
+
+# 3. Verify
+docker ps --filter name=ha-diag-agent
+curl http://localhost:8087/health
+```
+
+### chelsty-infra note
+
+`chelsty-infra` runs docker-compose v1 (1.29.2). Use `docker-compose` (hyphenated):
+```bash
+docker-compose -f docker-compose.yml up -d --build
+```
+
+### HA long-lived token
+
+In HA UI: Profile → Long-Lived Access Tokens → Create token.
+
+## Running Tests
+
+```bash
+cd services/ha-diag-agent
+pip install -e ".[dev]"
+pytest tests/ -v
+```
+
+## Optional YAML config
+
+Place `/opt/homelab/config/ha-diag-agent/ha-diag-agent.yaml` on the node.
+Values there are defaults; env vars take priority.
+
+```yaml
+ha_url: http://homeassistant.local:8123
+location_tag: ken
+check_interval: 60
+```
--- a/services/ha-diag-agent/docker-compose.yml
+++ b/services/ha-diag-agent/docker-compose.yml
@ -0,0 +1,30 @@
+services:
+  ha-diag-agent:
+    build: .
+    container_name: ha-diag-agent
+    restart: unless-stopped
+
+    env_file:
+      - /opt/homelab/config/ha-diag-agent/.env
+
+    ports:
+      - "8087:8087"
+
+    volumes:
+      # Events dir: host path includes node name; inside container always /events
+      - /opt/homelab/events/${NODE_NAME:-ha-diag}:/events
+      # SQLite baseline cache and check history
+      - /var/lib/ha-diag-agent:/data
+      # Optional YAML config (read-only)
+      - /opt/homelab/config/ha-diag-agent:/config:ro
+
+    healthcheck:
+      test:
+        - "CMD"
+        - "python"
+        - "-c"
+        - "import urllib.request; urllib.request.urlopen('http://localhost:8087/health', timeout=5)"
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 20s
--- a/services/ha-diag-agent/env.example
+++ b/services/ha-diag-agent/env.example
@ -0,0 +1,27 @@
+# ha-diag-agent environment variables
+# Copy to /opt/homelab/config/ha-diag-agent/.env on the target node
+
+# Home Assistant connection (required)
+HA_URL=http://homeassistant.local:8123
+HA_TOKEN=your-long-lived-token-here
+HA_TIMEOUT=10.0
+
+# Node identity
+NODE_NAME=piha
+LOCATION_TAG=ken
+
+# Check intervals (seconds)
+CHECK_INTERVAL=60         # heartbeat check
+CHECK_INTERVAL_UNAVAILABLE=3600  # entity availability check (1h)
+
+# Unavailable entities thresholds
+UNAVAILABLE_THRESHOLD_HOURS=24   # alert after N hours unavailable
+INTEGRATION_FAILURE_THRESHOLD_PCT=0.5  # fraction of integration entities
+INTEGRATION_FAILURE_MIN_ENTITIES=3     # minimum count for integration event
+ALERT_COOLDOWN_HOURS=6           # suppress re-alert within N hours
+
+# API server
+PORT=8087
+
+# Logging: debug, info, warning, error
+LOG_LEVEL=info
--- a/services/ha-diag-agent/healthcheck.sh
+++ b/services/ha-diag-agent/healthcheck.sh
@ -0,0 +1,12 @@
+#!/bin/sh
+# Healthcheck: probe the FastAPI /health endpoint
+set -e
+PORT="${PORT:-8087}"
+python -c "
+import urllib.request, sys
+try:
+    r = urllib.request.urlopen('http://localhost:${PORT}/health', timeout=5)
+    sys.exit(0 if r.status == 200 else 1)
+except Exception:
+    sys.exit(1)
+"
--- a/services/ha-diag-agent/pyproject.toml
+++ b/services/ha-diag-agent/pyproject.toml
@ -0,0 +1,36 @@
+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "ha-diag-agent"
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = [
+    "aiohttp>=3.9",
+    "fastapi>=0.110",
+    "uvicorn[standard]>=0.29",
+    "pydantic>=2.6",
+    "pydantic-settings>=2.2",
+    "apscheduler>=3.10",
+    "aiosqlite>=0.20",
+    "structlog>=24.1",
+    "pyyaml>=6.0",
+]
+
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.1",
+    "pytest-asyncio>=0.23",
+    "aioresponses>=0.7",
+]
+
+[tool.setuptools.packages.find]
+where = ["src"]
+
+[tool.pytest.ini_options]
+asyncio_mode = "auto"
+testpaths = ["tests"]
+markers = [
+    "integration: requires running HA instances — run with -m integration",
+]
--- a/services/ha-diag-agent/service.yaml
+++ b/services/ha-diag-agent/service.yaml
@ -0,0 +1,42 @@
+service:
+  name: ha-diag-agent
+  # Deployed per-host: piha (site: ken) and chelsty-infra (site: chelsty)
+  owner_node: per-host
+  exposure: local-only
+  monitor: true
+
+  dependencies:
+    - homeassistant
+
+  ports:
+    - 8087
+
+  healthcheck:
+    type: http
+    path: /health
+    interval: 30s
+    timeout: 10s
+    retries: 3
+    start_period: 20s
+
+  restart_policy: unless-stopped
+
+  persistence:
+    paths:
+      - /opt/homelab/events
+      - /var/lib/ha-diag-agent
+
+  runtime:
+    env_vars:
+      - HA_TOKEN                          # long-lived HA access token (required)
+      - HA_URL                            # http://homeassistant.local:8123
+      - NODE_NAME                         # canonical node name: piha, chelsty-infra
+      - LOCATION_TAG                      # human site label: ken, chelsty
+      - CHECK_INTERVAL                    # heartbeat interval seconds (default: 60)
+      - CHECK_INTERVAL_UNAVAILABLE        # entity check interval seconds (default: 3600)
+      - UNAVAILABLE_THRESHOLD_HOURS       # alert threshold (default: 24)
+      - INTEGRATION_FAILURE_THRESHOLD_PCT # fraction threshold (default: 0.5)
+      - INTEGRATION_FAILURE_MIN_ENTITIES  # min count for integration event (default: 3)
+      - ALERT_COOLDOWN_HOURS              # re-alert suppression (default: 6)
+      - PORT                              # FastAPI port (default: 8087)
+      - LOG_LEVEL                         # default: info
--- a/services/ha-diag-agent/src/ha_diag/init.py
+++ b/services/ha-diag-agent/src/ha_diag/init.py
--- a/services/ha-diag-agent/src/ha_diag/api.py
+++ b/services/ha-diag-agent/src/ha_diag/api.py
@ -0,0 +1,58 @@
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from fastapi import FastAPI, HTTPException
+
+if TYPE_CHECKING:
+    from .checks.base import Check
+    from .monitors.base import Monitor
+
+app = FastAPI(title="ha-diag-agent", version="0.1.0")
+
+# Populated by main.py during startup
+_checks: dict[str, "Check"] = {}
+_ws_monitor: "Monitor | None" = None
+_node_name: str = "unknown"
+_location_tag: str = "default"
+
+
+def register_checks(checks: list["Check"], node_name: str, location_tag: str) -> None:
+    global _node_name, _location_tag
+    _checks.update({c.name: c for c in checks})
+    _node_name = node_name
+    _location_tag = location_tag
+
+
+def register_ws_monitor(monitor: "Monitor") -> None:
+    global _ws_monitor
+    _ws_monitor = monitor
+
+
+@app.get("/health")
+async def health() -> dict:
+    response: dict = {
+        "status": "ok",
+        "node": _node_name,
+        "location_tag": _location_tag,
+        "checks": list(_checks.keys()),
+    }
+    if _ws_monitor is not None:
+        response["ws_connected"] = _ws_monitor.is_healthy
+    return response
+
+
+@app.post("/trigger/{check_name}")
+async def trigger(check_name: str) -> dict:
+    check = _checks.get(check_name)
+    if check is None:
+        raise HTTPException(status_code=404, detail=f"Unknown check: {check_name!r}")
+    result = await check.run()
+    return {
+        "check": check_name,
+        "healthy": result.healthy,
+        "event_type": result.event_type,
+        "severity": result.severity,
+        "message": result.message,
+        "payload": result.payload,
+    }
--- a/services/ha-diag-agent/src/ha_diag/checks/init.py
+++ b/services/ha-diag-agent/src/ha_diag/checks/init.py
--- a/services/ha-diag-agent/src/ha_diag/checks/automation_failures.py
+++ b/services/ha-diag-agent/src/ha_diag/checks/automation_failures.py
@ -0,0 +1,97 @@
+from __future__ import annotations
+
+from typing import TYPE_CHECKING, Any
+
+from ..ha_client import HAClient
+from ..models import CheckResult, HAEventType, Severity
+from ..storage import Storage
+from .base import Check
+
+if TYPE_CHECKING:
+    from ..config import Settings
+
+
+class AutomationFailuresCheck(Check):
+    """Detects automations with consecutive run failures.
+
+    For each enabled automation (state="on"), fetches the last N run traces.
+    When all N most-recent traces indicate failure, emits ha_automation_failing
+    with a 6-hour dedup per automation.
+    """
+
+    name = "automation_failures"
+
+    def __init__(
+        self,
+        ha_client: HAClient,
+        storage: Storage,
+        settings: "Settings",
+    ) -> None:
+        self._client = ha_client
+        self._storage = storage
+        self._settings = settings
+
+    async def run(self) -> list[CheckResult]:
+        try:
+            all_states = await self._client.get_states()
+        except Exception:
+            return []
+
+        automations = [
+            s for s in all_states
+            if s["entity_id"].startswith("automation.") and s["state"] == "on"
+        ]
+
+        results: list[CheckResult] = []
+        cooldown_s = self._settings.alert_cooldown_hours * 3600
+        threshold = self._settings.automation_failure_threshold
+
+        for auto_state in automations:
+            eid = auto_state["entity_id"]
+            try:
+                traces = await self._client.get_automation_traces(eid)
+            except Exception:
+                continue
+
+            if not traces or len(traces) < threshold:
+                continue
+
+            recent = traces[:threshold]
+            failures = [t for t in recent if _is_trace_failure(t)]
+            if len(failures) < threshold:
+                continue
+
+            alert_key = f"automation_failing:{eid}"
+            if await self._storage.was_alert_sent(alert_key, cooldown_s):
+                continue
+
+            attrs = auto_state.get("attributes", {})
+            friendly_name = attrs.get("friendly_name", eid)
+            last_failures = [
+                {"timestamp": t.get("timestamp"), "error": t.get("error", "")}
+                for t in failures
+            ]
+
+            results.append(CheckResult(
+                healthy=False,
+                event_type=HAEventType.ha_automation_failing,
+                severity=Severity.warning,
+                message=(
+                    f"Automation '{friendly_name}' failed "
+                    f"{len(failures)} consecutive time(s)"
+                ),
+                payload={
+                    "entity_id": eid,
+                    "friendly_name": friendly_name,
+                    "last_failures": last_failures,
+                    "total_recent_failures": len(failures),
+                },
+            ))
+            await self._storage.mark_alert_sent(alert_key)
+
+        return results
+
+
+def _is_trace_failure(trace: dict[str, Any]) -> bool:
+    """A trace is a failure if it has a non-empty error or an explicit failed state."""
+    return bool(trace.get("error")) or trace.get("state") == "failed"
--- a/services/ha-diag-agent/src/ha_diag/checks/base.py
+++ b/services/ha-diag-agent/src/ha_diag/checks/base.py
@ -0,0 +1,20 @@
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+
+from ..models import CheckResult
+
+
+class Check(ABC):
+    """Base class for all HA diagnostic checks."""
+
+    name: str  # unique slug used in /trigger/<name> and check_history
+
+    @abstractmethod
+    async def run(self) -> list[CheckResult]:
+        """Execute the check and return results.
+
+        Empty list means the check passed cleanly.
+        Each CheckResult with event_type set causes an event to be emitted.
+        The caller (runner in main.py) handles emission and history recording.
+        """
--- a/services/ha-diag-agent/src/ha_diag/checks/heartbeat.py
+++ b/services/ha-diag-agent/src/ha_diag/checks/heartbeat.py
@ -0,0 +1,38 @@
+from __future__ import annotations
+
+from ..ha_client import HAClient
+from ..models import CheckResult, HAEventType, Severity
+from .base import Check
+
+
+class HeartbeatCheck(Check):
+    """Pings HA /api/ to verify the REST API is reachable.
+
+    Validates the end-to-end pipeline: shared HAClient → check → event emitter.
+    """
+
+    name = "heartbeat"
+
+    def __init__(self, ha_client: HAClient) -> None:
+        self._client = ha_client
+
+    async def run(self) -> list[CheckResult]:
+        try:
+            data = await self._client.get_api_status()
+            if isinstance(data, dict) and "message" in data:
+                return []
+            return [CheckResult(
+                healthy=False,
+                event_type=HAEventType.ha_websocket_dead,
+                severity=Severity.error,
+                message=f"HA API returned unexpected response: {data!r}",
+                payload={"response": str(data)},
+            )]
+        except Exception as exc:
+            return [CheckResult(
+                healthy=False,
+                event_type=HAEventType.ha_websocket_dead,
+                severity=Severity.error,
+                message=f"HA API unreachable: {exc}",
+                payload={"error": str(exc)},
+            )]
--- a/services/ha-diag-agent/src/ha_diag/checks/system_health.py
+++ b/services/ha-diag-agent/src/ha_diag/checks/system_health.py
@ -0,0 +1,110 @@
+from __future__ import annotations
+
+import json
+from typing import TYPE_CHECKING, Any
+
+from ..ha_client import HAClient
+from ..models import CheckResult, HAEventType, Severity
+from ..storage import Storage
+from .base import Check
+
+if TYPE_CHECKING:
+    from ..config import Settings
+
+
+class SystemHealthCheck(Check):
+    """Detects newly-failing HA integrations via /api/system_health.
+
+    Logic per run:
+    1. Fetch /api/system_health and parse per-component statuses.
+    2. Diff against stored snapshots in system_health_snapshot.
+    3. Emit ha_system_health_degraded on ok → error transitions.
+    4. Clear alerts_sent on error → ok recovery (next degradation re-alerts).
+    5. Update all component snapshots.
+
+    API errors (HA unreachable) return no results; HeartbeatCheck handles
+    HA reachability separately.
+    """
+
+    name = "system_health"
+
+    def __init__(
+        self,
+        ha_client: HAClient,
+        storage: Storage,
+        settings: "Settings",
+    ) -> None:
+        self._client = ha_client
+        self._storage = storage
+        self._settings = settings
+
+    async def run(self) -> list[CheckResult]:
+        try:
+            health_data = await self._client.get_system_health()
+        except Exception:
+            return []
+
+        statuses = _extract_component_statuses(health_data)
+        results: list[CheckResult] = []
+
+        for component, info in statuses.items():
+            status = info["status"]
+            details = info.get("details", {})
+            prev = await self._storage.get_system_health_snapshot(component)
+
+            if status == "error":
+                if prev is None or prev["last_status"] == "ok":
+                    results.append(CheckResult(
+                        healthy=False,
+                        event_type=HAEventType.ha_system_health_degraded,
+                        severity=Severity.warning,
+                        message=f"HA component '{component}' is degraded",
+                        payload={
+                            "component": component,
+                            "previous_status": prev["last_status"] if prev else "unknown",
+                            "current_status": "error",
+                            "details": details,
+                        },
+                    ))
+            elif status == "ok" and prev and prev["last_status"] == "error":
+                await self._storage.clear_alert(f"system_health:{component}")
+
+            await self._storage.upsert_system_health_snapshot(
+                component, status, json.dumps(details, default=str)
+            )
+
+        return results
+
+
+def _extract_component_statuses(
+    health_data: dict[str, Any],
+) -> dict[str, dict[str, Any]]:
+    """Parse HA /api/system_health into {component: {status, details}}.
+
+    Handles multiple HA response shapes:
+    - Typed:   {component: {"type": "result"|"error", "data": {...}}}
+    - Legacy:  {component: {"error": "msg"}} or {component: {plain_data}}
+    - Nested:  {"checks": {component: {...}}, "info": {...}}
+    """
+    checks = health_data.get("checks", health_data)
+    if not isinstance(checks, dict):
+        return {}
+
+    result: dict[str, dict[str, Any]] = {}
+    for component, value in checks.items():
+        if not isinstance(value, dict):
+            continue
+
+        if value.get("type") == "error" or value.get("error"):
+            result[component] = {
+                "status": "error",
+                "details": {"error": str(value.get("error") or value.get("type", "error"))},
+            }
+        else:
+            inner = value.get("data", value)
+            result[component] = {
+                "status": "ok",
+                "details": inner if isinstance(inner, dict) else value,
+            }
+
+    return result
--- a/services/ha-diag-agent/src/ha_diag/checks/unavailable_entities.py
+++ b/services/ha-diag-agent/src/ha_diag/checks/unavailable_entities.py
@ -0,0 +1,266 @@
+from __future__ import annotations
+
+import time
+from datetime import datetime, timezone
+from typing import TYPE_CHECKING, Any
+
+from ..ha_client import HAClient
+from ..models import CheckResult, HAEventType, Severity
+from ..storage import Storage
+from .base import Check
+
+if TYPE_CHECKING:
+    from ..config import Settings
+
+_BAD_STATES = frozenset({"unavailable", "unknown"})
+
+
+def _parse_last_changed_ts(value: str | None) -> float | None:
+    """Parse HA last_changed ISO string → Unix timestamp.
+
+    Returns None on missing or malformed input so callers can fall back
+    to the baseline first_seen without special-casing.
+    """
+    if not value:
+        return None
+    try:
+        return datetime.fromisoformat(value).timestamp()
+    except (ValueError, TypeError):
+        return None
+
+
+class UnavailableEntitiesCheck(Check):
+    """Detects entities stuck in unavailable/unknown state.
+
+    Logic:
+    1. Fetch all entity states from HA.
+    2. Maintain SQLite baseline: INSERT OR IGNORE to preserve first-seen timestamp.
+    3. Handle recoveries: clear baseline + alert dedup for entities back online.
+    4. Alert on entities unavailable > unavailable_threshold_hours.
+    5. Root-cause grouping: if >= integration_failure_threshold_pct of an
+       integration's entities are unavailable (and count >= min_entities), emit
+       ha_integration_failed instead of N individual ha_entity_unavailable_long
+       events.
+    6. Alert dedup: skip re-emitting the same alert within alert_cooldown_hours.
+    """
+
+    name = "unavailable_entities"
+
+    def __init__(
+        self,
+        ha_client: HAClient,
+        storage: Storage,
+        settings: "Settings",
+    ) -> None:
+        self._client = ha_client
+        self._storage = storage
+        self._settings = settings
+
+    # ------------------------------------------------------------------
+    # Public entry point
+    # ------------------------------------------------------------------
+
+    async def run(self) -> list[CheckResult]:
+        now = time.time()
+
+        try:
+            all_states = await self._client.get_states()
+        except Exception as exc:
+            return [CheckResult(
+                healthy=False,
+                event_type=HAEventType.ha_websocket_dead,
+                severity=Severity.error,
+                message=f"Failed to fetch entity states: {exc}",
+                payload={"error": str(exc)},
+            )]
+
+        integration_map, area_map = await self._load_registry()
+
+        unavailable: dict[str, dict[str, Any]] = {
+            s["entity_id"]: s for s in all_states if s["state"] in _BAD_STATES
+        }
+        available_ids: set[str] = {
+            s["entity_id"] for s in all_states if s["state"] not in _BAD_STATES
+        }
+
+        # Handle recoveries first
+        tracked = await self._storage.get_all_tracked_entity_ids()
+        for eid in tracked:
+            if eid in available_ids:
+                await self._handle_recovery(eid)
+
+        # Record new/continuing unavailable entities (INSERT OR IGNORE preserves timestamp)
+        for eid, state_data in unavailable.items():
+            await self._storage.set_entity_unavailable_since(
+                eid, state_data["state"], now
+            )
+
+        # Determine which entities have exceeded the alert threshold
+        to_alert: list[dict[str, Any]] = []
+        cooldown_s = self._settings.alert_cooldown_hours * 3600
+        threshold_h = self._settings.unavailable_threshold_hours
+
+        for eid, state_data in unavailable.items():
+            first_at = await self._storage.get_entity_first_unavailable_at(eid)
+            if first_at is None:
+                continue
+            # Phase 3 Flag #1: if HA reports an earlier last_changed (entity was
+            # already unavailable before the agent started/reconnected), use that
+            # as the authoritative "since" so duration is accurate.
+            last_changed_ts = _parse_last_changed_ts(state_data.get("last_changed"))
+            effective_since = (
+                min(last_changed_ts, first_at)
+                if last_changed_ts is not None
+                else first_at
+            )
+            duration_h = (now - effective_since) / 3600
+            if duration_h < threshold_h:
+                continue
+            alert_key = f"entity_unavailable:{eid}"
+            if await self._storage.was_alert_sent(alert_key, cooldown_s):
+                continue
+            to_alert.append({
+                "entity_id": eid,
+                "state": state_data["state"],
+                "first_at": effective_since,
+                "duration_h": duration_h,
+                "domain": eid.split(".")[0],
+                "integration": integration_map.get(eid),
+                "area_id": area_map.get(eid),
+            })
+
+        if not to_alert:
+            return []
+
+        return await self._build_results(to_alert, all_states, integration_map)
+
+    # ------------------------------------------------------------------
+    # Internal helpers
+    # ------------------------------------------------------------------
+
+    async def _load_registry(
+        self,
+    ) -> tuple[dict[str, str], dict[str, str]]:
+        """Fetch entity registry; return (integration_map, area_map).
+
+        Falls back to empty dicts when the endpoint is unavailable.
+        """
+        try:
+            registry = await self._client.get_entity_registry()
+            integration_map = {
+                e["entity_id"]: e.get("platform") or ""
+                for e in registry
+                if "entity_id" in e
+            }
+            area_map = {
+                e["entity_id"]: e.get("area_id") or ""
+                for e in registry
+                if "entity_id" in e
+            }
+            return integration_map, area_map
+        except Exception:
+            return {}, {}
+
+    async def _handle_recovery(self, entity_id: str) -> None:
+        await self._storage.clear_entity_unavailable(entity_id)
+        # Clear dedup so the next unavailability triggers an alert immediately
+        await self._storage.clear_alert(f"entity_unavailable:{entity_id}")
+
+    async def _build_results(
+        self,
+        to_alert: list[dict[str, Any]],
+        all_states: list[dict[str, Any]],
+        integration_map: dict[str, str],
+    ) -> list[CheckResult]:
+        results: list[CheckResult] = []
+        handled: set[str] = set()
+
+        # Build per-integration stats across ALL entities (not just to_alert)
+        total_per_integ: dict[str, int] = {}
+        unav_per_integ: dict[str, list[str]] = {}
+
+        for state in all_states:
+            eid = state["entity_id"]
+            integ = integration_map.get(eid)
+            if not integ:
+                continue
+            total_per_integ[integ] = total_per_integ.get(integ, 0) + 1
+            if state["state"] in _BAD_STATES:
+                unav_per_integ.setdefault(integ, []).append(eid)
+
+        min_ent = self._settings.integration_failure_min_entities
+        threshold_pct = self._settings.integration_failure_threshold_pct
+        cooldown_s = self._settings.alert_cooldown_hours * 3600
+
+        # Integration-level events
+        for integ, unav_ids in unav_per_integ.items():
+            total = total_per_integ.get(integ, 0)
+            pct = len(unav_ids) / total if total else 0
+
+            alerted_from_integ = [e for e in to_alert if e["integration"] == integ]
+            if not alerted_from_integ:
+                continue
+            if pct < threshold_pct or len(unav_ids) < min_ent:
+                continue
+
+            alert_key = f"integration_failed:{integ}"
+            if await self._storage.was_alert_sent(alert_key, cooldown_s):
+                handled.update(e["entity_id"] for e in alerted_from_integ)
+                continue
+
+            results.append(CheckResult(
+                healthy=False,
+                event_type=HAEventType.ha_integration_failed,
+                severity=Severity.error,
+                message=(
+                    f"Integration '{integ}' appears down: "
+                    f"{len(unav_ids)}/{total} entities unavailable"
+                ),
+                payload={
+                    "integration": integ,
+                    "affected_entities": unav_ids,
+                    "unavailable_count": len(unav_ids),
+                    "total_count": total,
+                    "unavailable_pct": round(pct, 2),
+                },
+            ))
+            await self._storage.mark_alert_sent(alert_key)
+            handled.update(e["entity_id"] for e in alerted_from_integ)
+
+        # Per-entity events for entities not covered by an integration event
+        for entity in to_alert:
+            eid = entity["entity_id"]
+            if eid in handled:
+                continue
+
+            since_iso = (
+                datetime.fromtimestamp(entity["first_at"], tz=timezone.utc)
+                .isoformat()
+                .replace("+00:00", "Z")
+            )
+
+            payload: dict[str, Any] = {
+                "entity_id": eid,
+                "state": entity["state"],
+                "since": since_iso,
+                "duration_hours": round(entity["duration_h"], 1),
+                "domain": entity["domain"],
+            }
+            if entity["integration"]:
+                payload["integration"] = entity["integration"]
+            if entity["area_id"]:
+                payload["area"] = entity["area_id"]
+
+            results.append(CheckResult(
+                healthy=False,
+                event_type=HAEventType.ha_entity_unavailable_long,
+                severity=Severity.warning,
+                message=(
+                    f"Entity {eid} unavailable for "
+                    f"{entity['duration_h']:.1f}h"
+                ),
+                payload=payload,
+            ))
+            await self._storage.mark_alert_sent(f"entity_unavailable:{eid}")
+
+        return results
--- a/services/ha-diag-agent/src/ha_diag/checks/updates_available.py
+++ b/services/ha-diag-agent/src/ha_diag/checks/updates_available.py
@ -0,0 +1,123 @@
+from __future__ import annotations
+
+from datetime import datetime
+from typing import TYPE_CHECKING, Any
+
+from ..ha_client import HAClient
+from ..models import CheckResult, HAEventType, Severity
+from ..storage import Storage
+from .base import Check
+
+if TYPE_CHECKING:
+    from ..config import Settings
+
+_MAX_RELEASE_NOTES = 2000
+
+
+class UpdatesAvailableCheck(Check):
+    """Detects available HA core/add-on updates via update.* entities.
+
+    Runs daily. Emits one ha_update_available event per update entity whose
+    7-day dedup window has expired. Falls back gracefully when HA is down.
+    """
+
+    name = "updates_available"
+
+    def __init__(
+        self,
+        ha_client: HAClient,
+        storage: Storage,
+        settings: "Settings",
+    ) -> None:
+        self._client = ha_client
+        self._storage = storage
+        self._settings = settings
+
+    async def run(self) -> list[CheckResult]:
+        updates = await self._fetch_active_updates()
+        if not updates:
+            return []
+
+        results: list[CheckResult] = []
+        cooldown_s = self._settings.updates_cooldown_days * 86400
+
+        for state in updates:
+            eid = state["entity_id"]
+            alert_key = f"update_available:{eid}"
+            if await self._storage.was_alert_sent(alert_key, cooldown_s):
+                continue
+            attrs = state.get("attributes", {})
+            results.append(CheckResult(
+                healthy=False,
+                event_type=HAEventType.ha_update_available,
+                severity=Severity.info,
+                message=(
+                    f"Update available: {attrs.get('title', eid)} "
+                    f"{attrs.get('installed_version', '?')} → "
+                    f"{attrs.get('latest_version', '?')}"
+                ),
+                payload=_build_update_payload(eid, attrs),
+            ))
+            await self._storage.mark_alert_sent(alert_key)
+
+        return results
+
+    async def _fetch_active_updates(self) -> list[dict[str, Any]]:
+        try:
+            all_states = await self._client.get_states()
+        except Exception:
+            return []
+        return [
+            s for s in all_states
+            if s["entity_id"].startswith("update.") and s["state"] == "on"
+        ]
+
+
+class UpdatesDigestCheck(UpdatesAvailableCheck):
+    """Weekly Sunday digest: single event listing all pending updates.
+
+    Deduped per ISO week (won't re-fire if triggered multiple times on the
+    same Sunday, e.g. manual + scheduled).
+    """
+
+    name = "updates_digest"
+
+    async def run(self) -> list[CheckResult]:
+        updates = await self._fetch_active_updates()
+        if not updates:
+            return []
+
+        week_key = datetime.now().strftime("%G-W%V")
+        alert_key = f"update_digest:{week_key}"
+        if await self._storage.was_alert_sent(alert_key, 6 * 86400):
+            return []
+
+        all_payloads = [
+            _build_update_payload(s["entity_id"], s.get("attributes", {}))
+            for s in updates
+        ]
+        await self._storage.mark_alert_sent(alert_key)
+        return [CheckResult(
+            healthy=False,
+            event_type=HAEventType.ha_update_available,
+            severity=Severity.info,
+            message=f"Weekly digest: {len(all_payloads)} update(s) available",
+            payload={"digest": True, "updates": all_payloads, "count": len(all_payloads)},
+        )]
+
+
+def _build_update_payload(entity_id: str, attrs: dict[str, Any]) -> dict[str, Any]:
+    payload: dict[str, Any] = {
+        "entity_id": entity_id,
+        "title": attrs.get("title", entity_id),
+        "installed_version": attrs.get("installed_version"),
+        "latest_version": attrs.get("latest_version"),
+        "in_progress": attrs.get("in_progress", False),
+        "auto_update": attrs.get("auto_update", False),
+    }
+    if attrs.get("release_url"):
+        payload["release_url"] = attrs["release_url"]
+    summary = attrs.get("release_summary")
+    if summary:
+        payload["release_summary"] = summary[:_MAX_RELEASE_NOTES]
+    return payload
--- a/services/ha-diag-agent/src/ha_diag/config.py
+++ b/services/ha-diag-agent/src/ha_diag/config.py
@ -0,0 +1,83 @@
+from __future__ import annotations
+
+import os
+from pathlib import Path
+
+import yaml
+from pydantic import field_validator
+from pydantic_settings import BaseSettings
+
+_CONFIG_YAML = Path("/config/ha-diag-agent.yaml")
+
+
+class Settings(BaseSettings):
+    # HA connection
+    ha_url: str = "http://homeassistant.local:8123"
+    ha_token: str = ""
+    ha_timeout: float = 10.0
+
+    # Node identity
+    node_name: str = "unknown"
+    location_tag: str = "default"
+
+    # Intervals (seconds)
+    check_interval: int = 60          # heartbeat check interval
+    check_interval_unavailable: int = 3600  # unavailable entities check interval
+
+    # Unavailable entities check thresholds
+    unavailable_threshold_hours: float = 24.0  # alert after N hours unavailable
+    integration_failure_threshold_pct: float = 0.5  # % of integration entities unavailable
+    integration_failure_min_entities: int = 3   # min count to trigger integration event
+    alert_cooldown_hours: float = 6.0           # don't re-alert same entity within N hours
+
+    # Phase 3 Flag #3: entity registry cache TTL
+    entity_registry_cache_ttl: int = 300  # seconds
+
+    # SystemHealthCheck
+    system_health_check_interval: int = 900   # 15 min
+
+    # AutomationFailuresCheck
+    automation_check_interval: int = 1800     # 30 min
+    automation_failure_threshold: int = 3     # consecutive failures before alert
+
+    # UpdatesAvailableCheck
+    updates_check_hour: int = 9
+    updates_check_minute: int = 0
+    updates_cooldown_days: int = 7            # don't re-alert same update within N days
+
+    # WebSocket monitor
+    websocket_enabled: bool = True
+    websocket_silence_threshold_seconds: int = 300   # 5 min
+    websocket_watchdog_interval_seconds: int = 30
+    websocket_reconnect_initial_delay: float = 1.0
+    websocket_reconnect_max_delay: float = 60.0
+    websocket_reconnect_jitter: float = 0.2          # ±20% of delay
+    websocket_down_alert_repeat_minutes: int = 10
+
+    # API server
+    port: int = 8087
+    log_level: str = "info"
+
+    # Runtime paths (inside container)
+    events_dir: Path = Path("/events")
+    data_dir: Path = Path("/data")
+
+    model_config = {"extra": "ignore", "case_sensitive": False}
+
+    @field_validator("ha_url")
+    @classmethod
+    def strip_trailing_slash(cls, v: str) -> str:
+        return v.rstrip("/")
+
+    @classmethod
+    def load(cls) -> "Settings":
+        """Load settings: YAML file provides defaults; env vars override."""
+        if _CONFIG_YAML.exists():
+            try:
+                with _CONFIG_YAML.open() as f:
+                    data = yaml.safe_load(f) or {}
+                for k, v in data.items():
+                    os.environ.setdefault(k.upper(), str(v))
+            except Exception:
+                pass
+        return cls()
--- a/services/ha-diag-agent/src/ha_diag/event_emitter.py
+++ b/services/ha-diag-agent/src/ha_diag/event_emitter.py
@ -0,0 +1,61 @@
+from __future__ import annotations
+
+import json
+import re
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+from .models import EventRecord
+
+
+class EventEmitter:
+    """Writes atomic JSON event files to the events directory."""
+
+    def __init__(
+        self, events_dir: Path, node_name: str, location_tag: str = ""
+    ) -> None:
+        self._events_dir = events_dir
+        self._node_name = node_name
+        self._location_tag = location_tag
+        self._seq = 0
+        events_dir.mkdir(parents=True, exist_ok=True)
+
+    def _make_id(self, event_type: str, service: str) -> str:
+        # Sequence suffix guarantees uniqueness even when multiple events of the
+        # same type are emitted within the same millisecond.
+        self._seq += 1
+        ts = int(time.time())
+        svc_slug = re.sub(r"[^a-z0-9]", "-", (service or "ha").lower())[:32].strip("-")
+        return f"evt-{self._node_name}-{ts}-{event_type}-{svc_slug}-{self._seq}"
+
+    def emit(
+        self,
+        event_type: str,
+        severity: str,
+        service: str,
+        message: str,
+        payload: dict[str, Any] | None = None,
+    ) -> str:
+        event_id = self._make_id(event_type, service)
+        merged: dict[str, Any] = {}
+        if self._location_tag:
+            merged["location_tag"] = self._location_tag
+        merged.update(payload or {})
+        record = EventRecord(
+            id=event_id,
+            timestamp=int(time.time()),
+            date=datetime.now(timezone.utc).isoformat(),
+            type=event_type,
+            severity=severity,
+            node=self._node_name,
+            service=service,
+            message=message,
+            payload=merged,
+        )
+        path = self._events_dir / f"{event_id}.json"
+        tmp = path.with_suffix(".tmp")
+        tmp.write_text(json.dumps(record.model_dump(), indent=2))
+        tmp.rename(path)
+        return event_id
--- a/services/ha-diag-agent/src/ha_diag/ha_client.py
+++ b/services/ha-diag-agent/src/ha_diag/ha_client.py
@ -0,0 +1,104 @@
+from __future__ import annotations
+
+import time
+from typing import Any
+
+import aiohttp
+
+
+def make_session(token: str, timeout: float = 10.0) -> aiohttp.ClientSession:
+    """Create a pre-configured ClientSession for use with HAClient."""
+    return aiohttp.ClientSession(
+        headers={
+            "Authorization": f"Bearer {token}",
+            "Content-Type": "application/json",
+        },
+        timeout=aiohttp.ClientTimeout(total=timeout),
+    )
+
+
+class HAClient:
+    """Async Home Assistant REST API client.
+
+    Session lifecycle is managed externally — the caller creates the session
+    via make_session() at startup and closes it on shutdown.  HAClient is a
+    session-borrower: it never opens or closes the session it receives.
+    """
+
+    def __init__(
+        self,
+        base_url: str,
+        session: aiohttp.ClientSession,
+        entity_registry_cache_ttl: float = 300.0,
+    ) -> None:
+        self._base_url = base_url.rstrip("/")
+        self._session = session
+        self._registry_cache_ttl = entity_registry_cache_ttl
+        self._registry_cache: list[dict[str, Any]] | None = None
+        self._registry_fetched_at: float = 0.0
+
+    async def get_api_status(self) -> dict[str, Any]:
+        """GET /api/ — returns {"message": "API running."} when HA is up."""
+        async with self._session.get(f"{self._base_url}/api/") as resp:
+            resp.raise_for_status()
+            return await resp.json()
+
+    async def get_states(self) -> list[dict[str, Any]]:
+        """GET /api/states — full entity state list."""
+        async with self._session.get(f"{self._base_url}/api/states") as resp:
+            resp.raise_for_status()
+            return await resp.json()
+
+    async def get_system_health(self) -> dict[str, Any]:
+        """GET /api/system_health — per-integration health summary."""
+        async with self._session.get(f"{self._base_url}/api/system_health") as resp:
+            resp.raise_for_status()
+            return await resp.json()
+
+    async def get_config(self) -> dict[str, Any]:
+        """GET /api/config — HA configuration including version."""
+        async with self._session.get(f"{self._base_url}/api/config") as resp:
+            resp.raise_for_status()
+            return await resp.json()
+
+    async def get_entity_registry(self) -> list[dict[str, Any]]:
+        """GET /api/config/entity_registry — entity registry entries.
+
+        Each entry includes entity_id, platform (integration name), area_id,
+        config_entry_id, and other metadata.
+
+        Result is cached in-process for entity_registry_cache_ttl seconds to
+        avoid hammering HA on every check cycle (Phase 3 Flag #3).
+        """
+        now = time.monotonic()
+        if (
+            self._registry_cache is not None
+            and (now - self._registry_fetched_at) < self._registry_cache_ttl
+        ):
+            return self._registry_cache
+        async with self._session.get(
+            f"{self._base_url}/api/config/entity_registry"
+        ) as resp:
+            resp.raise_for_status()
+            result = await resp.json()
+        self._registry_cache = result
+        self._registry_fetched_at = now
+        return result
+
+    def invalidate_registry_cache(self) -> None:
+        """Force the next get_entity_registry() call to fetch fresh data."""
+        self._registry_cache = None
+        self._registry_fetched_at = 0.0
+
+    async def get_automation_traces(self, automation_id: str) -> list[dict[str, Any]]:
+        """GET /api/trace/automation/<id> — last run traces for an automation."""
+        url = f"{self._base_url}/api/trace/automation/{automation_id}"
+        async with self._session.get(url) as resp:
+            resp.raise_for_status()
+            return await resp.json()
+
+    async def get_error_log(self) -> str:
+        """GET /api/error_log — plaintext error log."""
+        async with self._session.get(f"{self._base_url}/api/error_log") as resp:
+            resp.raise_for_status()
+            return await resp.text()
--- a/services/ha-diag-agent/src/ha_diag/main.py
+++ b/services/ha-diag-agent/src/ha_diag/main.py
@ -0,0 +1,204 @@
+from __future__ import annotations
+
+import asyncio
+import json
+import logging
+import time
+from datetime import datetime
+
+import structlog
+import uvicorn
+from apscheduler.schedulers.asyncio import AsyncIOScheduler
+
+from .api import app, register_checks, register_ws_monitor
+from .checks.automation_failures import AutomationFailuresCheck
+from .checks.heartbeat import HeartbeatCheck
+from .checks.system_health import SystemHealthCheck
+from .checks.unavailable_entities import UnavailableEntitiesCheck
+from .checks.updates_available import UpdatesAvailableCheck, UpdatesDigestCheck
+from .config import Settings
+from .event_emitter import EventEmitter
+from .ha_client import HAClient, make_session
+from .monitors import WebSocketMonitor
+from .storage import Storage
+
+_log = structlog.get_logger()
+
+
+def _configure_structlog(log_level: str) -> None:
+    structlog.configure(
+        processors=[
+            structlog.processors.add_log_level,
+            structlog.processors.TimeStamper(fmt="iso"),
+            structlog.processors.StackInfoRenderer(),
+            structlog.processors.format_exc_info,
+            structlog.processors.JSONRenderer(),
+        ],
+        logger_factory=structlog.PrintLoggerFactory(),
+    )
+    logging.basicConfig(level=getattr(logging, log_level.upper(), logging.INFO))
+
+
+async def _run_check_and_emit(
+    check, emitter: EventEmitter, storage: Storage
+) -> None:
+    """Run a check, emit events for each result, and record to check_history."""
+    try:
+        results = await check.run()
+        healthy = not any(r.event_type for r in results)
+        summary = f"{len(results)} issue(s)" if results else "ok"
+
+        await storage.record_check(
+            check_name=check.name,
+            ran_at=time.time(),
+            healthy=healthy,
+            message=summary,
+            payload=json.dumps([r.model_dump() for r in results]),
+        )
+
+        for result in results:
+            if result.event_type:
+                emitter.emit(
+                    event_type=result.event_type,
+                    severity=result.severity.value,
+                    service="homeassistant",
+                    message=result.message,
+                    payload=result.payload,
+                )
+                _log.warning(
+                    "check_unhealthy",
+                    check=check.name,
+                    event=result.event_type,
+                    msg=result.message,
+                )
+
+        if healthy:
+            _log.info("check_ok", check=check.name)
+
+    except Exception as exc:
+        _log.error("check_error", check=check.name, error=str(exc), exc_info=True)
+
+
+async def run(settings: Settings) -> None:
+    _configure_structlog(settings.log_level)
+    _log.info(
+        "ha_diag_agent_starting",
+        node=settings.node_name,
+        location=settings.location_tag,
+        ha_url=settings.ha_url,
+        heartbeat_interval=settings.check_interval,
+        unavailable_interval=settings.check_interval_unavailable,
+    )
+
+    storage = Storage(settings.data_dir / "ha_diag.db")
+    await storage.open()
+
+    emitter = EventEmitter(settings.events_dir, settings.node_name, settings.location_tag)
+
+    # Shared session — created once at startup, closed on shutdown
+    session = make_session(settings.ha_token, settings.ha_timeout)
+    ha_client = HAClient(
+        settings.ha_url, session,
+        entity_registry_cache_ttl=settings.entity_registry_cache_ttl,
+    )
+
+    heartbeat = HeartbeatCheck(ha_client)
+    unavailable = UnavailableEntitiesCheck(ha_client, storage, settings)
+    system_health = SystemHealthCheck(ha_client, storage, settings)
+    automation_failures = AutomationFailuresCheck(ha_client, storage, settings)
+    updates_daily = UpdatesAvailableCheck(ha_client, storage, settings)
+    updates_digest = UpdatesDigestCheck(ha_client, storage, settings)
+
+    all_checks = [heartbeat, unavailable, system_health, automation_failures,
+                  updates_daily, updates_digest]
+    register_checks(all_checks, settings.node_name, settings.location_tag)
+
+    ws_monitor = WebSocketMonitor(
+        ha_url=settings.ha_url,
+        token=settings.ha_token,
+        settings=settings,
+        emitter=emitter,
+        session=session,
+    )
+    register_ws_monitor(ws_monitor)
+
+    scheduler = AsyncIOScheduler()
+    scheduler.add_job(
+        _run_check_and_emit, "interval",
+        seconds=settings.check_interval,
+        args=[heartbeat, emitter, storage],
+        id="check_heartbeat",
+        next_run_time=datetime.now(),
+    )
+    scheduler.add_job(
+        _run_check_and_emit, "interval",
+        seconds=settings.check_interval_unavailable,
+        args=[unavailable, emitter, storage],
+        id="check_unavailable_entities",
+        next_run_time=datetime.now(),
+    )
+    scheduler.add_job(
+        _run_check_and_emit, "interval",
+        seconds=settings.system_health_check_interval,
+        args=[system_health, emitter, storage],
+        id="check_system_health",
+        next_run_time=datetime.now(),
+    )
+    scheduler.add_job(
+        _run_check_and_emit, "interval",
+        seconds=settings.automation_check_interval,
+        args=[automation_failures, emitter, storage],
+        id="check_automation_failures",
+        next_run_time=datetime.now(),
+    )
+    scheduler.add_job(
+        _run_check_and_emit, "cron",
+        hour=settings.updates_check_hour,
+        minute=settings.updates_check_minute,
+        args=[updates_daily, emitter, storage],
+        id="check_updates_available",
+    )
+    scheduler.add_job(
+        _run_check_and_emit, "cron",
+        day_of_week="sun",
+        hour=settings.updates_check_hour,
+        minute=settings.updates_check_minute,
+        args=[updates_digest, emitter, storage],
+        id="check_updates_digest",
+    )
+    scheduler.start()
+    _log.info(
+        "scheduler_started",
+        checks=[c.name for c in all_checks],
+        heartbeat_interval=settings.check_interval,
+        unavailable_interval=settings.check_interval_unavailable,
+        system_health_interval=settings.system_health_check_interval,
+        automation_interval=settings.automation_check_interval,
+        updates_hour=settings.updates_check_hour,
+    )
+
+    await ws_monitor.start()
+
+    config = uvicorn.Config(
+        app,
+        host="0.0.0.0",
+        port=settings.port,
+        log_level=settings.log_level.lower(),
+    )
+    server = uvicorn.Server(config)
+    try:
+        await server.serve()
+    finally:
+        await ws_monitor.stop()
+        scheduler.shutdown(wait=False)
+        await storage.close()
+        await session.close()
+
+
+def main() -> None:
+    settings = Settings.load()
+    asyncio.run(run(settings))
+
+
+if __name__ == "__main__":
+    main()
--- a/services/ha-diag-agent/src/ha_diag/models.py
+++ b/services/ha-diag-agent/src/ha_diag/models.py
@ -0,0 +1,43 @@
+from __future__ import annotations
+
+from enum import Enum
+from typing import Any
+
+from pydantic import BaseModel
+
+
+class Severity(str, Enum):
+    info = "info"
+    warning = "warning"
+    error = "error"
+
+
+class HAEventType(str, Enum):
+    ha_integration_failed = "ha_integration_failed"
+    ha_entity_unavailable_long = "ha_entity_unavailable_long"
+    ha_websocket_dead = "ha_websocket_dead"
+    ha_websocket_recovered = "ha_websocket_recovered"
+    ha_automation_failing = "ha_automation_failing"
+    ha_update_available = "ha_update_available"
+    ha_recorder_lag = "ha_recorder_lag"
+    ha_system_health_degraded = "ha_system_health_degraded"
+
+
+class EventRecord(BaseModel):
+    id: str
+    timestamp: int
+    date: str
+    type: str
+    severity: str
+    node: str
+    service: str
+    message: str
+    payload: dict[str, Any] = {}
+
+
+class CheckResult(BaseModel):
+    healthy: bool
+    event_type: str | None = None  # None means no event to emit
+    severity: Severity = Severity.info
+    message: str = ""
+    payload: dict[str, Any] = {}
--- a/services/ha-diag-agent/src/ha_diag/monitors/init.py
+++ b/services/ha-diag-agent/src/ha_diag/monitors/init.py
@ -0,0 +1,4 @@
+from .base import Monitor
+from .websocket_monitor import WebSocketMonitor
+
+__all__ = ["Monitor", "WebSocketMonitor"]
--- a/services/ha-diag-agent/src/ha_diag/monitors/base.py
+++ b/services/ha-diag-agent/src/ha_diag/monitors/base.py
@ -0,0 +1,24 @@
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+
+
+class Monitor(ABC):
+    """Base class for long-running background monitors.
+
+    Unlike checks (one-shot, APScheduler-driven), monitors maintain
+    persistent state — connections, subscriptions, background tasks.
+    """
+
+    @abstractmethod
+    async def start(self) -> None:
+        """Spawn background task(s). Idempotent if already started."""
+
+    @abstractmethod
+    async def stop(self) -> None:
+        """Cancel background tasks and wait for cleanup."""
+
+    @property
+    @abstractmethod
+    def is_healthy(self) -> bool:
+        """True when the monitor is running and its connection is live."""
--- a/services/ha-diag-agent/src/ha_diag/monitors/websocket_monitor.py
+++ b/services/ha-diag-agent/src/ha_diag/monitors/websocket_monitor.py
@ -0,0 +1,286 @@
+from __future__ import annotations
+
+import asyncio
+import json
+import random
+import time
+from datetime import datetime, timezone
+
+import aiohttp
+import structlog
+
+from ..config import Settings
+from ..event_emitter import EventEmitter
+from ..models import HAEventType, Severity
+from .base import Monitor
+
+_log = structlog.get_logger().bind(monitor="websocket")
+
+
+class _AuthError(Exception):
+    """Raised when HA returns auth_invalid during the WS handshake."""
+
+
+def _make_ws_url(ha_url: str) -> str:
+    if ha_url.startswith("https://"):
+        base = ha_url.replace("https://", "wss://", 1)
+    else:
+        base = ha_url.replace("http://", "ws://", 1)
+    return base.rstrip("/") + "/api/websocket"
+
+
+class WebSocketMonitor(Monitor):
+    """Persistent WebSocket connection to HA for real-time liveness monitoring.
+
+    Subscribes to state_changed events — any traffic proves HA is alive.
+    The watchdog fires ha_websocket_dead when the connection is silent for
+    longer than silence_threshold, or immediately on disconnect.
+    ha_websocket_recovered is emitted when the connection is restored after
+    a dead alert was sent (allows supervisor to clear active incidents).
+    """
+
+    def __init__(
+        self,
+        ha_url: str,
+        token: str,
+        settings: Settings,
+        emitter: EventEmitter,
+        session: aiohttp.ClientSession,
+    ) -> None:
+        self._ws_url = _make_ws_url(ha_url)
+        self._token = token
+        self._settings = settings
+        self._emitter = emitter
+        self._session = session
+
+        self._state: str = "disconnected"
+        self._last_event_monotonic: float = time.monotonic()
+        # 0.0 means no ha_websocket_dead has been emitted yet (for this session)
+        self._last_dead_alert_at: float = 0.0
+
+        self._stopping = False
+        self._msg_id = 0
+        self._main_task: asyncio.Task | None = None
+        self._watchdog_task: asyncio.Task | None = None
+
+    # ------------------------------------------------------------------
+    # Monitor ABC
+    # ------------------------------------------------------------------
+
+    async def start(self) -> None:
+        if not self._settings.websocket_enabled:
+            _log.info("ws_monitor_disabled")
+            return
+        self._stopping = False
+        self._last_event_monotonic = time.monotonic()
+        self._main_task = asyncio.create_task(
+            self._connection_loop(), name="ws_connection_loop"
+        )
+        self._watchdog_task = asyncio.create_task(
+            self._watchdog_loop(), name="ws_watchdog"
+        )
+        _log.info("ws_monitor_started", ws_url=self._ws_url)
+
+    async def stop(self) -> None:
+        self._stopping = True
+        self._state = "stopped"
+        tasks = [t for t in [self._main_task, self._watchdog_task] if t is not None]
+        for t in tasks:
+            t.cancel()
+        if tasks:
+            await asyncio.gather(*tasks, return_exceptions=True)
+        self._main_task = None
+        self._watchdog_task = None
+        _log.info("ws_monitor_stopped")
+
+    @property
+    def is_healthy(self) -> bool:
+        if not self._settings.websocket_enabled:
+            return True  # disabled monitors are not unhealthy
+        return self._state == "subscribed"
+
+    # ------------------------------------------------------------------
+    # Connection loop — reconnects with exponential back-off
+    # ------------------------------------------------------------------
+
+    async def _connection_loop(self) -> None:
+        delay = float(self._settings.websocket_reconnect_initial_delay)
+        while not self._stopping:
+            self._state = "connecting"
+            clean_close = False
+            try:
+                await self._connect_and_listen()
+                clean_close = True
+                delay = float(self._settings.websocket_reconnect_initial_delay)
+            except asyncio.CancelledError:
+                raise
+            except _AuthError as exc:
+                _log.error("ws_auth_failed", error=str(exc))
+                # Auth failures won't self-heal on fast retry — jump to max delay
+                delay = float(self._settings.websocket_reconnect_max_delay)
+            except Exception as exc:
+                _log.warning("ws_connect_error", error=str(exc))
+
+            self._state = "disconnected"
+            if not self._stopping:
+                self._on_disconnected()
+
+            if self._stopping:
+                break
+
+            if clean_close:
+                wait = 1.0  # brief pause before reconnecting after a clean HA close
+            else:
+                jitter_range = delay * self._settings.websocket_reconnect_jitter
+                wait = max(0.1, delay + random.uniform(-jitter_range, jitter_range))
+                delay = min(delay * 2, float(self._settings.websocket_reconnect_max_delay))
+
+            _log.debug("ws_reconnect_wait", seconds=round(wait, 2))
+            await asyncio.sleep(wait)
+
+    # ------------------------------------------------------------------
+    # Connect, auth, subscribe, receive
+    # ------------------------------------------------------------------
+
+    async def _connect_and_listen(self) -> None:
+        # Override the session-level timeout: WS must stay open indefinitely,
+        # only the initial TCP connect should be bounded.
+        ws_timeout = aiohttp.ClientTimeout(total=None, connect=10.0, sock_connect=10.0)
+        async with self._session.ws_connect(
+            self._ws_url,
+            timeout=ws_timeout,
+            heartbeat=30.0,
+        ) as ws:
+            self._state = "authenticating"
+
+            # Receive auth_required
+            try:
+                msg = await asyncio.wait_for(ws.receive_json(), timeout=10.0)
+            except (asyncio.TimeoutError, TypeError, json.JSONDecodeError) as exc:
+                raise ConnectionError(f"Failed to receive auth_required: {exc}") from exc
+
+            if msg.get("type") != "auth_required":
+                raise ConnectionError(
+                    f"Unexpected initial message type: {msg.get('type')!r}"
+                )
+
+            await ws.send_json({"type": "auth", "access_token": self._token})
+
+            # Receive auth_ok or auth_invalid
+            try:
+                msg = await asyncio.wait_for(ws.receive_json(), timeout=10.0)
+            except (asyncio.TimeoutError, TypeError, json.JSONDecodeError) as exc:
+                raise ConnectionError(f"Failed to receive auth response: {exc}") from exc
+
+            if msg.get("type") == "auth_invalid":
+                raise _AuthError(msg.get("message", "auth_invalid"))
+            if msg.get("type") != "auth_ok":
+                raise ConnectionError(
+                    f"Unexpected auth response type: {msg.get('type')!r}"
+                )
+
+            # Subscribe to state_changed events
+            self._msg_id += 1
+            await ws.send_json({
+                "id": self._msg_id,
+                "type": "subscribe_events",
+                "event_type": "state_changed",
+            })
+
+            # Mark connected — capture prior dead state before resetting
+            prev_dead_at = self._last_dead_alert_at
+            self._state = "subscribed"
+            self._last_event_monotonic = time.monotonic()
+
+            # Emit recovery if this reconnect follows a dead alert
+            if prev_dead_at > 0.0:
+                self._last_dead_alert_at = 0.0
+                self._emit_recovered()
+
+            _log.info("ws_subscribed", ws_url=self._ws_url)
+
+            # Receive loop — any TEXT message proves HA is alive
+            async for raw in ws:
+                if self._stopping:
+                    break
+                if raw.type == aiohttp.WSMsgType.TEXT:
+                    self._last_event_monotonic = time.monotonic()
+                elif raw.type in (aiohttp.WSMsgType.ERROR, aiohttp.WSMsgType.CLOSE):
+                    _log.warning("ws_closed_by_server", msg_type=raw.type.name)
+                    break
+
+    # ------------------------------------------------------------------
+    # Watchdog loop — detects silence while the WS appears connected
+    # ------------------------------------------------------------------
+
+    async def _watchdog_loop(self) -> None:
+        while not self._stopping:
+            try:
+                await asyncio.sleep(self._settings.websocket_watchdog_interval_seconds)
+            except asyncio.CancelledError:
+                raise
+
+            if self._state != "subscribed":
+                continue  # disconnects are handled by the connection loop
+
+            now = time.monotonic()
+            silent_secs = now - self._last_event_monotonic
+            if silent_secs <= self._settings.websocket_silence_threshold_seconds:
+                continue
+
+            cooldown = self._settings.websocket_down_alert_repeat_minutes * 60
+            if self._last_dead_alert_at == 0.0 or (now - self._last_dead_alert_at) >= cooldown:
+                self._emitter.emit(
+                    event_type=HAEventType.ha_websocket_dead.value,
+                    severity=Severity.error.value,
+                    service="homeassistant",
+                    message=(
+                        f"HA WebSocket silent for {silent_secs:.0f}s — no events received"
+                    ),
+                    payload=self._dead_payload(silent_secs),
+                )
+                self._last_dead_alert_at = now
+                _log.warning("ws_silent_dead_emitted", silent_seconds=round(silent_secs))
+
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+
+    def _on_disconnected(self) -> None:
+        """Emit ha_websocket_dead on connection loss, respecting cooldown."""
+        if self._stopping:
+            return
+        now = time.monotonic()
+        cooldown = self._settings.websocket_down_alert_repeat_minutes * 60
+        if self._last_dead_alert_at == 0.0 or (now - self._last_dead_alert_at) >= cooldown:
+            silent_secs = now - self._last_event_monotonic
+            self._emitter.emit(
+                event_type=HAEventType.ha_websocket_dead.value,
+                severity=Severity.error.value,
+                service="homeassistant",
+                message=f"HA WebSocket disconnected — silent for {silent_secs:.0f}s",
+                payload=self._dead_payload(silent_secs),
+            )
+            self._last_dead_alert_at = now
+            _log.warning("ws_dead_emitted", silent_seconds=round(silent_secs))
+
+    def _emit_recovered(self) -> None:
+        self._emitter.emit(
+            event_type=HAEventType.ha_websocket_recovered.value,
+            severity=Severity.info.value,
+            service="homeassistant",
+            message="HA WebSocket reconnected and receiving events",
+            payload={"connection_state": "subscribed"},
+        )
+        _log.info("ws_recovered_emitted")
+
+    def _dead_payload(self, silent_secs: float) -> dict:
+        event_age = time.monotonic() - self._last_event_monotonic
+        last_event_wall = time.time() - event_age
+        return {
+            "silent_seconds": round(silent_secs),
+            "last_event_at": datetime.fromtimestamp(
+                last_event_wall, tz=timezone.utc
+            ).isoformat(),
+            "connection_state": self._state,
+        }
--- a/services/ha-diag-agent/src/ha_diag/storage.py
+++ b/services/ha-diag-agent/src/ha_diag/storage.py
@ -0,0 +1,227 @@
+from __future__ import annotations
+
+import time
+from pathlib import Path
+from typing import Any
+
+import aiosqlite
+
+_SCHEMA = """
+CREATE TABLE IF NOT EXISTS system_health_snapshot (
+    component    TEXT PRIMARY KEY,
+    last_status  TEXT NOT NULL,
+    last_seen_at REAL NOT NULL,
+    payload      TEXT NOT NULL DEFAULT '{}'
+);
+
+CREATE TABLE IF NOT EXISTS entity_baseline (
+    entity_id   TEXT PRIMARY KEY,
+    -- state when entity first entered unavailable/unknown
+    state       TEXT NOT NULL,
+    -- timestamp when the entity FIRST entered its current bad state (INSERT OR IGNORE)
+    first_seen  REAL NOT NULL,
+    -- kept for legacy compat; not used by UnavailableEntitiesCheck
+    attributes  TEXT NOT NULL DEFAULT '{}',
+    updated_at  REAL NOT NULL
+);
+
+CREATE TABLE IF NOT EXISTS check_history (
+    id          INTEGER PRIMARY KEY AUTOINCREMENT,
+    check_name  TEXT NOT NULL,
+    ran_at      REAL NOT NULL,
+    healthy     INTEGER NOT NULL,
+    message     TEXT NOT NULL DEFAULT '',
+    payload     TEXT NOT NULL DEFAULT '{}'
+);
+
+CREATE TABLE IF NOT EXISTS alerts_sent (
+    alert_key   TEXT PRIMARY KEY,
+    sent_at     REAL NOT NULL
+);
+"""
+
+_MIGRATE_ENTITY_BASELINE = """
+ALTER TABLE entity_baseline ADD COLUMN first_seen REAL NOT NULL DEFAULT 0;
+"""
+
+
+class Storage:
+    def __init__(self, db_path: Path) -> None:
+        self._db_path = db_path
+        self._db: aiosqlite.Connection | None = None
+
+    async def open(self) -> None:
+        self._db_path.parent.mkdir(parents=True, exist_ok=True)
+        self._db = await aiosqlite.connect(self._db_path)
+        self._db.row_factory = aiosqlite.Row
+        await self._db.executescript(_SCHEMA)
+        # Add first_seen column to existing databases that pre-date Phase 3
+        try:
+            await self._db.execute(_MIGRATE_ENTITY_BASELINE)
+        except Exception:
+            pass  # column already exists
+        await self._db.commit()
+
+    async def close(self) -> None:
+        if self._db:
+            await self._db.close()
+            self._db = None
+
+    def _conn(self) -> aiosqlite.Connection:
+        if self._db is None:
+            raise RuntimeError("Storage not open — call await storage.open() first")
+        return self._db
+
+    # ------------------------------------------------------------------
+    # entity_baseline — tracks entities currently in bad state
+    # ------------------------------------------------------------------
+
+    async def set_entity_unavailable_since(
+        self, entity_id: str, state: str, first_seen: float
+    ) -> None:
+        """Record when an entity first entered unavailable/unknown state.
+
+        INSERT OR IGNORE: if the entity is already tracked, preserves the
+        original first_seen timestamp so duration is computed correctly.
+        """
+        await self._conn().execute(
+            """
+            INSERT OR IGNORE INTO entity_baseline
+                (entity_id, state, first_seen, attributes, updated_at)
+            VALUES (?, ?, ?, '{}', ?)
+            """,
+            (entity_id, state, first_seen, first_seen),
+        )
+        await self._conn().commit()
+
+    async def get_entity_first_unavailable_at(self, entity_id: str) -> float | None:
+        """Return when the entity first entered its bad state, or None if not tracked."""
+        async with self._conn().execute(
+            "SELECT first_seen FROM entity_baseline WHERE entity_id = ?",
+            (entity_id,),
+        ) as cur:
+            row = await cur.fetchone()
+            return float(row["first_seen"]) if row else None
+
+    async def clear_entity_unavailable(self, entity_id: str) -> None:
+        """Remove entity from unavailable tracking (entity has recovered)."""
+        await self._conn().execute(
+            "DELETE FROM entity_baseline WHERE entity_id = ?",
+            (entity_id,),
+        )
+        await self._conn().commit()
+
+    async def get_all_tracked_entity_ids(self) -> list[str]:
+        """Return all entity IDs currently tracked as unavailable/unknown."""
+        async with self._conn().execute(
+            "SELECT entity_id FROM entity_baseline"
+        ) as cur:
+            rows = await cur.fetchall()
+            return [r["entity_id"] for r in rows]
+
+    # Legacy upsert — kept for backwards compat with existing callers
+    async def upsert_entity_baseline(
+        self, entity_id: str, state: str, attributes: str, updated_at: float
+    ) -> None:
+        await self._conn().execute(
+            """
+            INSERT INTO entity_baseline (entity_id, state, first_seen, attributes, updated_at)
+            VALUES (?, ?, ?, ?, ?)
+            ON CONFLICT(entity_id) DO UPDATE SET
+                state = excluded.state,
+                attributes = excluded.attributes,
+                updated_at = excluded.updated_at
+            """,
+            (entity_id, state, updated_at, attributes, updated_at),
+        )
+        await self._conn().commit()
+
+    async def get_entity_baseline(self, entity_id: str) -> dict[str, Any] | None:
+        async with self._conn().execute(
+            "SELECT * FROM entity_baseline WHERE entity_id = ?", (entity_id,)
+        ) as cur:
+            row = await cur.fetchone()
+            return dict(row) if row else None
+
+    # ------------------------------------------------------------------
+    # check_history
+    # ------------------------------------------------------------------
+
+    async def record_check(
+        self,
+        check_name: str,
+        ran_at: float,
+        healthy: bool,
+        message: str,
+        payload: str,
+    ) -> None:
+        await self._conn().execute(
+            """
+            INSERT INTO check_history (check_name, ran_at, healthy, message, payload)
+            VALUES (?, ?, ?, ?, ?)
+            """,
+            (check_name, ran_at, int(healthy), message, payload),
+        )
+        await self._conn().commit()
+
+    # ------------------------------------------------------------------
+    # alerts_sent (dedup gate)
+    # ------------------------------------------------------------------
+
+    async def was_alert_sent(self, alert_key: str, within_seconds: float) -> bool:
+        cutoff = time.time() - within_seconds
+        async with self._conn().execute(
+            "SELECT sent_at FROM alerts_sent WHERE alert_key = ? AND sent_at > ?",
+            (alert_key, cutoff),
+        ) as cur:
+            return (await cur.fetchone()) is not None
+
+    async def mark_alert_sent(self, alert_key: str) -> None:
+        await self._conn().execute(
+            """
+            INSERT INTO alerts_sent (alert_key, sent_at) VALUES (?, ?)
+            ON CONFLICT(alert_key) DO UPDATE SET sent_at = excluded.sent_at
+            """,
+            (alert_key, time.time()),
+        )
+        await self._conn().commit()
+
+    async def clear_alert(self, alert_key: str) -> None:
+        """Delete an alert record so the next occurrence triggers immediately."""
+        await self._conn().execute(
+            "DELETE FROM alerts_sent WHERE alert_key = ?", (alert_key,)
+        )
+        await self._conn().commit()
+
+    # ------------------------------------------------------------------
+    # system_health_snapshot — tracks last-known per-component status
+    # ------------------------------------------------------------------
+
+    async def get_system_health_snapshot(
+        self, component: str
+    ) -> dict[str, Any] | None:
+        """Return the stored snapshot for a component, or None if unseen."""
+        async with self._conn().execute(
+            "SELECT * FROM system_health_snapshot WHERE component = ?",
+            (component,),
+        ) as cur:
+            row = await cur.fetchone()
+            return dict(row) if row else None
+
+    async def upsert_system_health_snapshot(
+        self, component: str, last_status: str, payload: str
+    ) -> None:
+        """Insert or replace the snapshot for a component."""
+        await self._conn().execute(
+            """
+            INSERT INTO system_health_snapshot
+                (component, last_status, last_seen_at, payload)
+            VALUES (?, ?, ?, ?)
+            ON CONFLICT(component) DO UPDATE SET
+                last_status  = excluded.last_status,
+                last_seen_at = excluded.last_seen_at,
+                payload      = excluded.payload
+            """,
+            (component, last_status, time.time(), payload),
+        )
+        await self._conn().commit()
--- a/services/ha-diag-agent/tests/conftest.py
+++ b/services/ha-diag-agent/tests/conftest.py
@ -0,0 +1,64 @@
+"""Shared fixtures for ha-diag-agent tests."""
+from __future__ import annotations
+
+from pathlib import Path
+from typing import AsyncGenerator
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+import pytest_asyncio
+
+from ha_diag.event_emitter import EventEmitter
+from ha_diag.storage import Storage
+
+
+# ---------------------------------------------------------------------------
+# Filesystem fixtures
+# ---------------------------------------------------------------------------
+
+
+@pytest.fixture
+def tmp_events_dir(tmp_path: Path) -> Path:
+    events = tmp_path / "events"
+    events.mkdir()
+    return events
+
+
+# ---------------------------------------------------------------------------
+# Storage fixture (tmp SQLite — fast, no mocking)
+# ---------------------------------------------------------------------------
+
+
+@pytest_asyncio.fixture
+async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
+    s = Storage(tmp_path / "test.db")
+    await s.open()
+    yield s
+    await s.close()
+
+
+# ---------------------------------------------------------------------------
+# EventEmitter fixture
+# ---------------------------------------------------------------------------
+
+
+@pytest.fixture
+def emitter(tmp_events_dir: Path) -> EventEmitter:
+    return EventEmitter(tmp_events_dir, node_name="test-node")
+
+
+# ---------------------------------------------------------------------------
+# Mock HA client fixture
+# ---------------------------------------------------------------------------
+
+
+@pytest.fixture
+def mock_ha_client():
+    """Plain HAClient mock — no context manager, just async methods."""
+    client = MagicMock()
+    client.get_api_status = AsyncMock(return_value={"message": "API running."})
+    client.get_states = AsyncMock(return_value=[])
+    client.get_entity_registry = AsyncMock(return_value=[])
+    client.get_system_health = AsyncMock(return_value={})
+    client.get_automation_traces = AsyncMock(return_value=[])
+    return client
--- a/services/ha-diag-agent/tests/integration/conftest.py
+++ b/services/ha-diag-agent/tests/integration/conftest.py
@ -0,0 +1,38 @@
+"""Integration test fixtures.
+
+Integration tests require real HA instances. Start them with:
+
+    docker compose -f tests/integration/docker-compose.ken.yml up -d
+    docker compose -f tests/integration/docker-compose.chelsty.yml up -d
+    tests/integration/scripts/wait-for-ha.sh http://localhost:8123
+    tests/integration/scripts/wait-for-ha.sh http://localhost:8124
+
+Then set TEST_HA_TOKEN (a long-lived HA token) and run:
+
+    pytest tests/ -m integration
+
+All tests in this module are automatically skipped when TEST_HA_TOKEN is unset.
+"""
+from __future__ import annotations
+
+import os
+
+import pytest
+
+
+@pytest.fixture(scope="session")
+def ha_ken_url() -> str:
+    return os.getenv("TEST_HA_KEN_URL", "http://localhost:8123")
+
+
+@pytest.fixture(scope="session")
+def ha_chelsty_url() -> str:
+    return os.getenv("TEST_HA_CHELSTY_URL", "http://localhost:8124")
+
+
+@pytest.fixture(scope="session")
+def ha_token() -> str:
+    token = os.getenv("TEST_HA_TOKEN", "")
+    if not token:
+        pytest.skip("TEST_HA_TOKEN not set — skipping integration tests")
+    return token
--- a/services/ha-diag-agent/tests/integration/docker-compose.chelsty.yml
+++ b/services/ha-diag-agent/tests/integration/docker-compose.chelsty.yml
@ -0,0 +1,27 @@
+services:
+  ha-chelsty-init:
+    image: busybox
+    container_name: ha-test-chelsty-init
+    command: sh -c "cp -rn /fixtures/. /config/ && echo 'Fixtures copied'"
+    volumes:
+      - ./fixtures/chelsty:/fixtures:ro
+      - ha_chelsty_config:/config
+    restart: "no"
+
+  ha-chelsty:
+    image: ghcr.io/home-assistant/home-assistant:stable
+    container_name: ha-test-chelsty
+    privileged: true
+    depends_on:
+      ha-chelsty-init:
+        condition: service_completed_successfully
+    ports:
+      - "8124:8123"
+    volumes:
+      - ha_chelsty_config:/config
+    environment:
+      TZ: UTC
+    restart: "no"
+
+volumes:
+  ha_chelsty_config:
--- a/services/ha-diag-agent/tests/integration/docker-compose.ken.yml
+++ b/services/ha-diag-agent/tests/integration/docker-compose.ken.yml
@ -0,0 +1,27 @@
+services:
+  ha-ken-init:
+    image: busybox
+    container_name: ha-test-ken-init
+    command: sh -c "cp -rn /fixtures/. /config/ && echo 'Fixtures copied'"
+    volumes:
+      - ./fixtures/ken:/fixtures:ro
+      - ha_ken_config:/config
+    restart: "no"
+
+  ha-ken:
+    image: ghcr.io/home-assistant/home-assistant:stable
+    container_name: ha-test-ken
+    privileged: true
+    depends_on:
+      ha-ken-init:
+        condition: service_completed_successfully
+    ports:
+      - "8123:8123"
+    volumes:
+      - ha_ken_config:/config
+    environment:
+      TZ: UTC
+    restart: "no"
+
+volumes:
+  ha_ken_config:
--- a/services/ha-diag-agent/tests/integration/fixtures/chelsty/configuration.yaml
+++ b/services/ha-diag-agent/tests/integration/fixtures/chelsty/configuration.yaml
@ -0,0 +1,18 @@
+# Home Assistant test fixture — chelsty site
+# Used by integration tests only. Not for production.
+
+homeassistant:
+  name: "Test HA - Chelsty"
+  latitude: 0.0
+  longitude: 0.0
+  elevation: 0
+  unit_system: metric
+  time_zone: UTC
+  country: PL
+
+# Enable REST API
+api:
+
+# Disable analytics
+analytics:
+  reporting: false
--- a/services/ha-diag-agent/tests/integration/fixtures/ken/configuration.yaml
+++ b/services/ha-diag-agent/tests/integration/fixtures/ken/configuration.yaml
@ -0,0 +1,18 @@
+# Home Assistant test fixture — ken (piha) site
+# Used by integration tests only. Not for production.
+
+homeassistant:
+  name: "Test HA - Ken"
+  latitude: 0.0
+  longitude: 0.0
+  elevation: 0
+  unit_system: metric
+  time_zone: UTC
+  country: PL
+
+# Enable REST API (no auth required for trusted networks in tests)
+api:
+
+# Disable analytics
+analytics:
+  reporting: false
--- a/services/ha-diag-agent/tests/integration/scripts/reset.sh
+++ b/services/ha-diag-agent/tests/integration/scripts/reset.sh
@ -0,0 +1,36 @@
+#!/bin/sh
+# Reset an HA Docker volume from a snapshot or fixture directory.
+# Usage: reset.sh <compose_file> <service_name> <fixture_dir>
+#
+# Stops the service, clears and repopulates its volume from the fixture
+# directory, then restarts.
+
+set -e
+
+COMPOSE_FILE="${1:?Usage: reset.sh <compose_file> <service_name> <fixture_dir>}"
+SERVICE="${2:?}"
+FIXTURE_DIR="${3:?}"
+COMPOSE_DIR="$(dirname "$COMPOSE_FILE")"
+
+printf 'Resetting %s from %s...\n' "$SERVICE" "$FIXTURE_DIR"
+
+# Stop the service (keep the init container stopped too)
+docker compose -f "$COMPOSE_FILE" stop "$SERVICE" 2>/dev/null || true
+
+# Determine the volume name from compose project + service
+VOLUME_NAME="$(docker compose -f "$COMPOSE_FILE" config --volumes 2>/dev/null | head -1)"
+if [ -z "$VOLUME_NAME" ]; then
+    printf 'Could not determine volume name from %s\n' "$COMPOSE_FILE" >&2
+    exit 1
+fi
+
+# Wipe and repopulate the volume
+docker run --rm \
+    -v "$VOLUME_NAME":/config \
+    -v "$(realpath "$FIXTURE_DIR")":/fixtures:ro \
+    busybox \
+    sh -c "rm -rf /config/.storage && cp -r /fixtures/. /config/"
+
+# Restart the service
+docker compose -f "$COMPOSE_FILE" start "$SERVICE"
+printf 'Reset complete. Run wait-for-ha.sh to confirm readiness.\n'
--- a/services/ha-diag-agent/tests/integration/scripts/snapshot.sh
+++ b/services/ha-diag-agent/tests/integration/scripts/snapshot.sh
@ -0,0 +1,21 @@
+#!/bin/sh
+# Snapshot the current state of an HA Docker volume.
+# Usage: snapshot.sh <volume_name> [output_dir]
+#
+# Saves a tar.gz of the entire volume to output_dir (default: ./snapshots/).
+# Use reset.sh to restore.
+
+VOLUME="${1:?Usage: snapshot.sh <volume_name> [output_dir]}"
+OUTPUT_DIR="${2:-./snapshots}"
+SNAPSHOT_FILE="$OUTPUT_DIR/$VOLUME-$(date +%Y%m%d-%H%M%S).tar.gz"
+
+mkdir -p "$OUTPUT_DIR"
+printf 'Snapshotting volume %s -> %s\n' "$VOLUME" "$SNAPSHOT_FILE"
+
+docker run --rm \
+    -v "$VOLUME":/data:ro \
+    alpine \
+    tar czf - -C / data \
+    > "$SNAPSHOT_FILE"
+
+printf 'Snapshot saved: %s\n' "$SNAPSHOT_FILE"
--- a/services/ha-diag-agent/tests/integration/scripts/wait-for-ha.sh
+++ b/services/ha-diag-agent/tests/integration/scripts/wait-for-ha.sh
@ -0,0 +1,23 @@
+#!/bin/sh
+# Wait until a Home Assistant instance is ready (responds to /api/).
+# Usage: wait-for-ha.sh <url> [timeout_seconds]
+#
+# Exit 0 = HA ready, Exit 1 = timeout reached.
+
+URL="${1:-http://localhost:8123}"
+TIMEOUT="${2:-120}"
+
+elapsed=0
+printf 'Waiting for HA at %s (timeout %ss)...\n' "$URL" "$TIMEOUT"
+
+while [ "$elapsed" -lt "$TIMEOUT" ]; do
+    if curl -sf --max-time 3 "$URL/api/" -o /dev/null 2>/dev/null; then
+        printf 'HA ready at %s (after %ss)\n' "$URL" "$elapsed"
+        exit 0
+    fi
+    sleep 2
+    elapsed=$((elapsed + 2))
+done
+
+printf 'Timeout: HA not ready at %s after %ss\n' "$URL" "$TIMEOUT" >&2
+exit 1
--- a/services/ha-diag-agent/tests/integration/test_automation_failures_integration.py
+++ b/services/ha-diag-agent/tests/integration/test_automation_failures_integration.py
@ -0,0 +1,167 @@
+"""Integration tests for AutomationFailuresCheck.
+
+Uses real aiosqlite Storage + EventEmitter + mocked HTTP.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import AsyncGenerator
+
+import pytest
+import pytest_asyncio
+from aioresponses import aioresponses
+
+from ha_diag.checks.automation_failures import AutomationFailuresCheck
+from ha_diag.config import Settings
+from ha_diag.event_emitter import EventEmitter
+from ha_diag.ha_client import HAClient, make_session
+from ha_diag.models import HAEventType
+from ha_diag.storage import Storage
+
+HA_URL = "http://ha-test-ken:8123"
+
+
+def _settings(**overrides) -> Settings:
+    defaults: dict = {
+        "ha_url": HA_URL,
+        "ha_token": "test-token",
+        "node_name": "piha",
+        "location_tag": "ken",
+        "alert_cooldown_hours": 0.0,
+        "automation_failure_threshold": 3,
+        "check_interval": 60,
+        "check_interval_unavailable": 3600,
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+@pytest_asyncio.fixture
+async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
+    s = Storage(tmp_path / "integration_test.db")
+    await s.open()
+    yield s
+    await s.close()
+
+
+@pytest.fixture
+def events_dir(tmp_path: Path) -> Path:
+    d = tmp_path / "events"
+    d.mkdir()
+    return d
+
+
+def _auto_states(*entity_ids: str) -> list[dict]:
+    return [
+        {
+            "entity_id": eid,
+            "state": "on",
+            "attributes": {"friendly_name": eid.split(".")[-1].replace("_", " ").title()},
+        }
+        for eid in entity_ids
+    ]
+
+
+def _fail_traces(n: int = 3) -> list[dict]:
+    return [
+        {
+            "run_id": f"run-{i}",
+            "timestamp": f"2026-05-27T{10+i:02d}:00:00+00:00",
+            "trigger": "state",
+            "state": "stopped",
+            "error": f"Script error #{i}",
+        }
+        for i in range(n)
+    ]
+
+
+def _ok_traces(n: int = 3) -> list[dict]:
+    return [
+        {
+            "run_id": f"run-{i}",
+            "timestamp": f"2026-05-27T{10+i:02d}:00:00+00:00",
+            "trigger": "state",
+            "state": "stopped",
+            "error": None,
+        }
+        for i in range(n)
+    ]
+
+
+@pytest.mark.integration
+async def test_failing_automation_emits_event_and_writes_file(
+    storage: Storage, events_dir: Path
+):
+    """3 consecutive failures → event file written with correct structure."""
+    states = _auto_states("automation.morning_lights")
+    traces = _fail_traces(3)
+    emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
+
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states)
+        m.get(f"{HA_URL}/api/trace/automation/automation.morning_lights", payload=traces)
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check = AutomationFailuresCheck(client, storage, _settings())
+            results = await check.run()
+
+    assert len(results) == 1
+    r = results[0]
+    assert r.event_type == HAEventType.ha_automation_failing
+    assert r.payload["entity_id"] == "automation.morning_lights"
+    assert r.payload["total_recent_failures"] == 3
+
+    emitter.emit(
+        event_type=r.event_type,
+        severity=r.severity.value,
+        service="homeassistant",
+        message=r.message,
+        payload=r.payload,
+    )
+
+    files = list(events_dir.glob("*.json"))
+    assert len(files) == 1
+    data = json.loads(files[0].read_text())
+    assert data["type"] == "ha_automation_failing"
+    assert data["payload"]["location_tag"] == "ken"
+    assert "last_failures" in data["payload"]
+
+
+@pytest.mark.integration
+async def test_healthy_automation_no_event(storage: Storage):
+    """All recent runs successful → no event."""
+    states = _auto_states("automation.morning_lights")
+    traces = _ok_traces(3)
+
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states)
+        m.get(f"{HA_URL}/api/trace/automation/automation.morning_lights", payload=traces)
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check = AutomationFailuresCheck(client, storage, _settings())
+            results = await check.run()
+
+    assert results == []
+
+
+@pytest.mark.integration
+async def test_cooldown_suppresses_duplicate(storage: Storage):
+    """Second run within cooldown window → no duplicate event."""
+    states = _auto_states("automation.morning_lights")
+    traces = _fail_traces(3)
+    settings = _settings(alert_cooldown_hours=6.0)
+
+    for _ in range(2):
+        with aioresponses() as m:
+            m.get(f"{HA_URL}/api/states", payload=states)
+            m.get(f"{HA_URL}/api/trace/automation/automation.morning_lights", payload=traces)
+            async with make_session("test-token") as session:
+                check = AutomationFailuresCheck(
+                    HAClient(HA_URL, session), storage, settings
+                )
+                results = await check.run()
+        if _ == 0:
+            assert len(results) == 1
+        else:
+            assert results == []
--- a/services/ha-diag-agent/tests/integration/test_heartbeat_integration.py
+++ b/services/ha-diag-agent/tests/integration/test_heartbeat_integration.py
@ -0,0 +1,59 @@
+"""Integration tests for HeartbeatCheck against real HA instances.
+
+Requires:
+    - docker compose -f tests/integration/docker-compose.ken.yml up -d
+    - docker compose -f tests/integration/docker-compose.chelsty.yml up -d
+    - TEST_HA_TOKEN=<long-lived-token> pytest tests/ -m integration
+"""
+from __future__ import annotations
+
+import pytest
+
+from ha_diag.checks.heartbeat import HeartbeatCheck
+from ha_diag.event_emitter import EventEmitter
+from ha_diag.ha_client import HAClient, make_session
+from ha_diag.models import HAEventType
+
+
+@pytest.mark.integration
+async def test_heartbeat_ken_healthy(ha_ken_url: str, ha_token: str):
+    async with make_session(ha_token) as session:
+        client = HAClient(ha_ken_url, session)
+        check = HeartbeatCheck(client)
+        results = await check.run()
+    assert results == [], f"HA ken not healthy: {results}"
+
+
+@pytest.mark.integration
+async def test_heartbeat_chelsty_healthy(ha_chelsty_url: str, ha_token: str):
+    async with make_session(ha_token) as session:
+        client = HAClient(ha_chelsty_url, session)
+        check = HeartbeatCheck(client)
+        results = await check.run()
+    assert results == [], f"HA chelsty not healthy: {results}"
+
+
+@pytest.mark.integration
+async def test_heartbeat_emits_event_on_failure():
+    """Connecting to a closed port should yield ha_websocket_dead."""
+    async with make_session("bad-token") as session:
+        client = HAClient("http://127.0.0.1:19999", session)  # nothing here
+        check = HeartbeatCheck(client)
+        results = await check.run()
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_websocket_dead
+
+
+@pytest.mark.integration
+async def test_heartbeat_event_written_to_filesystem(
+    ha_ken_url: str, ha_token: str, tmp_path
+):
+    emitter = EventEmitter(tmp_path / "events", node_name="test-piha", location_tag="ken")
+    async with make_session(ha_token) as session:
+        client = HAClient(ha_ken_url, session)
+        check = HeartbeatCheck(client)
+        results = await check.run()
+
+    # Healthy HA → no events
+    assert results == []
+    assert not list((tmp_path / "events").glob("*.json"))
--- a/services/ha-diag-agent/tests/integration/test_system_health_integration.py
+++ b/services/ha-diag-agent/tests/integration/test_system_health_integration.py
@ -0,0 +1,151 @@
+"""Integration tests for SystemHealthCheck using aioresponses.
+
+Uses real aiosqlite Storage + EventEmitter + mocked HTTP.
+Marked 'integration' because it exercises the full stack end-to-end.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import AsyncGenerator
+
+import pytest
+import pytest_asyncio
+from aioresponses import aioresponses
+
+from ha_diag.checks.system_health import SystemHealthCheck
+from ha_diag.config import Settings
+from ha_diag.event_emitter import EventEmitter
+from ha_diag.ha_client import HAClient, make_session
+from ha_diag.models import HAEventType
+from ha_diag.storage import Storage
+
+HA_URL = "http://ha-test-ken:8123"
+
+
+def _settings(**overrides) -> Settings:
+    defaults: dict = {
+        "ha_url": HA_URL,
+        "ha_token": "test-token",
+        "node_name": "piha",
+        "location_tag": "ken",
+        "alert_cooldown_hours": 0.0,
+        "check_interval": 60,
+        "check_interval_unavailable": 3600,
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+@pytest_asyncio.fixture
+async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
+    s = Storage(tmp_path / "integration_test.db")
+    await s.open()
+    yield s
+    await s.close()
+
+
+@pytest.fixture
+def events_dir(tmp_path: Path) -> Path:
+    d = tmp_path / "events"
+    d.mkdir()
+    return d
+
+
+@pytest.mark.integration
+async def test_system_health_ok_components_no_event(
+    storage: Storage, events_dir: Path
+):
+    """All components healthy on first run → no events emitted."""
+    health = {
+        "homeassistant": {"type": "result", "data": {"version": "2025.5.0"}},
+        "recorder": {"type": "result", "data": {"backlog": 0}},
+    }
+    emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
+
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/system_health", payload=health)
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check = SystemHealthCheck(client, storage, _settings())
+            results = await check.run()
+
+    assert results == []
+    assert not list(events_dir.glob("*.json"))
+
+
+@pytest.mark.integration
+async def test_system_health_degraded_emits_event_and_writes_file(
+    storage: Storage, events_dir: Path
+):
+    """Component degrades: event emitted + file written with correct structure."""
+    # First run: all ok
+    health_ok = {"cloud": {"type": "result", "data": {}}}
+    health_err = {"cloud": {"type": "error", "error": "Cloud connection lost"}}
+    emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
+
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/system_health", payload=health_ok)
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            await SystemHealthCheck(client, storage, _settings()).run()
+
+    # Second run: cloud errors
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/system_health", payload=health_err)
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check = SystemHealthCheck(client, storage, _settings())
+            results = await check.run()
+
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_system_health_degraded
+
+    emitter.emit(
+        event_type=results[0].event_type,
+        severity=results[0].severity.value,
+        service="homeassistant",
+        message=results[0].message,
+        payload=results[0].payload,
+    )
+
+    files = list(events_dir.glob("*.json"))
+    assert len(files) == 1
+    data = json.loads(files[0].read_text())
+    assert data["type"] == "ha_system_health_degraded"
+    assert data["payload"]["component"] == "cloud"
+    assert data["payload"]["location_tag"] == "ken"
+
+
+@pytest.mark.integration
+async def test_system_health_recovery_and_re_degradation(storage: Storage):
+    """Full ok→error→ok→error cycle: events fire on degradation, not on recovery."""
+    def _run(health):
+        with aioresponses() as m:
+            m.get(f"{HA_URL}/api/system_health", payload=health)
+            return make_session("test-token"), health
+
+    settings = _settings()
+
+    async def run_once(health):
+        with aioresponses() as m:
+            m.get(f"{HA_URL}/api/system_health", payload=health)
+            async with make_session("test-token") as session:
+                return await SystemHealthCheck(
+                    HAClient(HA_URL, session), storage, settings
+                ).run()
+
+    ok_h = {"cloud": {"type": "result", "data": {}}}
+    err_h = {"cloud": {"type": "error", "error": "timeout"}}
+
+    r1 = await run_once(ok_h)   # baseline ok
+    r2 = await run_once(err_h)  # first degradation
+    r3 = await run_once(err_h)  # sustained error (no dup)
+    r4 = await run_once(ok_h)   # recovery
+    r5 = await run_once(err_h)  # second degradation
+
+    assert r1 == []
+    assert len(r2) == 1
+    assert r3 == []
+    assert r4 == []
+    assert len(r5) == 1
--- a/services/ha-diag-agent/tests/integration/test_unavailable_entities_integration.py
+++ b/services/ha-diag-agent/tests/integration/test_unavailable_entities_integration.py
@ -0,0 +1,192 @@
+"""Functional integration test for UnavailableEntitiesCheck.
+
+Uses aioresponses for HA HTTP (controlled, deterministic) and real aiosqlite +
+EventEmitter (tests the full agent pipeline end-to-end without a live HA).
+Marked 'integration' because it exercises the complete multi-component stack.
+
+For a live-HA variant, start the ken testenv Docker instances, set
+TEST_HA_TOKEN, and extend with tests that call real HA endpoints.
+"""
+from __future__ import annotations
+
+import json
+import time
+from pathlib import Path
+from typing import AsyncGenerator
+
+import pytest
+import pytest_asyncio
+from aioresponses import aioresponses
+
+from ha_diag.checks.unavailable_entities import UnavailableEntitiesCheck
+from ha_diag.config import Settings
+from ha_diag.event_emitter import EventEmitter
+from ha_diag.ha_client import HAClient, make_session
+from ha_diag.models import HAEventType
+from ha_diag.storage import Storage
+
+HA_URL = "http://ha-test-ken:8123"
+
+
+def _settings(**overrides) -> Settings:
+    defaults: dict = {
+        "ha_url": HA_URL,
+        "ha_token": "test-token",
+        "node_name": "piha",
+        "location_tag": "ken",
+        "unavailable_threshold_hours": 0.0,
+        "integration_failure_threshold_pct": 0.5,
+        "integration_failure_min_entities": 3,
+        "alert_cooldown_hours": 0.0,
+        "check_interval": 60,
+        "check_interval_unavailable": 3600,
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+@pytest_asyncio.fixture
+async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
+    s = Storage(tmp_path / "integration_test.db")
+    await s.open()
+    yield s
+    await s.close()
+
+
+@pytest.fixture
+def events_dir(tmp_path: Path) -> Path:
+    d = tmp_path / "events"
+    d.mkdir()
+    return d
+
+
+@pytest.mark.integration
+async def test_full_pipeline_integration_event(storage: Storage, events_dir: Path):
+    """3/3 zha entities unavailable → ha_integration_failed, 1 event file on disk."""
+    unavailable_entities = [
+        {"entity_id": f"light.test_{i}", "state": "unavailable", "attributes": {}}
+        for i in range(3)
+    ]
+    available_entities = [{"entity_id": "sensor.ok", "state": "on", "attributes": {}}]
+    all_states = unavailable_entities + available_entities
+    registry = [
+        {"entity_id": e["entity_id"], "platform": "zha", "area_id": "living_room"}
+        for e in unavailable_entities
+    ]
+
+    for e in unavailable_entities:
+        await storage.set_entity_unavailable_since(
+            e["entity_id"], "unavailable", time.time() - 25 * 3600
+        )
+
+    emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
+
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=all_states)
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=registry)
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check = UnavailableEntitiesCheck(client, storage, _settings())
+            results = await check.run()
+
+    # 3/3 zha entities (100% >= 50%, count 3 >= 3) → integration event
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_integration_failed
+    assert results[0].payload["integration"] == "zha"
+
+    emitter.emit(
+        event_type=results[0].event_type,
+        severity=results[0].severity.value,
+        service="homeassistant",
+        message=results[0].message,
+        payload=results[0].payload,
+    )
+
+    event_files = list(events_dir.glob("*.json"))
+    assert len(event_files) == 1
+    event_data = json.loads(event_files[0].read_text())
+    assert event_data["node"] == "piha"
+    assert event_data["payload"]["location_tag"] == "ken"
+    assert event_data["payload"]["integration"] == "zha"
+    assert event_data["type"] == "ha_integration_failed"
+
+
+@pytest.mark.integration
+async def test_full_pipeline_individual_entity_events(
+    storage: Storage, events_dir: Path
+):
+    """2 unavailable entities from different integrations → 2 individual events."""
+    states = [
+        {"entity_id": "light.zha_one", "state": "unavailable", "attributes": {}},
+        {"entity_id": "sensor.mqtt_one", "state": "unavailable", "attributes": {}},
+        {"entity_id": "switch.ok", "state": "on", "attributes": {}},
+    ]
+    registry = [
+        {"entity_id": "light.zha_one", "platform": "zha", "area_id": ""},
+        {"entity_id": "sensor.mqtt_one", "platform": "mqtt", "area_id": ""},
+    ]
+
+    for e in ["light.zha_one", "sensor.mqtt_one"]:
+        await storage.set_entity_unavailable_since(e, "unavailable", time.time() - 25 * 3600)
+
+    emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
+
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states)
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=registry)
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check = UnavailableEntitiesCheck(client, storage, _settings())
+            results = await check.run()
+
+    # Both integrations have only 1 entity each → below min_entities threshold
+    assert len(results) == 2
+    assert all(r.event_type == HAEventType.ha_entity_unavailable_long for r in results)
+
+    for result in results:
+        emitter.emit(
+            event_type=result.event_type,
+            severity=result.severity.value,
+            service="homeassistant",
+            message=result.message,
+            payload=result.payload,
+        )
+
+    files = list(events_dir.glob("*.json"))
+    assert len(files) == 2
+    for f in files:
+        data = json.loads(f.read_text())
+        assert data["payload"]["location_tag"] == "ken"
+        assert "entity_id" in data["payload"]
+        assert "since" in data["payload"]
+        assert data["payload"]["since"].endswith("Z")
+
+
+@pytest.mark.integration
+async def test_recovery_removes_tracking(storage: Storage, events_dir: Path):
+    """Entity recovers between check cycles → baseline cleared, no event next cycle."""
+    eid = "light.recoverable"
+    await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
+
+    # Cycle 1: entity unavailable → event
+    states_cycle1 = [{"entity_id": eid, "state": "unavailable", "attributes": {}}]
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states_cycle1)
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=[])
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check = UnavailableEntitiesCheck(client, storage, _settings())
+            results1 = await check.run()
+    assert len(results1) == 1
+
+    # Cycle 2: entity recovered → no event, baseline cleared
+    states_cycle2 = [{"entity_id": eid, "state": "on", "attributes": {}}]
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states_cycle2)
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=[])
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check2 = UnavailableEntitiesCheck(client, storage, _settings())
+            results2 = await check2.run()
+    assert results2 == []
+    assert await storage.get_entity_first_unavailable_at(eid) is None
--- a/services/ha-diag-agent/tests/integration/test_updates_available_integration.py
+++ b/services/ha-diag-agent/tests/integration/test_updates_available_integration.py
@ -0,0 +1,169 @@
+"""Integration tests for UpdatesAvailableCheck and UpdatesDigestCheck.
+
+Uses real aiosqlite Storage + EventEmitter + mocked HTTP.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import AsyncGenerator
+
+import pytest
+import pytest_asyncio
+from aioresponses import aioresponses
+
+from ha_diag.checks.updates_available import UpdatesAvailableCheck, UpdatesDigestCheck
+from ha_diag.config import Settings
+from ha_diag.event_emitter import EventEmitter
+from ha_diag.ha_client import HAClient, make_session
+from ha_diag.models import HAEventType
+from ha_diag.storage import Storage
+
+HA_URL = "http://ha-test-ken:8123"
+
+
+def _settings(**overrides) -> Settings:
+    defaults: dict = {
+        "ha_url": HA_URL,
+        "ha_token": "test-token",
+        "node_name": "piha",
+        "location_tag": "ken",
+        "alert_cooldown_hours": 0.0,
+        "updates_cooldown_days": 0,
+        "check_interval": 60,
+        "check_interval_unavailable": 3600,
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+@pytest_asyncio.fixture
+async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
+    s = Storage(tmp_path / "integration_test.db")
+    await s.open()
+    yield s
+    await s.close()
+
+
+@pytest.fixture
+def events_dir(tmp_path: Path) -> Path:
+    d = tmp_path / "events"
+    d.mkdir()
+    return d
+
+
+def _update_states(*entity_ids: str) -> list[dict]:
+    return [
+        {
+            "entity_id": eid,
+            "state": "on",
+            "attributes": {
+                "title": eid.split(".")[-1].replace("_", " ").title(),
+                "installed_version": "1.0.0",
+                "latest_version": "2.0.0",
+                "in_progress": False,
+                "auto_update": False,
+            },
+        }
+        for eid in entity_ids
+    ]
+
+
+@pytest.mark.integration
+async def test_individual_updates_written_to_disk(storage: Storage, events_dir: Path):
+    """2 pending updates → 2 event files with correct structure."""
+    states = _update_states("update.ha_core", "update.mosquitto")
+    emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
+
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states)
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check = UpdatesAvailableCheck(client, storage, _settings())
+            results = await check.run()
+
+    assert len(results) == 2
+    for r in results:
+        assert r.event_type == HAEventType.ha_update_available
+        emitter.emit(
+            event_type=r.event_type,
+            severity=r.severity.value,
+            service="homeassistant",
+            message=r.message,
+            payload=r.payload,
+        )
+
+    files = list(events_dir.glob("*.json"))
+    assert len(files) == 2
+    for f in files:
+        data = json.loads(f.read_text())
+        assert data["type"] == "ha_update_available"
+        assert data["payload"]["location_tag"] == "ken"
+        assert "entity_id" in data["payload"]
+
+
+@pytest.mark.integration
+async def test_digest_writes_single_event_file(storage: Storage, events_dir: Path):
+    """Sunday digest → single event file with digest=True payload."""
+    states = _update_states("update.ha_core", "update.mosquitto", "update.esphome")
+    emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
+
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states)
+        async with make_session("test-token") as session:
+            client = HAClient(HA_URL, session)
+            check = UpdatesDigestCheck(client, storage, _settings())
+            results = await check.run()
+
+    assert len(results) == 1
+    r = results[0]
+    assert r.payload["digest"] is True
+    assert r.payload["count"] == 3
+
+    emitter.emit(
+        event_type=r.event_type,
+        severity=r.severity.value,
+        service="homeassistant",
+        message=r.message,
+        payload=r.payload,
+    )
+    files = list(events_dir.glob("*.json"))
+    assert len(files) == 1
+    data = json.loads(files[0].read_text())
+    assert data["payload"]["digest"] is True
+    assert len(data["payload"]["updates"]) == 3
+
+
+@pytest.mark.integration
+async def test_dedup_across_daily_and_digest_independent(storage: Storage):
+    """Daily dedup key doesn't suppress digest, and vice versa."""
+    states = _update_states("update.ha_core")
+    settings = _settings(updates_cooldown_days=7)
+
+    # Daily check
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states)
+        async with make_session("test-token") as session:
+            r1 = await UpdatesAvailableCheck(
+                HAClient(HA_URL, session), storage, settings
+            ).run()
+    assert len(r1) == 1
+
+    # Daily again — cooldown active
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states)
+        async with make_session("test-token") as session:
+            r2 = await UpdatesAvailableCheck(
+                HAClient(HA_URL, session), storage, settings
+            ).run()
+    assert r2 == []
+
+    # Digest — different key, should still fire
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=states)
+        async with make_session("test-token") as session:
+            r3 = await UpdatesDigestCheck(
+                HAClient(HA_URL, session), storage, settings
+            ).run()
+    assert len(r3) == 1
+    assert r3[0].payload["digest"] is True
--- a/services/ha-diag-agent/tests/integration/test_websocket_monitor_integration.py
+++ b/services/ha-diag-agent/tests/integration/test_websocket_monitor_integration.py
@ -0,0 +1,186 @@
+"""Integration tests for WebSocketMonitor against real HA instances.
+
+Requires:
+    docker compose -f tests/integration/docker-compose.ken.yml up -d
+    tests/integration/scripts/wait-for-ha.sh http://localhost:8123
+    TEST_HA_TOKEN=<long-lived-token> pytest tests/ -m integration
+
+Container stop/restart tests additionally need Docker access from the host.
+"""
+from __future__ import annotations
+
+import asyncio
+import subprocess
+import time
+from pathlib import Path
+
+import pytest
+
+from ha_diag.config import Settings
+from ha_diag.event_emitter import EventEmitter
+from ha_diag.models import HAEventType
+from ha_diag.monitors.websocket_monitor import WebSocketMonitor
+from ha_diag.ha_client import make_session
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_settings(ha_url: str, ha_token: str, **overrides) -> Settings:
+    defaults: dict = {
+        "ha_url": ha_url,
+        "ha_token": ha_token,
+        "node_name": "test-piha",
+        "location_tag": "ken",
+        "websocket_enabled": True,
+        "websocket_silence_threshold_seconds": 30,    # low for fast test
+        "websocket_watchdog_interval_seconds": 5,
+        "websocket_reconnect_initial_delay": 1.0,
+        "websocket_reconnect_max_delay": 10.0,
+        "websocket_reconnect_jitter": 0.0,
+        "websocket_down_alert_repeat_minutes": 0,     # always re-alert
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+def _emitted_types(events_dir: Path) -> list[str]:
+    return [
+        __import__("json").loads(f.read_text())["type"]
+        for f in sorted(events_dir.glob("*.json"))
+    ]
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.integration
+async def test_ws_normal_operation_no_false_alerts(
+    ha_ken_url: str, ha_token: str, tmp_path: Path
+):
+    """Normal operation: monitor connects, subscribes, no dead alerts emitted."""
+    events_dir = tmp_path / "events"
+    events_dir.mkdir()
+    settings = _make_settings(ha_ken_url, ha_token)
+    emitter = EventEmitter(events_dir, node_name="test-piha", location_tag="ken")
+
+    async with make_session(ha_token) as session:
+        monitor = WebSocketMonitor(
+            ha_url=ha_ken_url,
+            token=ha_token,
+            settings=settings,
+            emitter=emitter,
+            session=session,
+        )
+        await monitor.start()
+        await asyncio.sleep(5)  # let it connect and settle
+        assert monitor.is_healthy, "Monitor should be subscribed and healthy"
+        await monitor.stop()
+
+    # No dead alerts during normal operation
+    types = _emitted_types(events_dir)
+    assert HAEventType.ha_websocket_dead.value not in types, (
+        f"Unexpected dead alerts during normal operation: {types}"
+    )
+
+
+@pytest.mark.integration
+async def test_ws_dead_emitted_when_ha_stops(ha_ken_url: str, ha_token: str, tmp_path: Path):
+    """Stopping the HA container triggers ha_websocket_dead."""
+    events_dir = tmp_path / "events"
+    events_dir.mkdir()
+    settings = _make_settings(ha_ken_url, ha_token)
+    emitter = EventEmitter(events_dir, node_name="test-piha", location_tag="ken")
+
+    async with make_session(ha_token) as session:
+        monitor = WebSocketMonitor(
+            ha_url=ha_ken_url,
+            token=ha_token,
+            settings=settings,
+            emitter=emitter,
+            session=session,
+        )
+        await monitor.start()
+        # Wait for initial subscription
+        for _ in range(20):
+            if monitor.is_healthy:
+                break
+            await asyncio.sleep(0.5)
+        assert monitor.is_healthy, "Monitor did not subscribe within 10s"
+
+        # Stop HA container
+        subprocess.run(
+            ["docker", "stop", "ha-test-ken"],
+            check=True, capture_output=True, timeout=30,
+        )
+        try:
+            # Wait for dead alert (up to 15s)
+            deadline = time.monotonic() + 15
+            while time.monotonic() < deadline:
+                types = _emitted_types(events_dir)
+                if HAEventType.ha_websocket_dead.value in types:
+                    break
+                await asyncio.sleep(0.5)
+
+            types = _emitted_types(events_dir)
+            assert HAEventType.ha_websocket_dead.value in types, (
+                "ha_websocket_dead not emitted after HA container stopped"
+            )
+        finally:
+            await monitor.stop()
+            subprocess.run(
+                ["docker", "start", "ha-test-ken"],
+                check=False, capture_output=True, timeout=30,
+            )
+
+
+@pytest.mark.integration
+async def test_ws_recovered_after_ha_restart(ha_ken_url: str, ha_token: str, tmp_path: Path):
+    """After HA restarts, monitor reconnects and emits ha_websocket_recovered."""
+    events_dir = tmp_path / "events"
+    events_dir.mkdir()
+    settings = _make_settings(ha_ken_url, ha_token)
+    emitter = EventEmitter(events_dir, node_name="test-piha", location_tag="ken")
+
+    async with make_session(ha_token) as session:
+        monitor = WebSocketMonitor(
+            ha_url=ha_ken_url,
+            token=ha_token,
+            settings=settings,
+            emitter=emitter,
+            session=session,
+        )
+        await monitor.start()
+        for _ in range(20):
+            if monitor.is_healthy:
+                break
+            await asyncio.sleep(0.5)
+        assert monitor.is_healthy
+
+        # Stop then restart HA
+        subprocess.run(["docker", "stop", "ha-test-ken"], check=True, timeout=30)
+        await asyncio.sleep(2)
+        subprocess.run(["docker", "start", "ha-test-ken"], check=True, timeout=30)
+
+        try:
+            # Wait for recovery (up to 60s — HA takes time to start)
+            deadline = time.monotonic() + 60
+            while time.monotonic() < deadline:
+                types = _emitted_types(events_dir)
+                if HAEventType.ha_websocket_recovered.value in types:
+                    break
+                await asyncio.sleep(1.0)
+
+            types = _emitted_types(events_dir)
+            assert HAEventType.ha_websocket_dead.value in types, (
+                "ha_websocket_dead not emitted after container stop"
+            )
+            assert HAEventType.ha_websocket_recovered.value in types, (
+                "ha_websocket_recovered not emitted after HA restarted"
+            )
+        finally:
+            await monitor.stop()
--- a/services/ha-diag-agent/tests/test_automation_failures.py
+++ b/services/ha-diag-agent/tests/test_automation_failures.py
@ -0,0 +1,217 @@
+"""Unit tests for AutomationFailuresCheck."""
+from __future__ import annotations
+
+from pathlib import Path
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from ha_diag.checks.automation_failures import AutomationFailuresCheck, _is_trace_failure
+from ha_diag.config import Settings
+from ha_diag.models import HAEventType, Severity
+from ha_diag.storage import Storage
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_settings(**overrides) -> Settings:
+    defaults: dict = {
+        "ha_url": "http://test.local:8123",
+        "ha_token": "test",
+        "node_name": "test-node",
+        "location_tag": "test-loc",
+        "alert_cooldown_hours": 0.0,
+        "automation_failure_threshold": 3,
+        "check_interval": 60,
+        "check_interval_unavailable": 3600,
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+def _make_client(states=None, traces_by_id=None, states_error=None):
+    client = MagicMock()
+    if states_error:
+        client.get_states = AsyncMock(side_effect=states_error)
+    else:
+        client.get_states = AsyncMock(return_value=states or [])
+
+    traces_map = traces_by_id or {}
+
+    async def _get_traces(eid):
+        if eid not in traces_map:
+            raise Exception(f"404 for {eid}")
+        return traces_map[eid]
+
+    client.get_automation_traces = AsyncMock(side_effect=_get_traces)
+    return client
+
+
+def _auto_state(entity_id: str, state: str = "on", friendly_name: str | None = None) -> dict:
+    attrs: dict = {}
+    if friendly_name:
+        attrs["friendly_name"] = friendly_name
+    return {"entity_id": entity_id, "state": state, "attributes": attrs}
+
+
+def _trace(error: str | None = None, state: str = "stopped") -> dict:
+    return {
+        "run_id": "abc",
+        "timestamp": "2026-05-27T10:00:00+00:00",
+        "trigger": "state",
+        "state": state if error is None else "stopped",
+        "error": error,
+    }
+
+
+def _fail(error: str = "Script error") -> dict:
+    return _trace(error=error)
+
+
+def _ok() -> dict:
+    return _trace(error=None)
+
+
+# ---------------------------------------------------------------------------
+# _is_trace_failure unit tests
+# ---------------------------------------------------------------------------
+
+
+def test_trace_with_error_is_failure():
+    assert _is_trace_failure({"error": "Something went wrong"}) is True
+
+
+def test_trace_with_state_failed_is_failure():
+    assert _is_trace_failure({"state": "failed", "error": None}) is True
+
+
+def test_trace_with_null_error_is_success():
+    assert _is_trace_failure({"error": None, "state": "stopped"}) is False
+
+
+def test_trace_with_empty_string_error_is_success():
+    assert _is_trace_failure({"error": "", "state": "stopped"}) is False
+
+
+def test_trace_with_no_keys_is_success():
+    assert _is_trace_failure({}) is False
+
+
+# ---------------------------------------------------------------------------
+# AutomationFailuresCheck.run() tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_no_automations_returns_empty(storage: Storage):
+    check = AutomationFailuresCheck(_make_client([]), storage, _make_settings())
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_disabled_automation_skipped(storage: Storage):
+    states = [_auto_state("automation.morning_lights", state="off")]
+    check = AutomationFailuresCheck(_make_client(states, {}), storage, _make_settings())
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_automation_with_no_traces_skipped(storage: Storage):
+    states = [_auto_state("automation.morning_lights")]
+    # _make_client raises exception for missing keys → graceful skip
+    check = AutomationFailuresCheck(_make_client(states, {}), storage, _make_settings())
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_fewer_traces_than_threshold_skipped(storage: Storage):
+    states = [_auto_state("automation.a")]
+    traces = {"automation.a": [_fail(), _fail()]}  # 2 failures, threshold=3
+    check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_all_recent_failed_emits_event(storage: Storage):
+    states = [_auto_state("automation.a", friendly_name="Morning Lights")]
+    traces = {"automation.a": [_fail("step failed"), _fail("timeout"), _fail("no device")]}
+    check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 1
+    r = results[0]
+    assert r.event_type == HAEventType.ha_automation_failing
+    assert r.severity == Severity.warning
+    assert r.payload["entity_id"] == "automation.a"
+    assert r.payload["friendly_name"] == "Morning Lights"
+    assert r.payload["total_recent_failures"] == 3
+    assert len(r.payload["last_failures"]) == 3
+
+
+@pytest.mark.asyncio
+async def test_partial_failures_no_event(storage: Storage):
+    states = [_auto_state("automation.a")]
+    # 2 failures, 1 success in recent 3 → not all failed
+    traces = {"automation.a": [_fail(), _ok(), _fail()]}
+    check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_cooldown_prevents_duplicate_event(storage: Storage):
+    states = [_auto_state("automation.a")]
+    traces = {"automation.a": [_fail(), _fail(), _fail()]}
+    settings = _make_settings(alert_cooldown_hours=6.0)
+    check = AutomationFailuresCheck(_make_client(states, traces), storage, settings)
+    r1 = await check.run()
+    r2 = await check.run()
+    assert len(r1) == 1
+    assert r2 == []
+
+
+@pytest.mark.asyncio
+async def test_multiple_failing_automations(storage: Storage):
+    states = [_auto_state("automation.a"), _auto_state("automation.b")]
+    traces = {
+        "automation.a": [_fail(), _fail(), _fail()],
+        "automation.b": [_fail(), _fail(), _fail()],
+    }
+    check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 2
+    eids = {r.payload["entity_id"] for r in results}
+    assert eids == {"automation.a", "automation.b"}
+
+
+@pytest.mark.asyncio
+async def test_states_error_returns_empty(storage: Storage):
+    check = AutomationFailuresCheck(
+        _make_client(states_error=ConnectionError("down")), storage, _make_settings()
+    )
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_custom_threshold(storage: Storage):
+    states = [_auto_state("automation.a")]
+    # threshold=2: 2 failures should trigger
+    traces = {"automation.a": [_fail(), _fail(), _ok()]}
+    settings = _make_settings(automation_failure_threshold=2)
+    check = AutomationFailuresCheck(_make_client(states, traces), storage, settings)
+    results = await check.run()
+    assert len(results) == 1
+
+
+@pytest.mark.asyncio
+async def test_failure_with_state_failed_field(storage: Storage):
+    states = [_auto_state("automation.a")]
+    traces = {"automation.a": [
+        {"run_id": "x", "state": "failed", "error": None, "timestamp": "2026-05-27T10:00:00Z"},
+        {"run_id": "y", "state": "failed", "error": None, "timestamp": "2026-05-27T09:00:00Z"},
+        {"run_id": "z", "state": "failed", "error": None, "timestamp": "2026-05-27T08:00:00Z"},
+    ]}
+    check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 1
--- a/services/ha-diag-agent/tests/test_event_emitter.py
+++ b/services/ha-diag-agent/tests/test_event_emitter.py
@ -0,0 +1,88 @@
+"""Tests for EventEmitter."""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+from ha_diag.event_emitter import EventEmitter
+
+
+def test_emit_creates_json_file(tmp_events_dir: Path, emitter: EventEmitter):
+    event_id = emitter.emit(
+        event_type="ha_websocket_dead",
+        severity="error",
+        service="homeassistant",
+        message="HA unreachable",
+        payload={"error": "timeout"},
+    )
+    files = list(tmp_events_dir.glob("*.json"))
+    assert len(files) == 1
+    assert files[0].name == f"{event_id}.json"
+
+
+def test_emit_event_schema(tmp_events_dir: Path, emitter: EventEmitter):
+    event_id = emitter.emit(
+        event_type="ha_websocket_dead",
+        severity="error",
+        service="homeassistant",
+        message="HA unreachable",
+        payload={"error": "timeout"},
+    )
+    data = json.loads((tmp_events_dir / f"{event_id}.json").read_text())
+    assert data["id"] == event_id
+    assert data["type"] == "ha_websocket_dead"
+    assert data["severity"] == "error"
+    assert data["node"] == "test-node"
+    assert data["service"] == "homeassistant"
+    assert data["message"] == "HA unreachable"
+    assert data["payload"] == {"error": "timeout"}
+    assert "timestamp" in data
+    assert "date" in data
+
+
+def test_emit_multiple_events_unique_files(tmp_events_dir: Path, emitter: EventEmitter):
+    ids = [
+        emitter.emit("ha_websocket_dead", "error", "homeassistant", f"msg {i}")
+        for i in range(3)
+    ]
+    assert len(set(ids)) == 3
+    assert len(list(tmp_events_dir.glob("*.json"))) == 3
+
+
+def test_emit_no_tmp_file_left(tmp_events_dir: Path, emitter: EventEmitter):
+    emitter.emit("ha_websocket_dead", "error", "homeassistant", "msg")
+    assert not list(tmp_events_dir.glob("*.tmp"))
+
+
+def test_emitter_creates_events_dir(tmp_path: Path):
+    new_dir = tmp_path / "nested" / "events"
+    emitter = EventEmitter(new_dir, "my-node")
+    assert new_dir.exists()
+
+
+def test_location_tag_included_in_payload(tmp_events_dir: Path):
+    emitter = EventEmitter(tmp_events_dir, node_name="piha", location_tag="ken")
+    event_id = emitter.emit("ha_websocket_dead", "error", "homeassistant", "msg")
+    data = json.loads((tmp_events_dir / f"{event_id}.json").read_text())
+    assert data["payload"]["location_tag"] == "ken"
+
+
+def test_location_tag_empty_not_in_payload(tmp_events_dir: Path):
+    emitter = EventEmitter(tmp_events_dir, node_name="piha", location_tag="")
+    event_id = emitter.emit("ha_websocket_dead", "error", "homeassistant", "msg")
+    data = json.loads((tmp_events_dir / f"{event_id}.json").read_text())
+    assert "location_tag" not in data["payload"]
+
+
+def test_location_tag_does_not_override_explicit_payload_key(tmp_events_dir: Path):
+    emitter = EventEmitter(tmp_events_dir, node_name="piha", location_tag="ken")
+    event_id = emitter.emit(
+        "ha_websocket_dead", "error", "homeassistant", "msg",
+        payload={"location_tag": "override", "other": "value"},
+    )
+    data = json.loads((tmp_events_dir / f"{event_id}.json").read_text())
+    # Explicit payload key wins over the emitter's location_tag
+    assert data["payload"]["location_tag"] == "override"
+    assert data["payload"]["other"] == "value"
--- a/services/ha-diag-agent/tests/test_ha_client.py
+++ b/services/ha-diag-agent/tests/test_ha_client.py
@ -0,0 +1,135 @@
+"""Tests for HAClient using aioresponses to mock aiohttp."""
+from __future__ import annotations
+
+import pytest
+from aioresponses import aioresponses
+
+from ha_diag.ha_client import HAClient, make_session
+
+HA_URL = "http://homeassistant.test:8123"
+TOKEN = "test-token"
+
+
+@pytest.mark.asyncio
+async def test_get_api_status_ok():
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/", payload={"message": "API running."})
+        async with make_session(TOKEN) as session:
+            client = HAClient(HA_URL, session)
+            result = await client.get_api_status()
+    assert result == {"message": "API running."}
+
+
+@pytest.mark.asyncio
+async def test_get_api_status_unauthorized():
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/", status=401)
+        async with make_session(TOKEN) as session:
+            client = HAClient(HA_URL, session)
+            with pytest.raises(Exception):
+                await client.get_api_status()
+
+
+@pytest.mark.asyncio
+async def test_get_states_returns_list():
+    payload = [{"entity_id": "light.living_room", "state": "on"}]
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/states", payload=payload)
+        async with make_session(TOKEN) as session:
+            client = HAClient(HA_URL, session)
+            states = await client.get_states()
+    assert isinstance(states, list)
+    assert states[0]["entity_id"] == "light.living_room"
+
+
+@pytest.mark.asyncio
+async def test_get_config_returns_dict():
+    payload = {"version": "2024.1.0", "location_name": "Home"}
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/config", payload=payload)
+        async with make_session(TOKEN) as session:
+            client = HAClient(HA_URL, session)
+            config = await client.get_config()
+    assert config["version"] == "2024.1.0"
+
+
+@pytest.mark.asyncio
+async def test_get_entity_registry_returns_list():
+    payload = [
+        {"entity_id": "light.hall", "platform": "zha", "area_id": "hallway"},
+        {"entity_id": "sensor.temp", "platform": "mqtt", "area_id": None},
+    ]
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
+        async with make_session(TOKEN) as session:
+            client = HAClient(HA_URL, session)
+            registry = await client.get_entity_registry()
+    assert len(registry) == 2
+    assert registry[0]["platform"] == "zha"
+
+
+@pytest.mark.asyncio
+async def test_make_session_sets_auth_header():
+    """make_session injects the Bearer token in all requests."""
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/", payload={"message": "API running."})
+        async with make_session("my-secret-token") as session:
+            client = HAClient(HA_URL, session)
+            await client.get_api_status()
+        # Verify the Authorization header was sent
+        assert session.headers.get("Authorization") == "Bearer my-secret-token"
+
+
+# ---------------------------------------------------------------------------
+# Entity registry TTL cache (Phase 3 Flag #3)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_entity_registry_cached_on_second_call():
+    """Second call within TTL returns cache, making only one HTTP request."""
+    payload = [{"entity_id": "light.hall", "platform": "zha", "area_id": "hallway"}]
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
+        async with make_session(TOKEN) as session:
+            client = HAClient(HA_URL, session, entity_registry_cache_ttl=300.0)
+            r1 = await client.get_entity_registry()
+            r2 = await client.get_entity_registry()  # from cache — no second HTTP call
+    assert r1 == r2
+    # aioresponses would raise ConnectionError on the unmocked second request
+    # if caching weren't working; reaching here means it used the cache.
+
+
+@pytest.mark.asyncio
+async def test_entity_registry_cache_bypassed_after_ttl(monkeypatch):
+    """After TTL expiry, next call fetches fresh data."""
+    import time
+    payload = [{"entity_id": "light.hall", "platform": "zha", "area_id": "hallway"}]
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
+        async with make_session(TOKEN) as session:
+            client = HAClient(HA_URL, session, entity_registry_cache_ttl=0.0)
+            await client.get_entity_registry()   # fetches
+            await client.get_entity_registry()   # TTL=0 → fetches again
+
+
+@pytest.mark.asyncio
+async def test_invalidate_registry_cache_forces_refetch():
+    """invalidate_registry_cache() makes the next call hit the network."""
+    payload = [{"entity_id": "light.hall", "platform": "zha", "area_id": ""}]
+    with aioresponses() as m:
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
+        m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
+        async with make_session(TOKEN) as session:
+            client = HAClient(HA_URL, session, entity_registry_cache_ttl=300.0)
+            await client.get_entity_registry()
+            client.invalidate_registry_cache()
+            await client.get_entity_registry()  # must hit network again
+
+
+@pytest.mark.asyncio
+async def test_entity_registry_cache_default_ttl_is_300():
+    async with make_session(TOKEN) as session:
+        client = HAClient(HA_URL, session)
+    assert client._registry_cache_ttl == 300.0
--- a/services/ha-diag-agent/tests/test_heartbeat_check.py
+++ b/services/ha-diag-agent/tests/test_heartbeat_check.py
@ -0,0 +1,62 @@
+"""Tests for HeartbeatCheck."""
+from __future__ import annotations
+
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from ha_diag.checks.heartbeat import HeartbeatCheck
+from ha_diag.models import HAEventType, Severity
+
+
+def _make_client(api_status=None, side_effect=None):
+    client = MagicMock()
+    if side_effect:
+        client.get_api_status = AsyncMock(side_effect=side_effect)
+    else:
+        client.get_api_status = AsyncMock(return_value=api_status)
+    return client
+
+
+@pytest.mark.asyncio
+async def test_heartbeat_ok_returns_empty_list():
+    client = _make_client(api_status={"message": "API running."})
+    check = HeartbeatCheck(client)
+    results = await check.run()
+    assert results == []
+
+
+@pytest.mark.asyncio
+async def test_heartbeat_connection_error():
+    client = _make_client(side_effect=ConnectionError("refused"))
+    check = HeartbeatCheck(client)
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].healthy is False
+    assert results[0].event_type == HAEventType.ha_websocket_dead
+    assert results[0].severity == Severity.error
+    assert "refused" in results[0].message
+
+
+@pytest.mark.asyncio
+async def test_heartbeat_unexpected_response():
+    client = _make_client(api_status={"unexpected": "key"})
+    check = HeartbeatCheck(client)
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_websocket_dead
+
+
+@pytest.mark.asyncio
+async def test_heartbeat_timeout():
+    client = _make_client(side_effect=TimeoutError("timed out"))
+    check = HeartbeatCheck(client)
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_websocket_dead
+    assert "timed out" in results[0].message
+
+
+def test_heartbeat_check_name():
+    check = HeartbeatCheck(MagicMock())
+    assert check.name == "heartbeat"
--- a/services/ha-diag-agent/tests/test_system_health.py
+++ b/services/ha-diag-agent/tests/test_system_health.py
@ -0,0 +1,221 @@
+"""Unit tests for SystemHealthCheck."""
+from __future__ import annotations
+
+from pathlib import Path
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from ha_diag.checks.system_health import SystemHealthCheck, _extract_component_statuses
+from ha_diag.config import Settings
+from ha_diag.models import HAEventType, Severity
+from ha_diag.storage import Storage
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_settings(**overrides) -> Settings:
+    defaults: dict = {
+        "ha_url": "http://test.local:8123",
+        "ha_token": "test",
+        "node_name": "test-node",
+        "location_tag": "test-loc",
+        "alert_cooldown_hours": 0.0,
+        "check_interval": 60,
+        "check_interval_unavailable": 3600,
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+def _make_client(health=None, error=None):
+    client = MagicMock()
+    if error:
+        client.get_system_health = AsyncMock(side_effect=error)
+    else:
+        client.get_system_health = AsyncMock(return_value=health or {})
+    return client
+
+
+def _ok_response(*components: str) -> dict:
+    return {c: {"type": "result", "data": {"ok": True}} for c in components}
+
+
+def _error_response(*components: str) -> dict:
+    return {c: {"type": "error", "error": f"{c} failed"} for c in components}
+
+
+# ---------------------------------------------------------------------------
+# _extract_component_statuses unit tests
+# ---------------------------------------------------------------------------
+
+
+def test_extract_typed_result_format():
+    data = {"recorder": {"type": "result", "data": {"backlog": 0}}}
+    result = _extract_component_statuses(data)
+    assert result["recorder"]["status"] == "ok"
+    assert result["recorder"]["details"] == {"backlog": 0}
+
+
+def test_extract_typed_error_format():
+    data = {"cloud": {"type": "error", "error": "Connection refused"}}
+    result = _extract_component_statuses(data)
+    assert result["cloud"]["status"] == "error"
+    assert "Connection refused" in result["cloud"]["details"]["error"]
+
+
+def test_extract_legacy_error_field():
+    data = {"cloud": {"error": "Timeout"}}
+    result = _extract_component_statuses(data)
+    assert result["cloud"]["status"] == "error"
+
+
+def test_extract_nested_checks_format():
+    data = {
+        "info": {"version": "2024.12.0"},
+        "checks": {
+            "homeassistant": {"type": "result", "data": {}},
+            "recorder": {"type": "error", "error": "DB locked"},
+        },
+    }
+    result = _extract_component_statuses(data)
+    assert "homeassistant" not in result or result.get("homeassistant", {}).get("status") == "ok"
+    assert result["recorder"]["status"] == "error"
+    assert "info" not in result
+
+
+def test_extract_plain_dict_treated_as_ok():
+    data = {"homeassistant": {"version": "2024.12.0", "docker": True}}
+    result = _extract_component_statuses(data)
+    assert result["homeassistant"]["status"] == "ok"
+
+
+def test_extract_non_dict_value_skipped():
+    data = {"scalar_component": "just-a-string"}
+    result = _extract_component_statuses(data)
+    assert "scalar_component" not in result
+
+
+# ---------------------------------------------------------------------------
+# SystemHealthCheck run() tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_first_run_no_snapshot_no_event_for_ok(storage: Storage):
+    """All components ok on first run — record snapshots, emit nothing."""
+    check = SystemHealthCheck(_make_client(_ok_response("homeassistant", "recorder")),
+                              storage, _make_settings())
+    results = await check.run()
+    assert results == []
+    snap = await storage.get_system_health_snapshot("homeassistant")
+    assert snap is not None
+    assert snap["last_status"] == "ok"
+
+
+@pytest.mark.asyncio
+async def test_first_run_error_component_emits_event(storage: Storage):
+    """Component in error on first run (no prior snapshot) → ha_system_health_degraded."""
+    check = SystemHealthCheck(_make_client(_error_response("cloud")), storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 1
+    r = results[0]
+    assert r.event_type == HAEventType.ha_system_health_degraded
+    assert r.payload["component"] == "cloud"
+    assert r.payload["previous_status"] == "unknown"
+    assert r.payload["current_status"] == "error"
+    assert r.severity == Severity.warning
+
+
+@pytest.mark.asyncio
+async def test_ok_to_error_transition_emits_event(storage: Storage):
+    """Component transitions ok → error → event fired."""
+    client_ok = _make_client(_ok_response("cloud"))
+    client_err = _make_client(_error_response("cloud"))
+    settings = _make_settings()
+
+    await SystemHealthCheck(client_ok, storage, settings).run()
+    results = await SystemHealthCheck(client_err, storage, settings).run()
+
+    assert len(results) == 1
+    assert results[0].payload["previous_status"] == "ok"
+    assert results[0].payload["current_status"] == "error"
+
+
+@pytest.mark.asyncio
+async def test_sustained_error_no_duplicate_event(storage: Storage):
+    """Component stays in error across multiple runs — only first run emits."""
+    client_ok = _make_client(_ok_response("cloud"))
+    client_err = _make_client(_error_response("cloud"))
+    settings = _make_settings()
+
+    await SystemHealthCheck(client_ok, storage, settings).run()
+    results1 = await SystemHealthCheck(client_err, storage, settings).run()
+    results2 = await SystemHealthCheck(client_err, storage, settings).run()
+    results3 = await SystemHealthCheck(client_err, storage, settings).run()
+
+    assert len(results1) == 1  # transition fires
+    assert results2 == []
+    assert results3 == []
+
+
+@pytest.mark.asyncio
+async def test_recovery_clears_alert_and_next_degradation_re_fires(storage: Storage):
+    """error → ok → error: second degradation fires a new event."""
+    settings = _make_settings()
+
+    # First degradation
+    await SystemHealthCheck(_make_client(_ok_response("cloud")), storage, settings).run()
+    r1 = await SystemHealthCheck(_make_client(_error_response("cloud")), storage, settings).run()
+    assert len(r1) == 1
+
+    # Recovery
+    r2 = await SystemHealthCheck(_make_client(_ok_response("cloud")), storage, settings).run()
+    assert r2 == []
+
+    # Second degradation
+    r3 = await SystemHealthCheck(_make_client(_error_response("cloud")), storage, settings).run()
+    assert len(r3) == 1
+    assert r3[0].payload["previous_status"] == "ok"
+
+
+@pytest.mark.asyncio
+async def test_multiple_degraded_components_multiple_events(storage: Storage):
+    health = {**_error_response("cloud", "recorder"), **_ok_response("homeassistant")}
+    check = SystemHealthCheck(_make_client(health), storage, _make_settings())
+    results = await check.run()
+    components = {r.payload["component"] for r in results}
+    assert components == {"cloud", "recorder"}
+    assert all(r.event_type == HAEventType.ha_system_health_degraded for r in results)
+
+
+@pytest.mark.asyncio
+async def test_api_error_returns_empty(storage: Storage):
+    """If /api/system_health is unreachable, return no results (not an error event)."""
+    check = SystemHealthCheck(
+        _make_client(error=Exception("timeout")), storage, _make_settings()
+    )
+    results = await check.run()
+    assert results == []
+
+
+@pytest.mark.asyncio
+async def test_payload_contains_details(storage: Storage):
+    health = {"recorder": {"type": "error", "error": "DB write lag 5000ms"}}
+    check = SystemHealthCheck(_make_client(health), storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 1
+    assert "DB write lag" in results[0].payload["details"]["error"]
+
+
+@pytest.mark.asyncio
+async def test_snapshot_updated_after_recovery(storage: Storage):
+    """After a recovery cycle, snapshot shows last_status='ok'."""
+    settings = _make_settings()
+    await SystemHealthCheck(_make_client(_error_response("cloud")), storage, settings).run()
+    await SystemHealthCheck(_make_client(_ok_response("cloud")), storage, settings).run()
+    snap = await storage.get_system_health_snapshot("cloud")
+    assert snap["last_status"] == "ok"
--- a/services/ha-diag-agent/tests/test_unavailable_entities.py
+++ b/services/ha-diag-agent/tests/test_unavailable_entities.py
@ -0,0 +1,493 @@
+"""Unit tests for UnavailableEntitiesCheck."""
+from __future__ import annotations
+
+import time
+from pathlib import Path
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from ha_diag.checks.unavailable_entities import UnavailableEntitiesCheck
+from ha_diag.config import Settings
+from ha_diag.models import HAEventType
+from ha_diag.storage import Storage
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_settings(**overrides) -> Settings:
+    """Settings with safe test defaults (alert immediately, no cooldown)."""
+    defaults: dict = {
+        "ha_url": "http://test.local:8123",
+        "ha_token": "test",
+        "node_name": "test-node",
+        "location_tag": "test-loc",
+        "unavailable_threshold_hours": 0.0,  # alert immediately
+        "integration_failure_threshold_pct": 0.5,
+        "integration_failure_min_entities": 3,
+        "alert_cooldown_hours": 0.0,  # no dedup window in most tests
+        "check_interval": 60,
+        "check_interval_unavailable": 3600,
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+def _make_state(entity_id: str, state: str = "on") -> dict:
+    return {"entity_id": entity_id, "state": state, "attributes": {}}
+
+
+def _make_registry_entry(entity_id: str, platform: str, area_id: str = "") -> dict:
+    return {"entity_id": entity_id, "platform": platform, "area_id": area_id}
+
+
+def _make_client(states=None, registry=None, states_error=None):
+    client = MagicMock()
+    if states_error:
+        client.get_states = AsyncMock(side_effect=states_error)
+    else:
+        client.get_states = AsyncMock(return_value=states or [])
+    client.get_entity_registry = AsyncMock(return_value=registry or [])
+    return client
+
+
+# ---------------------------------------------------------------------------
+# Basic unavailability detection
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_no_unavailable_entities_returns_empty(storage: Storage):
+    states = [_make_state("light.a", "on"), _make_state("sensor.b", "off")]
+    check = UnavailableEntitiesCheck(_make_client(states), storage, _make_settings())
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_first_cycle_records_baseline_no_event(storage: Storage):
+    """First observation of unavailable entity: record, don't alert yet."""
+    states = [_make_state("light.kitchen", "unavailable")]
+    settings = _make_settings(unavailable_threshold_hours=1.0)  # needs 1h before alert
+    check = UnavailableEntitiesCheck(_make_client(states), storage, settings)
+    results = await check.run()
+    assert results == []
+    # Baseline should be recorded
+    first_at = await storage.get_entity_first_unavailable_at("light.kitchen")
+    assert first_at is not None
+
+
+@pytest.mark.asyncio
+async def test_unavailable_below_threshold_no_event(storage: Storage):
+    states = [_make_state("light.kitchen", "unavailable")]
+    settings = _make_settings(unavailable_threshold_hours=24.0)
+    check = UnavailableEntitiesCheck(_make_client(states), storage, settings)
+
+    # Seed the baseline as if entity just became unavailable
+    await storage.set_entity_unavailable_since("light.kitchen", "unavailable", time.time())
+    results = await check.run()
+    assert results == []
+
+
+@pytest.mark.asyncio
+async def test_unavailable_above_threshold_emits_event(storage: Storage):
+    states = [_make_state("light.kitchen", "unavailable")]
+    check = UnavailableEntitiesCheck(
+        _make_client(states), storage, _make_settings()
+    )
+    # Seed baseline as if 25h ago
+    await storage.set_entity_unavailable_since(
+        "light.kitchen", "unavailable", time.time() - 25 * 3600
+    )
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_entity_unavailable_long
+    assert results[0].payload["entity_id"] == "light.kitchen"
+    assert results[0].payload["duration_hours"] == pytest.approx(25.0, abs=0.1)
+    assert results[0].payload["domain"] == "light"
+
+
+@pytest.mark.asyncio
+async def test_unknown_state_treated_as_unavailable(storage: Storage):
+    states = [_make_state("sensor.temp", "unknown")]
+    await storage.set_entity_unavailable_since(
+        "sensor.temp", "unknown", time.time() - 25 * 3600
+    )
+    check = UnavailableEntitiesCheck(
+        _make_client(states), storage, _make_settings()
+    )
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].payload["state"] == "unknown"
+
+
+@pytest.mark.asyncio
+async def test_payload_contains_since_timestamp(storage: Storage):
+    first_at = time.time() - 27 * 3600
+    await storage.set_entity_unavailable_since("light.k", "unavailable", first_at)
+    states = [_make_state("light.k", "unavailable")]
+    check = UnavailableEntitiesCheck(
+        _make_client(states), storage, _make_settings()
+    )
+    results = await check.run()
+    assert len(results) == 1
+    assert "since" in results[0].payload
+    assert "Z" in results[0].payload["since"]  # ISO UTC timestamp
+
+
+# ---------------------------------------------------------------------------
+# Recovery
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_recovery_clears_baseline(storage: Storage):
+    await storage.set_entity_unavailable_since("light.k", "unavailable", time.time())
+    # Entity is now back online
+    states = [_make_state("light.k", "on")]
+    check = UnavailableEntitiesCheck(
+        _make_client(states), storage, _make_settings()
+    )
+    await check.run()
+    assert await storage.get_entity_first_unavailable_at("light.k") is None
+
+
+@pytest.mark.asyncio
+async def test_recovery_clears_alert_dedup(storage: Storage):
+    await storage.set_entity_unavailable_since(
+        "light.k", "unavailable", time.time() - 25 * 3600
+    )
+    await storage.mark_alert_sent("entity_unavailable:light.k")
+    # Entity recovers
+    states = [_make_state("light.k", "on")]
+    check = UnavailableEntitiesCheck(
+        _make_client(states), storage, _make_settings()
+    )
+    await check.run()
+    # Alert dedup should be gone
+    assert not await storage.was_alert_sent("entity_unavailable:light.k", 9999)
+
+
+# ---------------------------------------------------------------------------
+# Alert cooldown / deduplication
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_cooldown_prevents_duplicate_event(storage: Storage):
+    await storage.set_entity_unavailable_since(
+        "light.k", "unavailable", time.time() - 25 * 3600
+    )
+    settings = _make_settings(alert_cooldown_hours=6.0)
+    states = [_make_state("light.k", "unavailable")]
+
+    check = UnavailableEntitiesCheck(_make_client(states), storage, settings)
+
+    results1 = await check.run()
+    assert len(results1) == 1  # first alert fires
+
+    results2 = await check.run()
+    assert results2 == []  # cooldown active
+
+
+@pytest.mark.asyncio
+async def test_no_cooldown_allows_repeat_event(storage: Storage):
+    await storage.set_entity_unavailable_since(
+        "light.k", "unavailable", time.time() - 25 * 3600
+    )
+    settings = _make_settings(alert_cooldown_hours=0.0)
+    states = [_make_state("light.k", "unavailable")]
+
+    check = UnavailableEntitiesCheck(_make_client(states), storage, settings)
+    results1 = await check.run()
+    results2 = await check.run()
+    assert len(results1) == 1
+    assert len(results2) == 1
+
+
+# ---------------------------------------------------------------------------
+# Integration root-cause grouping
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_integration_failure_emits_single_event(storage: Storage):
+    """5/8 entities from zha unavailable → ha_integration_failed, not 5 entity events."""
+    zha_entities = [f"light.zha_{i}" for i in range(8)]
+    states = [
+        _make_state(eid, "unavailable" if i < 5 else "on")
+        for i, eid in enumerate(zha_entities)
+    ]
+    registry = [_make_registry_entry(eid, "zha") for eid in zha_entities]
+
+    # Seed baselines for unavailable entities as 25h ago
+    for eid in zha_entities[:5]:
+        await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
+
+    settings = _make_settings(
+        integration_failure_threshold_pct=0.5,
+        integration_failure_min_entities=3,
+    )
+    check = UnavailableEntitiesCheck(
+        _make_client(states, registry), storage, settings
+    )
+    results = await check.run()
+
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_integration_failed
+    assert results[0].payload["integration"] == "zha"
+    assert results[0].payload["unavailable_count"] == 5
+    assert results[0].payload["total_count"] == 8
+    assert set(results[0].payload["affected_entities"]) == set(zha_entities[:5])
+
+
+@pytest.mark.asyncio
+async def test_integration_failure_below_pct_threshold(storage: Storage):
+    """2/8 entities from zha unavailable (25%) → per-entity events, not integration event."""
+    zha_entities = [f"light.zha_{i}" for i in range(8)]
+    states = [
+        _make_state(eid, "unavailable" if i < 2 else "on")
+        for i, eid in enumerate(zha_entities)
+    ]
+    registry = [_make_registry_entry(eid, "zha") for eid in zha_entities]
+
+    for eid in zha_entities[:2]:
+        await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
+
+    settings = _make_settings(
+        integration_failure_threshold_pct=0.5,
+        integration_failure_min_entities=3,
+    )
+    check = UnavailableEntitiesCheck(
+        _make_client(states, registry), storage, settings
+    )
+    results = await check.run()
+
+    # Below count threshold (2 < 3) so individual events
+    assert all(r.event_type == HAEventType.ha_entity_unavailable_long for r in results)
+    assert len(results) == 2
+
+
+@pytest.mark.asyncio
+async def test_integration_failure_below_count_threshold(storage: Storage):
+    """3/6 entities unavailable (50%) but min_entities=5 → per-entity events."""
+    zha_entities = [f"light.zha_{i}" for i in range(6)]
+    states = [
+        _make_state(eid, "unavailable" if i < 3 else "on")
+        for i, eid in enumerate(zha_entities)
+    ]
+    registry = [_make_registry_entry(eid, "zha") for eid in zha_entities]
+    for eid in zha_entities[:3]:
+        await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
+
+    settings = _make_settings(
+        integration_failure_threshold_pct=0.5,
+        integration_failure_min_entities=5,  # need 5, only have 3
+    )
+    check = UnavailableEntitiesCheck(
+        _make_client(states, registry), storage, settings
+    )
+    results = await check.run()
+    assert all(r.event_type == HAEventType.ha_entity_unavailable_long for r in results)
+
+
+@pytest.mark.asyncio
+async def test_entity_without_integration_gets_individual_event(storage: Storage):
+    """Entity not in entity registry gets per-entity event regardless of integration grouping."""
+    await storage.set_entity_unavailable_since(
+        "light.mystery", "unavailable", time.time() - 25 * 3600
+    )
+    states = [_make_state("light.mystery", "unavailable")]
+    # Empty registry — no integration info
+    check = UnavailableEntitiesCheck(
+        _make_client(states, []), storage, _make_settings()
+    )
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_entity_unavailable_long
+    assert "integration" not in results[0].payload
+
+
+@pytest.mark.asyncio
+async def test_mixed_integrations_correctly_partitioned(storage: Storage):
+    """5 zha entities unavailable (triggers integration event) + 1 mqtt entity (individual)."""
+    zha_entities = [f"light.zha_{i}" for i in range(8)]
+    mqtt_entity = "sensor.mqtt_temp"
+    all_entities = zha_entities + [mqtt_entity]
+    states = (
+        [_make_state(eid, "unavailable" if i < 5 else "on") for i, eid in enumerate(zha_entities)]
+        + [_make_state(mqtt_entity, "unavailable")]
+    )
+    registry = (
+        [_make_registry_entry(eid, "zha") for eid in zha_entities]
+        + [_make_registry_entry(mqtt_entity, "mqtt")]
+    )
+    for eid in zha_entities[:5]:
+        await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
+    await storage.set_entity_unavailable_since(mqtt_entity, "unavailable", time.time() - 25 * 3600)
+
+    settings = _make_settings(
+        integration_failure_threshold_pct=0.5,
+        integration_failure_min_entities=3,
+    )
+    check = UnavailableEntitiesCheck(
+        _make_client(states, registry), storage, settings
+    )
+    results = await check.run()
+
+    event_types = {r.event_type for r in results}
+    assert HAEventType.ha_integration_failed in event_types
+    assert HAEventType.ha_entity_unavailable_long in event_types
+    # Exactly 2 events: 1 integration + 1 individual mqtt entity
+    assert len(results) == 2
+
+
+# ---------------------------------------------------------------------------
+# Error handling
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_ha_client_error_returns_dead_event(storage: Storage):
+    client = _make_client(states_error=ConnectionError("HA down"))
+    check = UnavailableEntitiesCheck(client, storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_websocket_dead
+
+
+@pytest.mark.asyncio
+async def test_registry_failure_falls_back_gracefully(storage: Storage):
+    """Registry endpoint failure → individual entity events without integration info."""
+    states = [_make_state("light.k", "unavailable")]
+    client = _make_client(states)
+    client.get_entity_registry = AsyncMock(side_effect=Exception("registry unavailable"))
+    await storage.set_entity_unavailable_since(
+        "light.k", "unavailable", time.time() - 25 * 3600
+    )
+    check = UnavailableEntitiesCheck(client, storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_entity_unavailable_long
+    assert "integration" not in results[0].payload
+
+
+# ---------------------------------------------------------------------------
+# Area / integration in payload
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_area_included_in_payload_when_known(storage: Storage):
+    await storage.set_entity_unavailable_since(
+        "light.hall", "unavailable", time.time() - 25 * 3600
+    )
+    states = [_make_state("light.hall", "unavailable")]
+    registry = [_make_registry_entry("light.hall", "zha", "hallway")]
+    check = UnavailableEntitiesCheck(
+        _make_client(states, registry), storage, _make_settings()
+    )
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].payload.get("area") == "hallway"
+    assert results[0].payload.get("integration") == "zha"
+
+
+@pytest.mark.asyncio
+async def test_area_omitted_when_unknown(storage: Storage):
+    await storage.set_entity_unavailable_since(
+        "light.k", "unavailable", time.time() - 25 * 3600
+    )
+    states = [_make_state("light.k", "unavailable")]
+    registry = [_make_registry_entry("light.k", "zha", "")]
+    check = UnavailableEntitiesCheck(
+        _make_client(states, registry), storage, _make_settings()
+    )
+    results = await check.run()
+    assert "area" not in results[0].payload
+
+
+# ---------------------------------------------------------------------------
+# Phase 3 Flag #1: since = min(last_changed, first_seen)
+# ---------------------------------------------------------------------------
+
+
+def _make_state_with_last_changed(
+    entity_id: str, state: str, last_changed_iso: str
+) -> dict:
+    return {
+        "entity_id": entity_id,
+        "state": state,
+        "attributes": {},
+        "last_changed": last_changed_iso,
+    }
+
+
+@pytest.mark.asyncio
+async def test_since_uses_last_changed_when_earlier_than_baseline(storage: Storage):
+    """Entity's last_changed predates our baseline → duration computed from last_changed."""
+    import datetime as dt
+
+    now = time.time()
+    # Baseline recorded 1h ago (agent just started)
+    await storage.set_entity_unavailable_since("light.k", "unavailable", now - 3600)
+
+    # HA says entity changed to unavailable 48h ago
+    lc_iso = (
+        dt.datetime.fromtimestamp(now - 48 * 3600, tz=dt.timezone.utc)
+        .isoformat()
+        .replace("+00:00", "Z")
+    )
+    states = [_make_state_with_last_changed("light.k", "unavailable", lc_iso)]
+    check = UnavailableEntitiesCheck(
+        _make_client(states), storage, _make_settings(unavailable_threshold_hours=0.0)
+    )
+    results = await check.run()
+
+    assert len(results) == 1
+    # Duration should be ~48h, not ~1h
+    assert results[0].payload["duration_hours"] == pytest.approx(48.0, abs=0.1)
+
+
+@pytest.mark.asyncio
+async def test_since_ignores_last_changed_when_later_than_baseline(storage: Storage):
+    """Baseline predates last_changed → use baseline (entity was unavailable before
+    last_changed, e.g. if HA reports last_changed as now for some reason)."""
+    import datetime as dt
+
+    now = time.time()
+    # Baseline recorded 48h ago
+    await storage.set_entity_unavailable_since("light.k", "unavailable", now - 48 * 3600)
+
+    # HA says last_changed is only 2h ago (shouldn't override the older baseline)
+    lc_iso = (
+        dt.datetime.fromtimestamp(now - 2 * 3600, tz=dt.timezone.utc)
+        .isoformat()
+        .replace("+00:00", "Z")
+    )
+    states = [_make_state_with_last_changed("light.k", "unavailable", lc_iso)]
+    check = UnavailableEntitiesCheck(
+        _make_client(states), storage, _make_settings(unavailable_threshold_hours=0.0)
+    )
+    results = await check.run()
+
+    assert len(results) == 1
+    # Duration should be ~48h (from baseline), not ~2h
+    assert results[0].payload["duration_hours"] == pytest.approx(48.0, abs=0.1)
+
+
+@pytest.mark.asyncio
+async def test_since_falls_back_gracefully_when_last_changed_missing(storage: Storage):
+    """No last_changed in state → uses baseline first_seen without error."""
+    await storage.set_entity_unavailable_since(
+        "light.k", "unavailable", time.time() - 25 * 3600
+    )
+    states = [_make_state("light.k", "unavailable")]  # no last_changed key
+    check = UnavailableEntitiesCheck(
+        _make_client(states), storage, _make_settings(unavailable_threshold_hours=0.0)
+    )
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_entity_unavailable_long
--- a/services/ha-diag-agent/tests/test_updates_available.py
+++ b/services/ha-diag-agent/tests/test_updates_available.py
@ -0,0 +1,256 @@
+"""Unit tests for UpdatesAvailableCheck and UpdatesDigestCheck."""
+from __future__ import annotations
+
+from pathlib import Path
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from ha_diag.checks.updates_available import (
+    UpdatesAvailableCheck,
+    UpdatesDigestCheck,
+    _build_update_payload,
+)
+from ha_diag.config import Settings
+from ha_diag.models import HAEventType
+from ha_diag.storage import Storage
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_settings(**overrides) -> Settings:
+    defaults: dict = {
+        "ha_url": "http://test.local:8123",
+        "ha_token": "test",
+        "node_name": "test-node",
+        "location_tag": "test-loc",
+        "alert_cooldown_hours": 0.0,
+        "updates_cooldown_days": 0,   # no dedup in most tests
+        "check_interval": 60,
+        "check_interval_unavailable": 3600,
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+def _make_client(states=None, error=None):
+    client = MagicMock()
+    if error:
+        client.get_states = AsyncMock(side_effect=error)
+    else:
+        client.get_states = AsyncMock(return_value=states or [])
+    return client
+
+
+def _update_state(
+    entity_id: str = "update.homeassistant_core",
+    state: str = "on",
+    title: str = "Home Assistant Core",
+    installed: str = "2025.5.0",
+    latest: str = "2025.6.0",
+    release_summary: str | None = None,
+    release_url: str | None = None,
+) -> dict:
+    attrs: dict = {
+        "title": title,
+        "installed_version": installed,
+        "latest_version": latest,
+        "in_progress": False,
+        "auto_update": False,
+    }
+    if release_summary:
+        attrs["release_summary"] = release_summary
+    if release_url:
+        attrs["release_url"] = release_url
+    return {"entity_id": entity_id, "state": state, "attributes": attrs}
+
+
+# ---------------------------------------------------------------------------
+# _build_update_payload helper
+# ---------------------------------------------------------------------------
+
+
+def test_build_update_payload_basic():
+    attrs = {"title": "HA Core", "installed_version": "1.0", "latest_version": "2.0"}
+    p = _build_update_payload("update.ha_core", attrs)
+    assert p["entity_id"] == "update.ha_core"
+    assert p["title"] == "HA Core"
+    assert p["installed_version"] == "1.0"
+    assert p["latest_version"] == "2.0"
+
+
+def test_build_update_payload_release_summary_truncated():
+    long_notes = "x" * 3000
+    attrs = {"release_summary": long_notes}
+    p = _build_update_payload("update.ha_core", attrs)
+    assert len(p["release_summary"]) == 2000
+
+
+def test_build_update_payload_release_url_omitted_when_absent():
+    p = _build_update_payload("update.ha_core", {})
+    assert "release_url" not in p
+
+
+def test_build_update_payload_release_url_included_when_present():
+    attrs = {"release_url": "https://github.com/..."}
+    p = _build_update_payload("update.x", attrs)
+    assert p["release_url"] == "https://github.com/..."
+
+
+# ---------------------------------------------------------------------------
+# UpdatesAvailableCheck (daily individual events)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_no_updates_returns_empty(storage: Storage):
+    states = [{"entity_id": "light.living_room", "state": "on", "attributes": {}}]
+    check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_update_off_state_not_emitted(storage: Storage):
+    states = [_update_state(state="off")]
+    check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_single_update_emits_event(storage: Storage):
+    states = [_update_state()]
+    check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 1
+    assert results[0].event_type == HAEventType.ha_update_available
+    assert "2025.5.0" in results[0].message
+    assert "2025.6.0" in results[0].message
+
+
+@pytest.mark.asyncio
+async def test_multiple_updates_emit_multiple_events(storage: Storage):
+    states = [
+        _update_state("update.ha_core"),
+        _update_state("update.mosquitto", title="Mosquitto"),
+    ]
+    check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 2
+    assert all(r.event_type == HAEventType.ha_update_available for r in results)
+
+
+@pytest.mark.asyncio
+async def test_cooldown_prevents_same_update_next_day(storage: Storage):
+    states = [_update_state()]
+    settings = _make_settings(updates_cooldown_days=7)
+    check = UpdatesAvailableCheck(_make_client(states), storage, settings)
+    r1 = await check.run()
+    r2 = await check.run()
+    assert len(r1) == 1
+    assert r2 == []
+
+
+@pytest.mark.asyncio
+async def test_no_cooldown_allows_repeat(storage: Storage):
+    states = [_update_state()]
+    check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings(updates_cooldown_days=0))
+    r1 = await check.run()
+    r2 = await check.run()
+    assert len(r1) == 1
+    assert len(r2) == 1
+
+
+@pytest.mark.asyncio
+async def test_payload_contains_version_fields(storage: Storage):
+    states = [_update_state(installed="2025.5.0", latest="2025.6.0")]
+    check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
+    results = await check.run()
+    p = results[0].payload
+    assert p["installed_version"] == "2025.5.0"
+    assert p["latest_version"] == "2025.6.0"
+    assert p["in_progress"] is False
+
+
+@pytest.mark.asyncio
+async def test_ha_error_returns_empty(storage: Storage):
+    check = UpdatesAvailableCheck(
+        _make_client(error=ConnectionError("HA down")), storage, _make_settings()
+    )
+    assert await check.run() == []
+
+
+# ---------------------------------------------------------------------------
+# UpdatesDigestCheck (Sunday digest)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_digest_no_updates_returns_empty(storage: Storage):
+    check = UpdatesDigestCheck(_make_client([]), storage, _make_settings())
+    assert await check.run() == []
+
+
+@pytest.mark.asyncio
+async def test_digest_emits_single_event_for_all_updates(storage: Storage):
+    states = [
+        _update_state("update.ha_core"),
+        _update_state("update.mosquitto", title="Mosquitto"),
+        _update_state("update.esphome", title="ESPHome"),
+    ]
+    check = UpdatesDigestCheck(_make_client(states), storage, _make_settings())
+    results = await check.run()
+    assert len(results) == 1
+    p = results[0].payload
+    assert p["digest"] is True
+    assert p["count"] == 3
+    assert len(p["updates"]) == 3
+
+
+@pytest.mark.asyncio
+async def test_digest_payload_has_digest_true(storage: Storage):
+    states = [_update_state()]
+    check = UpdatesDigestCheck(_make_client(states), storage, _make_settings())
+    results = await check.run()
+    assert results[0].payload["digest"] is True
+
+
+@pytest.mark.asyncio
+async def test_digest_weekly_dedup_prevents_same_week_refiring(storage: Storage):
+    states = [_update_state()]
+    check = UpdatesDigestCheck(_make_client(states), storage, _make_settings())
+    r1 = await check.run()
+    r2 = await check.run()
+    assert len(r1) == 1
+    assert r2 == []
+
+
+@pytest.mark.asyncio
+async def test_digest_fires_independently_of_daily_dedup(storage: Storage):
+    """Daily cooldown on entity X doesn't suppress Sunday digest."""
+    states = [_update_state()]
+    settings = _make_settings(updates_cooldown_days=7)
+
+    # Daily check marks alert_key="update_available:update.homeassistant_core"
+    daily = UpdatesAvailableCheck(_make_client(states), storage, settings)
+    await daily.run()
+
+    # Digest uses different key "update_digest:{week}" — should still fire
+    digest = UpdatesDigestCheck(_make_client(states), storage, settings)
+    r = await digest.run()
+    assert len(r) == 1
+    assert r[0].payload["digest"] is True
+
+
+@pytest.mark.asyncio
+async def test_digest_name_is_updates_digest(storage: Storage):
+    check = UpdatesDigestCheck(_make_client([]), storage, _make_settings())
+    assert check.name == "updates_digest"
+
+
+@pytest.mark.asyncio
+async def test_daily_check_name_is_updates_available(storage: Storage):
+    check = UpdatesAvailableCheck(_make_client([]), storage, _make_settings())
+    assert check.name == "updates_available"
--- a/services/ha-diag-agent/tests/test_websocket_monitor.py
+++ b/services/ha-diag-agent/tests/test_websocket_monitor.py
@ -0,0 +1,558 @@
+"""Unit tests for WebSocketMonitor."""
+from __future__ import annotations
+
+import asyncio
+import time
+from pathlib import Path
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import aiohttp
+import pytest
+
+from ha_diag.config import Settings
+from ha_diag.event_emitter import EventEmitter
+from ha_diag.models import HAEventType
+from ha_diag.monitors.websocket_monitor import (
+    WebSocketMonitor,
+    _AuthError,
+    _make_ws_url,
+)
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_settings(**overrides) -> Settings:
+    defaults: dict = {
+        "ha_url": "http://test.local:8123",
+        "ha_token": "test-token",
+        "node_name": "test-node",
+        "location_tag": "test-loc",
+        "websocket_enabled": True,
+        "websocket_silence_threshold_seconds": 300,
+        "websocket_watchdog_interval_seconds": 30,
+        "websocket_reconnect_initial_delay": 1.0,
+        "websocket_reconnect_max_delay": 60.0,
+        "websocket_reconnect_jitter": 0.0,
+        "websocket_down_alert_repeat_minutes": 10,
+    }
+    defaults.update(overrides)
+    return Settings(**defaults)
+
+
+class FakeWS:
+    """Fake aiohttp ClientWebSocketResponse for unit tests."""
+
+    def __init__(self, auth_messages: list, event_messages: list | None = None):
+        self._auth_queue = list(auth_messages)
+        self._event_queue = list(event_messages or [])
+        self.sent: list = []
+
+    async def receive_json(self) -> dict:
+        if not self._auth_queue:
+            raise ConnectionError("FakeWS: no more auth messages")
+        return self._auth_queue.pop(0)
+
+    async def send_json(self, data: dict) -> None:
+        self.sent.append(data)
+
+    def __aiter__(self):
+        return self
+
+    async def __anext__(self):
+        if not self._event_queue:
+            raise StopAsyncIteration
+        item = self._event_queue.pop(0)
+        if isinstance(item, BaseException):
+            raise item
+        return item
+
+
+def _text_msg(data: str = '{"type":"event"}') -> aiohttp.WSMessage:
+    return aiohttp.WSMessage(type=aiohttp.WSMsgType.TEXT, data=data, extra=None)
+
+
+def _close_msg() -> aiohttp.WSMessage:
+    return aiohttp.WSMessage(type=aiohttp.WSMsgType.CLOSE, data=b"", extra=None)
+
+
+def _mock_session(fake_ws: FakeWS) -> MagicMock:
+    cm = MagicMock()
+    cm.__aenter__ = AsyncMock(return_value=fake_ws)
+    cm.__aexit__ = AsyncMock(return_value=False)
+    session = MagicMock()
+    session.ws_connect.return_value = cm
+    return session
+
+
+def _make_monitor(
+    settings: Settings | None = None,
+    session=None,
+    emitter: EventEmitter | None = None,
+    tmp_path: Path | None = None,
+) -> WebSocketMonitor:
+    if settings is None:
+        settings = _make_settings()
+    if emitter is None:
+        p = (tmp_path or Path("/tmp/ws_test_events")).absolute()
+        p.mkdir(parents=True, exist_ok=True)
+        emitter = EventEmitter(p, node_name="test-node")
+    if session is None:
+        session = MagicMock()
+    return WebSocketMonitor(
+        ha_url=settings.ha_url,
+        token=settings.ha_token,
+        settings=settings,
+        emitter=emitter,
+        session=session,
+    )
+
+
+# ---------------------------------------------------------------------------
+# URL derivation
+# ---------------------------------------------------------------------------
+
+
+def test_make_ws_url_http():
+    assert _make_ws_url("http://ha.local:8123") == "ws://ha.local:8123/api/websocket"
+
+
+def test_make_ws_url_https():
+    assert _make_ws_url("https://ha.example.com") == "wss://ha.example.com/api/websocket"
+
+
+def test_make_ws_url_strips_trailing_slash():
+    assert _make_ws_url("http://ha.local:8123/") == "ws://ha.local:8123/api/websocket"
+
+
+# ---------------------------------------------------------------------------
+# Auth flow (via _connect_and_listen)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_normal_auth_sends_correct_messages(tmp_path):
+    """Happy path: sends auth + subscribe, ends in subscribed state."""
+    fake_ws = FakeWS(
+        [{"type": "auth_required"}, {"type": "auth_ok"}],
+        [_text_msg('{"type":"result","id":1,"success":true}')],
+    )
+    monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
+
+    await monitor._connect_and_listen()
+
+    assert fake_ws.sent[0] == {"type": "auth", "access_token": "test-token"}
+    assert fake_ws.sent[1]["type"] == "subscribe_events"
+    assert fake_ws.sent[1]["event_type"] == "state_changed"
+    assert monitor._state == "subscribed"
+
+
+@pytest.mark.asyncio
+async def test_last_event_monotonic_updated_on_text_message(tmp_path):
+    """Receiving TEXT messages updates last_event_monotonic."""
+    fake_ws = FakeWS(
+        [{"type": "auth_required"}, {"type": "auth_ok"}],
+        [_text_msg(), _text_msg()],
+    )
+    monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
+    before = time.monotonic()
+
+    await monitor._connect_and_listen()
+
+    assert monitor._last_event_monotonic >= before
+
+
+@pytest.mark.asyncio
+async def test_auth_invalid_raises_auth_error(tmp_path):
+    """auth_invalid → _AuthError propagates."""
+    fake_ws = FakeWS([
+        {"type": "auth_required"},
+        {"type": "auth_invalid", "message": "invalid token"},
+    ])
+    monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
+
+    with pytest.raises(_AuthError, match="invalid token"):
+        await monitor._connect_and_listen()
+
+
+@pytest.mark.asyncio
+async def test_unexpected_initial_message_raises(tmp_path):
+    """Anything other than auth_required on connect → ConnectionError."""
+    fake_ws = FakeWS([{"type": "unexpected"}])
+    monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
+
+    with pytest.raises(ConnectionError, match="Unexpected initial"):
+        await monitor._connect_and_listen()
+
+
+@pytest.mark.asyncio
+async def test_empty_auth_queue_raises_connection_error(tmp_path):
+    """Connection closed before auth_required → ConnectionError."""
+    fake_ws = FakeWS([])
+    monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
+
+    with pytest.raises(ConnectionError):
+        await monitor._connect_and_listen()
+
+
+# ---------------------------------------------------------------------------
+# Disconnect / dead alerts (_on_disconnected)
+# ---------------------------------------------------------------------------
+
+
+def test_on_disconnected_emits_ha_websocket_dead(tmp_path):
+    emitter = MagicMock()
+    monitor = _make_monitor(emitter=emitter, tmp_path=tmp_path)
+    monitor._state = "disconnected"
+
+    monitor._on_disconnected()
+
+    emitter.emit.assert_called_once()
+    assert emitter.emit.call_args[1]["event_type"] == HAEventType.ha_websocket_dead.value
+
+
+def test_on_disconnected_within_cooldown_suppresses_second_emit(tmp_path):
+    emitter = MagicMock()
+    monitor = _make_monitor(
+        settings=_make_settings(websocket_down_alert_repeat_minutes=10),
+        emitter=emitter,
+        tmp_path=tmp_path,
+    )
+    monitor._state = "disconnected"
+
+    monitor._on_disconnected()  # first emit
+    emitter.emit.reset_mock()
+    monitor._on_disconnected()  # within cooldown → suppressed
+
+    emitter.emit.assert_not_called()
+
+
+def test_on_disconnected_after_cooldown_emits_again(tmp_path):
+    emitter = MagicMock()
+    monitor = _make_monitor(
+        settings=_make_settings(websocket_down_alert_repeat_minutes=10),
+        emitter=emitter,
+        tmp_path=tmp_path,
+    )
+    monitor._state = "disconnected"
+    monitor._on_disconnected()
+    # Backdate to simulate cooldown expiry
+    monitor._last_dead_alert_at = time.monotonic() - (10 * 60 + 5)
+    emitter.emit.reset_mock()
+
+    monitor._on_disconnected()
+
+    emitter.emit.assert_called_once()
+
+
+def test_on_disconnected_noop_when_stopping(tmp_path):
+    emitter = MagicMock()
+    monitor = _make_monitor(emitter=emitter, tmp_path=tmp_path)
+    monitor._stopping = True
+
+    monitor._on_disconnected()
+
+    emitter.emit.assert_not_called()
+
+
+# ---------------------------------------------------------------------------
+# Recovery
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_reconnect_after_dead_emits_recovered(tmp_path):
+    """Successful reconnect after a dead alert emits ha_websocket_recovered."""
+    emitter = MagicMock()
+    fake_ws = FakeWS([{"type": "auth_required"}, {"type": "auth_ok"}], [])
+    settings = _make_settings()
+    monitor = WebSocketMonitor(
+        ha_url=settings.ha_url,
+        token=settings.ha_token,
+        settings=settings,
+        emitter=emitter,
+        session=_mock_session(fake_ws),
+    )
+    monitor._last_dead_alert_at = time.monotonic() - 30.0  # prior dead was sent
+
+    await monitor._connect_and_listen()
+
+    emitted_types = [c[1]["event_type"] for c in emitter.emit.call_args_list]
+    assert HAEventType.ha_websocket_recovered.value in emitted_types
+    assert monitor._last_dead_alert_at == 0.0  # reset after recovery
+
+
+@pytest.mark.asyncio
+async def test_no_recovered_if_no_prior_dead(tmp_path):
+    """First-ever connect with no prior dead alert → no recovered emitted."""
+    emitter = MagicMock()
+    fake_ws = FakeWS([{"type": "auth_required"}, {"type": "auth_ok"}], [])
+    settings = _make_settings()
+    monitor = WebSocketMonitor(
+        ha_url=settings.ha_url,
+        token=settings.ha_token,
+        settings=settings,
+        emitter=emitter,
+        session=_mock_session(fake_ws),
+    )
+    monitor._last_dead_alert_at = 0.0
+
+    await monitor._connect_and_listen()
+
+    emitter.emit.assert_not_called()
+
+
+# ---------------------------------------------------------------------------
+# Watchdog loop
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_watchdog_emits_dead_when_silent_over_threshold(tmp_path):
+    """Watchdog detects silence > threshold and emits ha_websocket_dead."""
+    emitter = MagicMock()
+    settings = _make_settings(
+        websocket_silence_threshold_seconds=60,
+        websocket_watchdog_interval_seconds=30,
+        websocket_down_alert_repeat_minutes=0,
+    )
+    monitor = _make_monitor(settings=settings, emitter=emitter, tmp_path=tmp_path)
+    monitor._state = "subscribed"
+    monitor._last_event_monotonic = time.monotonic() - 120.0  # 120s > 60s threshold
+    monitor._last_dead_alert_at = 0.0
+
+    sleep_calls = 0
+
+    async def one_iteration(t):
+        nonlocal sleep_calls
+        sleep_calls += 1
+        if sleep_calls >= 2:
+            raise asyncio.CancelledError()
+
+    with patch("asyncio.sleep", side_effect=one_iteration):
+        with pytest.raises(asyncio.CancelledError):
+            await monitor._watchdog_loop()
+
+    emitter.emit.assert_called_once()
+    assert emitter.emit.call_args[1]["event_type"] == HAEventType.ha_websocket_dead.value
+
+
+@pytest.mark.asyncio
+async def test_watchdog_no_emit_when_events_recent(tmp_path):
+    """Watchdog does not emit when last event is within silence threshold."""
+    emitter = MagicMock()
+    settings = _make_settings(
+        websocket_silence_threshold_seconds=300,
+        websocket_watchdog_interval_seconds=30,
+        websocket_down_alert_repeat_minutes=0,
+    )
+    monitor = _make_monitor(settings=settings, emitter=emitter, tmp_path=tmp_path)
+    monitor._state = "subscribed"
+    monitor._last_event_monotonic = time.monotonic() - 10.0  # recent
+
+    sleep_calls = 0
+
+    async def one_iteration(t):
+        nonlocal sleep_calls
+        sleep_calls += 1
+        if sleep_calls >= 2:
+            raise asyncio.CancelledError()
+
+    with patch("asyncio.sleep", side_effect=one_iteration):
+        with pytest.raises(asyncio.CancelledError):
+            await monitor._watchdog_loop()
+
+    emitter.emit.assert_not_called()
+
+
+@pytest.mark.asyncio
+async def test_watchdog_skips_when_not_subscribed(tmp_path):
+    """Watchdog does not emit when state is not 'subscribed'."""
+    emitter = MagicMock()
+    settings = _make_settings(
+        websocket_silence_threshold_seconds=1,
+        websocket_watchdog_interval_seconds=30,
+        websocket_down_alert_repeat_minutes=0,
+    )
+    monitor = _make_monitor(settings=settings, emitter=emitter, tmp_path=tmp_path)
+    monitor._state = "disconnected"
+    monitor._last_event_monotonic = time.monotonic() - 9999.0  # very old
+
+    sleep_calls = 0
+
+    async def one_iteration(t):
+        nonlocal sleep_calls
+        sleep_calls += 1
+        if sleep_calls >= 2:
+            raise asyncio.CancelledError()
+
+    with patch("asyncio.sleep", side_effect=one_iteration):
+        with pytest.raises(asyncio.CancelledError):
+            await monitor._watchdog_loop()
+
+    emitter.emit.assert_not_called()
+
+
+@pytest.mark.asyncio
+async def test_watchdog_repeat_alert_respects_cooldown(tmp_path):
+    """Second watchdog dead alert fires only after cooldown."""
+    emitter = MagicMock()
+    settings = _make_settings(
+        websocket_silence_threshold_seconds=60,
+        websocket_watchdog_interval_seconds=30,
+        websocket_down_alert_repeat_minutes=10,
+    )
+    monitor = _make_monitor(settings=settings, emitter=emitter, tmp_path=tmp_path)
+    monitor._state = "subscribed"
+    monitor._last_event_monotonic = time.monotonic() - 3600.0  # 1hr silent
+    # Set last alert to just now → still within 10-min cooldown
+    monitor._last_dead_alert_at = time.monotonic()
+
+    sleep_calls = 0
+
+    async def one_iteration(t):
+        nonlocal sleep_calls
+        sleep_calls += 1
+        if sleep_calls >= 2:
+            raise asyncio.CancelledError()
+
+    with patch("asyncio.sleep", side_effect=one_iteration):
+        with pytest.raises(asyncio.CancelledError):
+            await monitor._watchdog_loop()
+
+    emitter.emit.assert_not_called()  # within cooldown
+
+
+# ---------------------------------------------------------------------------
+# Reconnect backoff
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_reconnect_backoff_doubles_each_attempt(tmp_path):
+    """Retry delay doubles on consecutive failures."""
+    delays: list[float] = []
+    call_count = 0
+
+    async def fail_connect():
+        nonlocal call_count
+        call_count += 1
+        raise ConnectionError("refused")
+
+    async def capture_sleep(t):
+        delays.append(t)
+        if call_count >= 3:
+            raise asyncio.CancelledError()
+
+    settings = _make_settings(
+        websocket_reconnect_initial_delay=1.0,
+        websocket_reconnect_max_delay=60.0,
+        websocket_reconnect_jitter=0.0,
+    )
+    monitor = _make_monitor(settings=settings, emitter=MagicMock(), tmp_path=tmp_path)
+    monitor._connect_and_listen = fail_connect
+
+    with patch("asyncio.sleep", side_effect=capture_sleep):
+        with pytest.raises(asyncio.CancelledError):
+            await monitor._connection_loop()
+
+    assert len(delays) >= 2
+    assert delays[0] == pytest.approx(1.0)
+    assert delays[1] == pytest.approx(2.0)
+
+
+@pytest.mark.asyncio
+async def test_reconnect_delay_capped_at_max(tmp_path):
+    """Delay never exceeds websocket_reconnect_max_delay."""
+    delays: list[float] = []
+    call_count = 0
+
+    async def fail_connect():
+        nonlocal call_count
+        call_count += 1
+        raise ConnectionError("refused")
+
+    async def capture_sleep(t):
+        delays.append(t)
+        if call_count >= 8:
+            raise asyncio.CancelledError()
+
+    settings = _make_settings(
+        websocket_reconnect_initial_delay=1.0,
+        websocket_reconnect_max_delay=8.0,
+        websocket_reconnect_jitter=0.0,
+    )
+    monitor = _make_monitor(settings=settings, emitter=MagicMock(), tmp_path=tmp_path)
+    monitor._connect_and_listen = fail_connect
+
+    with patch("asyncio.sleep", side_effect=capture_sleep):
+        with pytest.raises(asyncio.CancelledError):
+            await monitor._connection_loop()
+
+    assert max(delays) <= 8.0
+
+
+# ---------------------------------------------------------------------------
+# is_healthy
+# ---------------------------------------------------------------------------
+
+
+def test_is_healthy_true_when_subscribed(tmp_path):
+    monitor = _make_monitor(settings=_make_settings(websocket_enabled=True), tmp_path=tmp_path)
+    monitor._state = "subscribed"
+    assert monitor.is_healthy is True
+
+
+def test_is_healthy_false_when_disconnected(tmp_path):
+    monitor = _make_monitor(settings=_make_settings(websocket_enabled=True), tmp_path=tmp_path)
+    monitor._state = "disconnected"
+    assert monitor.is_healthy is False
+
+
+def test_is_healthy_false_when_connecting(tmp_path):
+    monitor = _make_monitor(settings=_make_settings(websocket_enabled=True), tmp_path=tmp_path)
+    monitor._state = "connecting"
+    assert monitor.is_healthy is False
+
+
+def test_is_healthy_true_when_disabled(tmp_path):
+    """Disabled monitor reports healthy — it's off, not broken."""
+    monitor = _make_monitor(settings=_make_settings(websocket_enabled=False), tmp_path=tmp_path)
+    monitor._state = "disconnected"
+    assert monitor.is_healthy is True
+
+
+# ---------------------------------------------------------------------------
+# start / stop lifecycle
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_stop_cancels_background_tasks(tmp_path):
+    """stop() cancels the main and watchdog tasks."""
+
+    async def hang():
+        await asyncio.sleep(9999)
+
+    monitor = _make_monitor(tmp_path=tmp_path)
+    monitor._main_task = asyncio.create_task(hang())
+    monitor._watchdog_task = asyncio.create_task(hang())
+
+    await monitor.stop()
+
+    assert monitor._main_task is None
+    assert monitor._watchdog_task is None
+
+
+@pytest.mark.asyncio
+async def test_start_no_tasks_when_disabled(tmp_path):
+    """start() with websocket_enabled=False does not spawn tasks."""
+    monitor = _make_monitor(
+        settings=_make_settings(websocket_enabled=False),
+        tmp_path=tmp_path,
+    )
+    await monitor.start()
+    assert monitor._main_task is None
+    assert monitor._watchdog_task is None
--- a/services/zigbee2mqtt/README.md
+++ b/services/zigbee2mqtt/README.md
@ -3,8 +3,10 @@
 Zigbee to MQTT bridge, get rid of your proprietary Zigbee bridges.

 ## Usage
-Deployed on the `piha` node.

-Requires a Zigbee adapter (e.g., Sonoff ZBDongle-E) mapped to `/dev/ttyACM0`.
+Deployed on the `chelsty-infra` node (CHELSTY site).
+
+Coordinator: **SLZB-06U** over TCP at `192.168.1.105:6638` (`ezsp` adapter).
+Do not use USB paths (`/dev/ttyUSB0`, `/dev/ttyACM0`) — the coordinator is network-attached.

 Frontend is available on port 8080.
Author	SHA1	Message	Date
Oskar Kapala	52607a7cdd	feat(control-plane): shadow_mode for HA event auto-actions + deploy docs - HA_DIAG_SHADOW_MODE env flag in supervisor (default true) - shadow_mode downgrades container_restart actions to alert_only with [SHADOW MODE] note; same action_id and 30-min cooldown apply - alert_only events unaffected (always routed normally) - 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected - DEPLOY.md with token gen, per-host config, verification, 48h observation, production-mode enablement, rollback - README.md updated with shadow mode flag summary and DEPLOY.md link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 17:12:33 +02:00
Oskar Kapala	b9ed118b8c	fix(telegram-bot): correct risk_level field + show description in alerts - read risk_level with risk fallback (was: risk only → "unknown" for all actions written by supervisor which uses risk_level key) - include description field in alert format (was: alert_only payloads' substance was invisible — description carried the full message) - extract _format_pending_action() pure helper to enable unit testing without a live Telegram connection - 8 tests: risk_level present, risk fallback, both absent, description shown/absent, truncation, full HA alert_only shape, no-description no-crash - flagged during Phase 5 review of ha-diag-agent supervisor routing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 16:26:49 +02:00
Oskar Kapala	bf1415e4c1	feat(control-plane): route ha-diag-agent events through supervisor - 8 HA event types mapped to existing action types - ha_websocket_dead → container_restart (homeassistant), 30-min cooldown - 6 events → alert_only (entity_unavailable, integration_failed, automation_failing, update_available, recorder_lag, system_health_degraded), 1-hour cooldown - ha_websocket_recovered → cancels matching pending container_restart - state-aware suppression: skip HA events when homeassistant has an active containers_not_running incident < 5 min ago (avoids alert storms during HA restarts/updates) - location_tag preserved through action pipeline for per-house telegram alerts - executor: alert_only acknowledged as no-op success - 18 tests covering all 8 event types, suppression, cooldown, dedup, location_tag, recovery cancellation - CLAUDE.md: supervisor event routing table added Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 15:59:23 +02:00
Oskar Kapala	31b48d162a	feat(ha-diag-agent): WebSocketMonitor for real-time HA liveness - persistent WS connection to HA with auth + state_changed subscription - watchdog detects silence > 5min → emits ha_websocket_dead - immediate ha_websocket_dead on disconnect, exponential reconnect with jitter - cooldown prevents alert spam (10min repeat window while HA stays down) - ha_websocket_recovered emitted on reconnect after a dead alert (allows supervisor to clear active incidents in Phase 5) - new monitors/ subpackage for long-running tasks (vs interval checks/) - /health endpoint now includes ws_connected field - 26 unit tests, 3 integration tests (real HA + container stop/restart) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 15:00:18 +02:00
Oskar Kapala	3499b2f280	feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes New checks: - SystemHealthCheck (15min interval): detects newly-failing HA integrations via /api/system_health snapshot diff; transition-based dedup (ok→error fires, sustained error silent, error→ok clears alert) - UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available events with 7-day dedup; release notes truncated at 2000 chars - UpdatesDigestCheck (Sunday cron 09:00): single digest event with all pending updates; weekly ISO-week dedup, independent of daily dedup key - AutomationFailuresCheck (30min interval): detects automations with N consecutive failures (default 3) via /api/trace/automation/<id>; 6h cooldown per automation Phase 3 flag fixes: - Flag #1 (since field): UnavailableEntitiesCheck now uses min(state.last_changed, baseline.first_seen) as effective "since", giving accurate duration when agent was offline at entity's first fail - Flag #3 (registry cache): HAClient.get_entity_registry() caches response in-process with configurable TTL (default 300s); avoids repeated API calls across concurrent check cycles; invalidate_registry_cache() for manual invalidation Storage: system_health_snapshot table (component, last_status, last_seen_at, payload) created automatically on next Storage.open() call Config additions (all with defaults): entity_registry_cache_ttl=300, system_health_check_interval=900, automation_check_interval=1800, automation_failure_threshold=3, updates_check_hour=9, updates_check_minute=0, updates_cooldown_days=7 Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new); 3 skipped (live-HA token not set in CI) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 14:43:10 +02:00
Oskar Kapala	f41ec5d0c5	docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 14:17:23 +02:00
Oskar Kapala	20f6761a67	feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup - shared aiohttp ClientSession in HAClient (Phase 1 Flag #2 fixed): make_session() factory, session injected at startup, closed on shutdown - Check.run() → list[CheckResult]: clean multi-event interface - first real diagnostic check: entity unavailable > 24h (INSERT OR IGNORE baseline preserves first-seen timestamp) - root cause grouping: emit ha_integration_failed instead of N entity events when ≥50% of integration's entities are unavailable (≥3 min) - alert deduplication via SQLite cooldown window (default 6h) - recovery clears baseline + dedup for immediate re-alert - configurable thresholds: duration, integration %, cooldown - 38 unit tests + 7 integration tests (42 pass, 3 skip w/o live HA) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 13:41:55 +02:00
Oskar Kapala	07bd498fd6	feat(ha-diag-agent): test environment with dual HA Docker instances - dockerized ken + chelsty HA test instances with template fixtures - snapshot/reset/wait scripts for fixture management - integration test infrastructure with separate marker - location_tag promoted from metadata to event payload (Phase 1 flag #3) - chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:56:13 +02:00
Oskar Kapala	90c8e77bf7	chore: gitignore *.egg-info, remove committed egg-info Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:26:57 +02:00
Oskar Kapala	ab8895d28b	feat(ha-diag-agent): scaffold service with HA REST client and event emitter - new per-host service, follows node-agent pattern - 7 new HA event types defined (routing in supervisor — Phase 5) - HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead) - service.yaml + host configs for piha (ken) and chelsty-infra (chelsty) - test scaffolding with aiohttp/aiosqlite mocks (15/15 passing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:26:34 +02:00