Compare commits
No commits in common. "52607a7cddf05e36156657afb8f4674a46ca3e2f" and "bd7f955e4e8815623ade91553d7425f7283c9c39" have entirely different histories.
52607a7cdd
...
bd7f955e4e
1
.gitignore
vendored
1
.gitignore
vendored
|
|
@ -15,7 +15,6 @@ __pycache__/
|
||||||
*$py.class
|
*$py.class
|
||||||
venv/
|
venv/
|
||||||
.venv/
|
.venv/
|
||||||
*.egg-info/
|
|
||||||
|
|
||||||
# Tools
|
# Tools
|
||||||
.aider*
|
.aider*
|
||||||
|
|
|
||||||
106
CLAUDE.md
106
CLAUDE.md
|
|
@ -17,22 +17,43 @@ GitOps-lite orchestration for a distributed homelab. The repo is the source of t
|
||||||
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
|
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
|
||||||
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
|
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
|
||||||
|
|
||||||
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
|
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services (`zigbee2mqtt`, `mosquitto`, `homeassistant`, `stability-agent`) must never depend on SATURN, VPS, or Forgejo at runtime.
|
||||||
|
|
||||||
## Deployment
|
## Deployment
|
||||||
|
|
||||||
|
### Run a fresh deployment on the current node
|
||||||
```bash
|
```bash
|
||||||
scripts/deploy/deploy.sh # fresh deploy on current node
|
scripts/deploy/deploy.sh
|
||||||
scripts/deploy/deploy.sh --resume # resume after interruption
|
|
||||||
scripts/deploy/deploy.sh --stage verify # specific stage only
|
|
||||||
scripts/deploy/deploy.sh --service mosquitto # specific service only
|
|
||||||
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
|
|
||||||
./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)
|
|
||||||
./scripts/bootstrap/prepare-node.sh # general node bootstrap
|
|
||||||
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
|
### Resume after interruption
|
||||||
|
```bash
|
||||||
|
scripts/deploy/deploy.sh --resume
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run a specific stage only
|
||||||
|
```bash
|
||||||
|
scripts/deploy/deploy.sh --stage verify
|
||||||
|
scripts/deploy/deploy.sh --stage diagnose
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy a specific service
|
||||||
|
```bash
|
||||||
|
scripts/deploy/deploy.sh --service mosquitto
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy from SATURN/SOLARIA to VPS (control plane)
|
||||||
|
```bash
|
||||||
|
./scripts/deploy/deploy-control-plane.sh --ssh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Bootstrap a new node
|
||||||
|
```bash
|
||||||
|
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific
|
||||||
|
./scripts/bootstrap/prepare-node.sh # General node prep
|
||||||
|
```
|
||||||
|
|
||||||
|
The staged deploy pipeline runs: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state is persisted in `/opt/homelab/state/deploy/` allowing safe resumption.
|
||||||
|
|
||||||
## Service Structure
|
## Service Structure
|
||||||
|
|
||||||
|
|
@ -73,31 +94,25 @@ Agents must never execute destructive actions (restarts, deploys, config changes
|
||||||
|
|
||||||
## Event System
|
## Event System
|
||||||
|
|
||||||
Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
|
Events are append-only JSON lines at:
|
||||||
|
```
|
||||||
|
/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl
|
||||||
|
```
|
||||||
|
|
||||||
Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
|
Emit from shell:
|
||||||
|
```bash
|
||||||
|
source scripts/lib/events.sh
|
||||||
|
emit_event "deployment_started" "info" "my-script.sh" "mosquitto" "cid-123" '{}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Emit from Python:
|
||||||
|
```python
|
||||||
|
from scripts.lib.events import emit_event
|
||||||
|
emit_event("service_unhealthy", "error", "monitor.py", "ollama", "cid-123", {"error": "OOM"})
|
||||||
|
```
|
||||||
|
|
||||||
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
|
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
|
||||||
|
|
||||||
### Supervisor event routing table
|
|
||||||
|
|
||||||
| Event type | Source | Action generated | Cooldown |
|
|
||||||
|---|---|---|---|
|
|
||||||
| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
|
|
||||||
| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
|
|
||||||
| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
|
|
||||||
| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
|
|
||||||
| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
|
|
||||||
| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
|
|
||||||
| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
|
|
||||||
HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
|
|
||||||
|
|
||||||
## Discovery Entry Points for Agents
|
## Discovery Entry Points for Agents
|
||||||
|
|
||||||
When exploring the system, use these files in order:
|
When exploring the system, use these files in order:
|
||||||
|
|
@ -109,20 +124,29 @@ When exploring the system, use these files in order:
|
||||||
## CHELSTY-Specific Rules
|
## CHELSTY-Specific Rules
|
||||||
|
|
||||||
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
||||||
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
|
- Deploy CHELSTY nodes individually:
|
||||||
|
```bash
|
||||||
|
./scripts/deploy/deploy-node.sh chelsty-infra
|
||||||
|
./scripts/deploy/deploy-node.sh chelsty-ha
|
||||||
|
```
|
||||||
|
- Bootstrap CHELSTY runtime:
|
||||||
|
```bash
|
||||||
|
./scripts/bootstrap/chelsty-runtime.sh
|
||||||
|
```
|
||||||
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
|
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
|
||||||
|
|
||||||
## Runtime Path Conventions
|
## Runtime Path Conventions
|
||||||
|
|
||||||
`/opt/homelab/` layout on each node:
|
```
|
||||||
|
/opt/homelab/
|
||||||
- `data/<service>/` — persistent volumes
|
├── data/<service>/ # Persistent volumes
|
||||||
- `config/<service>/` — secrets and host-local overrides (not in Git)
|
├── config/<service>/ # Secrets and host-local overrides (not in Git)
|
||||||
- `logs/<service>/` — service logs
|
├── logs/<service>/ # Service logs
|
||||||
- `state/` — deployment stage markers, agent heartbeats
|
├── state/ # Deployment stage markers, agent heartbeats
|
||||||
- `events/` — append-only event store
|
├── events/ # Append-only event store
|
||||||
- `world/` — Observer output (synthesized state)
|
├── world/ # Observer output (synthesized state)
|
||||||
- `actions/` — pending / approved / running / completed / failed
|
└── actions/ # pending / approved / running / completed / failed
|
||||||
|
```
|
||||||
|
|
||||||
## Naming Conventions
|
## Naming Conventions
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -2,22 +2,6 @@ host: chelsty-infra
|
||||||
site: chelsty
|
site: chelsty
|
||||||
|
|
||||||
services:
|
services:
|
||||||
ha-diag-agent:
|
|
||||||
role: ha-diagnostic-agent
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: [homeassistant]
|
|
||||||
config:
|
|
||||||
target_url: http://100.70.180.90:8123 # chelsty-ha via Tailscale (HAOS, separate VM)
|
|
||||||
location_tag: "chelsty"
|
|
||||||
events_dir: /opt/homelab/events/chelsty-infra
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/ha-diag-agent
|
|
||||||
data_path: /var/lib/ha-diag-agent
|
|
||||||
|
|
||||||
node-agent:
|
node-agent:
|
||||||
role: node-stability-monitor
|
role: node-stability-monitor
|
||||||
# LTE node: node-agent monitors and emits events but does NO Docker cleanup.
|
# LTE node: node-agent monitors and emits events but does NO Docker cleanup.
|
||||||
|
|
|
||||||
|
|
@ -1,22 +1,6 @@
|
||||||
host: piha
|
host: piha
|
||||||
|
|
||||||
services:
|
services:
|
||||||
ha-diag-agent:
|
|
||||||
role: ha-diagnostic-agent
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: [homeassistant]
|
|
||||||
config:
|
|
||||||
target_url: http://localhost:8123
|
|
||||||
location_tag: "ken"
|
|
||||||
events_dir: /opt/homelab/events/piha
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/ha-diag-agent
|
|
||||||
data_path: /var/lib/ha-diag-agent
|
|
||||||
|
|
||||||
node-agent:
|
node-agent:
|
||||||
role: node-stability-monitor
|
role: node-stability-monitor
|
||||||
deployment_model: docker-compose
|
deployment_model: docker-compose
|
||||||
|
|
|
||||||
|
|
@ -54,36 +54,6 @@ async def post_api(path, data):
|
||||||
logger.error(f"Error posting to {url}: {e}")
|
logger.error(f"Error posting to {url}: {e}")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
def _format_pending_action(action_id: str, data: dict) -> str:
|
|
||||||
"""Build the Telegram Markdown message for a pending action notification.
|
|
||||||
|
|
||||||
Extracted so it can be unit-tested without a live Telegram connection.
|
|
||||||
"""
|
|
||||||
# Supervisor writes risk_level; action-model.md legacy schema used risk.
|
|
||||||
risk = data.get("risk_level") or data.get("risk", "unknown")
|
|
||||||
message = (
|
|
||||||
f"⚠️ *Pending Action*\n"
|
|
||||||
f"ID: `{action_id}`\n"
|
|
||||||
f"Type: `{data.get('type', 'unknown')}`\n"
|
|
||||||
f"Service: `{data.get('service', 'unknown')}`\n"
|
|
||||||
f"Node: `{data.get('node', 'unknown')}`\n"
|
|
||||||
f"Risk: *{risk}*\n"
|
|
||||||
)
|
|
||||||
# description carries the human-readable substance of the action (required for
|
|
||||||
# alert_only actions where it is the entire operator-visible message).
|
|
||||||
description = data.get("description", "")
|
|
||||||
if description:
|
|
||||||
truncated = description[:300] + ("..." if len(description) > 300 else "")
|
|
||||||
message += f"Description: `{truncated}`\n"
|
|
||||||
# Legacy details block (old action-model.md schema) — kept for backwards compat.
|
|
||||||
if "details" in data:
|
|
||||||
details_str = json.dumps(data["details"], indent=2)
|
|
||||||
if len(details_str) > 1000:
|
|
||||||
details_str = details_str[:1000] + "..."
|
|
||||||
message += f"\nDetails:\n```json\n{details_str}\n```"
|
|
||||||
return message
|
|
||||||
|
|
||||||
|
|
||||||
class ApprovalBot:
|
class ApprovalBot:
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.pending_dir = ACTIONS_ROOT / "pending"
|
self.pending_dir = ACTIONS_ROOT / "pending"
|
||||||
|
|
@ -116,7 +86,20 @@ class ApprovalBot:
|
||||||
|
|
||||||
async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
|
async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
|
||||||
"""Sends an approval request message to all allowed users."""
|
"""Sends an approval request message to all allowed users."""
|
||||||
message = _format_pending_action(action_id, data)
|
message = (
|
||||||
|
f"⚠️ *Pending Action*\n"
|
||||||
|
f"ID: `{action_id}`\n"
|
||||||
|
f"Type: `{data.get('type', 'unknown')}`\n"
|
||||||
|
f"Service: `{data.get('service', 'unknown')}`\n"
|
||||||
|
f"Node: `{data.get('node', 'unknown')}`\n"
|
||||||
|
f"Risk: *{data.get('risk', 'unknown')}*\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
if "details" in data:
|
||||||
|
details_str = json.dumps(data['details'], indent=2)
|
||||||
|
if len(details_str) > 1000:
|
||||||
|
details_str = details_str[:1000] + "..."
|
||||||
|
message += f"\nDetails:\n```json\n{details_str}\n```"
|
||||||
|
|
||||||
keyboard = [
|
keyboard = [
|
||||||
[
|
[
|
||||||
|
|
|
||||||
|
|
@ -1,38 +0,0 @@
|
||||||
"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import types
|
|
||||||
from unittest.mock import MagicMock
|
|
||||||
|
|
||||||
|
|
||||||
def _make_telegram_stub() -> types.ModuleType:
|
|
||||||
mod = types.ModuleType("telegram")
|
|
||||||
mod.Update = MagicMock
|
|
||||||
mod.InlineKeyboardButton = MagicMock
|
|
||||||
mod.InlineKeyboardMarkup = MagicMock
|
|
||||||
return mod
|
|
||||||
|
|
||||||
|
|
||||||
def _make_telegram_ext_stub() -> types.ModuleType:
|
|
||||||
mod = types.ModuleType("telegram.ext")
|
|
||||||
mod.ApplicationBuilder = MagicMock
|
|
||||||
|
|
||||||
# ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
|
|
||||||
# evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
|
|
||||||
ContextTypesMock = MagicMock()
|
|
||||||
ContextTypesMock.DEFAULT_TYPE = type(None)
|
|
||||||
mod.ContextTypes = ContextTypesMock
|
|
||||||
|
|
||||||
mod.CommandHandler = MagicMock
|
|
||||||
mod.CallbackQueryHandler = MagicMock
|
|
||||||
mod.MessageHandler = MagicMock
|
|
||||||
mod.filters = MagicMock()
|
|
||||||
return mod
|
|
||||||
|
|
||||||
|
|
||||||
# Insert before any import of bot.py
|
|
||||||
if "telegram" not in sys.modules:
|
|
||||||
sys.modules["telegram"] = _make_telegram_stub()
|
|
||||||
if "telegram.ext" not in sys.modules:
|
|
||||||
sys.modules["telegram.ext"] = _make_telegram_ext_stub()
|
|
||||||
|
|
@ -1,116 +0,0 @@
|
||||||
"""Tests for _format_pending_action — no Telegram connection required.
|
|
||||||
|
|
||||||
telegram stubs are set up in conftest.py before this module is imported.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import sys
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
|
||||||
from bot import _format_pending_action
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Bug 1 — risk_level field
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_risk_level_shown_when_present():
|
|
||||||
data = {
|
|
||||||
"type": "container_restart", "service": "homeassistant",
|
|
||||||
"node": "chelsty-ha", "risk_level": "low",
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
|
||||||
assert "Risk: *low*" in msg
|
|
||||||
assert "unknown" not in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_risk_falls_back_to_legacy_risk_key():
|
|
||||||
data = {
|
|
||||||
"type": "redeploy", "service": "mosquitto",
|
|
||||||
"node": "chelsty-infra", "risk": "guarded",
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
|
|
||||||
assert "Risk: *guarded*" in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_risk_unknown_when_both_absent():
|
|
||||||
data = {"type": "redeploy", "service": "foo", "node": "bar"}
|
|
||||||
msg = _format_pending_action("redeploy-bar-foo", data)
|
|
||||||
assert "Risk: *unknown*" in msg
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Bug 2 — description field
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_description_shown_for_alert_only():
|
|
||||||
data = {
|
|
||||||
"type": "alert_only", "service": "homeassistant",
|
|
||||||
"node": "chelsty-ha", "risk_level": "info",
|
|
||||||
"description": "3 entities unavailable for >1h",
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
|
|
||||||
assert "3 entities unavailable for >1h" in msg
|
|
||||||
assert "Description:" in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_description_shown_for_container_restart():
|
|
||||||
data = {
|
|
||||||
"type": "container_restart", "service": "homeassistant",
|
|
||||||
"node": "chelsty-ha", "risk_level": "low",
|
|
||||||
"description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
|
||||||
assert "HA WebSocket unresponsive" in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_description_absent_no_crash():
|
|
||||||
data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
|
|
||||||
msg = _format_pending_action("redeploy-bar-foo", data)
|
|
||||||
assert "Description:" not in msg
|
|
||||||
assert "Risk: *guarded*" in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_description_truncated_at_300_chars():
|
|
||||||
long_desc = "x" * 400
|
|
||||||
data = {
|
|
||||||
"type": "alert_only", "service": "homeassistant",
|
|
||||||
"node": "chelsty-ha", "risk_level": "info",
|
|
||||||
"description": long_desc,
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
|
|
||||||
assert "x" * 300 in msg
|
|
||||||
assert "..." in msg
|
|
||||||
assert "x" * 301 not in msg
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Combined — real HA alert_only action shape
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_ha_alert_only_full_action():
|
|
||||||
"""Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
|
|
||||||
data = {
|
|
||||||
"action_id": "alert-ha-entity-unavailable-chelsty-ha",
|
|
||||||
"type": "alert_only",
|
|
||||||
"node": "chelsty-ha",
|
|
||||||
"service": "homeassistant",
|
|
||||||
"risk_level": "info",
|
|
||||||
"confidence": 1.0,
|
|
||||||
"description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
|
|
||||||
"status": "pending",
|
|
||||||
"payload": {
|
|
||||||
"location_tag": "chelsty",
|
|
||||||
"reason": "ha_entity_unavailable_long",
|
|
||||||
"count": 3,
|
|
||||||
},
|
|
||||||
}
|
|
||||||
msg = _format_pending_action(data["action_id"], data)
|
|
||||||
assert "alert_only" in msg
|
|
||||||
assert "chelsty-ha" in msg
|
|
||||||
assert "Risk: *info*" in msg
|
|
||||||
assert "3 entities unavailable" in msg
|
|
||||||
assert "unknown" not in msg
|
|
||||||
|
|
@ -1,19 +0,0 @@
|
||||||
[build-system]
|
|
||||||
requires = ["setuptools>=68"]
|
|
||||||
build-backend = "setuptools.build_meta"
|
|
||||||
|
|
||||||
[project]
|
|
||||||
name = "control-plane"
|
|
||||||
version = "0.1.0"
|
|
||||||
requires-python = ">=3.11"
|
|
||||||
dependencies = [
|
|
||||||
"pyyaml>=6.0",
|
|
||||||
]
|
|
||||||
|
|
||||||
[project.optional-dependencies]
|
|
||||||
dev = [
|
|
||||||
"pytest>=8.1",
|
|
||||||
]
|
|
||||||
|
|
||||||
[tool.pytest.ini_options]
|
|
||||||
testpaths = ["tests"]
|
|
||||||
|
|
@ -101,10 +101,6 @@ class Executor:
|
||||||
payload = data.get("payload", {})
|
payload = data.get("payload", {})
|
||||||
success, error_msg = self._execute_disk_cleanup(node, payload)
|
success, error_msg = self._execute_disk_cleanup(node, payload)
|
||||||
|
|
||||||
elif action_type == "alert_only":
|
|
||||||
# Operator acknowledged the alert; no automated execution needed.
|
|
||||||
success = True
|
|
||||||
|
|
||||||
else:
|
else:
|
||||||
success = False
|
success = False
|
||||||
error_msg = f"Unknown action type: {action_type}"
|
error_msg = f"Unknown action type: {action_type}"
|
||||||
|
|
|
||||||
|
|
@ -9,7 +9,6 @@ from pathlib import Path
|
||||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||||
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
||||||
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
||||||
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
|
||||||
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
|
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
|
||||||
|
|
||||||
# Node alias map: maps alternative node names (as they appear in events/world state)
|
# Node alias map: maps alternative node names (as they appear in events/world state)
|
||||||
|
|
@ -33,53 +32,6 @@ CONTAINER_RESTART_TRIGGERS = {"containers_not_running", "mqtt_unreachable"}
|
||||||
# decide explicitly (e.g. adjust Frigate retain policy or purge HA recorder).
|
# decide explicitly (e.g. adjust Frigate retain policy or purge HA recorder).
|
||||||
NO_DISK_CLEANUP_NODES = {"chelsty-infra", "chelsty-ha"}
|
NO_DISK_CLEANUP_NODES = {"chelsty-infra", "chelsty-ha"}
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# HA diagnostic event routing (ha-diag-agent events)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
# ha_websocket_dead: HA WebSocket unresponsive → restart the homeassistant container.
|
|
||||||
# Separate from CONTAINER_RESTART_TRIGGERS because these events are routed directly
|
|
||||||
# from the events dir (not via the world-state drift loop) to avoid conflicts with
|
|
||||||
# the stability-agent's independent container health tracking on the same service key.
|
|
||||||
HA_CONTAINER_RESTART_EVENTS = {"ha_websocket_dead"}
|
|
||||||
|
|
||||||
# Alert-only events — operator notification, no automated action.
|
|
||||||
HA_ALERT_ONLY_EVENTS = {
|
|
||||||
"ha_integration_failed",
|
|
||||||
"ha_entity_unavailable_long",
|
|
||||||
"ha_automation_failing",
|
|
||||||
"ha_update_available",
|
|
||||||
"ha_recorder_lag",
|
|
||||||
"ha_system_health_degraded",
|
|
||||||
}
|
|
||||||
|
|
||||||
# Stable action-ID suffix for each alert-only type
|
|
||||||
_HA_ALERT_ID_SUFFIX = {
|
|
||||||
"ha_integration_failed": "integration-failed",
|
|
||||||
"ha_entity_unavailable_long": "entity-unavailable",
|
|
||||||
"ha_automation_failing": "automation-failing",
|
|
||||||
"ha_update_available": "update-available",
|
|
||||||
"ha_recorder_lag": "recorder-lag",
|
|
||||||
"ha_system_health_degraded": "system-health-degraded",
|
|
||||||
}
|
|
||||||
|
|
||||||
# 30-min cooldown after a container_restart completes; prevents restart loops
|
|
||||||
# when HA repeatedly fails to connect (e.g. bad config, slow startup).
|
|
||||||
HA_WEBSOCKET_RESTART_COOLDOWN = 1800
|
|
||||||
|
|
||||||
# 1-hour cooldown for alert-only events; avoids repeated Telegram noise for
|
|
||||||
# persistent conditions (e.g. an entity that stays unavailable for hours).
|
|
||||||
HA_ALERT_COOLDOWN = 3600
|
|
||||||
|
|
||||||
# Suppress ha_* events if homeassistant had a containers_not_running incident
|
|
||||||
# within this window — HA is in a planned restart/update and alerts would be noise.
|
|
||||||
HA_TRANSITION_WINDOW = 300 # 5 minutes
|
|
||||||
|
|
||||||
# When True, events that would generate container_restart are downgraded to alert_only
|
|
||||||
# with a "[SHADOW MODE]" note. Safe default for initial deployment; set
|
|
||||||
# HA_DIAG_SHADOW_MODE=false on the control-plane node when ready for live actions.
|
|
||||||
HA_DIAG_SHADOW_MODE = os.getenv("HA_DIAG_SHADOW_MODE", "true").lower() == "true"
|
|
||||||
|
|
||||||
# Logging setup
|
# Logging setup
|
||||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||||
logger = logging.getLogger("supervisor")
|
logger = logging.getLogger("supervisor")
|
||||||
|
|
@ -89,15 +41,7 @@ class Supervisor:
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.desired_state = {"services": {}}
|
self.desired_state = {"services": {}}
|
||||||
self.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
|
self.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
|
||||||
# In-memory set of already-routed HA event IDs; prevents re-processing
|
|
||||||
# on each reconcile cycle. Grows to at most ~hundreds of entries/day.
|
|
||||||
self._ha_processed_event_ids: set = set()
|
|
||||||
self._ensure_dirs()
|
self._ensure_dirs()
|
||||||
logger.info(
|
|
||||||
"shadow_mode=%s — HA container_restart actions %s",
|
|
||||||
HA_DIAG_SHADOW_MODE,
|
|
||||||
"downgraded to alert_only" if HA_DIAG_SHADOW_MODE else "enabled",
|
|
||||||
)
|
|
||||||
|
|
||||||
def _ensure_dirs(self):
|
def _ensure_dirs(self):
|
||||||
ACTIONS_DIR.mkdir(parents=True, exist_ok=True)
|
ACTIONS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
@ -298,12 +242,6 @@ class Supervisor:
|
||||||
# operator can see it was auto-resolved rather than silently dropped.
|
# operator can see it was auto-resolved rather than silently dropped.
|
||||||
self._cancel_resolved_pending_actions()
|
self._cancel_resolved_pending_actions()
|
||||||
|
|
||||||
# 5. Route HA diagnostic events emitted by ha-diag-agent.
|
|
||||||
# Processed directly from the events directory — not via the world-state
|
|
||||||
# drift loop — to avoid conflicts with stability-agent's independent
|
|
||||||
# container health tracking for the homeassistant service.
|
|
||||||
self._process_ha_events()
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
# ------------------------------------------------------------------
|
||||||
# Recommendation generation
|
# Recommendation generation
|
||||||
# ------------------------------------------------------------------
|
# ------------------------------------------------------------------
|
||||||
|
|
@ -504,247 +442,6 @@ class Supervisor:
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Failed to cancel action {action_file.name}: {e}")
|
logger.error(f"Failed to cancel action {action_file.name}: {e}")
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# HA diagnostic event routing
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _process_ha_events(self):
|
|
||||||
"""Scan the events directory for unprocessed ha_* events and route them."""
|
|
||||||
if not EVENTS_DIR.exists():
|
|
||||||
return
|
|
||||||
for event_file in sorted(EVENTS_DIR.glob("**/*.json")):
|
|
||||||
event_id = event_file.stem
|
|
||||||
if event_id in self._ha_processed_event_ids:
|
|
||||||
continue
|
|
||||||
self._ha_processed_event_ids.add(event_id)
|
|
||||||
try:
|
|
||||||
with open(event_file) as f:
|
|
||||||
event = json.load(f)
|
|
||||||
except Exception as e:
|
|
||||||
logger.debug(f"Could not read event {event_file}: {e}")
|
|
||||||
continue
|
|
||||||
if not event.get("type", "").startswith("ha_"):
|
|
||||||
continue
|
|
||||||
self._route_ha_event(event)
|
|
||||||
|
|
||||||
def _route_ha_event(self, event: dict):
|
|
||||||
event_type = event.get("type", "")
|
|
||||||
node = event.get("node", "")
|
|
||||||
if not node:
|
|
||||||
return
|
|
||||||
|
|
||||||
if event_type in HA_CONTAINER_RESTART_EVENTS:
|
|
||||||
if self._is_ha_in_transition(node):
|
|
||||||
logger.debug(
|
|
||||||
f"Suppressing {event_type} on {node}: homeassistant in transition"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
if HA_DIAG_SHADOW_MODE:
|
|
||||||
logger.info(
|
|
||||||
"shadow_mode: suppressed container_restart for %s", event_type
|
|
||||||
)
|
|
||||||
self._generate_ha_shadow_alert(node, event)
|
|
||||||
else:
|
|
||||||
self._generate_ha_container_restart(node, event)
|
|
||||||
|
|
||||||
elif event_type == "ha_websocket_recovered":
|
|
||||||
self._cancel_ha_container_restart(node)
|
|
||||||
|
|
||||||
elif event_type in HA_ALERT_ONLY_EVENTS:
|
|
||||||
if self._is_ha_in_transition(node):
|
|
||||||
logger.debug(
|
|
||||||
f"Suppressing {event_type} on {node}: homeassistant in transition"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
self._generate_ha_alert_only(node, event)
|
|
||||||
|
|
||||||
def _is_ha_in_transition(self, node: str) -> bool:
|
|
||||||
"""Return True if homeassistant container had a recent containers_not_running incident.
|
|
||||||
|
|
||||||
Suppresses ha_* alerts during planned HA restarts/updates to avoid
|
|
||||||
flooding the operator with secondary diagnostic alerts.
|
|
||||||
"""
|
|
||||||
svc_key = f"{node}/homeassistant"
|
|
||||||
svc_info = self.actual_state["services"].get(svc_key, {})
|
|
||||||
incident_id = svc_info.get("incident_id")
|
|
||||||
if not incident_id:
|
|
||||||
return False
|
|
||||||
incident = self.actual_state["incidents"].get(incident_id, {})
|
|
||||||
return (
|
|
||||||
incident.get("status") == "active"
|
|
||||||
and incident.get("trigger_type") == "containers_not_running"
|
|
||||||
and time.time() - (incident.get("last_occurrence") or 0) < HA_TRANSITION_WINDOW
|
|
||||||
)
|
|
||||||
|
|
||||||
def _ha_action_recently_completed(self, action_id: str, cooldown: int) -> bool:
|
|
||||||
"""Return True if action completed/rejected/cancelled within the cooldown window."""
|
|
||||||
for state in ("completed", "rejected", "cancelled"):
|
|
||||||
path = ACTIONS_DIR / state / f"{action_id}.json"
|
|
||||||
if path.exists():
|
|
||||||
try:
|
|
||||||
with open(path) as f:
|
|
||||||
data = json.load(f)
|
|
||||||
finished = (
|
|
||||||
data.get("finished_at")
|
|
||||||
or data.get("cancelled_at")
|
|
||||||
or data.get("updated_at")
|
|
||||||
or 0
|
|
||||||
)
|
|
||||||
if time.time() - finished < cooldown:
|
|
||||||
return True
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
return False
|
|
||||||
|
|
||||||
def _generate_ha_container_restart(self, node: str, event: dict):
|
|
||||||
service = "homeassistant"
|
|
||||||
action_id = f"container-restart-{node}-{service}"
|
|
||||||
|
|
||||||
for state in ("pending", "approved", "running"):
|
|
||||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
|
||||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
|
||||||
return
|
|
||||||
|
|
||||||
if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
|
|
||||||
logger.debug(
|
|
||||||
f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
|
|
||||||
payload = dict(event.get("payload", {}))
|
|
||||||
payload["reason"] = "ha_websocket_dead"
|
|
||||||
payload["svc_key"] = f"{node}/{service}"
|
|
||||||
|
|
||||||
container_name = self._get_container_name(service)
|
|
||||||
action = {
|
|
||||||
"action_id": action_id,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"type": "container_restart",
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"container_name": container_name,
|
|
||||||
"risk_level": "low",
|
|
||||||
"confidence": 0.9,
|
|
||||||
"description": (
|
|
||||||
f"Restart '{container_name}' on {node}: HA WebSocket unresponsive"
|
|
||||||
),
|
|
||||||
"status": "pending",
|
|
||||||
"payload": payload,
|
|
||||||
}
|
|
||||||
self._write_pending_action(action)
|
|
||||||
|
|
||||||
def _generate_ha_shadow_alert(self, node: str, event: dict):
|
|
||||||
"""Shadow-mode downgrade: emit alert_only instead of container_restart.
|
|
||||||
|
|
||||||
Uses the same action_id and cooldown as the real restart so that
|
|
||||||
cooldown semantics are identical regardless of shadow mode state.
|
|
||||||
"""
|
|
||||||
service = "homeassistant"
|
|
||||||
action_id = f"container-restart-{node}-{service}"
|
|
||||||
|
|
||||||
for state in ("pending", "approved", "running"):
|
|
||||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
|
||||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
|
||||||
return
|
|
||||||
|
|
||||||
if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
|
|
||||||
logger.debug(
|
|
||||||
f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
|
|
||||||
payload = dict(event.get("payload", {}))
|
|
||||||
payload["reason"] = "ha_websocket_dead"
|
|
||||||
payload["svc_key"] = f"{node}/{service}"
|
|
||||||
payload["shadow_mode"] = True
|
|
||||||
|
|
||||||
action = {
|
|
||||||
"action_id": action_id,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"type": "alert_only",
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"risk_level": "info",
|
|
||||||
"confidence": 0.9,
|
|
||||||
"description": (
|
|
||||||
f"[SHADOW MODE] would have triggered container_restart "
|
|
||||||
f"for {service} on {node}: HA WebSocket unresponsive"
|
|
||||||
),
|
|
||||||
"status": "pending",
|
|
||||||
"payload": payload,
|
|
||||||
}
|
|
||||||
self._write_pending_action(action)
|
|
||||||
|
|
||||||
def _generate_ha_alert_only(self, node: str, event: dict):
|
|
||||||
event_type = event.get("type", "")
|
|
||||||
suffix = _HA_ALERT_ID_SUFFIX.get(event_type, event_type.replace("_", "-"))
|
|
||||||
action_id = f"alert-ha-{suffix}-{node}"
|
|
||||||
|
|
||||||
for state in ("pending", "approved", "running"):
|
|
||||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
|
||||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
|
||||||
return
|
|
||||||
|
|
||||||
if self._ha_action_recently_completed(action_id, HA_ALERT_COOLDOWN):
|
|
||||||
logger.debug(
|
|
||||||
f"Skipping {action_id}: within {HA_ALERT_COOLDOWN}s cooldown"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
|
|
||||||
payload = dict(event.get("payload", {}))
|
|
||||||
payload["reason"] = event_type
|
|
||||||
|
|
||||||
action = {
|
|
||||||
"action_id": action_id,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"type": "alert_only",
|
|
||||||
"node": node,
|
|
||||||
"service": event.get("service", "homeassistant"),
|
|
||||||
"risk_level": "info",
|
|
||||||
"confidence": 1.0,
|
|
||||||
"description": event.get(
|
|
||||||
"message", f"HA diagnostic alert: {event_type} on {node}"
|
|
||||||
),
|
|
||||||
"status": "pending",
|
|
||||||
"payload": payload,
|
|
||||||
}
|
|
||||||
self._write_pending_action(action)
|
|
||||||
|
|
||||||
def _cancel_ha_container_restart(self, node: str):
|
|
||||||
"""Move a pending ha_websocket_dead container_restart to cancelled on recovery."""
|
|
||||||
action_id = f"container-restart-{node}-homeassistant"
|
|
||||||
pending_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
|
||||||
if not pending_path.exists():
|
|
||||||
return
|
|
||||||
cancelled_dir = ACTIONS_DIR / "cancelled"
|
|
||||||
cancelled_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
dest = cancelled_dir / f"{action_id}.json"
|
|
||||||
try:
|
|
||||||
with open(pending_path) as f:
|
|
||||||
action = json.load(f)
|
|
||||||
action["status"] = "cancelled"
|
|
||||||
action["cancelled_reason"] = "ha_websocket_recovered"
|
|
||||||
action["cancelled_at"] = time.time()
|
|
||||||
with open(dest, "w") as f:
|
|
||||||
json.dump(action, f, indent=2)
|
|
||||||
pending_path.unlink()
|
|
||||||
logger.info(f"Cancelled {action_id}: ha_websocket_recovered on {node}")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to cancel {action_id}: {e}")
|
|
||||||
|
|
||||||
def _write_pending_action(self, action: dict):
|
|
||||||
action_id = action["action_id"]
|
|
||||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
|
||||||
try:
|
|
||||||
with open(action_path, "w") as f:
|
|
||||||
json.dump(action, f, indent=2)
|
|
||||||
logger.info(
|
|
||||||
f"Generated HA action: {action_id} "
|
|
||||||
f"(type={action['type']}, risk={action['risk_level']})"
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to save action {action_id}: {e}")
|
|
||||||
|
|
||||||
def loop(self, interval=30):
|
def loop(self, interval=30):
|
||||||
logger.info("Starting supervisor loop")
|
logger.info("Starting supervisor loop")
|
||||||
while True:
|
while True:
|
||||||
|
|
|
||||||
|
|
@ -1,395 +0,0 @@
|
||||||
"""Tests for HA diagnostic event routing in the supervisor."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
import sys
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
# Add src/ to path so we can import supervisor without installing
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
|
|
||||||
import supervisor as supervisor_module
|
|
||||||
from supervisor import Supervisor
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _make_event(event_type: str, node: str = "chelsty-ha", service: str = "homeassistant",
|
|
||||||
payload: dict | None = None, message: str = "") -> dict:
|
|
||||||
return {
|
|
||||||
"id": f"evt-{node}-{int(time.time())}-{event_type}-{service}-1",
|
|
||||||
"type": event_type,
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"severity": "warning",
|
|
||||||
"timestamp": int(time.time()),
|
|
||||||
"message": message or f"Test event: {event_type}",
|
|
||||||
"payload": payload or {"location_tag": "chelsty"},
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def _write_event(events_dir: Path, event: dict) -> Path:
|
|
||||||
path = events_dir / f"{event['id']}.json"
|
|
||||||
path.write_text(json.dumps(event))
|
|
||||||
return path
|
|
||||||
|
|
||||||
|
|
||||||
def _setup_supervisor(tmp_path: Path, monkeypatch) -> Supervisor:
|
|
||||||
"""Return a Supervisor instance with all paths redirected to tmp_path."""
|
|
||||||
actions = tmp_path / "actions"
|
|
||||||
events = tmp_path / "events"
|
|
||||||
world = tmp_path / "world"
|
|
||||||
repo = tmp_path / "repo"
|
|
||||||
state = tmp_path / "state"
|
|
||||||
|
|
||||||
for d in (actions, events, world, repo / "hosts", state):
|
|
||||||
d.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
monkeypatch.setattr(supervisor_module, "ACTIONS_DIR", actions)
|
|
||||||
monkeypatch.setattr(supervisor_module, "EVENTS_DIR", events)
|
|
||||||
monkeypatch.setattr(supervisor_module, "WORLD_DIR", world)
|
|
||||||
monkeypatch.setattr(supervisor_module, "REPO_ROOT", repo)
|
|
||||||
|
|
||||||
sup = Supervisor()
|
|
||||||
# Empty desired/actual state so reconcile drift loop is a no-op
|
|
||||||
sup.desired_state = {"services": {}}
|
|
||||||
sup.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
|
|
||||||
return sup
|
|
||||||
|
|
||||||
|
|
||||||
def _pending(tmp_path: Path, action_id: str) -> Path:
|
|
||||||
return tmp_path / "actions" / "pending" / f"{action_id}.json"
|
|
||||||
|
|
||||||
|
|
||||||
def _read_action(tmp_path: Path, state: str, action_id: str) -> dict:
|
|
||||||
return json.loads((tmp_path / "actions" / state / f"{action_id}.json").read_text())
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 1. Each event type → correct action type
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_ha_websocket_dead_generates_container_restart(tmp_path, monkeypatch):
|
|
||||||
monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", False)
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
events_dir = tmp_path / "events"
|
|
||||||
_write_event(events_dir, _make_event("ha_websocket_dead"))
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action_id = "container-restart-chelsty-ha-homeassistant"
|
|
||||||
assert _pending(tmp_path, action_id).exists()
|
|
||||||
action = _read_action(tmp_path, "pending", action_id)
|
|
||||||
assert action["type"] == "container_restart"
|
|
||||||
assert action["service"] == "homeassistant"
|
|
||||||
assert action["node"] == "chelsty-ha"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("event_type,expected_suffix", [
|
|
||||||
("ha_integration_failed", "integration-failed"),
|
|
||||||
("ha_entity_unavailable_long", "entity-unavailable"),
|
|
||||||
("ha_automation_failing", "automation-failing"),
|
|
||||||
("ha_update_available", "update-available"),
|
|
||||||
("ha_recorder_lag", "recorder-lag"),
|
|
||||||
("ha_system_health_degraded", "system-health-degraded"),
|
|
||||||
])
|
|
||||||
def test_alert_only_events_generate_alert_actions(
|
|
||||||
tmp_path, monkeypatch, event_type, expected_suffix
|
|
||||||
):
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_write_event(tmp_path / "events", _make_event(event_type))
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action_id = f"alert-ha-{expected_suffix}-chelsty-ha"
|
|
||||||
assert _pending(tmp_path, action_id).exists(), f"No pending action for {event_type}"
|
|
||||||
action = _read_action(tmp_path, "pending", action_id)
|
|
||||||
assert action["type"] == "alert_only"
|
|
||||||
assert action["node"] == "chelsty-ha"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 2. Transition suppression
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_ha_websocket_dead_suppressed_during_transition(tmp_path, monkeypatch):
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
|
|
||||||
# Set up world state: homeassistant has an active containers_not_running incident
|
|
||||||
inc_id = "inc-123-chelsty-ha-homeassistant"
|
|
||||||
sup.actual_state["services"]["chelsty-ha/homeassistant"] = {
|
|
||||||
"node": "chelsty-ha", "service": "homeassistant",
|
|
||||||
"status": "unhealthy", "incident_id": inc_id,
|
|
||||||
}
|
|
||||||
sup.actual_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "active",
|
|
||||||
"trigger_type": "containers_not_running",
|
|
||||||
"last_occurrence": time.time() - 60, # 1 min ago — within 5-min window
|
|
||||||
}
|
|
||||||
|
|
||||||
_write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action_id = "container-restart-chelsty-ha-homeassistant"
|
|
||||||
assert not _pending(tmp_path, action_id).exists(), "Action should be suppressed during transition"
|
|
||||||
|
|
||||||
|
|
||||||
def test_ha_alert_suppressed_during_transition(tmp_path, monkeypatch):
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
|
|
||||||
inc_id = "inc-456-chelsty-ha-homeassistant"
|
|
||||||
sup.actual_state["services"]["chelsty-ha/homeassistant"] = {
|
|
||||||
"node": "chelsty-ha", "service": "homeassistant",
|
|
||||||
"status": "unhealthy", "incident_id": inc_id,
|
|
||||||
}
|
|
||||||
sup.actual_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "active",
|
|
||||||
"trigger_type": "containers_not_running",
|
|
||||||
"last_occurrence": time.time() - 30,
|
|
||||||
}
|
|
||||||
|
|
||||||
for event_type in supervisor_module.HA_ALERT_ONLY_EVENTS:
|
|
||||||
_write_event(tmp_path / "events", _make_event(event_type))
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
for suffix in supervisor_module._HA_ALERT_ID_SUFFIX.values():
|
|
||||||
action_id = f"alert-ha-{suffix}-chelsty-ha"
|
|
||||||
assert not _pending(tmp_path, action_id).exists(), \
|
|
||||||
f"{action_id} should be suppressed"
|
|
||||||
|
|
||||||
|
|
||||||
def test_transition_suppression_expires_after_window(tmp_path, monkeypatch):
|
|
||||||
"""After 5 min, transition window expires and events are routed normally."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
|
|
||||||
inc_id = "inc-789-chelsty-ha-homeassistant"
|
|
||||||
sup.actual_state["services"]["chelsty-ha/homeassistant"] = {
|
|
||||||
"node": "chelsty-ha", "service": "homeassistant",
|
|
||||||
"status": "unhealthy", "incident_id": inc_id,
|
|
||||||
}
|
|
||||||
sup.actual_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "active",
|
|
||||||
"trigger_type": "containers_not_running",
|
|
||||||
"last_occurrence": time.time() - 400, # 6.7 min ago — outside window
|
|
||||||
}
|
|
||||||
|
|
||||||
_write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action_id = "container-restart-chelsty-ha-homeassistant"
|
|
||||||
assert _pending(tmp_path, action_id).exists(), "Should not be suppressed after window"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 3. Recovery cancellation
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_ha_websocket_recovered_cancels_pending_restart(tmp_path, monkeypatch):
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
events_dir = tmp_path / "events"
|
|
||||||
actions = tmp_path / "actions"
|
|
||||||
(actions / "cancelled").mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
# Pre-create a pending container_restart for homeassistant
|
|
||||||
action_id = "container-restart-chelsty-ha-homeassistant"
|
|
||||||
pending_action = {
|
|
||||||
"action_id": action_id, "type": "container_restart",
|
|
||||||
"node": "chelsty-ha", "service": "homeassistant",
|
|
||||||
"status": "pending", "timestamp": time.time(),
|
|
||||||
}
|
|
||||||
_pending(tmp_path, action_id).write_text(json.dumps(pending_action))
|
|
||||||
|
|
||||||
_write_event(events_dir, _make_event("ha_websocket_recovered"))
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
assert not _pending(tmp_path, action_id).exists(), "Pending action should be cancelled"
|
|
||||||
cancelled = actions / "cancelled" / f"{action_id}.json"
|
|
||||||
assert cancelled.exists()
|
|
||||||
data = json.loads(cancelled.read_text())
|
|
||||||
assert data["cancelled_reason"] == "ha_websocket_recovered"
|
|
||||||
|
|
||||||
|
|
||||||
def test_ha_websocket_recovered_no_pending_action_is_noop(tmp_path, monkeypatch):
|
|
||||||
"""Recovery event when no pending restart exists must not raise."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_write_event(tmp_path / "events", _make_event("ha_websocket_recovered"))
|
|
||||||
sup._process_ha_events() # should not raise
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 4. Cooldown
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_ha_websocket_dead_cooldown_prevents_second_restart(tmp_path, monkeypatch):
|
|
||||||
"""Two ha_websocket_dead events within 30 min → only one container_restart."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
events_dir = tmp_path / "events"
|
|
||||||
actions = tmp_path / "actions"
|
|
||||||
(actions / "completed").mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
# First event → action generated
|
|
||||||
_write_event(events_dir, _make_event("ha_websocket_dead", service="homeassistant"))
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action_id = "container-restart-chelsty-ha-homeassistant"
|
|
||||||
assert _pending(tmp_path, action_id).exists()
|
|
||||||
|
|
||||||
# Simulate: action completed recently (< 30 min ago)
|
|
||||||
action_data = json.loads(_pending(tmp_path, action_id).read_text())
|
|
||||||
action_data["status"] = "completed"
|
|
||||||
action_data["finished_at"] = time.time() - 60 # 1 min ago
|
|
||||||
(actions / "completed" / f"{action_id}.json").write_text(json.dumps(action_data))
|
|
||||||
_pending(tmp_path, action_id).unlink()
|
|
||||||
|
|
||||||
# Second event — should be suppressed by cooldown
|
|
||||||
event2 = _make_event("ha_websocket_dead", service="homeassistant")
|
|
||||||
event2["id"] = event2["id"] + "-2" # different event ID
|
|
||||||
_write_event(events_dir, event2)
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
assert not _pending(tmp_path, action_id).exists(), "Second restart within cooldown should be suppressed"
|
|
||||||
|
|
||||||
|
|
||||||
def test_ha_websocket_dead_cooldown_expires(tmp_path, monkeypatch):
|
|
||||||
"""After cooldown expires, a new ha_websocket_dead should generate an action."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
events_dir = tmp_path / "events"
|
|
||||||
actions = tmp_path / "actions"
|
|
||||||
(actions / "completed").mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
action_id = "container-restart-chelsty-ha-homeassistant"
|
|
||||||
# Pre-populate completed action with timestamp > 30 min ago
|
|
||||||
old_action = {
|
|
||||||
"action_id": action_id, "type": "container_restart",
|
|
||||||
"status": "completed", "finished_at": time.time() - 3700, # > 30 min
|
|
||||||
}
|
|
||||||
(actions / "completed" / f"{action_id}.json").write_text(json.dumps(old_action))
|
|
||||||
|
|
||||||
_write_event(events_dir, _make_event("ha_websocket_dead"))
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
assert _pending(tmp_path, action_id).exists(), "Should generate new restart after cooldown"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 5. Location tag preserved
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_location_tag_preserved_in_container_restart_payload(tmp_path, monkeypatch):
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_write_event(tmp_path / "events",
|
|
||||||
_make_event("ha_websocket_dead", payload={"location_tag": "chelsty", "extra": "data"}))
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action = _read_action(tmp_path, "pending", "container-restart-chelsty-ha-homeassistant")
|
|
||||||
assert action["payload"]["location_tag"] == "chelsty"
|
|
||||||
|
|
||||||
|
|
||||||
def test_location_tag_preserved_in_alert_only_payload(tmp_path, monkeypatch):
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_write_event(tmp_path / "events",
|
|
||||||
_make_event("ha_entity_unavailable_long",
|
|
||||||
payload={"location_tag": "ken", "count": 3}))
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action = _read_action(tmp_path, "pending", "alert-ha-entity-unavailable-chelsty-ha")
|
|
||||||
assert action["payload"]["location_tag"] == "ken"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 6. Dedup — same alert type twice → only one pending action
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_alert_only_dedup_second_event_skipped(tmp_path, monkeypatch):
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
events_dir = tmp_path / "events"
|
|
||||||
|
|
||||||
event1 = _make_event("ha_entity_unavailable_long")
|
|
||||||
event2 = _make_event("ha_entity_unavailable_long")
|
|
||||||
event2["id"] = event2["id"] + "-2"
|
|
||||||
_write_event(events_dir, event1)
|
|
||||||
_write_event(events_dir, event2)
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action_id = "alert-ha-entity-unavailable-chelsty-ha"
|
|
||||||
assert _pending(tmp_path, action_id).exists()
|
|
||||||
# Only one file — not duplicated
|
|
||||||
pending_files = list((tmp_path / "actions" / "pending").glob("alert-ha-entity-unavailable*.json"))
|
|
||||||
assert len(pending_files) == 1
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 7. Shadow mode
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_shadow_mode_websocket_dead_generates_alert_not_restart(tmp_path, monkeypatch):
|
|
||||||
"""shadow_mode=True: ha_websocket_dead → alert_only with [SHADOW MODE], not container_restart."""
|
|
||||||
monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", True)
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action_id = "container-restart-chelsty-ha-homeassistant"
|
|
||||||
assert _pending(tmp_path, action_id).exists(), "Shadow alert should be written"
|
|
||||||
action = _read_action(tmp_path, "pending", action_id)
|
|
||||||
assert action["type"] == "alert_only"
|
|
||||||
assert "[SHADOW MODE]" in action["description"]
|
|
||||||
assert action["payload"].get("shadow_mode") is True
|
|
||||||
|
|
||||||
|
|
||||||
def test_no_shadow_mode_websocket_dead_generates_container_restart(tmp_path, monkeypatch):
|
|
||||||
"""shadow_mode=False: ha_websocket_dead → container_restart (normal path)."""
|
|
||||||
monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", False)
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action_id = "container-restart-chelsty-ha-homeassistant"
|
|
||||||
assert _pending(tmp_path, action_id).exists()
|
|
||||||
action = _read_action(tmp_path, "pending", action_id)
|
|
||||||
assert action["type"] == "container_restart"
|
|
||||||
assert "[SHADOW MODE]" not in action["description"]
|
|
||||||
|
|
||||||
|
|
||||||
def test_shadow_mode_alert_only_events_unaffected(tmp_path, monkeypatch):
|
|
||||||
"""shadow_mode=True: alert-only events (ha_entity_unavailable_long) are still routed normally."""
|
|
||||||
monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", True)
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_write_event(tmp_path / "events", _make_event("ha_entity_unavailable_long"))
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
action_id = "alert-ha-entity-unavailable-chelsty-ha"
|
|
||||||
assert _pending(tmp_path, action_id).exists()
|
|
||||||
action = _read_action(tmp_path, "pending", action_id)
|
|
||||||
assert action["type"] == "alert_only"
|
|
||||||
assert "[SHADOW MODE]" not in action["description"]
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 8. Non-HA events are ignored
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_non_ha_events_not_routed(tmp_path, monkeypatch):
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
events_dir = tmp_path / "events"
|
|
||||||
|
|
||||||
for etype in ("service_unhealthy", "containers_not_running", "node_online", "deployment_failed"):
|
|
||||||
e = _make_event(etype, service="mosquitto")
|
|
||||||
e["type"] = etype
|
|
||||||
_write_event(events_dir, e)
|
|
||||||
|
|
||||||
sup._process_ha_events()
|
|
||||||
|
|
||||||
pending_files = list((tmp_path / "actions" / "pending").glob("*.json"))
|
|
||||||
assert pending_files == [], "Non-HA events should not generate actions via HA path"
|
|
||||||
|
|
@ -1,239 +0,0 @@
|
||||||
# ha-diag-agent Deployment Guide
|
|
||||||
|
|
||||||
## Section 1: Prerequisites
|
|
||||||
|
|
||||||
### HA long-lived access token
|
|
||||||
|
|
||||||
The agent authenticates to Home Assistant with a long-lived token issued by a
|
|
||||||
dedicated service account. Do not use a personal admin token.
|
|
||||||
|
|
||||||
1. In HA: **Settings → People → Add Person**
|
|
||||||
- Name: `diag_agent`
|
|
||||||
- Do **not** add to any group (no admin rights needed)
|
|
||||||
2. Log in to HA as `diag_agent`
|
|
||||||
3. Go to **Profile → Long-Lived Access Tokens → Create token**
|
|
||||||
- Name: `ha-diag-agent`
|
|
||||||
- Copy the token — it is shown only once
|
|
||||||
4. Store the token in the node's `.env` file (see Section 2)
|
|
||||||
|
|
||||||
### Tailnet reachability check (chelsty-infra only)
|
|
||||||
|
|
||||||
`chelsty-infra` reaches Home Assistant on `chelsty-ha` over Tailscale.
|
|
||||||
Verify before deploying:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
curl -sf http://100.70.180.90:8123/api/ \
|
|
||||||
-H "Authorization: Bearer <token>" | python3 -m json.tool
|
|
||||||
# Expect: {"message": "API running."}
|
|
||||||
```
|
|
||||||
|
|
||||||
If the request times out, check that both nodes are on the Tailscale mesh
|
|
||||||
(`tailscale status`) and that `chelsty-ha` is powered on.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Section 2: Per-host config
|
|
||||||
|
|
||||||
Create `/opt/homelab/config/ha-diag-agent/.env` on **each target node**:
|
|
||||||
|
|
||||||
### piha
|
|
||||||
|
|
||||||
```bash
|
|
||||||
mkdir -p /opt/homelab/config/ha-diag-agent
|
|
||||||
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
|
|
||||||
HA_URL=http://localhost:8123
|
|
||||||
HA_TOKEN=<long-lived-token-for-piha>
|
|
||||||
NODE_NAME=piha
|
|
||||||
LOCATION_TAG=ken
|
|
||||||
CHECK_INTERVAL=60
|
|
||||||
CHECK_INTERVAL_UNAVAILABLE=3600
|
|
||||||
UNAVAILABLE_THRESHOLD_HOURS=24
|
|
||||||
ALERT_COOLDOWN_HOURS=6
|
|
||||||
LOG_LEVEL=info
|
|
||||||
EOF
|
|
||||||
chmod 600 /opt/homelab/config/ha-diag-agent/.env
|
|
||||||
```
|
|
||||||
|
|
||||||
### chelsty-infra
|
|
||||||
|
|
||||||
```bash
|
|
||||||
mkdir -p /opt/homelab/config/ha-diag-agent
|
|
||||||
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
|
|
||||||
HA_URL=http://100.70.180.90:8123
|
|
||||||
HA_TOKEN=<long-lived-token-for-chelsty-ha>
|
|
||||||
NODE_NAME=chelsty-infra
|
|
||||||
LOCATION_TAG=chelsty
|
|
||||||
CHECK_INTERVAL=60
|
|
||||||
CHECK_INTERVAL_UNAVAILABLE=3600
|
|
||||||
UNAVAILABLE_THRESHOLD_HOURS=24
|
|
||||||
ALERT_COOLDOWN_HOURS=6
|
|
||||||
LOG_LEVEL=info
|
|
||||||
EOF
|
|
||||||
chmod 600 /opt/homelab/config/ha-diag-agent/.env
|
|
||||||
```
|
|
||||||
|
|
||||||
> If `chelsty-ha` gets a new Tailscale IP, update `HA_URL` in this file and
|
|
||||||
> restart the container.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Section 3: Deploy procedure
|
|
||||||
|
|
||||||
### From SATURN (standard flow)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Commit and push changes from SATURN
|
|
||||||
git push
|
|
||||||
|
|
||||||
# 2. SSH to target node
|
|
||||||
ssh oskar@piha # or chelsty-infra
|
|
||||||
|
|
||||||
# 3. Pull latest and deploy
|
|
||||||
cd ~/homelab-codex-ws
|
|
||||||
git pull
|
|
||||||
scripts/deploy/deploy.sh --service ha-diag-agent
|
|
||||||
```
|
|
||||||
|
|
||||||
### chelsty-infra (docker-compose v1)
|
|
||||||
|
|
||||||
`chelsty-infra` runs docker-compose v1 (1.29.2). The deploy script calls
|
|
||||||
`docker-compose` (hyphenated), which is correct. If you need to run manually:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd ~/homelab-codex-ws/services/ha-diag-agent
|
|
||||||
docker-compose up -d --build
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Section 4: Verification
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Container is up
|
|
||||||
docker ps | grep ha-diag-agent
|
|
||||||
|
|
||||||
# Last 50 log lines
|
|
||||||
docker logs ha-diag-agent --tail 50
|
|
||||||
|
|
||||||
# FastAPI health endpoint
|
|
||||||
curl http://localhost:8087/health
|
|
||||||
# Expect: {"status": "ok", "ws_connected": true, ...}
|
|
||||||
|
|
||||||
# Events are being written
|
|
||||||
ls /opt/homelab/events/<node-name>/
|
|
||||||
# Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds
|
|
||||||
|
|
||||||
# Supervisor is picking up events (check on VPS / control-plane)
|
|
||||||
tail -f /opt/homelab/logs/supervisor.log | grep ha_
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Section 5: First-48h observation (shadow mode)
|
|
||||||
|
|
||||||
The supervisor starts with `HA_DIAG_SHADOW_MODE=true` (default). During this
|
|
||||||
window, `ha_websocket_dead` events are downgraded to `alert_only` actions
|
|
||||||
tagged `[SHADOW MODE]` rather than triggering an automatic restart.
|
|
||||||
|
|
||||||
Watch for these signals in Telegram:
|
|
||||||
|
|
||||||
- `[SHADOW MODE] would have triggered container_restart for homeassistant` —
|
|
||||||
confirms the detection path works end-to-end
|
|
||||||
- `ha_entity_unavailable_long` / `ha_integration_failed` / etc. — these are
|
|
||||||
always `alert_only` regardless of shadow mode; verify descriptions look
|
|
||||||
accurate and thresholds are reasonable
|
|
||||||
|
|
||||||
Things to evaluate:
|
|
||||||
|
|
||||||
| Question | Good sign |
|
|
||||||
|----------|-----------|
|
|
||||||
| Are shadow alerts firing at reasonable frequency? | ≤ 1 per 30 min per node |
|
|
||||||
| Are there false positives? | No alerts during known-good uptime |
|
|
||||||
| Are entity-unavailable alerts describing real entities? | Yes, names match HA UI |
|
|
||||||
| Are integration-failed alerts genuine? | Yes, not noise from startup |
|
|
||||||
|
|
||||||
Note any false positives or noisy thresholds before enabling production mode.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Section 6: Enabling production mode
|
|
||||||
|
|
||||||
`HA_DIAG_SHADOW_MODE` is an environment variable read by the supervisor
|
|
||||||
container. The VPS supervisor env vars live in the version-controlled
|
|
||||||
override file at `hosts/vps/runtime/control-plane/docker-compose.override.yml`
|
|
||||||
(not in a runtime `.env` file — the supervisor has no `env_file:` directive).
|
|
||||||
|
|
||||||
When the 48h observation period looks clean:
|
|
||||||
|
|
||||||
**1. Edit the override file on SATURN:**
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# hosts/vps/runtime/control-plane/docker-compose.override.yml
|
|
||||||
services:
|
|
||||||
supervisor:
|
|
||||||
environment:
|
|
||||||
- NODE_ALIAS_MAP={"node-2":"chelsty"}
|
|
||||||
- HA_DIAG_SHADOW_MODE=false # add this line
|
|
||||||
```
|
|
||||||
|
|
||||||
**2. Commit and push from SATURN:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git add hosts/vps/runtime/control-plane/docker-compose.override.yml
|
|
||||||
git commit -m "feat(control-plane): disable HA shadow mode — production ready"
|
|
||||||
git push
|
|
||||||
```
|
|
||||||
|
|
||||||
**3. Apply on VPS:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ssh oskar@100.95.58.48
|
|
||||||
cd ~/homelab-codex-ws && git pull
|
|
||||||
docker compose \
|
|
||||||
-f services/control-plane/docker-compose.yml \
|
|
||||||
-f hosts/vps/runtime/control-plane/docker-compose.override.yml \
|
|
||||||
up -d supervisor
|
|
||||||
```
|
|
||||||
|
|
||||||
**4. Confirm:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker logs control-plane-supervisor --tail 5
|
|
||||||
# Expect: shadow_mode=False — HA container_restart actions enabled
|
|
||||||
```
|
|
||||||
|
|
||||||
From this point, the next `ha_websocket_dead` event will generate a
|
|
||||||
`container_restart` action in the approval queue. The 30-minute cooldown
|
|
||||||
still applies after each restart.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Section 7: Rollback
|
|
||||||
|
|
||||||
If production mode causes unexpected behaviour:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Option A — re-enable shadow mode
|
|
||||||
# On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml
|
|
||||||
# Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true)
|
|
||||||
# Commit, push, then on VPS:
|
|
||||||
ssh oskar@100.95.58.48
|
|
||||||
cd ~/homelab-codex-ws && git pull
|
|
||||||
docker compose \
|
|
||||||
-f services/control-plane/docker-compose.yml \
|
|
||||||
-f hosts/vps/runtime/control-plane/docker-compose.override.yml \
|
|
||||||
up -d supervisor
|
|
||||||
|
|
||||||
# Option B — stop ha-diag-agent entirely on affected nodes
|
|
||||||
ssh oskar@<node>
|
|
||||||
docker stop ha-diag-agent
|
|
||||||
|
|
||||||
# Events written before rollback remain in /opt/homelab/events/<node>/
|
|
||||||
# and are historical only — no automated action will be taken on them
|
|
||||||
# unless the supervisor re-processes them, which it won't (already in
|
|
||||||
# _ha_processed_event_ids).
|
|
||||||
```
|
|
||||||
|
|
||||||
Any `container_restart` actions still in `pending/` after rollback can be
|
|
||||||
manually rejected via the Telegram bot or by deleting the action files from
|
|
||||||
`/opt/homelab/actions/pending/` on the VPS.
|
|
||||||
|
|
@ -1,13 +0,0 @@
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
COPY pyproject.toml .
|
|
||||||
RUN mkdir -p src/ha_diag && touch src/ha_diag/__init__.py && \
|
|
||||||
pip install --no-cache-dir -e .
|
|
||||||
|
|
||||||
COPY src/ src/
|
|
||||||
|
|
||||||
ENV PYTHONUNBUFFERED=1
|
|
||||||
|
|
||||||
CMD ["python", "-m", "ha_diag.main"]
|
|
||||||
|
|
@ -1,131 +0,0 @@
|
||||||
# ha-diag-agent
|
|
||||||
|
|
||||||
Per-host Home Assistant diagnostic agent. Polls HA REST API on a schedule,
|
|
||||||
emits structured events to `/opt/homelab/events/<node>/`, and exposes an
|
|
||||||
HTTP API for health checks and manual check triggers.
|
|
||||||
|
|
||||||
Follows the same event-pipeline pattern as `node-agent`: filesystem-first,
|
|
||||||
no direct supervisor integration, events processed by the VPS observer.
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
```
|
|
||||||
APScheduler (interval-based REST checks)
|
|
||||||
├─ HeartbeatCheck → pings /api/, emits ha_websocket_dead on failure
|
|
||||||
├─ UnavailableEntitiesCheck → entity unavailable > threshold
|
|
||||||
├─ SystemHealthCheck → /api/system_health per-integration status
|
|
||||||
├─ AutomationFailuresCheck → automation last-run error traces
|
|
||||||
└─ UpdatesAvailableCheck → pending HA/integration updates
|
|
||||||
|
|
||||||
WebSocketMonitor (persistent, long-running — Phase 4b)
|
|
||||||
└─ Maintains a live WS subscription to state_changed events
|
|
||||||
Any traffic = HA is alive. Watchdog fires ha_websocket_dead on
|
|
||||||
silence > 5min or on disconnect. Emits ha_websocket_recovered
|
|
||||||
when the connection is restored after a dead alert.
|
|
||||||
|
|
||||||
FastAPI (port 8087)
|
|
||||||
GET /health → liveness probe (includes ws_connected field)
|
|
||||||
POST /trigger/<check> → run a named check on demand
|
|
||||||
|
|
||||||
SQLite (/data/ha_diag.db)
|
|
||||||
entity_baseline → last-known entity states
|
|
||||||
check_history → per-check run log
|
|
||||||
alerts_sent → dedup gate for alert events
|
|
||||||
```
|
|
||||||
|
|
||||||
The WebSocketMonitor is the only persistent-connection component; all other
|
|
||||||
checks are APScheduler intervals (stateless REST polls).
|
|
||||||
|
|
||||||
## Event Types
|
|
||||||
|
|
||||||
| Type | Severity | Trigger |
|
|
||||||
|------|----------|---------|
|
|
||||||
| `ha_websocket_dead` | error | WS disconnect, silence > 5min, or /api/ unreachable |
|
|
||||||
| `ha_websocket_recovered` | info | WS reconnected after a dead alert (clears incident) |
|
|
||||||
| `ha_integration_failed` | error | Integration in error state |
|
|
||||||
| `ha_entity_unavailable_long` | warning | Entity unavailable > threshold |
|
|
||||||
| `ha_automation_failing` | warning | Automation last run errored |
|
|
||||||
| `ha_update_available` | info | HA or integration update pending |
|
|
||||||
| `ha_recorder_lag` | warning | Recorder write lag > threshold |
|
|
||||||
| `ha_system_health_degraded` | warning | System health check failed |
|
|
||||||
|
|
||||||
Event routing in supervisor (Phase 5) maps these to `notify` actions.
|
|
||||||
`ha_websocket_recovered` should be routed to clear any active `ha_websocket_dead` incident.
|
|
||||||
|
|
||||||
## First-time deployment
|
|
||||||
|
|
||||||
See **[DEPLOY.md](DEPLOY.md)** for the full procedure: HA token creation,
|
|
||||||
per-host `.env` config, deploy commands, verification steps, 48h shadow-mode
|
|
||||||
observation, and rollback.
|
|
||||||
|
|
||||||
**Shadow mode** (`HA_DIAG_SHADOW_MODE`, default `true` on the control-plane):
|
|
||||||
`ha_websocket_dead` events are downgraded to `alert_only` with a `[SHADOW MODE]`
|
|
||||||
note instead of queuing an automatic `container_restart`. Set to `false` in
|
|
||||||
`/opt/homelab/config/control-plane/.env` on the VPS when ready for live actions.
|
|
||||||
|
|
||||||
## Deployment model
|
|
||||||
|
|
||||||
The agent is deployed **per-host** but targets a potentially remote HA instance:
|
|
||||||
|
|
||||||
| Node | Agent runs on | HA lives on | HA URL |
|
|
||||||
|------|--------------|-------------|--------|
|
|
||||||
| piha | piha | piha (localhost) | `http://localhost:8123` |
|
|
||||||
| chelsty-infra | chelsty-infra | chelsty-ha (HAOS VM, separate machine) | `http://100.70.180.90:8123` |
|
|
||||||
|
|
||||||
**chelsty-infra note:** Home Assistant runs on `chelsty-ha`, a dedicated Home Assistant
|
|
||||||
OS VM. `chelsty-infra` is the hypervisor but does not run HA itself. The agent on
|
|
||||||
`chelsty-infra` reaches HA over the Tailscale network (`100.70.180.90:8123`). If `chelsty-ha`
|
|
||||||
gets a new Tailscale IP, update `HA_URL` in `/opt/homelab/config/ha-diag-agent/.env` on
|
|
||||||
`chelsty-infra`.
|
|
||||||
|
|
||||||
## Deployment
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Create config on target node
|
|
||||||
ssh oskar@<node-ip>
|
|
||||||
mkdir -p /opt/homelab/config/ha-diag-agent /var/lib/ha-diag-agent
|
|
||||||
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
|
|
||||||
HA_URL=http://homeassistant.local:8123 # or http://100.70.180.90:8123 for chelsty-infra
|
|
||||||
HA_TOKEN=<long-lived-token>
|
|
||||||
NODE_NAME=piha # or chelsty-infra
|
|
||||||
LOCATION_TAG=ken # or chelsty
|
|
||||||
CHECK_INTERVAL=60
|
|
||||||
EOF
|
|
||||||
|
|
||||||
# 2. Deploy
|
|
||||||
scripts/deploy/deploy.sh --service ha-diag-agent
|
|
||||||
|
|
||||||
# 3. Verify
|
|
||||||
docker ps --filter name=ha-diag-agent
|
|
||||||
curl http://localhost:8087/health
|
|
||||||
```
|
|
||||||
|
|
||||||
### chelsty-infra note
|
|
||||||
|
|
||||||
`chelsty-infra` runs docker-compose v1 (1.29.2). Use `docker-compose` (hyphenated):
|
|
||||||
```bash
|
|
||||||
docker-compose -f docker-compose.yml up -d --build
|
|
||||||
```
|
|
||||||
|
|
||||||
### HA long-lived token
|
|
||||||
|
|
||||||
In HA UI: Profile → Long-Lived Access Tokens → Create token.
|
|
||||||
|
|
||||||
## Running Tests
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd services/ha-diag-agent
|
|
||||||
pip install -e ".[dev]"
|
|
||||||
pytest tests/ -v
|
|
||||||
```
|
|
||||||
|
|
||||||
## Optional YAML config
|
|
||||||
|
|
||||||
Place `/opt/homelab/config/ha-diag-agent/ha-diag-agent.yaml` on the node.
|
|
||||||
Values there are defaults; env vars take priority.
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
ha_url: http://homeassistant.local:8123
|
|
||||||
location_tag: ken
|
|
||||||
check_interval: 60
|
|
||||||
```
|
|
||||||
|
|
@ -1,30 +0,0 @@
|
||||||
services:
|
|
||||||
ha-diag-agent:
|
|
||||||
build: .
|
|
||||||
container_name: ha-diag-agent
|
|
||||||
restart: unless-stopped
|
|
||||||
|
|
||||||
env_file:
|
|
||||||
- /opt/homelab/config/ha-diag-agent/.env
|
|
||||||
|
|
||||||
ports:
|
|
||||||
- "8087:8087"
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
# Events dir: host path includes node name; inside container always /events
|
|
||||||
- /opt/homelab/events/${NODE_NAME:-ha-diag}:/events
|
|
||||||
# SQLite baseline cache and check history
|
|
||||||
- /var/lib/ha-diag-agent:/data
|
|
||||||
# Optional YAML config (read-only)
|
|
||||||
- /opt/homelab/config/ha-diag-agent:/config:ro
|
|
||||||
|
|
||||||
healthcheck:
|
|
||||||
test:
|
|
||||||
- "CMD"
|
|
||||||
- "python"
|
|
||||||
- "-c"
|
|
||||||
- "import urllib.request; urllib.request.urlopen('http://localhost:8087/health', timeout=5)"
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
start_period: 20s
|
|
||||||
|
|
@ -1,27 +0,0 @@
|
||||||
# ha-diag-agent environment variables
|
|
||||||
# Copy to /opt/homelab/config/ha-diag-agent/.env on the target node
|
|
||||||
|
|
||||||
# Home Assistant connection (required)
|
|
||||||
HA_URL=http://homeassistant.local:8123
|
|
||||||
HA_TOKEN=your-long-lived-token-here
|
|
||||||
HA_TIMEOUT=10.0
|
|
||||||
|
|
||||||
# Node identity
|
|
||||||
NODE_NAME=piha
|
|
||||||
LOCATION_TAG=ken
|
|
||||||
|
|
||||||
# Check intervals (seconds)
|
|
||||||
CHECK_INTERVAL=60 # heartbeat check
|
|
||||||
CHECK_INTERVAL_UNAVAILABLE=3600 # entity availability check (1h)
|
|
||||||
|
|
||||||
# Unavailable entities thresholds
|
|
||||||
UNAVAILABLE_THRESHOLD_HOURS=24 # alert after N hours unavailable
|
|
||||||
INTEGRATION_FAILURE_THRESHOLD_PCT=0.5 # fraction of integration entities
|
|
||||||
INTEGRATION_FAILURE_MIN_ENTITIES=3 # minimum count for integration event
|
|
||||||
ALERT_COOLDOWN_HOURS=6 # suppress re-alert within N hours
|
|
||||||
|
|
||||||
# API server
|
|
||||||
PORT=8087
|
|
||||||
|
|
||||||
# Logging: debug, info, warning, error
|
|
||||||
LOG_LEVEL=info
|
|
||||||
|
|
@ -1,12 +0,0 @@
|
||||||
#!/bin/sh
|
|
||||||
# Healthcheck: probe the FastAPI /health endpoint
|
|
||||||
set -e
|
|
||||||
PORT="${PORT:-8087}"
|
|
||||||
python -c "
|
|
||||||
import urllib.request, sys
|
|
||||||
try:
|
|
||||||
r = urllib.request.urlopen('http://localhost:${PORT}/health', timeout=5)
|
|
||||||
sys.exit(0 if r.status == 200 else 1)
|
|
||||||
except Exception:
|
|
||||||
sys.exit(1)
|
|
||||||
"
|
|
||||||
|
|
@ -1,36 +0,0 @@
|
||||||
[build-system]
|
|
||||||
requires = ["setuptools>=68"]
|
|
||||||
build-backend = "setuptools.build_meta"
|
|
||||||
|
|
||||||
[project]
|
|
||||||
name = "ha-diag-agent"
|
|
||||||
version = "0.1.0"
|
|
||||||
requires-python = ">=3.11"
|
|
||||||
dependencies = [
|
|
||||||
"aiohttp>=3.9",
|
|
||||||
"fastapi>=0.110",
|
|
||||||
"uvicorn[standard]>=0.29",
|
|
||||||
"pydantic>=2.6",
|
|
||||||
"pydantic-settings>=2.2",
|
|
||||||
"apscheduler>=3.10",
|
|
||||||
"aiosqlite>=0.20",
|
|
||||||
"structlog>=24.1",
|
|
||||||
"pyyaml>=6.0",
|
|
||||||
]
|
|
||||||
|
|
||||||
[project.optional-dependencies]
|
|
||||||
dev = [
|
|
||||||
"pytest>=8.1",
|
|
||||||
"pytest-asyncio>=0.23",
|
|
||||||
"aioresponses>=0.7",
|
|
||||||
]
|
|
||||||
|
|
||||||
[tool.setuptools.packages.find]
|
|
||||||
where = ["src"]
|
|
||||||
|
|
||||||
[tool.pytest.ini_options]
|
|
||||||
asyncio_mode = "auto"
|
|
||||||
testpaths = ["tests"]
|
|
||||||
markers = [
|
|
||||||
"integration: requires running HA instances — run with -m integration",
|
|
||||||
]
|
|
||||||
|
|
@ -1,42 +0,0 @@
|
||||||
service:
|
|
||||||
name: ha-diag-agent
|
|
||||||
# Deployed per-host: piha (site: ken) and chelsty-infra (site: chelsty)
|
|
||||||
owner_node: per-host
|
|
||||||
exposure: local-only
|
|
||||||
monitor: true
|
|
||||||
|
|
||||||
dependencies:
|
|
||||||
- homeassistant
|
|
||||||
|
|
||||||
ports:
|
|
||||||
- 8087
|
|
||||||
|
|
||||||
healthcheck:
|
|
||||||
type: http
|
|
||||||
path: /health
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
start_period: 20s
|
|
||||||
|
|
||||||
restart_policy: unless-stopped
|
|
||||||
|
|
||||||
persistence:
|
|
||||||
paths:
|
|
||||||
- /opt/homelab/events
|
|
||||||
- /var/lib/ha-diag-agent
|
|
||||||
|
|
||||||
runtime:
|
|
||||||
env_vars:
|
|
||||||
- HA_TOKEN # long-lived HA access token (required)
|
|
||||||
- HA_URL # http://homeassistant.local:8123
|
|
||||||
- NODE_NAME # canonical node name: piha, chelsty-infra
|
|
||||||
- LOCATION_TAG # human site label: ken, chelsty
|
|
||||||
- CHECK_INTERVAL # heartbeat interval seconds (default: 60)
|
|
||||||
- CHECK_INTERVAL_UNAVAILABLE # entity check interval seconds (default: 3600)
|
|
||||||
- UNAVAILABLE_THRESHOLD_HOURS # alert threshold (default: 24)
|
|
||||||
- INTEGRATION_FAILURE_THRESHOLD_PCT # fraction threshold (default: 0.5)
|
|
||||||
- INTEGRATION_FAILURE_MIN_ENTITIES # min count for integration event (default: 3)
|
|
||||||
- ALERT_COOLDOWN_HOURS # re-alert suppression (default: 6)
|
|
||||||
- PORT # FastAPI port (default: 8087)
|
|
||||||
- LOG_LEVEL # default: info
|
|
||||||
|
|
@ -1,58 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from typing import TYPE_CHECKING
|
|
||||||
|
|
||||||
from fastapi import FastAPI, HTTPException
|
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
|
||||||
from .checks.base import Check
|
|
||||||
from .monitors.base import Monitor
|
|
||||||
|
|
||||||
app = FastAPI(title="ha-diag-agent", version="0.1.0")
|
|
||||||
|
|
||||||
# Populated by main.py during startup
|
|
||||||
_checks: dict[str, "Check"] = {}
|
|
||||||
_ws_monitor: "Monitor | None" = None
|
|
||||||
_node_name: str = "unknown"
|
|
||||||
_location_tag: str = "default"
|
|
||||||
|
|
||||||
|
|
||||||
def register_checks(checks: list["Check"], node_name: str, location_tag: str) -> None:
|
|
||||||
global _node_name, _location_tag
|
|
||||||
_checks.update({c.name: c for c in checks})
|
|
||||||
_node_name = node_name
|
|
||||||
_location_tag = location_tag
|
|
||||||
|
|
||||||
|
|
||||||
def register_ws_monitor(monitor: "Monitor") -> None:
|
|
||||||
global _ws_monitor
|
|
||||||
_ws_monitor = monitor
|
|
||||||
|
|
||||||
|
|
||||||
@app.get("/health")
|
|
||||||
async def health() -> dict:
|
|
||||||
response: dict = {
|
|
||||||
"status": "ok",
|
|
||||||
"node": _node_name,
|
|
||||||
"location_tag": _location_tag,
|
|
||||||
"checks": list(_checks.keys()),
|
|
||||||
}
|
|
||||||
if _ws_monitor is not None:
|
|
||||||
response["ws_connected"] = _ws_monitor.is_healthy
|
|
||||||
return response
|
|
||||||
|
|
||||||
|
|
||||||
@app.post("/trigger/{check_name}")
|
|
||||||
async def trigger(check_name: str) -> dict:
|
|
||||||
check = _checks.get(check_name)
|
|
||||||
if check is None:
|
|
||||||
raise HTTPException(status_code=404, detail=f"Unknown check: {check_name!r}")
|
|
||||||
result = await check.run()
|
|
||||||
return {
|
|
||||||
"check": check_name,
|
|
||||||
"healthy": result.healthy,
|
|
||||||
"event_type": result.event_type,
|
|
||||||
"severity": result.severity,
|
|
||||||
"message": result.message,
|
|
||||||
"payload": result.payload,
|
|
||||||
}
|
|
||||||
|
|
@ -1,97 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from typing import TYPE_CHECKING, Any
|
|
||||||
|
|
||||||
from ..ha_client import HAClient
|
|
||||||
from ..models import CheckResult, HAEventType, Severity
|
|
||||||
from ..storage import Storage
|
|
||||||
from .base import Check
|
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
|
||||||
from ..config import Settings
|
|
||||||
|
|
||||||
|
|
||||||
class AutomationFailuresCheck(Check):
|
|
||||||
"""Detects automations with consecutive run failures.
|
|
||||||
|
|
||||||
For each enabled automation (state="on"), fetches the last N run traces.
|
|
||||||
When all N most-recent traces indicate failure, emits ha_automation_failing
|
|
||||||
with a 6-hour dedup per automation.
|
|
||||||
"""
|
|
||||||
|
|
||||||
name = "automation_failures"
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self,
|
|
||||||
ha_client: HAClient,
|
|
||||||
storage: Storage,
|
|
||||||
settings: "Settings",
|
|
||||||
) -> None:
|
|
||||||
self._client = ha_client
|
|
||||||
self._storage = storage
|
|
||||||
self._settings = settings
|
|
||||||
|
|
||||||
async def run(self) -> list[CheckResult]:
|
|
||||||
try:
|
|
||||||
all_states = await self._client.get_states()
|
|
||||||
except Exception:
|
|
||||||
return []
|
|
||||||
|
|
||||||
automations = [
|
|
||||||
s for s in all_states
|
|
||||||
if s["entity_id"].startswith("automation.") and s["state"] == "on"
|
|
||||||
]
|
|
||||||
|
|
||||||
results: list[CheckResult] = []
|
|
||||||
cooldown_s = self._settings.alert_cooldown_hours * 3600
|
|
||||||
threshold = self._settings.automation_failure_threshold
|
|
||||||
|
|
||||||
for auto_state in automations:
|
|
||||||
eid = auto_state["entity_id"]
|
|
||||||
try:
|
|
||||||
traces = await self._client.get_automation_traces(eid)
|
|
||||||
except Exception:
|
|
||||||
continue
|
|
||||||
|
|
||||||
if not traces or len(traces) < threshold:
|
|
||||||
continue
|
|
||||||
|
|
||||||
recent = traces[:threshold]
|
|
||||||
failures = [t for t in recent if _is_trace_failure(t)]
|
|
||||||
if len(failures) < threshold:
|
|
||||||
continue
|
|
||||||
|
|
||||||
alert_key = f"automation_failing:{eid}"
|
|
||||||
if await self._storage.was_alert_sent(alert_key, cooldown_s):
|
|
||||||
continue
|
|
||||||
|
|
||||||
attrs = auto_state.get("attributes", {})
|
|
||||||
friendly_name = attrs.get("friendly_name", eid)
|
|
||||||
last_failures = [
|
|
||||||
{"timestamp": t.get("timestamp"), "error": t.get("error", "")}
|
|
||||||
for t in failures
|
|
||||||
]
|
|
||||||
|
|
||||||
results.append(CheckResult(
|
|
||||||
healthy=False,
|
|
||||||
event_type=HAEventType.ha_automation_failing,
|
|
||||||
severity=Severity.warning,
|
|
||||||
message=(
|
|
||||||
f"Automation '{friendly_name}' failed "
|
|
||||||
f"{len(failures)} consecutive time(s)"
|
|
||||||
),
|
|
||||||
payload={
|
|
||||||
"entity_id": eid,
|
|
||||||
"friendly_name": friendly_name,
|
|
||||||
"last_failures": last_failures,
|
|
||||||
"total_recent_failures": len(failures),
|
|
||||||
},
|
|
||||||
))
|
|
||||||
await self._storage.mark_alert_sent(alert_key)
|
|
||||||
|
|
||||||
return results
|
|
||||||
|
|
||||||
|
|
||||||
def _is_trace_failure(trace: dict[str, Any]) -> bool:
|
|
||||||
"""A trace is a failure if it has a non-empty error or an explicit failed state."""
|
|
||||||
return bool(trace.get("error")) or trace.get("state") == "failed"
|
|
||||||
|
|
@ -1,20 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from abc import ABC, abstractmethod
|
|
||||||
|
|
||||||
from ..models import CheckResult
|
|
||||||
|
|
||||||
|
|
||||||
class Check(ABC):
|
|
||||||
"""Base class for all HA diagnostic checks."""
|
|
||||||
|
|
||||||
name: str # unique slug used in /trigger/<name> and check_history
|
|
||||||
|
|
||||||
@abstractmethod
|
|
||||||
async def run(self) -> list[CheckResult]:
|
|
||||||
"""Execute the check and return results.
|
|
||||||
|
|
||||||
Empty list means the check passed cleanly.
|
|
||||||
Each CheckResult with event_type set causes an event to be emitted.
|
|
||||||
The caller (runner in main.py) handles emission and history recording.
|
|
||||||
"""
|
|
||||||
|
|
@ -1,38 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from ..ha_client import HAClient
|
|
||||||
from ..models import CheckResult, HAEventType, Severity
|
|
||||||
from .base import Check
|
|
||||||
|
|
||||||
|
|
||||||
class HeartbeatCheck(Check):
|
|
||||||
"""Pings HA /api/ to verify the REST API is reachable.
|
|
||||||
|
|
||||||
Validates the end-to-end pipeline: shared HAClient → check → event emitter.
|
|
||||||
"""
|
|
||||||
|
|
||||||
name = "heartbeat"
|
|
||||||
|
|
||||||
def __init__(self, ha_client: HAClient) -> None:
|
|
||||||
self._client = ha_client
|
|
||||||
|
|
||||||
async def run(self) -> list[CheckResult]:
|
|
||||||
try:
|
|
||||||
data = await self._client.get_api_status()
|
|
||||||
if isinstance(data, dict) and "message" in data:
|
|
||||||
return []
|
|
||||||
return [CheckResult(
|
|
||||||
healthy=False,
|
|
||||||
event_type=HAEventType.ha_websocket_dead,
|
|
||||||
severity=Severity.error,
|
|
||||||
message=f"HA API returned unexpected response: {data!r}",
|
|
||||||
payload={"response": str(data)},
|
|
||||||
)]
|
|
||||||
except Exception as exc:
|
|
||||||
return [CheckResult(
|
|
||||||
healthy=False,
|
|
||||||
event_type=HAEventType.ha_websocket_dead,
|
|
||||||
severity=Severity.error,
|
|
||||||
message=f"HA API unreachable: {exc}",
|
|
||||||
payload={"error": str(exc)},
|
|
||||||
)]
|
|
||||||
|
|
@ -1,110 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
from typing import TYPE_CHECKING, Any
|
|
||||||
|
|
||||||
from ..ha_client import HAClient
|
|
||||||
from ..models import CheckResult, HAEventType, Severity
|
|
||||||
from ..storage import Storage
|
|
||||||
from .base import Check
|
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
|
||||||
from ..config import Settings
|
|
||||||
|
|
||||||
|
|
||||||
class SystemHealthCheck(Check):
|
|
||||||
"""Detects newly-failing HA integrations via /api/system_health.
|
|
||||||
|
|
||||||
Logic per run:
|
|
||||||
1. Fetch /api/system_health and parse per-component statuses.
|
|
||||||
2. Diff against stored snapshots in system_health_snapshot.
|
|
||||||
3. Emit ha_system_health_degraded on ok → error transitions.
|
|
||||||
4. Clear alerts_sent on error → ok recovery (next degradation re-alerts).
|
|
||||||
5. Update all component snapshots.
|
|
||||||
|
|
||||||
API errors (HA unreachable) return no results; HeartbeatCheck handles
|
|
||||||
HA reachability separately.
|
|
||||||
"""
|
|
||||||
|
|
||||||
name = "system_health"
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self,
|
|
||||||
ha_client: HAClient,
|
|
||||||
storage: Storage,
|
|
||||||
settings: "Settings",
|
|
||||||
) -> None:
|
|
||||||
self._client = ha_client
|
|
||||||
self._storage = storage
|
|
||||||
self._settings = settings
|
|
||||||
|
|
||||||
async def run(self) -> list[CheckResult]:
|
|
||||||
try:
|
|
||||||
health_data = await self._client.get_system_health()
|
|
||||||
except Exception:
|
|
||||||
return []
|
|
||||||
|
|
||||||
statuses = _extract_component_statuses(health_data)
|
|
||||||
results: list[CheckResult] = []
|
|
||||||
|
|
||||||
for component, info in statuses.items():
|
|
||||||
status = info["status"]
|
|
||||||
details = info.get("details", {})
|
|
||||||
prev = await self._storage.get_system_health_snapshot(component)
|
|
||||||
|
|
||||||
if status == "error":
|
|
||||||
if prev is None or prev["last_status"] == "ok":
|
|
||||||
results.append(CheckResult(
|
|
||||||
healthy=False,
|
|
||||||
event_type=HAEventType.ha_system_health_degraded,
|
|
||||||
severity=Severity.warning,
|
|
||||||
message=f"HA component '{component}' is degraded",
|
|
||||||
payload={
|
|
||||||
"component": component,
|
|
||||||
"previous_status": prev["last_status"] if prev else "unknown",
|
|
||||||
"current_status": "error",
|
|
||||||
"details": details,
|
|
||||||
},
|
|
||||||
))
|
|
||||||
elif status == "ok" and prev and prev["last_status"] == "error":
|
|
||||||
await self._storage.clear_alert(f"system_health:{component}")
|
|
||||||
|
|
||||||
await self._storage.upsert_system_health_snapshot(
|
|
||||||
component, status, json.dumps(details, default=str)
|
|
||||||
)
|
|
||||||
|
|
||||||
return results
|
|
||||||
|
|
||||||
|
|
||||||
def _extract_component_statuses(
|
|
||||||
health_data: dict[str, Any],
|
|
||||||
) -> dict[str, dict[str, Any]]:
|
|
||||||
"""Parse HA /api/system_health into {component: {status, details}}.
|
|
||||||
|
|
||||||
Handles multiple HA response shapes:
|
|
||||||
- Typed: {component: {"type": "result"|"error", "data": {...}}}
|
|
||||||
- Legacy: {component: {"error": "msg"}} or {component: {plain_data}}
|
|
||||||
- Nested: {"checks": {component: {...}}, "info": {...}}
|
|
||||||
"""
|
|
||||||
checks = health_data.get("checks", health_data)
|
|
||||||
if not isinstance(checks, dict):
|
|
||||||
return {}
|
|
||||||
|
|
||||||
result: dict[str, dict[str, Any]] = {}
|
|
||||||
for component, value in checks.items():
|
|
||||||
if not isinstance(value, dict):
|
|
||||||
continue
|
|
||||||
|
|
||||||
if value.get("type") == "error" or value.get("error"):
|
|
||||||
result[component] = {
|
|
||||||
"status": "error",
|
|
||||||
"details": {"error": str(value.get("error") or value.get("type", "error"))},
|
|
||||||
}
|
|
||||||
else:
|
|
||||||
inner = value.get("data", value)
|
|
||||||
result[component] = {
|
|
||||||
"status": "ok",
|
|
||||||
"details": inner if isinstance(inner, dict) else value,
|
|
||||||
}
|
|
||||||
|
|
||||||
return result
|
|
||||||
|
|
@ -1,266 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import time
|
|
||||||
from datetime import datetime, timezone
|
|
||||||
from typing import TYPE_CHECKING, Any
|
|
||||||
|
|
||||||
from ..ha_client import HAClient
|
|
||||||
from ..models import CheckResult, HAEventType, Severity
|
|
||||||
from ..storage import Storage
|
|
||||||
from .base import Check
|
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
|
||||||
from ..config import Settings
|
|
||||||
|
|
||||||
_BAD_STATES = frozenset({"unavailable", "unknown"})
|
|
||||||
|
|
||||||
|
|
||||||
def _parse_last_changed_ts(value: str | None) -> float | None:
|
|
||||||
"""Parse HA last_changed ISO string → Unix timestamp.
|
|
||||||
|
|
||||||
Returns None on missing or malformed input so callers can fall back
|
|
||||||
to the baseline first_seen without special-casing.
|
|
||||||
"""
|
|
||||||
if not value:
|
|
||||||
return None
|
|
||||||
try:
|
|
||||||
return datetime.fromisoformat(value).timestamp()
|
|
||||||
except (ValueError, TypeError):
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
class UnavailableEntitiesCheck(Check):
|
|
||||||
"""Detects entities stuck in unavailable/unknown state.
|
|
||||||
|
|
||||||
Logic:
|
|
||||||
1. Fetch all entity states from HA.
|
|
||||||
2. Maintain SQLite baseline: INSERT OR IGNORE to preserve first-seen timestamp.
|
|
||||||
3. Handle recoveries: clear baseline + alert dedup for entities back online.
|
|
||||||
4. Alert on entities unavailable > unavailable_threshold_hours.
|
|
||||||
5. Root-cause grouping: if >= integration_failure_threshold_pct of an
|
|
||||||
integration's entities are unavailable (and count >= min_entities), emit
|
|
||||||
ha_integration_failed instead of N individual ha_entity_unavailable_long
|
|
||||||
events.
|
|
||||||
6. Alert dedup: skip re-emitting the same alert within alert_cooldown_hours.
|
|
||||||
"""
|
|
||||||
|
|
||||||
name = "unavailable_entities"
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self,
|
|
||||||
ha_client: HAClient,
|
|
||||||
storage: Storage,
|
|
||||||
settings: "Settings",
|
|
||||||
) -> None:
|
|
||||||
self._client = ha_client
|
|
||||||
self._storage = storage
|
|
||||||
self._settings = settings
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Public entry point
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def run(self) -> list[CheckResult]:
|
|
||||||
now = time.time()
|
|
||||||
|
|
||||||
try:
|
|
||||||
all_states = await self._client.get_states()
|
|
||||||
except Exception as exc:
|
|
||||||
return [CheckResult(
|
|
||||||
healthy=False,
|
|
||||||
event_type=HAEventType.ha_websocket_dead,
|
|
||||||
severity=Severity.error,
|
|
||||||
message=f"Failed to fetch entity states: {exc}",
|
|
||||||
payload={"error": str(exc)},
|
|
||||||
)]
|
|
||||||
|
|
||||||
integration_map, area_map = await self._load_registry()
|
|
||||||
|
|
||||||
unavailable: dict[str, dict[str, Any]] = {
|
|
||||||
s["entity_id"]: s for s in all_states if s["state"] in _BAD_STATES
|
|
||||||
}
|
|
||||||
available_ids: set[str] = {
|
|
||||||
s["entity_id"] for s in all_states if s["state"] not in _BAD_STATES
|
|
||||||
}
|
|
||||||
|
|
||||||
# Handle recoveries first
|
|
||||||
tracked = await self._storage.get_all_tracked_entity_ids()
|
|
||||||
for eid in tracked:
|
|
||||||
if eid in available_ids:
|
|
||||||
await self._handle_recovery(eid)
|
|
||||||
|
|
||||||
# Record new/continuing unavailable entities (INSERT OR IGNORE preserves timestamp)
|
|
||||||
for eid, state_data in unavailable.items():
|
|
||||||
await self._storage.set_entity_unavailable_since(
|
|
||||||
eid, state_data["state"], now
|
|
||||||
)
|
|
||||||
|
|
||||||
# Determine which entities have exceeded the alert threshold
|
|
||||||
to_alert: list[dict[str, Any]] = []
|
|
||||||
cooldown_s = self._settings.alert_cooldown_hours * 3600
|
|
||||||
threshold_h = self._settings.unavailable_threshold_hours
|
|
||||||
|
|
||||||
for eid, state_data in unavailable.items():
|
|
||||||
first_at = await self._storage.get_entity_first_unavailable_at(eid)
|
|
||||||
if first_at is None:
|
|
||||||
continue
|
|
||||||
# Phase 3 Flag #1: if HA reports an earlier last_changed (entity was
|
|
||||||
# already unavailable before the agent started/reconnected), use that
|
|
||||||
# as the authoritative "since" so duration is accurate.
|
|
||||||
last_changed_ts = _parse_last_changed_ts(state_data.get("last_changed"))
|
|
||||||
effective_since = (
|
|
||||||
min(last_changed_ts, first_at)
|
|
||||||
if last_changed_ts is not None
|
|
||||||
else first_at
|
|
||||||
)
|
|
||||||
duration_h = (now - effective_since) / 3600
|
|
||||||
if duration_h < threshold_h:
|
|
||||||
continue
|
|
||||||
alert_key = f"entity_unavailable:{eid}"
|
|
||||||
if await self._storage.was_alert_sent(alert_key, cooldown_s):
|
|
||||||
continue
|
|
||||||
to_alert.append({
|
|
||||||
"entity_id": eid,
|
|
||||||
"state": state_data["state"],
|
|
||||||
"first_at": effective_since,
|
|
||||||
"duration_h": duration_h,
|
|
||||||
"domain": eid.split(".")[0],
|
|
||||||
"integration": integration_map.get(eid),
|
|
||||||
"area_id": area_map.get(eid),
|
|
||||||
})
|
|
||||||
|
|
||||||
if not to_alert:
|
|
||||||
return []
|
|
||||||
|
|
||||||
return await self._build_results(to_alert, all_states, integration_map)
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Internal helpers
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def _load_registry(
|
|
||||||
self,
|
|
||||||
) -> tuple[dict[str, str], dict[str, str]]:
|
|
||||||
"""Fetch entity registry; return (integration_map, area_map).
|
|
||||||
|
|
||||||
Falls back to empty dicts when the endpoint is unavailable.
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
registry = await self._client.get_entity_registry()
|
|
||||||
integration_map = {
|
|
||||||
e["entity_id"]: e.get("platform") or ""
|
|
||||||
for e in registry
|
|
||||||
if "entity_id" in e
|
|
||||||
}
|
|
||||||
area_map = {
|
|
||||||
e["entity_id"]: e.get("area_id") or ""
|
|
||||||
for e in registry
|
|
||||||
if "entity_id" in e
|
|
||||||
}
|
|
||||||
return integration_map, area_map
|
|
||||||
except Exception:
|
|
||||||
return {}, {}
|
|
||||||
|
|
||||||
async def _handle_recovery(self, entity_id: str) -> None:
|
|
||||||
await self._storage.clear_entity_unavailable(entity_id)
|
|
||||||
# Clear dedup so the next unavailability triggers an alert immediately
|
|
||||||
await self._storage.clear_alert(f"entity_unavailable:{entity_id}")
|
|
||||||
|
|
||||||
async def _build_results(
|
|
||||||
self,
|
|
||||||
to_alert: list[dict[str, Any]],
|
|
||||||
all_states: list[dict[str, Any]],
|
|
||||||
integration_map: dict[str, str],
|
|
||||||
) -> list[CheckResult]:
|
|
||||||
results: list[CheckResult] = []
|
|
||||||
handled: set[str] = set()
|
|
||||||
|
|
||||||
# Build per-integration stats across ALL entities (not just to_alert)
|
|
||||||
total_per_integ: dict[str, int] = {}
|
|
||||||
unav_per_integ: dict[str, list[str]] = {}
|
|
||||||
|
|
||||||
for state in all_states:
|
|
||||||
eid = state["entity_id"]
|
|
||||||
integ = integration_map.get(eid)
|
|
||||||
if not integ:
|
|
||||||
continue
|
|
||||||
total_per_integ[integ] = total_per_integ.get(integ, 0) + 1
|
|
||||||
if state["state"] in _BAD_STATES:
|
|
||||||
unav_per_integ.setdefault(integ, []).append(eid)
|
|
||||||
|
|
||||||
min_ent = self._settings.integration_failure_min_entities
|
|
||||||
threshold_pct = self._settings.integration_failure_threshold_pct
|
|
||||||
cooldown_s = self._settings.alert_cooldown_hours * 3600
|
|
||||||
|
|
||||||
# Integration-level events
|
|
||||||
for integ, unav_ids in unav_per_integ.items():
|
|
||||||
total = total_per_integ.get(integ, 0)
|
|
||||||
pct = len(unav_ids) / total if total else 0
|
|
||||||
|
|
||||||
alerted_from_integ = [e for e in to_alert if e["integration"] == integ]
|
|
||||||
if not alerted_from_integ:
|
|
||||||
continue
|
|
||||||
if pct < threshold_pct or len(unav_ids) < min_ent:
|
|
||||||
continue
|
|
||||||
|
|
||||||
alert_key = f"integration_failed:{integ}"
|
|
||||||
if await self._storage.was_alert_sent(alert_key, cooldown_s):
|
|
||||||
handled.update(e["entity_id"] for e in alerted_from_integ)
|
|
||||||
continue
|
|
||||||
|
|
||||||
results.append(CheckResult(
|
|
||||||
healthy=False,
|
|
||||||
event_type=HAEventType.ha_integration_failed,
|
|
||||||
severity=Severity.error,
|
|
||||||
message=(
|
|
||||||
f"Integration '{integ}' appears down: "
|
|
||||||
f"{len(unav_ids)}/{total} entities unavailable"
|
|
||||||
),
|
|
||||||
payload={
|
|
||||||
"integration": integ,
|
|
||||||
"affected_entities": unav_ids,
|
|
||||||
"unavailable_count": len(unav_ids),
|
|
||||||
"total_count": total,
|
|
||||||
"unavailable_pct": round(pct, 2),
|
|
||||||
},
|
|
||||||
))
|
|
||||||
await self._storage.mark_alert_sent(alert_key)
|
|
||||||
handled.update(e["entity_id"] for e in alerted_from_integ)
|
|
||||||
|
|
||||||
# Per-entity events for entities not covered by an integration event
|
|
||||||
for entity in to_alert:
|
|
||||||
eid = entity["entity_id"]
|
|
||||||
if eid in handled:
|
|
||||||
continue
|
|
||||||
|
|
||||||
since_iso = (
|
|
||||||
datetime.fromtimestamp(entity["first_at"], tz=timezone.utc)
|
|
||||||
.isoformat()
|
|
||||||
.replace("+00:00", "Z")
|
|
||||||
)
|
|
||||||
|
|
||||||
payload: dict[str, Any] = {
|
|
||||||
"entity_id": eid,
|
|
||||||
"state": entity["state"],
|
|
||||||
"since": since_iso,
|
|
||||||
"duration_hours": round(entity["duration_h"], 1),
|
|
||||||
"domain": entity["domain"],
|
|
||||||
}
|
|
||||||
if entity["integration"]:
|
|
||||||
payload["integration"] = entity["integration"]
|
|
||||||
if entity["area_id"]:
|
|
||||||
payload["area"] = entity["area_id"]
|
|
||||||
|
|
||||||
results.append(CheckResult(
|
|
||||||
healthy=False,
|
|
||||||
event_type=HAEventType.ha_entity_unavailable_long,
|
|
||||||
severity=Severity.warning,
|
|
||||||
message=(
|
|
||||||
f"Entity {eid} unavailable for "
|
|
||||||
f"{entity['duration_h']:.1f}h"
|
|
||||||
),
|
|
||||||
payload=payload,
|
|
||||||
))
|
|
||||||
await self._storage.mark_alert_sent(f"entity_unavailable:{eid}")
|
|
||||||
|
|
||||||
return results
|
|
||||||
|
|
@ -1,123 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from datetime import datetime
|
|
||||||
from typing import TYPE_CHECKING, Any
|
|
||||||
|
|
||||||
from ..ha_client import HAClient
|
|
||||||
from ..models import CheckResult, HAEventType, Severity
|
|
||||||
from ..storage import Storage
|
|
||||||
from .base import Check
|
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
|
||||||
from ..config import Settings
|
|
||||||
|
|
||||||
_MAX_RELEASE_NOTES = 2000
|
|
||||||
|
|
||||||
|
|
||||||
class UpdatesAvailableCheck(Check):
|
|
||||||
"""Detects available HA core/add-on updates via update.* entities.
|
|
||||||
|
|
||||||
Runs daily. Emits one ha_update_available event per update entity whose
|
|
||||||
7-day dedup window has expired. Falls back gracefully when HA is down.
|
|
||||||
"""
|
|
||||||
|
|
||||||
name = "updates_available"
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self,
|
|
||||||
ha_client: HAClient,
|
|
||||||
storage: Storage,
|
|
||||||
settings: "Settings",
|
|
||||||
) -> None:
|
|
||||||
self._client = ha_client
|
|
||||||
self._storage = storage
|
|
||||||
self._settings = settings
|
|
||||||
|
|
||||||
async def run(self) -> list[CheckResult]:
|
|
||||||
updates = await self._fetch_active_updates()
|
|
||||||
if not updates:
|
|
||||||
return []
|
|
||||||
|
|
||||||
results: list[CheckResult] = []
|
|
||||||
cooldown_s = self._settings.updates_cooldown_days * 86400
|
|
||||||
|
|
||||||
for state in updates:
|
|
||||||
eid = state["entity_id"]
|
|
||||||
alert_key = f"update_available:{eid}"
|
|
||||||
if await self._storage.was_alert_sent(alert_key, cooldown_s):
|
|
||||||
continue
|
|
||||||
attrs = state.get("attributes", {})
|
|
||||||
results.append(CheckResult(
|
|
||||||
healthy=False,
|
|
||||||
event_type=HAEventType.ha_update_available,
|
|
||||||
severity=Severity.info,
|
|
||||||
message=(
|
|
||||||
f"Update available: {attrs.get('title', eid)} "
|
|
||||||
f"{attrs.get('installed_version', '?')} → "
|
|
||||||
f"{attrs.get('latest_version', '?')}"
|
|
||||||
),
|
|
||||||
payload=_build_update_payload(eid, attrs),
|
|
||||||
))
|
|
||||||
await self._storage.mark_alert_sent(alert_key)
|
|
||||||
|
|
||||||
return results
|
|
||||||
|
|
||||||
async def _fetch_active_updates(self) -> list[dict[str, Any]]:
|
|
||||||
try:
|
|
||||||
all_states = await self._client.get_states()
|
|
||||||
except Exception:
|
|
||||||
return []
|
|
||||||
return [
|
|
||||||
s for s in all_states
|
|
||||||
if s["entity_id"].startswith("update.") and s["state"] == "on"
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
class UpdatesDigestCheck(UpdatesAvailableCheck):
|
|
||||||
"""Weekly Sunday digest: single event listing all pending updates.
|
|
||||||
|
|
||||||
Deduped per ISO week (won't re-fire if triggered multiple times on the
|
|
||||||
same Sunday, e.g. manual + scheduled).
|
|
||||||
"""
|
|
||||||
|
|
||||||
name = "updates_digest"
|
|
||||||
|
|
||||||
async def run(self) -> list[CheckResult]:
|
|
||||||
updates = await self._fetch_active_updates()
|
|
||||||
if not updates:
|
|
||||||
return []
|
|
||||||
|
|
||||||
week_key = datetime.now().strftime("%G-W%V")
|
|
||||||
alert_key = f"update_digest:{week_key}"
|
|
||||||
if await self._storage.was_alert_sent(alert_key, 6 * 86400):
|
|
||||||
return []
|
|
||||||
|
|
||||||
all_payloads = [
|
|
||||||
_build_update_payload(s["entity_id"], s.get("attributes", {}))
|
|
||||||
for s in updates
|
|
||||||
]
|
|
||||||
await self._storage.mark_alert_sent(alert_key)
|
|
||||||
return [CheckResult(
|
|
||||||
healthy=False,
|
|
||||||
event_type=HAEventType.ha_update_available,
|
|
||||||
severity=Severity.info,
|
|
||||||
message=f"Weekly digest: {len(all_payloads)} update(s) available",
|
|
||||||
payload={"digest": True, "updates": all_payloads, "count": len(all_payloads)},
|
|
||||||
)]
|
|
||||||
|
|
||||||
|
|
||||||
def _build_update_payload(entity_id: str, attrs: dict[str, Any]) -> dict[str, Any]:
|
|
||||||
payload: dict[str, Any] = {
|
|
||||||
"entity_id": entity_id,
|
|
||||||
"title": attrs.get("title", entity_id),
|
|
||||||
"installed_version": attrs.get("installed_version"),
|
|
||||||
"latest_version": attrs.get("latest_version"),
|
|
||||||
"in_progress": attrs.get("in_progress", False),
|
|
||||||
"auto_update": attrs.get("auto_update", False),
|
|
||||||
}
|
|
||||||
if attrs.get("release_url"):
|
|
||||||
payload["release_url"] = attrs["release_url"]
|
|
||||||
summary = attrs.get("release_summary")
|
|
||||||
if summary:
|
|
||||||
payload["release_summary"] = summary[:_MAX_RELEASE_NOTES]
|
|
||||||
return payload
|
|
||||||
|
|
@ -1,83 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import os
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import yaml
|
|
||||||
from pydantic import field_validator
|
|
||||||
from pydantic_settings import BaseSettings
|
|
||||||
|
|
||||||
_CONFIG_YAML = Path("/config/ha-diag-agent.yaml")
|
|
||||||
|
|
||||||
|
|
||||||
class Settings(BaseSettings):
|
|
||||||
# HA connection
|
|
||||||
ha_url: str = "http://homeassistant.local:8123"
|
|
||||||
ha_token: str = ""
|
|
||||||
ha_timeout: float = 10.0
|
|
||||||
|
|
||||||
# Node identity
|
|
||||||
node_name: str = "unknown"
|
|
||||||
location_tag: str = "default"
|
|
||||||
|
|
||||||
# Intervals (seconds)
|
|
||||||
check_interval: int = 60 # heartbeat check interval
|
|
||||||
check_interval_unavailable: int = 3600 # unavailable entities check interval
|
|
||||||
|
|
||||||
# Unavailable entities check thresholds
|
|
||||||
unavailable_threshold_hours: float = 24.0 # alert after N hours unavailable
|
|
||||||
integration_failure_threshold_pct: float = 0.5 # % of integration entities unavailable
|
|
||||||
integration_failure_min_entities: int = 3 # min count to trigger integration event
|
|
||||||
alert_cooldown_hours: float = 6.0 # don't re-alert same entity within N hours
|
|
||||||
|
|
||||||
# Phase 3 Flag #3: entity registry cache TTL
|
|
||||||
entity_registry_cache_ttl: int = 300 # seconds
|
|
||||||
|
|
||||||
# SystemHealthCheck
|
|
||||||
system_health_check_interval: int = 900 # 15 min
|
|
||||||
|
|
||||||
# AutomationFailuresCheck
|
|
||||||
automation_check_interval: int = 1800 # 30 min
|
|
||||||
automation_failure_threshold: int = 3 # consecutive failures before alert
|
|
||||||
|
|
||||||
# UpdatesAvailableCheck
|
|
||||||
updates_check_hour: int = 9
|
|
||||||
updates_check_minute: int = 0
|
|
||||||
updates_cooldown_days: int = 7 # don't re-alert same update within N days
|
|
||||||
|
|
||||||
# WebSocket monitor
|
|
||||||
websocket_enabled: bool = True
|
|
||||||
websocket_silence_threshold_seconds: int = 300 # 5 min
|
|
||||||
websocket_watchdog_interval_seconds: int = 30
|
|
||||||
websocket_reconnect_initial_delay: float = 1.0
|
|
||||||
websocket_reconnect_max_delay: float = 60.0
|
|
||||||
websocket_reconnect_jitter: float = 0.2 # ±20% of delay
|
|
||||||
websocket_down_alert_repeat_minutes: int = 10
|
|
||||||
|
|
||||||
# API server
|
|
||||||
port: int = 8087
|
|
||||||
log_level: str = "info"
|
|
||||||
|
|
||||||
# Runtime paths (inside container)
|
|
||||||
events_dir: Path = Path("/events")
|
|
||||||
data_dir: Path = Path("/data")
|
|
||||||
|
|
||||||
model_config = {"extra": "ignore", "case_sensitive": False}
|
|
||||||
|
|
||||||
@field_validator("ha_url")
|
|
||||||
@classmethod
|
|
||||||
def strip_trailing_slash(cls, v: str) -> str:
|
|
||||||
return v.rstrip("/")
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def load(cls) -> "Settings":
|
|
||||||
"""Load settings: YAML file provides defaults; env vars override."""
|
|
||||||
if _CONFIG_YAML.exists():
|
|
||||||
try:
|
|
||||||
with _CONFIG_YAML.open() as f:
|
|
||||||
data = yaml.safe_load(f) or {}
|
|
||||||
for k, v in data.items():
|
|
||||||
os.environ.setdefault(k.upper(), str(v))
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
return cls()
|
|
||||||
|
|
@ -1,61 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
import re
|
|
||||||
import time
|
|
||||||
from datetime import datetime, timezone
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Any
|
|
||||||
|
|
||||||
from .models import EventRecord
|
|
||||||
|
|
||||||
|
|
||||||
class EventEmitter:
|
|
||||||
"""Writes atomic JSON event files to the events directory."""
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self, events_dir: Path, node_name: str, location_tag: str = ""
|
|
||||||
) -> None:
|
|
||||||
self._events_dir = events_dir
|
|
||||||
self._node_name = node_name
|
|
||||||
self._location_tag = location_tag
|
|
||||||
self._seq = 0
|
|
||||||
events_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
def _make_id(self, event_type: str, service: str) -> str:
|
|
||||||
# Sequence suffix guarantees uniqueness even when multiple events of the
|
|
||||||
# same type are emitted within the same millisecond.
|
|
||||||
self._seq += 1
|
|
||||||
ts = int(time.time())
|
|
||||||
svc_slug = re.sub(r"[^a-z0-9]", "-", (service or "ha").lower())[:32].strip("-")
|
|
||||||
return f"evt-{self._node_name}-{ts}-{event_type}-{svc_slug}-{self._seq}"
|
|
||||||
|
|
||||||
def emit(
|
|
||||||
self,
|
|
||||||
event_type: str,
|
|
||||||
severity: str,
|
|
||||||
service: str,
|
|
||||||
message: str,
|
|
||||||
payload: dict[str, Any] | None = None,
|
|
||||||
) -> str:
|
|
||||||
event_id = self._make_id(event_type, service)
|
|
||||||
merged: dict[str, Any] = {}
|
|
||||||
if self._location_tag:
|
|
||||||
merged["location_tag"] = self._location_tag
|
|
||||||
merged.update(payload or {})
|
|
||||||
record = EventRecord(
|
|
||||||
id=event_id,
|
|
||||||
timestamp=int(time.time()),
|
|
||||||
date=datetime.now(timezone.utc).isoformat(),
|
|
||||||
type=event_type,
|
|
||||||
severity=severity,
|
|
||||||
node=self._node_name,
|
|
||||||
service=service,
|
|
||||||
message=message,
|
|
||||||
payload=merged,
|
|
||||||
)
|
|
||||||
path = self._events_dir / f"{event_id}.json"
|
|
||||||
tmp = path.with_suffix(".tmp")
|
|
||||||
tmp.write_text(json.dumps(record.model_dump(), indent=2))
|
|
||||||
tmp.rename(path)
|
|
||||||
return event_id
|
|
||||||
|
|
@ -1,104 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import time
|
|
||||||
from typing import Any
|
|
||||||
|
|
||||||
import aiohttp
|
|
||||||
|
|
||||||
|
|
||||||
def make_session(token: str, timeout: float = 10.0) -> aiohttp.ClientSession:
|
|
||||||
"""Create a pre-configured ClientSession for use with HAClient."""
|
|
||||||
return aiohttp.ClientSession(
|
|
||||||
headers={
|
|
||||||
"Authorization": f"Bearer {token}",
|
|
||||||
"Content-Type": "application/json",
|
|
||||||
},
|
|
||||||
timeout=aiohttp.ClientTimeout(total=timeout),
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class HAClient:
|
|
||||||
"""Async Home Assistant REST API client.
|
|
||||||
|
|
||||||
Session lifecycle is managed externally — the caller creates the session
|
|
||||||
via make_session() at startup and closes it on shutdown. HAClient is a
|
|
||||||
session-borrower: it never opens or closes the session it receives.
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self,
|
|
||||||
base_url: str,
|
|
||||||
session: aiohttp.ClientSession,
|
|
||||||
entity_registry_cache_ttl: float = 300.0,
|
|
||||||
) -> None:
|
|
||||||
self._base_url = base_url.rstrip("/")
|
|
||||||
self._session = session
|
|
||||||
self._registry_cache_ttl = entity_registry_cache_ttl
|
|
||||||
self._registry_cache: list[dict[str, Any]] | None = None
|
|
||||||
self._registry_fetched_at: float = 0.0
|
|
||||||
|
|
||||||
async def get_api_status(self) -> dict[str, Any]:
|
|
||||||
"""GET /api/ — returns {"message": "API running."} when HA is up."""
|
|
||||||
async with self._session.get(f"{self._base_url}/api/") as resp:
|
|
||||||
resp.raise_for_status()
|
|
||||||
return await resp.json()
|
|
||||||
|
|
||||||
async def get_states(self) -> list[dict[str, Any]]:
|
|
||||||
"""GET /api/states — full entity state list."""
|
|
||||||
async with self._session.get(f"{self._base_url}/api/states") as resp:
|
|
||||||
resp.raise_for_status()
|
|
||||||
return await resp.json()
|
|
||||||
|
|
||||||
async def get_system_health(self) -> dict[str, Any]:
|
|
||||||
"""GET /api/system_health — per-integration health summary."""
|
|
||||||
async with self._session.get(f"{self._base_url}/api/system_health") as resp:
|
|
||||||
resp.raise_for_status()
|
|
||||||
return await resp.json()
|
|
||||||
|
|
||||||
async def get_config(self) -> dict[str, Any]:
|
|
||||||
"""GET /api/config — HA configuration including version."""
|
|
||||||
async with self._session.get(f"{self._base_url}/api/config") as resp:
|
|
||||||
resp.raise_for_status()
|
|
||||||
return await resp.json()
|
|
||||||
|
|
||||||
async def get_entity_registry(self) -> list[dict[str, Any]]:
|
|
||||||
"""GET /api/config/entity_registry — entity registry entries.
|
|
||||||
|
|
||||||
Each entry includes entity_id, platform (integration name), area_id,
|
|
||||||
config_entry_id, and other metadata.
|
|
||||||
|
|
||||||
Result is cached in-process for entity_registry_cache_ttl seconds to
|
|
||||||
avoid hammering HA on every check cycle (Phase 3 Flag #3).
|
|
||||||
"""
|
|
||||||
now = time.monotonic()
|
|
||||||
if (
|
|
||||||
self._registry_cache is not None
|
|
||||||
and (now - self._registry_fetched_at) < self._registry_cache_ttl
|
|
||||||
):
|
|
||||||
return self._registry_cache
|
|
||||||
async with self._session.get(
|
|
||||||
f"{self._base_url}/api/config/entity_registry"
|
|
||||||
) as resp:
|
|
||||||
resp.raise_for_status()
|
|
||||||
result = await resp.json()
|
|
||||||
self._registry_cache = result
|
|
||||||
self._registry_fetched_at = now
|
|
||||||
return result
|
|
||||||
|
|
||||||
def invalidate_registry_cache(self) -> None:
|
|
||||||
"""Force the next get_entity_registry() call to fetch fresh data."""
|
|
||||||
self._registry_cache = None
|
|
||||||
self._registry_fetched_at = 0.0
|
|
||||||
|
|
||||||
async def get_automation_traces(self, automation_id: str) -> list[dict[str, Any]]:
|
|
||||||
"""GET /api/trace/automation/<id> — last run traces for an automation."""
|
|
||||||
url = f"{self._base_url}/api/trace/automation/{automation_id}"
|
|
||||||
async with self._session.get(url) as resp:
|
|
||||||
resp.raise_for_status()
|
|
||||||
return await resp.json()
|
|
||||||
|
|
||||||
async def get_error_log(self) -> str:
|
|
||||||
"""GET /api/error_log — plaintext error log."""
|
|
||||||
async with self._session.get(f"{self._base_url}/api/error_log") as resp:
|
|
||||||
resp.raise_for_status()
|
|
||||||
return await resp.text()
|
|
||||||
|
|
@ -1,204 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import asyncio
|
|
||||||
import json
|
|
||||||
import logging
|
|
||||||
import time
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
import structlog
|
|
||||||
import uvicorn
|
|
||||||
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
|
||||||
|
|
||||||
from .api import app, register_checks, register_ws_monitor
|
|
||||||
from .checks.automation_failures import AutomationFailuresCheck
|
|
||||||
from .checks.heartbeat import HeartbeatCheck
|
|
||||||
from .checks.system_health import SystemHealthCheck
|
|
||||||
from .checks.unavailable_entities import UnavailableEntitiesCheck
|
|
||||||
from .checks.updates_available import UpdatesAvailableCheck, UpdatesDigestCheck
|
|
||||||
from .config import Settings
|
|
||||||
from .event_emitter import EventEmitter
|
|
||||||
from .ha_client import HAClient, make_session
|
|
||||||
from .monitors import WebSocketMonitor
|
|
||||||
from .storage import Storage
|
|
||||||
|
|
||||||
_log = structlog.get_logger()
|
|
||||||
|
|
||||||
|
|
||||||
def _configure_structlog(log_level: str) -> None:
|
|
||||||
structlog.configure(
|
|
||||||
processors=[
|
|
||||||
structlog.processors.add_log_level,
|
|
||||||
structlog.processors.TimeStamper(fmt="iso"),
|
|
||||||
structlog.processors.StackInfoRenderer(),
|
|
||||||
structlog.processors.format_exc_info,
|
|
||||||
structlog.processors.JSONRenderer(),
|
|
||||||
],
|
|
||||||
logger_factory=structlog.PrintLoggerFactory(),
|
|
||||||
)
|
|
||||||
logging.basicConfig(level=getattr(logging, log_level.upper(), logging.INFO))
|
|
||||||
|
|
||||||
|
|
||||||
async def _run_check_and_emit(
|
|
||||||
check, emitter: EventEmitter, storage: Storage
|
|
||||||
) -> None:
|
|
||||||
"""Run a check, emit events for each result, and record to check_history."""
|
|
||||||
try:
|
|
||||||
results = await check.run()
|
|
||||||
healthy = not any(r.event_type for r in results)
|
|
||||||
summary = f"{len(results)} issue(s)" if results else "ok"
|
|
||||||
|
|
||||||
await storage.record_check(
|
|
||||||
check_name=check.name,
|
|
||||||
ran_at=time.time(),
|
|
||||||
healthy=healthy,
|
|
||||||
message=summary,
|
|
||||||
payload=json.dumps([r.model_dump() for r in results]),
|
|
||||||
)
|
|
||||||
|
|
||||||
for result in results:
|
|
||||||
if result.event_type:
|
|
||||||
emitter.emit(
|
|
||||||
event_type=result.event_type,
|
|
||||||
severity=result.severity.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message=result.message,
|
|
||||||
payload=result.payload,
|
|
||||||
)
|
|
||||||
_log.warning(
|
|
||||||
"check_unhealthy",
|
|
||||||
check=check.name,
|
|
||||||
event=result.event_type,
|
|
||||||
msg=result.message,
|
|
||||||
)
|
|
||||||
|
|
||||||
if healthy:
|
|
||||||
_log.info("check_ok", check=check.name)
|
|
||||||
|
|
||||||
except Exception as exc:
|
|
||||||
_log.error("check_error", check=check.name, error=str(exc), exc_info=True)
|
|
||||||
|
|
||||||
|
|
||||||
async def run(settings: Settings) -> None:
|
|
||||||
_configure_structlog(settings.log_level)
|
|
||||||
_log.info(
|
|
||||||
"ha_diag_agent_starting",
|
|
||||||
node=settings.node_name,
|
|
||||||
location=settings.location_tag,
|
|
||||||
ha_url=settings.ha_url,
|
|
||||||
heartbeat_interval=settings.check_interval,
|
|
||||||
unavailable_interval=settings.check_interval_unavailable,
|
|
||||||
)
|
|
||||||
|
|
||||||
storage = Storage(settings.data_dir / "ha_diag.db")
|
|
||||||
await storage.open()
|
|
||||||
|
|
||||||
emitter = EventEmitter(settings.events_dir, settings.node_name, settings.location_tag)
|
|
||||||
|
|
||||||
# Shared session — created once at startup, closed on shutdown
|
|
||||||
session = make_session(settings.ha_token, settings.ha_timeout)
|
|
||||||
ha_client = HAClient(
|
|
||||||
settings.ha_url, session,
|
|
||||||
entity_registry_cache_ttl=settings.entity_registry_cache_ttl,
|
|
||||||
)
|
|
||||||
|
|
||||||
heartbeat = HeartbeatCheck(ha_client)
|
|
||||||
unavailable = UnavailableEntitiesCheck(ha_client, storage, settings)
|
|
||||||
system_health = SystemHealthCheck(ha_client, storage, settings)
|
|
||||||
automation_failures = AutomationFailuresCheck(ha_client, storage, settings)
|
|
||||||
updates_daily = UpdatesAvailableCheck(ha_client, storage, settings)
|
|
||||||
updates_digest = UpdatesDigestCheck(ha_client, storage, settings)
|
|
||||||
|
|
||||||
all_checks = [heartbeat, unavailable, system_health, automation_failures,
|
|
||||||
updates_daily, updates_digest]
|
|
||||||
register_checks(all_checks, settings.node_name, settings.location_tag)
|
|
||||||
|
|
||||||
ws_monitor = WebSocketMonitor(
|
|
||||||
ha_url=settings.ha_url,
|
|
||||||
token=settings.ha_token,
|
|
||||||
settings=settings,
|
|
||||||
emitter=emitter,
|
|
||||||
session=session,
|
|
||||||
)
|
|
||||||
register_ws_monitor(ws_monitor)
|
|
||||||
|
|
||||||
scheduler = AsyncIOScheduler()
|
|
||||||
scheduler.add_job(
|
|
||||||
_run_check_and_emit, "interval",
|
|
||||||
seconds=settings.check_interval,
|
|
||||||
args=[heartbeat, emitter, storage],
|
|
||||||
id="check_heartbeat",
|
|
||||||
next_run_time=datetime.now(),
|
|
||||||
)
|
|
||||||
scheduler.add_job(
|
|
||||||
_run_check_and_emit, "interval",
|
|
||||||
seconds=settings.check_interval_unavailable,
|
|
||||||
args=[unavailable, emitter, storage],
|
|
||||||
id="check_unavailable_entities",
|
|
||||||
next_run_time=datetime.now(),
|
|
||||||
)
|
|
||||||
scheduler.add_job(
|
|
||||||
_run_check_and_emit, "interval",
|
|
||||||
seconds=settings.system_health_check_interval,
|
|
||||||
args=[system_health, emitter, storage],
|
|
||||||
id="check_system_health",
|
|
||||||
next_run_time=datetime.now(),
|
|
||||||
)
|
|
||||||
scheduler.add_job(
|
|
||||||
_run_check_and_emit, "interval",
|
|
||||||
seconds=settings.automation_check_interval,
|
|
||||||
args=[automation_failures, emitter, storage],
|
|
||||||
id="check_automation_failures",
|
|
||||||
next_run_time=datetime.now(),
|
|
||||||
)
|
|
||||||
scheduler.add_job(
|
|
||||||
_run_check_and_emit, "cron",
|
|
||||||
hour=settings.updates_check_hour,
|
|
||||||
minute=settings.updates_check_minute,
|
|
||||||
args=[updates_daily, emitter, storage],
|
|
||||||
id="check_updates_available",
|
|
||||||
)
|
|
||||||
scheduler.add_job(
|
|
||||||
_run_check_and_emit, "cron",
|
|
||||||
day_of_week="sun",
|
|
||||||
hour=settings.updates_check_hour,
|
|
||||||
minute=settings.updates_check_minute,
|
|
||||||
args=[updates_digest, emitter, storage],
|
|
||||||
id="check_updates_digest",
|
|
||||||
)
|
|
||||||
scheduler.start()
|
|
||||||
_log.info(
|
|
||||||
"scheduler_started",
|
|
||||||
checks=[c.name for c in all_checks],
|
|
||||||
heartbeat_interval=settings.check_interval,
|
|
||||||
unavailable_interval=settings.check_interval_unavailable,
|
|
||||||
system_health_interval=settings.system_health_check_interval,
|
|
||||||
automation_interval=settings.automation_check_interval,
|
|
||||||
updates_hour=settings.updates_check_hour,
|
|
||||||
)
|
|
||||||
|
|
||||||
await ws_monitor.start()
|
|
||||||
|
|
||||||
config = uvicorn.Config(
|
|
||||||
app,
|
|
||||||
host="0.0.0.0",
|
|
||||||
port=settings.port,
|
|
||||||
log_level=settings.log_level.lower(),
|
|
||||||
)
|
|
||||||
server = uvicorn.Server(config)
|
|
||||||
try:
|
|
||||||
await server.serve()
|
|
||||||
finally:
|
|
||||||
await ws_monitor.stop()
|
|
||||||
scheduler.shutdown(wait=False)
|
|
||||||
await storage.close()
|
|
||||||
await session.close()
|
|
||||||
|
|
||||||
|
|
||||||
def main() -> None:
|
|
||||||
settings = Settings.load()
|
|
||||||
asyncio.run(run(settings))
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,43 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from enum import Enum
|
|
||||||
from typing import Any
|
|
||||||
|
|
||||||
from pydantic import BaseModel
|
|
||||||
|
|
||||||
|
|
||||||
class Severity(str, Enum):
|
|
||||||
info = "info"
|
|
||||||
warning = "warning"
|
|
||||||
error = "error"
|
|
||||||
|
|
||||||
|
|
||||||
class HAEventType(str, Enum):
|
|
||||||
ha_integration_failed = "ha_integration_failed"
|
|
||||||
ha_entity_unavailable_long = "ha_entity_unavailable_long"
|
|
||||||
ha_websocket_dead = "ha_websocket_dead"
|
|
||||||
ha_websocket_recovered = "ha_websocket_recovered"
|
|
||||||
ha_automation_failing = "ha_automation_failing"
|
|
||||||
ha_update_available = "ha_update_available"
|
|
||||||
ha_recorder_lag = "ha_recorder_lag"
|
|
||||||
ha_system_health_degraded = "ha_system_health_degraded"
|
|
||||||
|
|
||||||
|
|
||||||
class EventRecord(BaseModel):
|
|
||||||
id: str
|
|
||||||
timestamp: int
|
|
||||||
date: str
|
|
||||||
type: str
|
|
||||||
severity: str
|
|
||||||
node: str
|
|
||||||
service: str
|
|
||||||
message: str
|
|
||||||
payload: dict[str, Any] = {}
|
|
||||||
|
|
||||||
|
|
||||||
class CheckResult(BaseModel):
|
|
||||||
healthy: bool
|
|
||||||
event_type: str | None = None # None means no event to emit
|
|
||||||
severity: Severity = Severity.info
|
|
||||||
message: str = ""
|
|
||||||
payload: dict[str, Any] = {}
|
|
||||||
|
|
@ -1,4 +0,0 @@
|
||||||
from .base import Monitor
|
|
||||||
from .websocket_monitor import WebSocketMonitor
|
|
||||||
|
|
||||||
__all__ = ["Monitor", "WebSocketMonitor"]
|
|
||||||
|
|
@ -1,24 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from abc import ABC, abstractmethod
|
|
||||||
|
|
||||||
|
|
||||||
class Monitor(ABC):
|
|
||||||
"""Base class for long-running background monitors.
|
|
||||||
|
|
||||||
Unlike checks (one-shot, APScheduler-driven), monitors maintain
|
|
||||||
persistent state — connections, subscriptions, background tasks.
|
|
||||||
"""
|
|
||||||
|
|
||||||
@abstractmethod
|
|
||||||
async def start(self) -> None:
|
|
||||||
"""Spawn background task(s). Idempotent if already started."""
|
|
||||||
|
|
||||||
@abstractmethod
|
|
||||||
async def stop(self) -> None:
|
|
||||||
"""Cancel background tasks and wait for cleanup."""
|
|
||||||
|
|
||||||
@property
|
|
||||||
@abstractmethod
|
|
||||||
def is_healthy(self) -> bool:
|
|
||||||
"""True when the monitor is running and its connection is live."""
|
|
||||||
|
|
@ -1,286 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import asyncio
|
|
||||||
import json
|
|
||||||
import random
|
|
||||||
import time
|
|
||||||
from datetime import datetime, timezone
|
|
||||||
|
|
||||||
import aiohttp
|
|
||||||
import structlog
|
|
||||||
|
|
||||||
from ..config import Settings
|
|
||||||
from ..event_emitter import EventEmitter
|
|
||||||
from ..models import HAEventType, Severity
|
|
||||||
from .base import Monitor
|
|
||||||
|
|
||||||
_log = structlog.get_logger().bind(monitor="websocket")
|
|
||||||
|
|
||||||
|
|
||||||
class _AuthError(Exception):
|
|
||||||
"""Raised when HA returns auth_invalid during the WS handshake."""
|
|
||||||
|
|
||||||
|
|
||||||
def _make_ws_url(ha_url: str) -> str:
|
|
||||||
if ha_url.startswith("https://"):
|
|
||||||
base = ha_url.replace("https://", "wss://", 1)
|
|
||||||
else:
|
|
||||||
base = ha_url.replace("http://", "ws://", 1)
|
|
||||||
return base.rstrip("/") + "/api/websocket"
|
|
||||||
|
|
||||||
|
|
||||||
class WebSocketMonitor(Monitor):
|
|
||||||
"""Persistent WebSocket connection to HA for real-time liveness monitoring.
|
|
||||||
|
|
||||||
Subscribes to state_changed events — any traffic proves HA is alive.
|
|
||||||
The watchdog fires ha_websocket_dead when the connection is silent for
|
|
||||||
longer than silence_threshold, or immediately on disconnect.
|
|
||||||
ha_websocket_recovered is emitted when the connection is restored after
|
|
||||||
a dead alert was sent (allows supervisor to clear active incidents).
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self,
|
|
||||||
ha_url: str,
|
|
||||||
token: str,
|
|
||||||
settings: Settings,
|
|
||||||
emitter: EventEmitter,
|
|
||||||
session: aiohttp.ClientSession,
|
|
||||||
) -> None:
|
|
||||||
self._ws_url = _make_ws_url(ha_url)
|
|
||||||
self._token = token
|
|
||||||
self._settings = settings
|
|
||||||
self._emitter = emitter
|
|
||||||
self._session = session
|
|
||||||
|
|
||||||
self._state: str = "disconnected"
|
|
||||||
self._last_event_monotonic: float = time.monotonic()
|
|
||||||
# 0.0 means no ha_websocket_dead has been emitted yet (for this session)
|
|
||||||
self._last_dead_alert_at: float = 0.0
|
|
||||||
|
|
||||||
self._stopping = False
|
|
||||||
self._msg_id = 0
|
|
||||||
self._main_task: asyncio.Task | None = None
|
|
||||||
self._watchdog_task: asyncio.Task | None = None
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Monitor ABC
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def start(self) -> None:
|
|
||||||
if not self._settings.websocket_enabled:
|
|
||||||
_log.info("ws_monitor_disabled")
|
|
||||||
return
|
|
||||||
self._stopping = False
|
|
||||||
self._last_event_monotonic = time.monotonic()
|
|
||||||
self._main_task = asyncio.create_task(
|
|
||||||
self._connection_loop(), name="ws_connection_loop"
|
|
||||||
)
|
|
||||||
self._watchdog_task = asyncio.create_task(
|
|
||||||
self._watchdog_loop(), name="ws_watchdog"
|
|
||||||
)
|
|
||||||
_log.info("ws_monitor_started", ws_url=self._ws_url)
|
|
||||||
|
|
||||||
async def stop(self) -> None:
|
|
||||||
self._stopping = True
|
|
||||||
self._state = "stopped"
|
|
||||||
tasks = [t for t in [self._main_task, self._watchdog_task] if t is not None]
|
|
||||||
for t in tasks:
|
|
||||||
t.cancel()
|
|
||||||
if tasks:
|
|
||||||
await asyncio.gather(*tasks, return_exceptions=True)
|
|
||||||
self._main_task = None
|
|
||||||
self._watchdog_task = None
|
|
||||||
_log.info("ws_monitor_stopped")
|
|
||||||
|
|
||||||
@property
|
|
||||||
def is_healthy(self) -> bool:
|
|
||||||
if not self._settings.websocket_enabled:
|
|
||||||
return True # disabled monitors are not unhealthy
|
|
||||||
return self._state == "subscribed"
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Connection loop — reconnects with exponential back-off
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def _connection_loop(self) -> None:
|
|
||||||
delay = float(self._settings.websocket_reconnect_initial_delay)
|
|
||||||
while not self._stopping:
|
|
||||||
self._state = "connecting"
|
|
||||||
clean_close = False
|
|
||||||
try:
|
|
||||||
await self._connect_and_listen()
|
|
||||||
clean_close = True
|
|
||||||
delay = float(self._settings.websocket_reconnect_initial_delay)
|
|
||||||
except asyncio.CancelledError:
|
|
||||||
raise
|
|
||||||
except _AuthError as exc:
|
|
||||||
_log.error("ws_auth_failed", error=str(exc))
|
|
||||||
# Auth failures won't self-heal on fast retry — jump to max delay
|
|
||||||
delay = float(self._settings.websocket_reconnect_max_delay)
|
|
||||||
except Exception as exc:
|
|
||||||
_log.warning("ws_connect_error", error=str(exc))
|
|
||||||
|
|
||||||
self._state = "disconnected"
|
|
||||||
if not self._stopping:
|
|
||||||
self._on_disconnected()
|
|
||||||
|
|
||||||
if self._stopping:
|
|
||||||
break
|
|
||||||
|
|
||||||
if clean_close:
|
|
||||||
wait = 1.0 # brief pause before reconnecting after a clean HA close
|
|
||||||
else:
|
|
||||||
jitter_range = delay * self._settings.websocket_reconnect_jitter
|
|
||||||
wait = max(0.1, delay + random.uniform(-jitter_range, jitter_range))
|
|
||||||
delay = min(delay * 2, float(self._settings.websocket_reconnect_max_delay))
|
|
||||||
|
|
||||||
_log.debug("ws_reconnect_wait", seconds=round(wait, 2))
|
|
||||||
await asyncio.sleep(wait)
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Connect, auth, subscribe, receive
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def _connect_and_listen(self) -> None:
|
|
||||||
# Override the session-level timeout: WS must stay open indefinitely,
|
|
||||||
# only the initial TCP connect should be bounded.
|
|
||||||
ws_timeout = aiohttp.ClientTimeout(total=None, connect=10.0, sock_connect=10.0)
|
|
||||||
async with self._session.ws_connect(
|
|
||||||
self._ws_url,
|
|
||||||
timeout=ws_timeout,
|
|
||||||
heartbeat=30.0,
|
|
||||||
) as ws:
|
|
||||||
self._state = "authenticating"
|
|
||||||
|
|
||||||
# Receive auth_required
|
|
||||||
try:
|
|
||||||
msg = await asyncio.wait_for(ws.receive_json(), timeout=10.0)
|
|
||||||
except (asyncio.TimeoutError, TypeError, json.JSONDecodeError) as exc:
|
|
||||||
raise ConnectionError(f"Failed to receive auth_required: {exc}") from exc
|
|
||||||
|
|
||||||
if msg.get("type") != "auth_required":
|
|
||||||
raise ConnectionError(
|
|
||||||
f"Unexpected initial message type: {msg.get('type')!r}"
|
|
||||||
)
|
|
||||||
|
|
||||||
await ws.send_json({"type": "auth", "access_token": self._token})
|
|
||||||
|
|
||||||
# Receive auth_ok or auth_invalid
|
|
||||||
try:
|
|
||||||
msg = await asyncio.wait_for(ws.receive_json(), timeout=10.0)
|
|
||||||
except (asyncio.TimeoutError, TypeError, json.JSONDecodeError) as exc:
|
|
||||||
raise ConnectionError(f"Failed to receive auth response: {exc}") from exc
|
|
||||||
|
|
||||||
if msg.get("type") == "auth_invalid":
|
|
||||||
raise _AuthError(msg.get("message", "auth_invalid"))
|
|
||||||
if msg.get("type") != "auth_ok":
|
|
||||||
raise ConnectionError(
|
|
||||||
f"Unexpected auth response type: {msg.get('type')!r}"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Subscribe to state_changed events
|
|
||||||
self._msg_id += 1
|
|
||||||
await ws.send_json({
|
|
||||||
"id": self._msg_id,
|
|
||||||
"type": "subscribe_events",
|
|
||||||
"event_type": "state_changed",
|
|
||||||
})
|
|
||||||
|
|
||||||
# Mark connected — capture prior dead state before resetting
|
|
||||||
prev_dead_at = self._last_dead_alert_at
|
|
||||||
self._state = "subscribed"
|
|
||||||
self._last_event_monotonic = time.monotonic()
|
|
||||||
|
|
||||||
# Emit recovery if this reconnect follows a dead alert
|
|
||||||
if prev_dead_at > 0.0:
|
|
||||||
self._last_dead_alert_at = 0.0
|
|
||||||
self._emit_recovered()
|
|
||||||
|
|
||||||
_log.info("ws_subscribed", ws_url=self._ws_url)
|
|
||||||
|
|
||||||
# Receive loop — any TEXT message proves HA is alive
|
|
||||||
async for raw in ws:
|
|
||||||
if self._stopping:
|
|
||||||
break
|
|
||||||
if raw.type == aiohttp.WSMsgType.TEXT:
|
|
||||||
self._last_event_monotonic = time.monotonic()
|
|
||||||
elif raw.type in (aiohttp.WSMsgType.ERROR, aiohttp.WSMsgType.CLOSE):
|
|
||||||
_log.warning("ws_closed_by_server", msg_type=raw.type.name)
|
|
||||||
break
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Watchdog loop — detects silence while the WS appears connected
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def _watchdog_loop(self) -> None:
|
|
||||||
while not self._stopping:
|
|
||||||
try:
|
|
||||||
await asyncio.sleep(self._settings.websocket_watchdog_interval_seconds)
|
|
||||||
except asyncio.CancelledError:
|
|
||||||
raise
|
|
||||||
|
|
||||||
if self._state != "subscribed":
|
|
||||||
continue # disconnects are handled by the connection loop
|
|
||||||
|
|
||||||
now = time.monotonic()
|
|
||||||
silent_secs = now - self._last_event_monotonic
|
|
||||||
if silent_secs <= self._settings.websocket_silence_threshold_seconds:
|
|
||||||
continue
|
|
||||||
|
|
||||||
cooldown = self._settings.websocket_down_alert_repeat_minutes * 60
|
|
||||||
if self._last_dead_alert_at == 0.0 or (now - self._last_dead_alert_at) >= cooldown:
|
|
||||||
self._emitter.emit(
|
|
||||||
event_type=HAEventType.ha_websocket_dead.value,
|
|
||||||
severity=Severity.error.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message=(
|
|
||||||
f"HA WebSocket silent for {silent_secs:.0f}s — no events received"
|
|
||||||
),
|
|
||||||
payload=self._dead_payload(silent_secs),
|
|
||||||
)
|
|
||||||
self._last_dead_alert_at = now
|
|
||||||
_log.warning("ws_silent_dead_emitted", silent_seconds=round(silent_secs))
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _on_disconnected(self) -> None:
|
|
||||||
"""Emit ha_websocket_dead on connection loss, respecting cooldown."""
|
|
||||||
if self._stopping:
|
|
||||||
return
|
|
||||||
now = time.monotonic()
|
|
||||||
cooldown = self._settings.websocket_down_alert_repeat_minutes * 60
|
|
||||||
if self._last_dead_alert_at == 0.0 or (now - self._last_dead_alert_at) >= cooldown:
|
|
||||||
silent_secs = now - self._last_event_monotonic
|
|
||||||
self._emitter.emit(
|
|
||||||
event_type=HAEventType.ha_websocket_dead.value,
|
|
||||||
severity=Severity.error.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message=f"HA WebSocket disconnected — silent for {silent_secs:.0f}s",
|
|
||||||
payload=self._dead_payload(silent_secs),
|
|
||||||
)
|
|
||||||
self._last_dead_alert_at = now
|
|
||||||
_log.warning("ws_dead_emitted", silent_seconds=round(silent_secs))
|
|
||||||
|
|
||||||
def _emit_recovered(self) -> None:
|
|
||||||
self._emitter.emit(
|
|
||||||
event_type=HAEventType.ha_websocket_recovered.value,
|
|
||||||
severity=Severity.info.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message="HA WebSocket reconnected and receiving events",
|
|
||||||
payload={"connection_state": "subscribed"},
|
|
||||||
)
|
|
||||||
_log.info("ws_recovered_emitted")
|
|
||||||
|
|
||||||
def _dead_payload(self, silent_secs: float) -> dict:
|
|
||||||
event_age = time.monotonic() - self._last_event_monotonic
|
|
||||||
last_event_wall = time.time() - event_age
|
|
||||||
return {
|
|
||||||
"silent_seconds": round(silent_secs),
|
|
||||||
"last_event_at": datetime.fromtimestamp(
|
|
||||||
last_event_wall, tz=timezone.utc
|
|
||||||
).isoformat(),
|
|
||||||
"connection_state": self._state,
|
|
||||||
}
|
|
||||||
|
|
@ -1,227 +0,0 @@
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Any
|
|
||||||
|
|
||||||
import aiosqlite
|
|
||||||
|
|
||||||
_SCHEMA = """
|
|
||||||
CREATE TABLE IF NOT EXISTS system_health_snapshot (
|
|
||||||
component TEXT PRIMARY KEY,
|
|
||||||
last_status TEXT NOT NULL,
|
|
||||||
last_seen_at REAL NOT NULL,
|
|
||||||
payload TEXT NOT NULL DEFAULT '{}'
|
|
||||||
);
|
|
||||||
|
|
||||||
CREATE TABLE IF NOT EXISTS entity_baseline (
|
|
||||||
entity_id TEXT PRIMARY KEY,
|
|
||||||
-- state when entity first entered unavailable/unknown
|
|
||||||
state TEXT NOT NULL,
|
|
||||||
-- timestamp when the entity FIRST entered its current bad state (INSERT OR IGNORE)
|
|
||||||
first_seen REAL NOT NULL,
|
|
||||||
-- kept for legacy compat; not used by UnavailableEntitiesCheck
|
|
||||||
attributes TEXT NOT NULL DEFAULT '{}',
|
|
||||||
updated_at REAL NOT NULL
|
|
||||||
);
|
|
||||||
|
|
||||||
CREATE TABLE IF NOT EXISTS check_history (
|
|
||||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
||||||
check_name TEXT NOT NULL,
|
|
||||||
ran_at REAL NOT NULL,
|
|
||||||
healthy INTEGER NOT NULL,
|
|
||||||
message TEXT NOT NULL DEFAULT '',
|
|
||||||
payload TEXT NOT NULL DEFAULT '{}'
|
|
||||||
);
|
|
||||||
|
|
||||||
CREATE TABLE IF NOT EXISTS alerts_sent (
|
|
||||||
alert_key TEXT PRIMARY KEY,
|
|
||||||
sent_at REAL NOT NULL
|
|
||||||
);
|
|
||||||
"""
|
|
||||||
|
|
||||||
_MIGRATE_ENTITY_BASELINE = """
|
|
||||||
ALTER TABLE entity_baseline ADD COLUMN first_seen REAL NOT NULL DEFAULT 0;
|
|
||||||
"""
|
|
||||||
|
|
||||||
|
|
||||||
class Storage:
|
|
||||||
def __init__(self, db_path: Path) -> None:
|
|
||||||
self._db_path = db_path
|
|
||||||
self._db: aiosqlite.Connection | None = None
|
|
||||||
|
|
||||||
async def open(self) -> None:
|
|
||||||
self._db_path.parent.mkdir(parents=True, exist_ok=True)
|
|
||||||
self._db = await aiosqlite.connect(self._db_path)
|
|
||||||
self._db.row_factory = aiosqlite.Row
|
|
||||||
await self._db.executescript(_SCHEMA)
|
|
||||||
# Add first_seen column to existing databases that pre-date Phase 3
|
|
||||||
try:
|
|
||||||
await self._db.execute(_MIGRATE_ENTITY_BASELINE)
|
|
||||||
except Exception:
|
|
||||||
pass # column already exists
|
|
||||||
await self._db.commit()
|
|
||||||
|
|
||||||
async def close(self) -> None:
|
|
||||||
if self._db:
|
|
||||||
await self._db.close()
|
|
||||||
self._db = None
|
|
||||||
|
|
||||||
def _conn(self) -> aiosqlite.Connection:
|
|
||||||
if self._db is None:
|
|
||||||
raise RuntimeError("Storage not open — call await storage.open() first")
|
|
||||||
return self._db
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# entity_baseline — tracks entities currently in bad state
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def set_entity_unavailable_since(
|
|
||||||
self, entity_id: str, state: str, first_seen: float
|
|
||||||
) -> None:
|
|
||||||
"""Record when an entity first entered unavailable/unknown state.
|
|
||||||
|
|
||||||
INSERT OR IGNORE: if the entity is already tracked, preserves the
|
|
||||||
original first_seen timestamp so duration is computed correctly.
|
|
||||||
"""
|
|
||||||
await self._conn().execute(
|
|
||||||
"""
|
|
||||||
INSERT OR IGNORE INTO entity_baseline
|
|
||||||
(entity_id, state, first_seen, attributes, updated_at)
|
|
||||||
VALUES (?, ?, ?, '{}', ?)
|
|
||||||
""",
|
|
||||||
(entity_id, state, first_seen, first_seen),
|
|
||||||
)
|
|
||||||
await self._conn().commit()
|
|
||||||
|
|
||||||
async def get_entity_first_unavailable_at(self, entity_id: str) -> float | None:
|
|
||||||
"""Return when the entity first entered its bad state, or None if not tracked."""
|
|
||||||
async with self._conn().execute(
|
|
||||||
"SELECT first_seen FROM entity_baseline WHERE entity_id = ?",
|
|
||||||
(entity_id,),
|
|
||||||
) as cur:
|
|
||||||
row = await cur.fetchone()
|
|
||||||
return float(row["first_seen"]) if row else None
|
|
||||||
|
|
||||||
async def clear_entity_unavailable(self, entity_id: str) -> None:
|
|
||||||
"""Remove entity from unavailable tracking (entity has recovered)."""
|
|
||||||
await self._conn().execute(
|
|
||||||
"DELETE FROM entity_baseline WHERE entity_id = ?",
|
|
||||||
(entity_id,),
|
|
||||||
)
|
|
||||||
await self._conn().commit()
|
|
||||||
|
|
||||||
async def get_all_tracked_entity_ids(self) -> list[str]:
|
|
||||||
"""Return all entity IDs currently tracked as unavailable/unknown."""
|
|
||||||
async with self._conn().execute(
|
|
||||||
"SELECT entity_id FROM entity_baseline"
|
|
||||||
) as cur:
|
|
||||||
rows = await cur.fetchall()
|
|
||||||
return [r["entity_id"] for r in rows]
|
|
||||||
|
|
||||||
# Legacy upsert — kept for backwards compat with existing callers
|
|
||||||
async def upsert_entity_baseline(
|
|
||||||
self, entity_id: str, state: str, attributes: str, updated_at: float
|
|
||||||
) -> None:
|
|
||||||
await self._conn().execute(
|
|
||||||
"""
|
|
||||||
INSERT INTO entity_baseline (entity_id, state, first_seen, attributes, updated_at)
|
|
||||||
VALUES (?, ?, ?, ?, ?)
|
|
||||||
ON CONFLICT(entity_id) DO UPDATE SET
|
|
||||||
state = excluded.state,
|
|
||||||
attributes = excluded.attributes,
|
|
||||||
updated_at = excluded.updated_at
|
|
||||||
""",
|
|
||||||
(entity_id, state, updated_at, attributes, updated_at),
|
|
||||||
)
|
|
||||||
await self._conn().commit()
|
|
||||||
|
|
||||||
async def get_entity_baseline(self, entity_id: str) -> dict[str, Any] | None:
|
|
||||||
async with self._conn().execute(
|
|
||||||
"SELECT * FROM entity_baseline WHERE entity_id = ?", (entity_id,)
|
|
||||||
) as cur:
|
|
||||||
row = await cur.fetchone()
|
|
||||||
return dict(row) if row else None
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# check_history
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def record_check(
|
|
||||||
self,
|
|
||||||
check_name: str,
|
|
||||||
ran_at: float,
|
|
||||||
healthy: bool,
|
|
||||||
message: str,
|
|
||||||
payload: str,
|
|
||||||
) -> None:
|
|
||||||
await self._conn().execute(
|
|
||||||
"""
|
|
||||||
INSERT INTO check_history (check_name, ran_at, healthy, message, payload)
|
|
||||||
VALUES (?, ?, ?, ?, ?)
|
|
||||||
""",
|
|
||||||
(check_name, ran_at, int(healthy), message, payload),
|
|
||||||
)
|
|
||||||
await self._conn().commit()
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# alerts_sent (dedup gate)
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def was_alert_sent(self, alert_key: str, within_seconds: float) -> bool:
|
|
||||||
cutoff = time.time() - within_seconds
|
|
||||||
async with self._conn().execute(
|
|
||||||
"SELECT sent_at FROM alerts_sent WHERE alert_key = ? AND sent_at > ?",
|
|
||||||
(alert_key, cutoff),
|
|
||||||
) as cur:
|
|
||||||
return (await cur.fetchone()) is not None
|
|
||||||
|
|
||||||
async def mark_alert_sent(self, alert_key: str) -> None:
|
|
||||||
await self._conn().execute(
|
|
||||||
"""
|
|
||||||
INSERT INTO alerts_sent (alert_key, sent_at) VALUES (?, ?)
|
|
||||||
ON CONFLICT(alert_key) DO UPDATE SET sent_at = excluded.sent_at
|
|
||||||
""",
|
|
||||||
(alert_key, time.time()),
|
|
||||||
)
|
|
||||||
await self._conn().commit()
|
|
||||||
|
|
||||||
async def clear_alert(self, alert_key: str) -> None:
|
|
||||||
"""Delete an alert record so the next occurrence triggers immediately."""
|
|
||||||
await self._conn().execute(
|
|
||||||
"DELETE FROM alerts_sent WHERE alert_key = ?", (alert_key,)
|
|
||||||
)
|
|
||||||
await self._conn().commit()
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# system_health_snapshot — tracks last-known per-component status
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
async def get_system_health_snapshot(
|
|
||||||
self, component: str
|
|
||||||
) -> dict[str, Any] | None:
|
|
||||||
"""Return the stored snapshot for a component, or None if unseen."""
|
|
||||||
async with self._conn().execute(
|
|
||||||
"SELECT * FROM system_health_snapshot WHERE component = ?",
|
|
||||||
(component,),
|
|
||||||
) as cur:
|
|
||||||
row = await cur.fetchone()
|
|
||||||
return dict(row) if row else None
|
|
||||||
|
|
||||||
async def upsert_system_health_snapshot(
|
|
||||||
self, component: str, last_status: str, payload: str
|
|
||||||
) -> None:
|
|
||||||
"""Insert or replace the snapshot for a component."""
|
|
||||||
await self._conn().execute(
|
|
||||||
"""
|
|
||||||
INSERT INTO system_health_snapshot
|
|
||||||
(component, last_status, last_seen_at, payload)
|
|
||||||
VALUES (?, ?, ?, ?)
|
|
||||||
ON CONFLICT(component) DO UPDATE SET
|
|
||||||
last_status = excluded.last_status,
|
|
||||||
last_seen_at = excluded.last_seen_at,
|
|
||||||
payload = excluded.payload
|
|
||||||
""",
|
|
||||||
(component, last_status, time.time(), payload),
|
|
||||||
)
|
|
||||||
await self._conn().commit()
|
|
||||||
|
|
@ -1,64 +0,0 @@
|
||||||
"""Shared fixtures for ha-diag-agent tests."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import AsyncGenerator
|
|
||||||
from unittest.mock import AsyncMock, MagicMock
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
import pytest_asyncio
|
|
||||||
|
|
||||||
from ha_diag.event_emitter import EventEmitter
|
|
||||||
from ha_diag.storage import Storage
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Filesystem fixtures
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def tmp_events_dir(tmp_path: Path) -> Path:
|
|
||||||
events = tmp_path / "events"
|
|
||||||
events.mkdir()
|
|
||||||
return events
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Storage fixture (tmp SQLite — fast, no mocking)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest_asyncio.fixture
|
|
||||||
async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
|
|
||||||
s = Storage(tmp_path / "test.db")
|
|
||||||
await s.open()
|
|
||||||
yield s
|
|
||||||
await s.close()
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# EventEmitter fixture
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def emitter(tmp_events_dir: Path) -> EventEmitter:
|
|
||||||
return EventEmitter(tmp_events_dir, node_name="test-node")
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Mock HA client fixture
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def mock_ha_client():
|
|
||||||
"""Plain HAClient mock — no context manager, just async methods."""
|
|
||||||
client = MagicMock()
|
|
||||||
client.get_api_status = AsyncMock(return_value={"message": "API running."})
|
|
||||||
client.get_states = AsyncMock(return_value=[])
|
|
||||||
client.get_entity_registry = AsyncMock(return_value=[])
|
|
||||||
client.get_system_health = AsyncMock(return_value={})
|
|
||||||
client.get_automation_traces = AsyncMock(return_value=[])
|
|
||||||
return client
|
|
||||||
|
|
@ -1,38 +0,0 @@
|
||||||
"""Integration test fixtures.
|
|
||||||
|
|
||||||
Integration tests require real HA instances. Start them with:
|
|
||||||
|
|
||||||
docker compose -f tests/integration/docker-compose.ken.yml up -d
|
|
||||||
docker compose -f tests/integration/docker-compose.chelsty.yml up -d
|
|
||||||
tests/integration/scripts/wait-for-ha.sh http://localhost:8123
|
|
||||||
tests/integration/scripts/wait-for-ha.sh http://localhost:8124
|
|
||||||
|
|
||||||
Then set TEST_HA_TOKEN (a long-lived HA token) and run:
|
|
||||||
|
|
||||||
pytest tests/ -m integration
|
|
||||||
|
|
||||||
All tests in this module are automatically skipped when TEST_HA_TOKEN is unset.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import os
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
|
||||||
def ha_ken_url() -> str:
|
|
||||||
return os.getenv("TEST_HA_KEN_URL", "http://localhost:8123")
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
|
||||||
def ha_chelsty_url() -> str:
|
|
||||||
return os.getenv("TEST_HA_CHELSTY_URL", "http://localhost:8124")
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
|
||||||
def ha_token() -> str:
|
|
||||||
token = os.getenv("TEST_HA_TOKEN", "")
|
|
||||||
if not token:
|
|
||||||
pytest.skip("TEST_HA_TOKEN not set — skipping integration tests")
|
|
||||||
return token
|
|
||||||
|
|
@ -1,27 +0,0 @@
|
||||||
services:
|
|
||||||
ha-chelsty-init:
|
|
||||||
image: busybox
|
|
||||||
container_name: ha-test-chelsty-init
|
|
||||||
command: sh -c "cp -rn /fixtures/. /config/ && echo 'Fixtures copied'"
|
|
||||||
volumes:
|
|
||||||
- ./fixtures/chelsty:/fixtures:ro
|
|
||||||
- ha_chelsty_config:/config
|
|
||||||
restart: "no"
|
|
||||||
|
|
||||||
ha-chelsty:
|
|
||||||
image: ghcr.io/home-assistant/home-assistant:stable
|
|
||||||
container_name: ha-test-chelsty
|
|
||||||
privileged: true
|
|
||||||
depends_on:
|
|
||||||
ha-chelsty-init:
|
|
||||||
condition: service_completed_successfully
|
|
||||||
ports:
|
|
||||||
- "8124:8123"
|
|
||||||
volumes:
|
|
||||||
- ha_chelsty_config:/config
|
|
||||||
environment:
|
|
||||||
TZ: UTC
|
|
||||||
restart: "no"
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
ha_chelsty_config:
|
|
||||||
|
|
@ -1,27 +0,0 @@
|
||||||
services:
|
|
||||||
ha-ken-init:
|
|
||||||
image: busybox
|
|
||||||
container_name: ha-test-ken-init
|
|
||||||
command: sh -c "cp -rn /fixtures/. /config/ && echo 'Fixtures copied'"
|
|
||||||
volumes:
|
|
||||||
- ./fixtures/ken:/fixtures:ro
|
|
||||||
- ha_ken_config:/config
|
|
||||||
restart: "no"
|
|
||||||
|
|
||||||
ha-ken:
|
|
||||||
image: ghcr.io/home-assistant/home-assistant:stable
|
|
||||||
container_name: ha-test-ken
|
|
||||||
privileged: true
|
|
||||||
depends_on:
|
|
||||||
ha-ken-init:
|
|
||||||
condition: service_completed_successfully
|
|
||||||
ports:
|
|
||||||
- "8123:8123"
|
|
||||||
volumes:
|
|
||||||
- ha_ken_config:/config
|
|
||||||
environment:
|
|
||||||
TZ: UTC
|
|
||||||
restart: "no"
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
ha_ken_config:
|
|
||||||
|
|
@ -1,18 +0,0 @@
|
||||||
# Home Assistant test fixture — chelsty site
|
|
||||||
# Used by integration tests only. Not for production.
|
|
||||||
|
|
||||||
homeassistant:
|
|
||||||
name: "Test HA - Chelsty"
|
|
||||||
latitude: 0.0
|
|
||||||
longitude: 0.0
|
|
||||||
elevation: 0
|
|
||||||
unit_system: metric
|
|
||||||
time_zone: UTC
|
|
||||||
country: PL
|
|
||||||
|
|
||||||
# Enable REST API
|
|
||||||
api:
|
|
||||||
|
|
||||||
# Disable analytics
|
|
||||||
analytics:
|
|
||||||
reporting: false
|
|
||||||
|
|
@ -1,18 +0,0 @@
|
||||||
# Home Assistant test fixture — ken (piha) site
|
|
||||||
# Used by integration tests only. Not for production.
|
|
||||||
|
|
||||||
homeassistant:
|
|
||||||
name: "Test HA - Ken"
|
|
||||||
latitude: 0.0
|
|
||||||
longitude: 0.0
|
|
||||||
elevation: 0
|
|
||||||
unit_system: metric
|
|
||||||
time_zone: UTC
|
|
||||||
country: PL
|
|
||||||
|
|
||||||
# Enable REST API (no auth required for trusted networks in tests)
|
|
||||||
api:
|
|
||||||
|
|
||||||
# Disable analytics
|
|
||||||
analytics:
|
|
||||||
reporting: false
|
|
||||||
|
|
@ -1,36 +0,0 @@
|
||||||
#!/bin/sh
|
|
||||||
# Reset an HA Docker volume from a snapshot or fixture directory.
|
|
||||||
# Usage: reset.sh <compose_file> <service_name> <fixture_dir>
|
|
||||||
#
|
|
||||||
# Stops the service, clears and repopulates its volume from the fixture
|
|
||||||
# directory, then restarts.
|
|
||||||
|
|
||||||
set -e
|
|
||||||
|
|
||||||
COMPOSE_FILE="${1:?Usage: reset.sh <compose_file> <service_name> <fixture_dir>}"
|
|
||||||
SERVICE="${2:?}"
|
|
||||||
FIXTURE_DIR="${3:?}"
|
|
||||||
COMPOSE_DIR="$(dirname "$COMPOSE_FILE")"
|
|
||||||
|
|
||||||
printf 'Resetting %s from %s...\n' "$SERVICE" "$FIXTURE_DIR"
|
|
||||||
|
|
||||||
# Stop the service (keep the init container stopped too)
|
|
||||||
docker compose -f "$COMPOSE_FILE" stop "$SERVICE" 2>/dev/null || true
|
|
||||||
|
|
||||||
# Determine the volume name from compose project + service
|
|
||||||
VOLUME_NAME="$(docker compose -f "$COMPOSE_FILE" config --volumes 2>/dev/null | head -1)"
|
|
||||||
if [ -z "$VOLUME_NAME" ]; then
|
|
||||||
printf 'Could not determine volume name from %s\n' "$COMPOSE_FILE" >&2
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Wipe and repopulate the volume
|
|
||||||
docker run --rm \
|
|
||||||
-v "$VOLUME_NAME":/config \
|
|
||||||
-v "$(realpath "$FIXTURE_DIR")":/fixtures:ro \
|
|
||||||
busybox \
|
|
||||||
sh -c "rm -rf /config/.storage && cp -r /fixtures/. /config/"
|
|
||||||
|
|
||||||
# Restart the service
|
|
||||||
docker compose -f "$COMPOSE_FILE" start "$SERVICE"
|
|
||||||
printf 'Reset complete. Run wait-for-ha.sh to confirm readiness.\n'
|
|
||||||
|
|
@ -1,21 +0,0 @@
|
||||||
#!/bin/sh
|
|
||||||
# Snapshot the current state of an HA Docker volume.
|
|
||||||
# Usage: snapshot.sh <volume_name> [output_dir]
|
|
||||||
#
|
|
||||||
# Saves a tar.gz of the entire volume to output_dir (default: ./snapshots/).
|
|
||||||
# Use reset.sh to restore.
|
|
||||||
|
|
||||||
VOLUME="${1:?Usage: snapshot.sh <volume_name> [output_dir]}"
|
|
||||||
OUTPUT_DIR="${2:-./snapshots}"
|
|
||||||
SNAPSHOT_FILE="$OUTPUT_DIR/$VOLUME-$(date +%Y%m%d-%H%M%S).tar.gz"
|
|
||||||
|
|
||||||
mkdir -p "$OUTPUT_DIR"
|
|
||||||
printf 'Snapshotting volume %s -> %s\n' "$VOLUME" "$SNAPSHOT_FILE"
|
|
||||||
|
|
||||||
docker run --rm \
|
|
||||||
-v "$VOLUME":/data:ro \
|
|
||||||
alpine \
|
|
||||||
tar czf - -C / data \
|
|
||||||
> "$SNAPSHOT_FILE"
|
|
||||||
|
|
||||||
printf 'Snapshot saved: %s\n' "$SNAPSHOT_FILE"
|
|
||||||
|
|
@ -1,23 +0,0 @@
|
||||||
#!/bin/sh
|
|
||||||
# Wait until a Home Assistant instance is ready (responds to /api/).
|
|
||||||
# Usage: wait-for-ha.sh <url> [timeout_seconds]
|
|
||||||
#
|
|
||||||
# Exit 0 = HA ready, Exit 1 = timeout reached.
|
|
||||||
|
|
||||||
URL="${1:-http://localhost:8123}"
|
|
||||||
TIMEOUT="${2:-120}"
|
|
||||||
|
|
||||||
elapsed=0
|
|
||||||
printf 'Waiting for HA at %s (timeout %ss)...\n' "$URL" "$TIMEOUT"
|
|
||||||
|
|
||||||
while [ "$elapsed" -lt "$TIMEOUT" ]; do
|
|
||||||
if curl -sf --max-time 3 "$URL/api/" -o /dev/null 2>/dev/null; then
|
|
||||||
printf 'HA ready at %s (after %ss)\n' "$URL" "$elapsed"
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
sleep 2
|
|
||||||
elapsed=$((elapsed + 2))
|
|
||||||
done
|
|
||||||
|
|
||||||
printf 'Timeout: HA not ready at %s after %ss\n' "$URL" "$TIMEOUT" >&2
|
|
||||||
exit 1
|
|
||||||
|
|
@ -1,167 +0,0 @@
|
||||||
"""Integration tests for AutomationFailuresCheck.
|
|
||||||
|
|
||||||
Uses real aiosqlite Storage + EventEmitter + mocked HTTP.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import AsyncGenerator
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
import pytest_asyncio
|
|
||||||
from aioresponses import aioresponses
|
|
||||||
|
|
||||||
from ha_diag.checks.automation_failures import AutomationFailuresCheck
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.event_emitter import EventEmitter
|
|
||||||
from ha_diag.ha_client import HAClient, make_session
|
|
||||||
from ha_diag.models import HAEventType
|
|
||||||
from ha_diag.storage import Storage
|
|
||||||
|
|
||||||
HA_URL = "http://ha-test-ken:8123"
|
|
||||||
|
|
||||||
|
|
||||||
def _settings(**overrides) -> Settings:
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": HA_URL,
|
|
||||||
"ha_token": "test-token",
|
|
||||||
"node_name": "piha",
|
|
||||||
"location_tag": "ken",
|
|
||||||
"alert_cooldown_hours": 0.0,
|
|
||||||
"automation_failure_threshold": 3,
|
|
||||||
"check_interval": 60,
|
|
||||||
"check_interval_unavailable": 3600,
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest_asyncio.fixture
|
|
||||||
async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
|
|
||||||
s = Storage(tmp_path / "integration_test.db")
|
|
||||||
await s.open()
|
|
||||||
yield s
|
|
||||||
await s.close()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def events_dir(tmp_path: Path) -> Path:
|
|
||||||
d = tmp_path / "events"
|
|
||||||
d.mkdir()
|
|
||||||
return d
|
|
||||||
|
|
||||||
|
|
||||||
def _auto_states(*entity_ids: str) -> list[dict]:
|
|
||||||
return [
|
|
||||||
{
|
|
||||||
"entity_id": eid,
|
|
||||||
"state": "on",
|
|
||||||
"attributes": {"friendly_name": eid.split(".")[-1].replace("_", " ").title()},
|
|
||||||
}
|
|
||||||
for eid in entity_ids
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
def _fail_traces(n: int = 3) -> list[dict]:
|
|
||||||
return [
|
|
||||||
{
|
|
||||||
"run_id": f"run-{i}",
|
|
||||||
"timestamp": f"2026-05-27T{10+i:02d}:00:00+00:00",
|
|
||||||
"trigger": "state",
|
|
||||||
"state": "stopped",
|
|
||||||
"error": f"Script error #{i}",
|
|
||||||
}
|
|
||||||
for i in range(n)
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
def _ok_traces(n: int = 3) -> list[dict]:
|
|
||||||
return [
|
|
||||||
{
|
|
||||||
"run_id": f"run-{i}",
|
|
||||||
"timestamp": f"2026-05-27T{10+i:02d}:00:00+00:00",
|
|
||||||
"trigger": "state",
|
|
||||||
"state": "stopped",
|
|
||||||
"error": None,
|
|
||||||
}
|
|
||||||
for i in range(n)
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_failing_automation_emits_event_and_writes_file(
|
|
||||||
storage: Storage, events_dir: Path
|
|
||||||
):
|
|
||||||
"""3 consecutive failures → event file written with correct structure."""
|
|
||||||
states = _auto_states("automation.morning_lights")
|
|
||||||
traces = _fail_traces(3)
|
|
||||||
emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
|
|
||||||
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states)
|
|
||||||
m.get(f"{HA_URL}/api/trace/automation/automation.morning_lights", payload=traces)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check = AutomationFailuresCheck(client, storage, _settings())
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
assert len(results) == 1
|
|
||||||
r = results[0]
|
|
||||||
assert r.event_type == HAEventType.ha_automation_failing
|
|
||||||
assert r.payload["entity_id"] == "automation.morning_lights"
|
|
||||||
assert r.payload["total_recent_failures"] == 3
|
|
||||||
|
|
||||||
emitter.emit(
|
|
||||||
event_type=r.event_type,
|
|
||||||
severity=r.severity.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message=r.message,
|
|
||||||
payload=r.payload,
|
|
||||||
)
|
|
||||||
|
|
||||||
files = list(events_dir.glob("*.json"))
|
|
||||||
assert len(files) == 1
|
|
||||||
data = json.loads(files[0].read_text())
|
|
||||||
assert data["type"] == "ha_automation_failing"
|
|
||||||
assert data["payload"]["location_tag"] == "ken"
|
|
||||||
assert "last_failures" in data["payload"]
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_healthy_automation_no_event(storage: Storage):
|
|
||||||
"""All recent runs successful → no event."""
|
|
||||||
states = _auto_states("automation.morning_lights")
|
|
||||||
traces = _ok_traces(3)
|
|
||||||
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states)
|
|
||||||
m.get(f"{HA_URL}/api/trace/automation/automation.morning_lights", payload=traces)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check = AutomationFailuresCheck(client, storage, _settings())
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
assert results == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_cooldown_suppresses_duplicate(storage: Storage):
|
|
||||||
"""Second run within cooldown window → no duplicate event."""
|
|
||||||
states = _auto_states("automation.morning_lights")
|
|
||||||
traces = _fail_traces(3)
|
|
||||||
settings = _settings(alert_cooldown_hours=6.0)
|
|
||||||
|
|
||||||
for _ in range(2):
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states)
|
|
||||||
m.get(f"{HA_URL}/api/trace/automation/automation.morning_lights", payload=traces)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
check = AutomationFailuresCheck(
|
|
||||||
HAClient(HA_URL, session), storage, settings
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
if _ == 0:
|
|
||||||
assert len(results) == 1
|
|
||||||
else:
|
|
||||||
assert results == []
|
|
||||||
|
|
@ -1,59 +0,0 @@
|
||||||
"""Integration tests for HeartbeatCheck against real HA instances.
|
|
||||||
|
|
||||||
Requires:
|
|
||||||
- docker compose -f tests/integration/docker-compose.ken.yml up -d
|
|
||||||
- docker compose -f tests/integration/docker-compose.chelsty.yml up -d
|
|
||||||
- TEST_HA_TOKEN=<long-lived-token> pytest tests/ -m integration
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ha_diag.checks.heartbeat import HeartbeatCheck
|
|
||||||
from ha_diag.event_emitter import EventEmitter
|
|
||||||
from ha_diag.ha_client import HAClient, make_session
|
|
||||||
from ha_diag.models import HAEventType
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_heartbeat_ken_healthy(ha_ken_url: str, ha_token: str):
|
|
||||||
async with make_session(ha_token) as session:
|
|
||||||
client = HAClient(ha_ken_url, session)
|
|
||||||
check = HeartbeatCheck(client)
|
|
||||||
results = await check.run()
|
|
||||||
assert results == [], f"HA ken not healthy: {results}"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_heartbeat_chelsty_healthy(ha_chelsty_url: str, ha_token: str):
|
|
||||||
async with make_session(ha_token) as session:
|
|
||||||
client = HAClient(ha_chelsty_url, session)
|
|
||||||
check = HeartbeatCheck(client)
|
|
||||||
results = await check.run()
|
|
||||||
assert results == [], f"HA chelsty not healthy: {results}"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_heartbeat_emits_event_on_failure():
|
|
||||||
"""Connecting to a closed port should yield ha_websocket_dead."""
|
|
||||||
async with make_session("bad-token") as session:
|
|
||||||
client = HAClient("http://127.0.0.1:19999", session) # nothing here
|
|
||||||
check = HeartbeatCheck(client)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_websocket_dead
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_heartbeat_event_written_to_filesystem(
|
|
||||||
ha_ken_url: str, ha_token: str, tmp_path
|
|
||||||
):
|
|
||||||
emitter = EventEmitter(tmp_path / "events", node_name="test-piha", location_tag="ken")
|
|
||||||
async with make_session(ha_token) as session:
|
|
||||||
client = HAClient(ha_ken_url, session)
|
|
||||||
check = HeartbeatCheck(client)
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
# Healthy HA → no events
|
|
||||||
assert results == []
|
|
||||||
assert not list((tmp_path / "events").glob("*.json"))
|
|
||||||
|
|
@ -1,151 +0,0 @@
|
||||||
"""Integration tests for SystemHealthCheck using aioresponses.
|
|
||||||
|
|
||||||
Uses real aiosqlite Storage + EventEmitter + mocked HTTP.
|
|
||||||
Marked 'integration' because it exercises the full stack end-to-end.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import AsyncGenerator
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
import pytest_asyncio
|
|
||||||
from aioresponses import aioresponses
|
|
||||||
|
|
||||||
from ha_diag.checks.system_health import SystemHealthCheck
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.event_emitter import EventEmitter
|
|
||||||
from ha_diag.ha_client import HAClient, make_session
|
|
||||||
from ha_diag.models import HAEventType
|
|
||||||
from ha_diag.storage import Storage
|
|
||||||
|
|
||||||
HA_URL = "http://ha-test-ken:8123"
|
|
||||||
|
|
||||||
|
|
||||||
def _settings(**overrides) -> Settings:
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": HA_URL,
|
|
||||||
"ha_token": "test-token",
|
|
||||||
"node_name": "piha",
|
|
||||||
"location_tag": "ken",
|
|
||||||
"alert_cooldown_hours": 0.0,
|
|
||||||
"check_interval": 60,
|
|
||||||
"check_interval_unavailable": 3600,
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest_asyncio.fixture
|
|
||||||
async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
|
|
||||||
s = Storage(tmp_path / "integration_test.db")
|
|
||||||
await s.open()
|
|
||||||
yield s
|
|
||||||
await s.close()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def events_dir(tmp_path: Path) -> Path:
|
|
||||||
d = tmp_path / "events"
|
|
||||||
d.mkdir()
|
|
||||||
return d
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_system_health_ok_components_no_event(
|
|
||||||
storage: Storage, events_dir: Path
|
|
||||||
):
|
|
||||||
"""All components healthy on first run → no events emitted."""
|
|
||||||
health = {
|
|
||||||
"homeassistant": {"type": "result", "data": {"version": "2025.5.0"}},
|
|
||||||
"recorder": {"type": "result", "data": {"backlog": 0}},
|
|
||||||
}
|
|
||||||
emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
|
|
||||||
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/system_health", payload=health)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check = SystemHealthCheck(client, storage, _settings())
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
assert results == []
|
|
||||||
assert not list(events_dir.glob("*.json"))
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_system_health_degraded_emits_event_and_writes_file(
|
|
||||||
storage: Storage, events_dir: Path
|
|
||||||
):
|
|
||||||
"""Component degrades: event emitted + file written with correct structure."""
|
|
||||||
# First run: all ok
|
|
||||||
health_ok = {"cloud": {"type": "result", "data": {}}}
|
|
||||||
health_err = {"cloud": {"type": "error", "error": "Cloud connection lost"}}
|
|
||||||
emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
|
|
||||||
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/system_health", payload=health_ok)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
await SystemHealthCheck(client, storage, _settings()).run()
|
|
||||||
|
|
||||||
# Second run: cloud errors
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/system_health", payload=health_err)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check = SystemHealthCheck(client, storage, _settings())
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_system_health_degraded
|
|
||||||
|
|
||||||
emitter.emit(
|
|
||||||
event_type=results[0].event_type,
|
|
||||||
severity=results[0].severity.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message=results[0].message,
|
|
||||||
payload=results[0].payload,
|
|
||||||
)
|
|
||||||
|
|
||||||
files = list(events_dir.glob("*.json"))
|
|
||||||
assert len(files) == 1
|
|
||||||
data = json.loads(files[0].read_text())
|
|
||||||
assert data["type"] == "ha_system_health_degraded"
|
|
||||||
assert data["payload"]["component"] == "cloud"
|
|
||||||
assert data["payload"]["location_tag"] == "ken"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_system_health_recovery_and_re_degradation(storage: Storage):
|
|
||||||
"""Full ok→error→ok→error cycle: events fire on degradation, not on recovery."""
|
|
||||||
def _run(health):
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/system_health", payload=health)
|
|
||||||
return make_session("test-token"), health
|
|
||||||
|
|
||||||
settings = _settings()
|
|
||||||
|
|
||||||
async def run_once(health):
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/system_health", payload=health)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
return await SystemHealthCheck(
|
|
||||||
HAClient(HA_URL, session), storage, settings
|
|
||||||
).run()
|
|
||||||
|
|
||||||
ok_h = {"cloud": {"type": "result", "data": {}}}
|
|
||||||
err_h = {"cloud": {"type": "error", "error": "timeout"}}
|
|
||||||
|
|
||||||
r1 = await run_once(ok_h) # baseline ok
|
|
||||||
r2 = await run_once(err_h) # first degradation
|
|
||||||
r3 = await run_once(err_h) # sustained error (no dup)
|
|
||||||
r4 = await run_once(ok_h) # recovery
|
|
||||||
r5 = await run_once(err_h) # second degradation
|
|
||||||
|
|
||||||
assert r1 == []
|
|
||||||
assert len(r2) == 1
|
|
||||||
assert r3 == []
|
|
||||||
assert r4 == []
|
|
||||||
assert len(r5) == 1
|
|
||||||
|
|
@ -1,192 +0,0 @@
|
||||||
"""Functional integration test for UnavailableEntitiesCheck.
|
|
||||||
|
|
||||||
Uses aioresponses for HA HTTP (controlled, deterministic) and real aiosqlite +
|
|
||||||
EventEmitter (tests the full agent pipeline end-to-end without a live HA).
|
|
||||||
Marked 'integration' because it exercises the complete multi-component stack.
|
|
||||||
|
|
||||||
For a live-HA variant, start the ken testenv Docker instances, set
|
|
||||||
TEST_HA_TOKEN, and extend with tests that call real HA endpoints.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import AsyncGenerator
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
import pytest_asyncio
|
|
||||||
from aioresponses import aioresponses
|
|
||||||
|
|
||||||
from ha_diag.checks.unavailable_entities import UnavailableEntitiesCheck
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.event_emitter import EventEmitter
|
|
||||||
from ha_diag.ha_client import HAClient, make_session
|
|
||||||
from ha_diag.models import HAEventType
|
|
||||||
from ha_diag.storage import Storage
|
|
||||||
|
|
||||||
HA_URL = "http://ha-test-ken:8123"
|
|
||||||
|
|
||||||
|
|
||||||
def _settings(**overrides) -> Settings:
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": HA_URL,
|
|
||||||
"ha_token": "test-token",
|
|
||||||
"node_name": "piha",
|
|
||||||
"location_tag": "ken",
|
|
||||||
"unavailable_threshold_hours": 0.0,
|
|
||||||
"integration_failure_threshold_pct": 0.5,
|
|
||||||
"integration_failure_min_entities": 3,
|
|
||||||
"alert_cooldown_hours": 0.0,
|
|
||||||
"check_interval": 60,
|
|
||||||
"check_interval_unavailable": 3600,
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest_asyncio.fixture
|
|
||||||
async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
|
|
||||||
s = Storage(tmp_path / "integration_test.db")
|
|
||||||
await s.open()
|
|
||||||
yield s
|
|
||||||
await s.close()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def events_dir(tmp_path: Path) -> Path:
|
|
||||||
d = tmp_path / "events"
|
|
||||||
d.mkdir()
|
|
||||||
return d
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_full_pipeline_integration_event(storage: Storage, events_dir: Path):
|
|
||||||
"""3/3 zha entities unavailable → ha_integration_failed, 1 event file on disk."""
|
|
||||||
unavailable_entities = [
|
|
||||||
{"entity_id": f"light.test_{i}", "state": "unavailable", "attributes": {}}
|
|
||||||
for i in range(3)
|
|
||||||
]
|
|
||||||
available_entities = [{"entity_id": "sensor.ok", "state": "on", "attributes": {}}]
|
|
||||||
all_states = unavailable_entities + available_entities
|
|
||||||
registry = [
|
|
||||||
{"entity_id": e["entity_id"], "platform": "zha", "area_id": "living_room"}
|
|
||||||
for e in unavailable_entities
|
|
||||||
]
|
|
||||||
|
|
||||||
for e in unavailable_entities:
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
e["entity_id"], "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
|
|
||||||
emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
|
|
||||||
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=all_states)
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=registry)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check = UnavailableEntitiesCheck(client, storage, _settings())
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
# 3/3 zha entities (100% >= 50%, count 3 >= 3) → integration event
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_integration_failed
|
|
||||||
assert results[0].payload["integration"] == "zha"
|
|
||||||
|
|
||||||
emitter.emit(
|
|
||||||
event_type=results[0].event_type,
|
|
||||||
severity=results[0].severity.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message=results[0].message,
|
|
||||||
payload=results[0].payload,
|
|
||||||
)
|
|
||||||
|
|
||||||
event_files = list(events_dir.glob("*.json"))
|
|
||||||
assert len(event_files) == 1
|
|
||||||
event_data = json.loads(event_files[0].read_text())
|
|
||||||
assert event_data["node"] == "piha"
|
|
||||||
assert event_data["payload"]["location_tag"] == "ken"
|
|
||||||
assert event_data["payload"]["integration"] == "zha"
|
|
||||||
assert event_data["type"] == "ha_integration_failed"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_full_pipeline_individual_entity_events(
|
|
||||||
storage: Storage, events_dir: Path
|
|
||||||
):
|
|
||||||
"""2 unavailable entities from different integrations → 2 individual events."""
|
|
||||||
states = [
|
|
||||||
{"entity_id": "light.zha_one", "state": "unavailable", "attributes": {}},
|
|
||||||
{"entity_id": "sensor.mqtt_one", "state": "unavailable", "attributes": {}},
|
|
||||||
{"entity_id": "switch.ok", "state": "on", "attributes": {}},
|
|
||||||
]
|
|
||||||
registry = [
|
|
||||||
{"entity_id": "light.zha_one", "platform": "zha", "area_id": ""},
|
|
||||||
{"entity_id": "sensor.mqtt_one", "platform": "mqtt", "area_id": ""},
|
|
||||||
]
|
|
||||||
|
|
||||||
for e in ["light.zha_one", "sensor.mqtt_one"]:
|
|
||||||
await storage.set_entity_unavailable_since(e, "unavailable", time.time() - 25 * 3600)
|
|
||||||
|
|
||||||
emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
|
|
||||||
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states)
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=registry)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check = UnavailableEntitiesCheck(client, storage, _settings())
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
# Both integrations have only 1 entity each → below min_entities threshold
|
|
||||||
assert len(results) == 2
|
|
||||||
assert all(r.event_type == HAEventType.ha_entity_unavailable_long for r in results)
|
|
||||||
|
|
||||||
for result in results:
|
|
||||||
emitter.emit(
|
|
||||||
event_type=result.event_type,
|
|
||||||
severity=result.severity.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message=result.message,
|
|
||||||
payload=result.payload,
|
|
||||||
)
|
|
||||||
|
|
||||||
files = list(events_dir.glob("*.json"))
|
|
||||||
assert len(files) == 2
|
|
||||||
for f in files:
|
|
||||||
data = json.loads(f.read_text())
|
|
||||||
assert data["payload"]["location_tag"] == "ken"
|
|
||||||
assert "entity_id" in data["payload"]
|
|
||||||
assert "since" in data["payload"]
|
|
||||||
assert data["payload"]["since"].endswith("Z")
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_recovery_removes_tracking(storage: Storage, events_dir: Path):
|
|
||||||
"""Entity recovers between check cycles → baseline cleared, no event next cycle."""
|
|
||||||
eid = "light.recoverable"
|
|
||||||
await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
|
|
||||||
|
|
||||||
# Cycle 1: entity unavailable → event
|
|
||||||
states_cycle1 = [{"entity_id": eid, "state": "unavailable", "attributes": {}}]
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states_cycle1)
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=[])
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check = UnavailableEntitiesCheck(client, storage, _settings())
|
|
||||||
results1 = await check.run()
|
|
||||||
assert len(results1) == 1
|
|
||||||
|
|
||||||
# Cycle 2: entity recovered → no event, baseline cleared
|
|
||||||
states_cycle2 = [{"entity_id": eid, "state": "on", "attributes": {}}]
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states_cycle2)
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=[])
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check2 = UnavailableEntitiesCheck(client, storage, _settings())
|
|
||||||
results2 = await check2.run()
|
|
||||||
assert results2 == []
|
|
||||||
assert await storage.get_entity_first_unavailable_at(eid) is None
|
|
||||||
|
|
@ -1,169 +0,0 @@
|
||||||
"""Integration tests for UpdatesAvailableCheck and UpdatesDigestCheck.
|
|
||||||
|
|
||||||
Uses real aiosqlite Storage + EventEmitter + mocked HTTP.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import AsyncGenerator
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
import pytest_asyncio
|
|
||||||
from aioresponses import aioresponses
|
|
||||||
|
|
||||||
from ha_diag.checks.updates_available import UpdatesAvailableCheck, UpdatesDigestCheck
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.event_emitter import EventEmitter
|
|
||||||
from ha_diag.ha_client import HAClient, make_session
|
|
||||||
from ha_diag.models import HAEventType
|
|
||||||
from ha_diag.storage import Storage
|
|
||||||
|
|
||||||
HA_URL = "http://ha-test-ken:8123"
|
|
||||||
|
|
||||||
|
|
||||||
def _settings(**overrides) -> Settings:
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": HA_URL,
|
|
||||||
"ha_token": "test-token",
|
|
||||||
"node_name": "piha",
|
|
||||||
"location_tag": "ken",
|
|
||||||
"alert_cooldown_hours": 0.0,
|
|
||||||
"updates_cooldown_days": 0,
|
|
||||||
"check_interval": 60,
|
|
||||||
"check_interval_unavailable": 3600,
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest_asyncio.fixture
|
|
||||||
async def storage(tmp_path: Path) -> AsyncGenerator[Storage, None]:
|
|
||||||
s = Storage(tmp_path / "integration_test.db")
|
|
||||||
await s.open()
|
|
||||||
yield s
|
|
||||||
await s.close()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def events_dir(tmp_path: Path) -> Path:
|
|
||||||
d = tmp_path / "events"
|
|
||||||
d.mkdir()
|
|
||||||
return d
|
|
||||||
|
|
||||||
|
|
||||||
def _update_states(*entity_ids: str) -> list[dict]:
|
|
||||||
return [
|
|
||||||
{
|
|
||||||
"entity_id": eid,
|
|
||||||
"state": "on",
|
|
||||||
"attributes": {
|
|
||||||
"title": eid.split(".")[-1].replace("_", " ").title(),
|
|
||||||
"installed_version": "1.0.0",
|
|
||||||
"latest_version": "2.0.0",
|
|
||||||
"in_progress": False,
|
|
||||||
"auto_update": False,
|
|
||||||
},
|
|
||||||
}
|
|
||||||
for eid in entity_ids
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_individual_updates_written_to_disk(storage: Storage, events_dir: Path):
|
|
||||||
"""2 pending updates → 2 event files with correct structure."""
|
|
||||||
states = _update_states("update.ha_core", "update.mosquitto")
|
|
||||||
emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
|
|
||||||
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check = UpdatesAvailableCheck(client, storage, _settings())
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
assert len(results) == 2
|
|
||||||
for r in results:
|
|
||||||
assert r.event_type == HAEventType.ha_update_available
|
|
||||||
emitter.emit(
|
|
||||||
event_type=r.event_type,
|
|
||||||
severity=r.severity.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message=r.message,
|
|
||||||
payload=r.payload,
|
|
||||||
)
|
|
||||||
|
|
||||||
files = list(events_dir.glob("*.json"))
|
|
||||||
assert len(files) == 2
|
|
||||||
for f in files:
|
|
||||||
data = json.loads(f.read_text())
|
|
||||||
assert data["type"] == "ha_update_available"
|
|
||||||
assert data["payload"]["location_tag"] == "ken"
|
|
||||||
assert "entity_id" in data["payload"]
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_digest_writes_single_event_file(storage: Storage, events_dir: Path):
|
|
||||||
"""Sunday digest → single event file with digest=True payload."""
|
|
||||||
states = _update_states("update.ha_core", "update.mosquitto", "update.esphome")
|
|
||||||
emitter = EventEmitter(events_dir, node_name="piha", location_tag="ken")
|
|
||||||
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
check = UpdatesDigestCheck(client, storage, _settings())
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
assert len(results) == 1
|
|
||||||
r = results[0]
|
|
||||||
assert r.payload["digest"] is True
|
|
||||||
assert r.payload["count"] == 3
|
|
||||||
|
|
||||||
emitter.emit(
|
|
||||||
event_type=r.event_type,
|
|
||||||
severity=r.severity.value,
|
|
||||||
service="homeassistant",
|
|
||||||
message=r.message,
|
|
||||||
payload=r.payload,
|
|
||||||
)
|
|
||||||
files = list(events_dir.glob("*.json"))
|
|
||||||
assert len(files) == 1
|
|
||||||
data = json.loads(files[0].read_text())
|
|
||||||
assert data["payload"]["digest"] is True
|
|
||||||
assert len(data["payload"]["updates"]) == 3
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_dedup_across_daily_and_digest_independent(storage: Storage):
|
|
||||||
"""Daily dedup key doesn't suppress digest, and vice versa."""
|
|
||||||
states = _update_states("update.ha_core")
|
|
||||||
settings = _settings(updates_cooldown_days=7)
|
|
||||||
|
|
||||||
# Daily check
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
r1 = await UpdatesAvailableCheck(
|
|
||||||
HAClient(HA_URL, session), storage, settings
|
|
||||||
).run()
|
|
||||||
assert len(r1) == 1
|
|
||||||
|
|
||||||
# Daily again — cooldown active
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
r2 = await UpdatesAvailableCheck(
|
|
||||||
HAClient(HA_URL, session), storage, settings
|
|
||||||
).run()
|
|
||||||
assert r2 == []
|
|
||||||
|
|
||||||
# Digest — different key, should still fire
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=states)
|
|
||||||
async with make_session("test-token") as session:
|
|
||||||
r3 = await UpdatesDigestCheck(
|
|
||||||
HAClient(HA_URL, session), storage, settings
|
|
||||||
).run()
|
|
||||||
assert len(r3) == 1
|
|
||||||
assert r3[0].payload["digest"] is True
|
|
||||||
|
|
@ -1,186 +0,0 @@
|
||||||
"""Integration tests for WebSocketMonitor against real HA instances.
|
|
||||||
|
|
||||||
Requires:
|
|
||||||
docker compose -f tests/integration/docker-compose.ken.yml up -d
|
|
||||||
tests/integration/scripts/wait-for-ha.sh http://localhost:8123
|
|
||||||
TEST_HA_TOKEN=<long-lived-token> pytest tests/ -m integration
|
|
||||||
|
|
||||||
Container stop/restart tests additionally need Docker access from the host.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import asyncio
|
|
||||||
import subprocess
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.event_emitter import EventEmitter
|
|
||||||
from ha_diag.models import HAEventType
|
|
||||||
from ha_diag.monitors.websocket_monitor import WebSocketMonitor
|
|
||||||
from ha_diag.ha_client import make_session
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def _make_settings(ha_url: str, ha_token: str, **overrides) -> Settings:
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": ha_url,
|
|
||||||
"ha_token": ha_token,
|
|
||||||
"node_name": "test-piha",
|
|
||||||
"location_tag": "ken",
|
|
||||||
"websocket_enabled": True,
|
|
||||||
"websocket_silence_threshold_seconds": 30, # low for fast test
|
|
||||||
"websocket_watchdog_interval_seconds": 5,
|
|
||||||
"websocket_reconnect_initial_delay": 1.0,
|
|
||||||
"websocket_reconnect_max_delay": 10.0,
|
|
||||||
"websocket_reconnect_jitter": 0.0,
|
|
||||||
"websocket_down_alert_repeat_minutes": 0, # always re-alert
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
def _emitted_types(events_dir: Path) -> list[str]:
|
|
||||||
return [
|
|
||||||
__import__("json").loads(f.read_text())["type"]
|
|
||||||
for f in sorted(events_dir.glob("*.json"))
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Tests
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_ws_normal_operation_no_false_alerts(
|
|
||||||
ha_ken_url: str, ha_token: str, tmp_path: Path
|
|
||||||
):
|
|
||||||
"""Normal operation: monitor connects, subscribes, no dead alerts emitted."""
|
|
||||||
events_dir = tmp_path / "events"
|
|
||||||
events_dir.mkdir()
|
|
||||||
settings = _make_settings(ha_ken_url, ha_token)
|
|
||||||
emitter = EventEmitter(events_dir, node_name="test-piha", location_tag="ken")
|
|
||||||
|
|
||||||
async with make_session(ha_token) as session:
|
|
||||||
monitor = WebSocketMonitor(
|
|
||||||
ha_url=ha_ken_url,
|
|
||||||
token=ha_token,
|
|
||||||
settings=settings,
|
|
||||||
emitter=emitter,
|
|
||||||
session=session,
|
|
||||||
)
|
|
||||||
await monitor.start()
|
|
||||||
await asyncio.sleep(5) # let it connect and settle
|
|
||||||
assert monitor.is_healthy, "Monitor should be subscribed and healthy"
|
|
||||||
await monitor.stop()
|
|
||||||
|
|
||||||
# No dead alerts during normal operation
|
|
||||||
types = _emitted_types(events_dir)
|
|
||||||
assert HAEventType.ha_websocket_dead.value not in types, (
|
|
||||||
f"Unexpected dead alerts during normal operation: {types}"
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_ws_dead_emitted_when_ha_stops(ha_ken_url: str, ha_token: str, tmp_path: Path):
|
|
||||||
"""Stopping the HA container triggers ha_websocket_dead."""
|
|
||||||
events_dir = tmp_path / "events"
|
|
||||||
events_dir.mkdir()
|
|
||||||
settings = _make_settings(ha_ken_url, ha_token)
|
|
||||||
emitter = EventEmitter(events_dir, node_name="test-piha", location_tag="ken")
|
|
||||||
|
|
||||||
async with make_session(ha_token) as session:
|
|
||||||
monitor = WebSocketMonitor(
|
|
||||||
ha_url=ha_ken_url,
|
|
||||||
token=ha_token,
|
|
||||||
settings=settings,
|
|
||||||
emitter=emitter,
|
|
||||||
session=session,
|
|
||||||
)
|
|
||||||
await monitor.start()
|
|
||||||
# Wait for initial subscription
|
|
||||||
for _ in range(20):
|
|
||||||
if monitor.is_healthy:
|
|
||||||
break
|
|
||||||
await asyncio.sleep(0.5)
|
|
||||||
assert monitor.is_healthy, "Monitor did not subscribe within 10s"
|
|
||||||
|
|
||||||
# Stop HA container
|
|
||||||
subprocess.run(
|
|
||||||
["docker", "stop", "ha-test-ken"],
|
|
||||||
check=True, capture_output=True, timeout=30,
|
|
||||||
)
|
|
||||||
try:
|
|
||||||
# Wait for dead alert (up to 15s)
|
|
||||||
deadline = time.monotonic() + 15
|
|
||||||
while time.monotonic() < deadline:
|
|
||||||
types = _emitted_types(events_dir)
|
|
||||||
if HAEventType.ha_websocket_dead.value in types:
|
|
||||||
break
|
|
||||||
await asyncio.sleep(0.5)
|
|
||||||
|
|
||||||
types = _emitted_types(events_dir)
|
|
||||||
assert HAEventType.ha_websocket_dead.value in types, (
|
|
||||||
"ha_websocket_dead not emitted after HA container stopped"
|
|
||||||
)
|
|
||||||
finally:
|
|
||||||
await monitor.stop()
|
|
||||||
subprocess.run(
|
|
||||||
["docker", "start", "ha-test-ken"],
|
|
||||||
check=False, capture_output=True, timeout=30,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.integration
|
|
||||||
async def test_ws_recovered_after_ha_restart(ha_ken_url: str, ha_token: str, tmp_path: Path):
|
|
||||||
"""After HA restarts, monitor reconnects and emits ha_websocket_recovered."""
|
|
||||||
events_dir = tmp_path / "events"
|
|
||||||
events_dir.mkdir()
|
|
||||||
settings = _make_settings(ha_ken_url, ha_token)
|
|
||||||
emitter = EventEmitter(events_dir, node_name="test-piha", location_tag="ken")
|
|
||||||
|
|
||||||
async with make_session(ha_token) as session:
|
|
||||||
monitor = WebSocketMonitor(
|
|
||||||
ha_url=ha_ken_url,
|
|
||||||
token=ha_token,
|
|
||||||
settings=settings,
|
|
||||||
emitter=emitter,
|
|
||||||
session=session,
|
|
||||||
)
|
|
||||||
await monitor.start()
|
|
||||||
for _ in range(20):
|
|
||||||
if monitor.is_healthy:
|
|
||||||
break
|
|
||||||
await asyncio.sleep(0.5)
|
|
||||||
assert monitor.is_healthy
|
|
||||||
|
|
||||||
# Stop then restart HA
|
|
||||||
subprocess.run(["docker", "stop", "ha-test-ken"], check=True, timeout=30)
|
|
||||||
await asyncio.sleep(2)
|
|
||||||
subprocess.run(["docker", "start", "ha-test-ken"], check=True, timeout=30)
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Wait for recovery (up to 60s — HA takes time to start)
|
|
||||||
deadline = time.monotonic() + 60
|
|
||||||
while time.monotonic() < deadline:
|
|
||||||
types = _emitted_types(events_dir)
|
|
||||||
if HAEventType.ha_websocket_recovered.value in types:
|
|
||||||
break
|
|
||||||
await asyncio.sleep(1.0)
|
|
||||||
|
|
||||||
types = _emitted_types(events_dir)
|
|
||||||
assert HAEventType.ha_websocket_dead.value in types, (
|
|
||||||
"ha_websocket_dead not emitted after container stop"
|
|
||||||
)
|
|
||||||
assert HAEventType.ha_websocket_recovered.value in types, (
|
|
||||||
"ha_websocket_recovered not emitted after HA restarted"
|
|
||||||
)
|
|
||||||
finally:
|
|
||||||
await monitor.stop()
|
|
||||||
|
|
@ -1,217 +0,0 @@
|
||||||
"""Unit tests for AutomationFailuresCheck."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from pathlib import Path
|
|
||||||
from unittest.mock import AsyncMock, MagicMock
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ha_diag.checks.automation_failures import AutomationFailuresCheck, _is_trace_failure
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.models import HAEventType, Severity
|
|
||||||
from ha_diag.storage import Storage
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def _make_settings(**overrides) -> Settings:
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": "http://test.local:8123",
|
|
||||||
"ha_token": "test",
|
|
||||||
"node_name": "test-node",
|
|
||||||
"location_tag": "test-loc",
|
|
||||||
"alert_cooldown_hours": 0.0,
|
|
||||||
"automation_failure_threshold": 3,
|
|
||||||
"check_interval": 60,
|
|
||||||
"check_interval_unavailable": 3600,
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
def _make_client(states=None, traces_by_id=None, states_error=None):
|
|
||||||
client = MagicMock()
|
|
||||||
if states_error:
|
|
||||||
client.get_states = AsyncMock(side_effect=states_error)
|
|
||||||
else:
|
|
||||||
client.get_states = AsyncMock(return_value=states or [])
|
|
||||||
|
|
||||||
traces_map = traces_by_id or {}
|
|
||||||
|
|
||||||
async def _get_traces(eid):
|
|
||||||
if eid not in traces_map:
|
|
||||||
raise Exception(f"404 for {eid}")
|
|
||||||
return traces_map[eid]
|
|
||||||
|
|
||||||
client.get_automation_traces = AsyncMock(side_effect=_get_traces)
|
|
||||||
return client
|
|
||||||
|
|
||||||
|
|
||||||
def _auto_state(entity_id: str, state: str = "on", friendly_name: str | None = None) -> dict:
|
|
||||||
attrs: dict = {}
|
|
||||||
if friendly_name:
|
|
||||||
attrs["friendly_name"] = friendly_name
|
|
||||||
return {"entity_id": entity_id, "state": state, "attributes": attrs}
|
|
||||||
|
|
||||||
|
|
||||||
def _trace(error: str | None = None, state: str = "stopped") -> dict:
|
|
||||||
return {
|
|
||||||
"run_id": "abc",
|
|
||||||
"timestamp": "2026-05-27T10:00:00+00:00",
|
|
||||||
"trigger": "state",
|
|
||||||
"state": state if error is None else "stopped",
|
|
||||||
"error": error,
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def _fail(error: str = "Script error") -> dict:
|
|
||||||
return _trace(error=error)
|
|
||||||
|
|
||||||
|
|
||||||
def _ok() -> dict:
|
|
||||||
return _trace(error=None)
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# _is_trace_failure unit tests
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def test_trace_with_error_is_failure():
|
|
||||||
assert _is_trace_failure({"error": "Something went wrong"}) is True
|
|
||||||
|
|
||||||
|
|
||||||
def test_trace_with_state_failed_is_failure():
|
|
||||||
assert _is_trace_failure({"state": "failed", "error": None}) is True
|
|
||||||
|
|
||||||
|
|
||||||
def test_trace_with_null_error_is_success():
|
|
||||||
assert _is_trace_failure({"error": None, "state": "stopped"}) is False
|
|
||||||
|
|
||||||
|
|
||||||
def test_trace_with_empty_string_error_is_success():
|
|
||||||
assert _is_trace_failure({"error": "", "state": "stopped"}) is False
|
|
||||||
|
|
||||||
|
|
||||||
def test_trace_with_no_keys_is_success():
|
|
||||||
assert _is_trace_failure({}) is False
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# AutomationFailuresCheck.run() tests
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_no_automations_returns_empty(storage: Storage):
|
|
||||||
check = AutomationFailuresCheck(_make_client([]), storage, _make_settings())
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_disabled_automation_skipped(storage: Storage):
|
|
||||||
states = [_auto_state("automation.morning_lights", state="off")]
|
|
||||||
check = AutomationFailuresCheck(_make_client(states, {}), storage, _make_settings())
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_automation_with_no_traces_skipped(storage: Storage):
|
|
||||||
states = [_auto_state("automation.morning_lights")]
|
|
||||||
# _make_client raises exception for missing keys → graceful skip
|
|
||||||
check = AutomationFailuresCheck(_make_client(states, {}), storage, _make_settings())
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_fewer_traces_than_threshold_skipped(storage: Storage):
|
|
||||||
states = [_auto_state("automation.a")]
|
|
||||||
traces = {"automation.a": [_fail(), _fail()]} # 2 failures, threshold=3
|
|
||||||
check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_all_recent_failed_emits_event(storage: Storage):
|
|
||||||
states = [_auto_state("automation.a", friendly_name="Morning Lights")]
|
|
||||||
traces = {"automation.a": [_fail("step failed"), _fail("timeout"), _fail("no device")]}
|
|
||||||
check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
r = results[0]
|
|
||||||
assert r.event_type == HAEventType.ha_automation_failing
|
|
||||||
assert r.severity == Severity.warning
|
|
||||||
assert r.payload["entity_id"] == "automation.a"
|
|
||||||
assert r.payload["friendly_name"] == "Morning Lights"
|
|
||||||
assert r.payload["total_recent_failures"] == 3
|
|
||||||
assert len(r.payload["last_failures"]) == 3
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_partial_failures_no_event(storage: Storage):
|
|
||||||
states = [_auto_state("automation.a")]
|
|
||||||
# 2 failures, 1 success in recent 3 → not all failed
|
|
||||||
traces = {"automation.a": [_fail(), _ok(), _fail()]}
|
|
||||||
check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_cooldown_prevents_duplicate_event(storage: Storage):
|
|
||||||
states = [_auto_state("automation.a")]
|
|
||||||
traces = {"automation.a": [_fail(), _fail(), _fail()]}
|
|
||||||
settings = _make_settings(alert_cooldown_hours=6.0)
|
|
||||||
check = AutomationFailuresCheck(_make_client(states, traces), storage, settings)
|
|
||||||
r1 = await check.run()
|
|
||||||
r2 = await check.run()
|
|
||||||
assert len(r1) == 1
|
|
||||||
assert r2 == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_multiple_failing_automations(storage: Storage):
|
|
||||||
states = [_auto_state("automation.a"), _auto_state("automation.b")]
|
|
||||||
traces = {
|
|
||||||
"automation.a": [_fail(), _fail(), _fail()],
|
|
||||||
"automation.b": [_fail(), _fail(), _fail()],
|
|
||||||
}
|
|
||||||
check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 2
|
|
||||||
eids = {r.payload["entity_id"] for r in results}
|
|
||||||
assert eids == {"automation.a", "automation.b"}
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_states_error_returns_empty(storage: Storage):
|
|
||||||
check = AutomationFailuresCheck(
|
|
||||||
_make_client(states_error=ConnectionError("down")), storage, _make_settings()
|
|
||||||
)
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_custom_threshold(storage: Storage):
|
|
||||||
states = [_auto_state("automation.a")]
|
|
||||||
# threshold=2: 2 failures should trigger
|
|
||||||
traces = {"automation.a": [_fail(), _fail(), _ok()]}
|
|
||||||
settings = _make_settings(automation_failure_threshold=2)
|
|
||||||
check = AutomationFailuresCheck(_make_client(states, traces), storage, settings)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_failure_with_state_failed_field(storage: Storage):
|
|
||||||
states = [_auto_state("automation.a")]
|
|
||||||
traces = {"automation.a": [
|
|
||||||
{"run_id": "x", "state": "failed", "error": None, "timestamp": "2026-05-27T10:00:00Z"},
|
|
||||||
{"run_id": "y", "state": "failed", "error": None, "timestamp": "2026-05-27T09:00:00Z"},
|
|
||||||
{"run_id": "z", "state": "failed", "error": None, "timestamp": "2026-05-27T08:00:00Z"},
|
|
||||||
]}
|
|
||||||
check = AutomationFailuresCheck(_make_client(states, traces), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
|
|
@ -1,88 +0,0 @@
|
||||||
"""Tests for EventEmitter."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ha_diag.event_emitter import EventEmitter
|
|
||||||
|
|
||||||
|
|
||||||
def test_emit_creates_json_file(tmp_events_dir: Path, emitter: EventEmitter):
|
|
||||||
event_id = emitter.emit(
|
|
||||||
event_type="ha_websocket_dead",
|
|
||||||
severity="error",
|
|
||||||
service="homeassistant",
|
|
||||||
message="HA unreachable",
|
|
||||||
payload={"error": "timeout"},
|
|
||||||
)
|
|
||||||
files = list(tmp_events_dir.glob("*.json"))
|
|
||||||
assert len(files) == 1
|
|
||||||
assert files[0].name == f"{event_id}.json"
|
|
||||||
|
|
||||||
|
|
||||||
def test_emit_event_schema(tmp_events_dir: Path, emitter: EventEmitter):
|
|
||||||
event_id = emitter.emit(
|
|
||||||
event_type="ha_websocket_dead",
|
|
||||||
severity="error",
|
|
||||||
service="homeassistant",
|
|
||||||
message="HA unreachable",
|
|
||||||
payload={"error": "timeout"},
|
|
||||||
)
|
|
||||||
data = json.loads((tmp_events_dir / f"{event_id}.json").read_text())
|
|
||||||
assert data["id"] == event_id
|
|
||||||
assert data["type"] == "ha_websocket_dead"
|
|
||||||
assert data["severity"] == "error"
|
|
||||||
assert data["node"] == "test-node"
|
|
||||||
assert data["service"] == "homeassistant"
|
|
||||||
assert data["message"] == "HA unreachable"
|
|
||||||
assert data["payload"] == {"error": "timeout"}
|
|
||||||
assert "timestamp" in data
|
|
||||||
assert "date" in data
|
|
||||||
|
|
||||||
|
|
||||||
def test_emit_multiple_events_unique_files(tmp_events_dir: Path, emitter: EventEmitter):
|
|
||||||
ids = [
|
|
||||||
emitter.emit("ha_websocket_dead", "error", "homeassistant", f"msg {i}")
|
|
||||||
for i in range(3)
|
|
||||||
]
|
|
||||||
assert len(set(ids)) == 3
|
|
||||||
assert len(list(tmp_events_dir.glob("*.json"))) == 3
|
|
||||||
|
|
||||||
|
|
||||||
def test_emit_no_tmp_file_left(tmp_events_dir: Path, emitter: EventEmitter):
|
|
||||||
emitter.emit("ha_websocket_dead", "error", "homeassistant", "msg")
|
|
||||||
assert not list(tmp_events_dir.glob("*.tmp"))
|
|
||||||
|
|
||||||
|
|
||||||
def test_emitter_creates_events_dir(tmp_path: Path):
|
|
||||||
new_dir = tmp_path / "nested" / "events"
|
|
||||||
emitter = EventEmitter(new_dir, "my-node")
|
|
||||||
assert new_dir.exists()
|
|
||||||
|
|
||||||
|
|
||||||
def test_location_tag_included_in_payload(tmp_events_dir: Path):
|
|
||||||
emitter = EventEmitter(tmp_events_dir, node_name="piha", location_tag="ken")
|
|
||||||
event_id = emitter.emit("ha_websocket_dead", "error", "homeassistant", "msg")
|
|
||||||
data = json.loads((tmp_events_dir / f"{event_id}.json").read_text())
|
|
||||||
assert data["payload"]["location_tag"] == "ken"
|
|
||||||
|
|
||||||
|
|
||||||
def test_location_tag_empty_not_in_payload(tmp_events_dir: Path):
|
|
||||||
emitter = EventEmitter(tmp_events_dir, node_name="piha", location_tag="")
|
|
||||||
event_id = emitter.emit("ha_websocket_dead", "error", "homeassistant", "msg")
|
|
||||||
data = json.loads((tmp_events_dir / f"{event_id}.json").read_text())
|
|
||||||
assert "location_tag" not in data["payload"]
|
|
||||||
|
|
||||||
|
|
||||||
def test_location_tag_does_not_override_explicit_payload_key(tmp_events_dir: Path):
|
|
||||||
emitter = EventEmitter(tmp_events_dir, node_name="piha", location_tag="ken")
|
|
||||||
event_id = emitter.emit(
|
|
||||||
"ha_websocket_dead", "error", "homeassistant", "msg",
|
|
||||||
payload={"location_tag": "override", "other": "value"},
|
|
||||||
)
|
|
||||||
data = json.loads((tmp_events_dir / f"{event_id}.json").read_text())
|
|
||||||
# Explicit payload key wins over the emitter's location_tag
|
|
||||||
assert data["payload"]["location_tag"] == "override"
|
|
||||||
assert data["payload"]["other"] == "value"
|
|
||||||
|
|
@ -1,135 +0,0 @@
|
||||||
"""Tests for HAClient using aioresponses to mock aiohttp."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
from aioresponses import aioresponses
|
|
||||||
|
|
||||||
from ha_diag.ha_client import HAClient, make_session
|
|
||||||
|
|
||||||
HA_URL = "http://homeassistant.test:8123"
|
|
||||||
TOKEN = "test-token"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_get_api_status_ok():
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/", payload={"message": "API running."})
|
|
||||||
async with make_session(TOKEN) as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
result = await client.get_api_status()
|
|
||||||
assert result == {"message": "API running."}
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_get_api_status_unauthorized():
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/", status=401)
|
|
||||||
async with make_session(TOKEN) as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
with pytest.raises(Exception):
|
|
||||||
await client.get_api_status()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_get_states_returns_list():
|
|
||||||
payload = [{"entity_id": "light.living_room", "state": "on"}]
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/states", payload=payload)
|
|
||||||
async with make_session(TOKEN) as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
states = await client.get_states()
|
|
||||||
assert isinstance(states, list)
|
|
||||||
assert states[0]["entity_id"] == "light.living_room"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_get_config_returns_dict():
|
|
||||||
payload = {"version": "2024.1.0", "location_name": "Home"}
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/config", payload=payload)
|
|
||||||
async with make_session(TOKEN) as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
config = await client.get_config()
|
|
||||||
assert config["version"] == "2024.1.0"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_get_entity_registry_returns_list():
|
|
||||||
payload = [
|
|
||||||
{"entity_id": "light.hall", "platform": "zha", "area_id": "hallway"},
|
|
||||||
{"entity_id": "sensor.temp", "platform": "mqtt", "area_id": None},
|
|
||||||
]
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
|
|
||||||
async with make_session(TOKEN) as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
registry = await client.get_entity_registry()
|
|
||||||
assert len(registry) == 2
|
|
||||||
assert registry[0]["platform"] == "zha"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_make_session_sets_auth_header():
|
|
||||||
"""make_session injects the Bearer token in all requests."""
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/", payload={"message": "API running."})
|
|
||||||
async with make_session("my-secret-token") as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
await client.get_api_status()
|
|
||||||
# Verify the Authorization header was sent
|
|
||||||
assert session.headers.get("Authorization") == "Bearer my-secret-token"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Entity registry TTL cache (Phase 3 Flag #3)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_entity_registry_cached_on_second_call():
|
|
||||||
"""Second call within TTL returns cache, making only one HTTP request."""
|
|
||||||
payload = [{"entity_id": "light.hall", "platform": "zha", "area_id": "hallway"}]
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
|
|
||||||
async with make_session(TOKEN) as session:
|
|
||||||
client = HAClient(HA_URL, session, entity_registry_cache_ttl=300.0)
|
|
||||||
r1 = await client.get_entity_registry()
|
|
||||||
r2 = await client.get_entity_registry() # from cache — no second HTTP call
|
|
||||||
assert r1 == r2
|
|
||||||
# aioresponses would raise ConnectionError on the unmocked second request
|
|
||||||
# if caching weren't working; reaching here means it used the cache.
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_entity_registry_cache_bypassed_after_ttl(monkeypatch):
|
|
||||||
"""After TTL expiry, next call fetches fresh data."""
|
|
||||||
import time
|
|
||||||
payload = [{"entity_id": "light.hall", "platform": "zha", "area_id": "hallway"}]
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
|
|
||||||
async with make_session(TOKEN) as session:
|
|
||||||
client = HAClient(HA_URL, session, entity_registry_cache_ttl=0.0)
|
|
||||||
await client.get_entity_registry() # fetches
|
|
||||||
await client.get_entity_registry() # TTL=0 → fetches again
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_invalidate_registry_cache_forces_refetch():
|
|
||||||
"""invalidate_registry_cache() makes the next call hit the network."""
|
|
||||||
payload = [{"entity_id": "light.hall", "platform": "zha", "area_id": ""}]
|
|
||||||
with aioresponses() as m:
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
|
|
||||||
m.get(f"{HA_URL}/api/config/entity_registry", payload=payload)
|
|
||||||
async with make_session(TOKEN) as session:
|
|
||||||
client = HAClient(HA_URL, session, entity_registry_cache_ttl=300.0)
|
|
||||||
await client.get_entity_registry()
|
|
||||||
client.invalidate_registry_cache()
|
|
||||||
await client.get_entity_registry() # must hit network again
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_entity_registry_cache_default_ttl_is_300():
|
|
||||||
async with make_session(TOKEN) as session:
|
|
||||||
client = HAClient(HA_URL, session)
|
|
||||||
assert client._registry_cache_ttl == 300.0
|
|
||||||
|
|
@ -1,62 +0,0 @@
|
||||||
"""Tests for HeartbeatCheck."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from unittest.mock import AsyncMock, MagicMock
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ha_diag.checks.heartbeat import HeartbeatCheck
|
|
||||||
from ha_diag.models import HAEventType, Severity
|
|
||||||
|
|
||||||
|
|
||||||
def _make_client(api_status=None, side_effect=None):
|
|
||||||
client = MagicMock()
|
|
||||||
if side_effect:
|
|
||||||
client.get_api_status = AsyncMock(side_effect=side_effect)
|
|
||||||
else:
|
|
||||||
client.get_api_status = AsyncMock(return_value=api_status)
|
|
||||||
return client
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_heartbeat_ok_returns_empty_list():
|
|
||||||
client = _make_client(api_status={"message": "API running."})
|
|
||||||
check = HeartbeatCheck(client)
|
|
||||||
results = await check.run()
|
|
||||||
assert results == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_heartbeat_connection_error():
|
|
||||||
client = _make_client(side_effect=ConnectionError("refused"))
|
|
||||||
check = HeartbeatCheck(client)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].healthy is False
|
|
||||||
assert results[0].event_type == HAEventType.ha_websocket_dead
|
|
||||||
assert results[0].severity == Severity.error
|
|
||||||
assert "refused" in results[0].message
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_heartbeat_unexpected_response():
|
|
||||||
client = _make_client(api_status={"unexpected": "key"})
|
|
||||||
check = HeartbeatCheck(client)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_websocket_dead
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_heartbeat_timeout():
|
|
||||||
client = _make_client(side_effect=TimeoutError("timed out"))
|
|
||||||
check = HeartbeatCheck(client)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_websocket_dead
|
|
||||||
assert "timed out" in results[0].message
|
|
||||||
|
|
||||||
|
|
||||||
def test_heartbeat_check_name():
|
|
||||||
check = HeartbeatCheck(MagicMock())
|
|
||||||
assert check.name == "heartbeat"
|
|
||||||
|
|
@ -1,221 +0,0 @@
|
||||||
"""Unit tests for SystemHealthCheck."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from pathlib import Path
|
|
||||||
from unittest.mock import AsyncMock, MagicMock
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ha_diag.checks.system_health import SystemHealthCheck, _extract_component_statuses
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.models import HAEventType, Severity
|
|
||||||
from ha_diag.storage import Storage
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def _make_settings(**overrides) -> Settings:
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": "http://test.local:8123",
|
|
||||||
"ha_token": "test",
|
|
||||||
"node_name": "test-node",
|
|
||||||
"location_tag": "test-loc",
|
|
||||||
"alert_cooldown_hours": 0.0,
|
|
||||||
"check_interval": 60,
|
|
||||||
"check_interval_unavailable": 3600,
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
def _make_client(health=None, error=None):
|
|
||||||
client = MagicMock()
|
|
||||||
if error:
|
|
||||||
client.get_system_health = AsyncMock(side_effect=error)
|
|
||||||
else:
|
|
||||||
client.get_system_health = AsyncMock(return_value=health or {})
|
|
||||||
return client
|
|
||||||
|
|
||||||
|
|
||||||
def _ok_response(*components: str) -> dict:
|
|
||||||
return {c: {"type": "result", "data": {"ok": True}} for c in components}
|
|
||||||
|
|
||||||
|
|
||||||
def _error_response(*components: str) -> dict:
|
|
||||||
return {c: {"type": "error", "error": f"{c} failed"} for c in components}
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# _extract_component_statuses unit tests
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def test_extract_typed_result_format():
|
|
||||||
data = {"recorder": {"type": "result", "data": {"backlog": 0}}}
|
|
||||||
result = _extract_component_statuses(data)
|
|
||||||
assert result["recorder"]["status"] == "ok"
|
|
||||||
assert result["recorder"]["details"] == {"backlog": 0}
|
|
||||||
|
|
||||||
|
|
||||||
def test_extract_typed_error_format():
|
|
||||||
data = {"cloud": {"type": "error", "error": "Connection refused"}}
|
|
||||||
result = _extract_component_statuses(data)
|
|
||||||
assert result["cloud"]["status"] == "error"
|
|
||||||
assert "Connection refused" in result["cloud"]["details"]["error"]
|
|
||||||
|
|
||||||
|
|
||||||
def test_extract_legacy_error_field():
|
|
||||||
data = {"cloud": {"error": "Timeout"}}
|
|
||||||
result = _extract_component_statuses(data)
|
|
||||||
assert result["cloud"]["status"] == "error"
|
|
||||||
|
|
||||||
|
|
||||||
def test_extract_nested_checks_format():
|
|
||||||
data = {
|
|
||||||
"info": {"version": "2024.12.0"},
|
|
||||||
"checks": {
|
|
||||||
"homeassistant": {"type": "result", "data": {}},
|
|
||||||
"recorder": {"type": "error", "error": "DB locked"},
|
|
||||||
},
|
|
||||||
}
|
|
||||||
result = _extract_component_statuses(data)
|
|
||||||
assert "homeassistant" not in result or result.get("homeassistant", {}).get("status") == "ok"
|
|
||||||
assert result["recorder"]["status"] == "error"
|
|
||||||
assert "info" not in result
|
|
||||||
|
|
||||||
|
|
||||||
def test_extract_plain_dict_treated_as_ok():
|
|
||||||
data = {"homeassistant": {"version": "2024.12.0", "docker": True}}
|
|
||||||
result = _extract_component_statuses(data)
|
|
||||||
assert result["homeassistant"]["status"] == "ok"
|
|
||||||
|
|
||||||
|
|
||||||
def test_extract_non_dict_value_skipped():
|
|
||||||
data = {"scalar_component": "just-a-string"}
|
|
||||||
result = _extract_component_statuses(data)
|
|
||||||
assert "scalar_component" not in result
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# SystemHealthCheck run() tests
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_first_run_no_snapshot_no_event_for_ok(storage: Storage):
|
|
||||||
"""All components ok on first run — record snapshots, emit nothing."""
|
|
||||||
check = SystemHealthCheck(_make_client(_ok_response("homeassistant", "recorder")),
|
|
||||||
storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert results == []
|
|
||||||
snap = await storage.get_system_health_snapshot("homeassistant")
|
|
||||||
assert snap is not None
|
|
||||||
assert snap["last_status"] == "ok"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_first_run_error_component_emits_event(storage: Storage):
|
|
||||||
"""Component in error on first run (no prior snapshot) → ha_system_health_degraded."""
|
|
||||||
check = SystemHealthCheck(_make_client(_error_response("cloud")), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
r = results[0]
|
|
||||||
assert r.event_type == HAEventType.ha_system_health_degraded
|
|
||||||
assert r.payload["component"] == "cloud"
|
|
||||||
assert r.payload["previous_status"] == "unknown"
|
|
||||||
assert r.payload["current_status"] == "error"
|
|
||||||
assert r.severity == Severity.warning
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_ok_to_error_transition_emits_event(storage: Storage):
|
|
||||||
"""Component transitions ok → error → event fired."""
|
|
||||||
client_ok = _make_client(_ok_response("cloud"))
|
|
||||||
client_err = _make_client(_error_response("cloud"))
|
|
||||||
settings = _make_settings()
|
|
||||||
|
|
||||||
await SystemHealthCheck(client_ok, storage, settings).run()
|
|
||||||
results = await SystemHealthCheck(client_err, storage, settings).run()
|
|
||||||
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].payload["previous_status"] == "ok"
|
|
||||||
assert results[0].payload["current_status"] == "error"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_sustained_error_no_duplicate_event(storage: Storage):
|
|
||||||
"""Component stays in error across multiple runs — only first run emits."""
|
|
||||||
client_ok = _make_client(_ok_response("cloud"))
|
|
||||||
client_err = _make_client(_error_response("cloud"))
|
|
||||||
settings = _make_settings()
|
|
||||||
|
|
||||||
await SystemHealthCheck(client_ok, storage, settings).run()
|
|
||||||
results1 = await SystemHealthCheck(client_err, storage, settings).run()
|
|
||||||
results2 = await SystemHealthCheck(client_err, storage, settings).run()
|
|
||||||
results3 = await SystemHealthCheck(client_err, storage, settings).run()
|
|
||||||
|
|
||||||
assert len(results1) == 1 # transition fires
|
|
||||||
assert results2 == []
|
|
||||||
assert results3 == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_recovery_clears_alert_and_next_degradation_re_fires(storage: Storage):
|
|
||||||
"""error → ok → error: second degradation fires a new event."""
|
|
||||||
settings = _make_settings()
|
|
||||||
|
|
||||||
# First degradation
|
|
||||||
await SystemHealthCheck(_make_client(_ok_response("cloud")), storage, settings).run()
|
|
||||||
r1 = await SystemHealthCheck(_make_client(_error_response("cloud")), storage, settings).run()
|
|
||||||
assert len(r1) == 1
|
|
||||||
|
|
||||||
# Recovery
|
|
||||||
r2 = await SystemHealthCheck(_make_client(_ok_response("cloud")), storage, settings).run()
|
|
||||||
assert r2 == []
|
|
||||||
|
|
||||||
# Second degradation
|
|
||||||
r3 = await SystemHealthCheck(_make_client(_error_response("cloud")), storage, settings).run()
|
|
||||||
assert len(r3) == 1
|
|
||||||
assert r3[0].payload["previous_status"] == "ok"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_multiple_degraded_components_multiple_events(storage: Storage):
|
|
||||||
health = {**_error_response("cloud", "recorder"), **_ok_response("homeassistant")}
|
|
||||||
check = SystemHealthCheck(_make_client(health), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
components = {r.payload["component"] for r in results}
|
|
||||||
assert components == {"cloud", "recorder"}
|
|
||||||
assert all(r.event_type == HAEventType.ha_system_health_degraded for r in results)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_api_error_returns_empty(storage: Storage):
|
|
||||||
"""If /api/system_health is unreachable, return no results (not an error event)."""
|
|
||||||
check = SystemHealthCheck(
|
|
||||||
_make_client(error=Exception("timeout")), storage, _make_settings()
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
assert results == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_payload_contains_details(storage: Storage):
|
|
||||||
health = {"recorder": {"type": "error", "error": "DB write lag 5000ms"}}
|
|
||||||
check = SystemHealthCheck(_make_client(health), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert "DB write lag" in results[0].payload["details"]["error"]
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_snapshot_updated_after_recovery(storage: Storage):
|
|
||||||
"""After a recovery cycle, snapshot shows last_status='ok'."""
|
|
||||||
settings = _make_settings()
|
|
||||||
await SystemHealthCheck(_make_client(_error_response("cloud")), storage, settings).run()
|
|
||||||
await SystemHealthCheck(_make_client(_ok_response("cloud")), storage, settings).run()
|
|
||||||
snap = await storage.get_system_health_snapshot("cloud")
|
|
||||||
assert snap["last_status"] == "ok"
|
|
||||||
|
|
@ -1,493 +0,0 @@
|
||||||
"""Unit tests for UnavailableEntitiesCheck."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
from unittest.mock import AsyncMock, MagicMock
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ha_diag.checks.unavailable_entities import UnavailableEntitiesCheck
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.models import HAEventType
|
|
||||||
from ha_diag.storage import Storage
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def _make_settings(**overrides) -> Settings:
|
|
||||||
"""Settings with safe test defaults (alert immediately, no cooldown)."""
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": "http://test.local:8123",
|
|
||||||
"ha_token": "test",
|
|
||||||
"node_name": "test-node",
|
|
||||||
"location_tag": "test-loc",
|
|
||||||
"unavailable_threshold_hours": 0.0, # alert immediately
|
|
||||||
"integration_failure_threshold_pct": 0.5,
|
|
||||||
"integration_failure_min_entities": 3,
|
|
||||||
"alert_cooldown_hours": 0.0, # no dedup window in most tests
|
|
||||||
"check_interval": 60,
|
|
||||||
"check_interval_unavailable": 3600,
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
def _make_state(entity_id: str, state: str = "on") -> dict:
|
|
||||||
return {"entity_id": entity_id, "state": state, "attributes": {}}
|
|
||||||
|
|
||||||
|
|
||||||
def _make_registry_entry(entity_id: str, platform: str, area_id: str = "") -> dict:
|
|
||||||
return {"entity_id": entity_id, "platform": platform, "area_id": area_id}
|
|
||||||
|
|
||||||
|
|
||||||
def _make_client(states=None, registry=None, states_error=None):
|
|
||||||
client = MagicMock()
|
|
||||||
if states_error:
|
|
||||||
client.get_states = AsyncMock(side_effect=states_error)
|
|
||||||
else:
|
|
||||||
client.get_states = AsyncMock(return_value=states or [])
|
|
||||||
client.get_entity_registry = AsyncMock(return_value=registry or [])
|
|
||||||
return client
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Basic unavailability detection
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_no_unavailable_entities_returns_empty(storage: Storage):
|
|
||||||
states = [_make_state("light.a", "on"), _make_state("sensor.b", "off")]
|
|
||||||
check = UnavailableEntitiesCheck(_make_client(states), storage, _make_settings())
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_first_cycle_records_baseline_no_event(storage: Storage):
|
|
||||||
"""First observation of unavailable entity: record, don't alert yet."""
|
|
||||||
states = [_make_state("light.kitchen", "unavailable")]
|
|
||||||
settings = _make_settings(unavailable_threshold_hours=1.0) # needs 1h before alert
|
|
||||||
check = UnavailableEntitiesCheck(_make_client(states), storage, settings)
|
|
||||||
results = await check.run()
|
|
||||||
assert results == []
|
|
||||||
# Baseline should be recorded
|
|
||||||
first_at = await storage.get_entity_first_unavailable_at("light.kitchen")
|
|
||||||
assert first_at is not None
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_unavailable_below_threshold_no_event(storage: Storage):
|
|
||||||
states = [_make_state("light.kitchen", "unavailable")]
|
|
||||||
settings = _make_settings(unavailable_threshold_hours=24.0)
|
|
||||||
check = UnavailableEntitiesCheck(_make_client(states), storage, settings)
|
|
||||||
|
|
||||||
# Seed the baseline as if entity just became unavailable
|
|
||||||
await storage.set_entity_unavailable_since("light.kitchen", "unavailable", time.time())
|
|
||||||
results = await check.run()
|
|
||||||
assert results == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_unavailable_above_threshold_emits_event(storage: Storage):
|
|
||||||
states = [_make_state("light.kitchen", "unavailable")]
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states), storage, _make_settings()
|
|
||||||
)
|
|
||||||
# Seed baseline as if 25h ago
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"light.kitchen", "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_entity_unavailable_long
|
|
||||||
assert results[0].payload["entity_id"] == "light.kitchen"
|
|
||||||
assert results[0].payload["duration_hours"] == pytest.approx(25.0, abs=0.1)
|
|
||||||
assert results[0].payload["domain"] == "light"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_unknown_state_treated_as_unavailable(storage: Storage):
|
|
||||||
states = [_make_state("sensor.temp", "unknown")]
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"sensor.temp", "unknown", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states), storage, _make_settings()
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].payload["state"] == "unknown"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_payload_contains_since_timestamp(storage: Storage):
|
|
||||||
first_at = time.time() - 27 * 3600
|
|
||||||
await storage.set_entity_unavailable_since("light.k", "unavailable", first_at)
|
|
||||||
states = [_make_state("light.k", "unavailable")]
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states), storage, _make_settings()
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert "since" in results[0].payload
|
|
||||||
assert "Z" in results[0].payload["since"] # ISO UTC timestamp
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Recovery
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_recovery_clears_baseline(storage: Storage):
|
|
||||||
await storage.set_entity_unavailable_since("light.k", "unavailable", time.time())
|
|
||||||
# Entity is now back online
|
|
||||||
states = [_make_state("light.k", "on")]
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states), storage, _make_settings()
|
|
||||||
)
|
|
||||||
await check.run()
|
|
||||||
assert await storage.get_entity_first_unavailable_at("light.k") is None
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_recovery_clears_alert_dedup(storage: Storage):
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"light.k", "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
await storage.mark_alert_sent("entity_unavailable:light.k")
|
|
||||||
# Entity recovers
|
|
||||||
states = [_make_state("light.k", "on")]
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states), storage, _make_settings()
|
|
||||||
)
|
|
||||||
await check.run()
|
|
||||||
# Alert dedup should be gone
|
|
||||||
assert not await storage.was_alert_sent("entity_unavailable:light.k", 9999)
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Alert cooldown / deduplication
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_cooldown_prevents_duplicate_event(storage: Storage):
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"light.k", "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
settings = _make_settings(alert_cooldown_hours=6.0)
|
|
||||||
states = [_make_state("light.k", "unavailable")]
|
|
||||||
|
|
||||||
check = UnavailableEntitiesCheck(_make_client(states), storage, settings)
|
|
||||||
|
|
||||||
results1 = await check.run()
|
|
||||||
assert len(results1) == 1 # first alert fires
|
|
||||||
|
|
||||||
results2 = await check.run()
|
|
||||||
assert results2 == [] # cooldown active
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_no_cooldown_allows_repeat_event(storage: Storage):
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"light.k", "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
settings = _make_settings(alert_cooldown_hours=0.0)
|
|
||||||
states = [_make_state("light.k", "unavailable")]
|
|
||||||
|
|
||||||
check = UnavailableEntitiesCheck(_make_client(states), storage, settings)
|
|
||||||
results1 = await check.run()
|
|
||||||
results2 = await check.run()
|
|
||||||
assert len(results1) == 1
|
|
||||||
assert len(results2) == 1
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Integration root-cause grouping
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_integration_failure_emits_single_event(storage: Storage):
|
|
||||||
"""5/8 entities from zha unavailable → ha_integration_failed, not 5 entity events."""
|
|
||||||
zha_entities = [f"light.zha_{i}" for i in range(8)]
|
|
||||||
states = [
|
|
||||||
_make_state(eid, "unavailable" if i < 5 else "on")
|
|
||||||
for i, eid in enumerate(zha_entities)
|
|
||||||
]
|
|
||||||
registry = [_make_registry_entry(eid, "zha") for eid in zha_entities]
|
|
||||||
|
|
||||||
# Seed baselines for unavailable entities as 25h ago
|
|
||||||
for eid in zha_entities[:5]:
|
|
||||||
await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
|
|
||||||
|
|
||||||
settings = _make_settings(
|
|
||||||
integration_failure_threshold_pct=0.5,
|
|
||||||
integration_failure_min_entities=3,
|
|
||||||
)
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states, registry), storage, settings
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_integration_failed
|
|
||||||
assert results[0].payload["integration"] == "zha"
|
|
||||||
assert results[0].payload["unavailable_count"] == 5
|
|
||||||
assert results[0].payload["total_count"] == 8
|
|
||||||
assert set(results[0].payload["affected_entities"]) == set(zha_entities[:5])
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_integration_failure_below_pct_threshold(storage: Storage):
|
|
||||||
"""2/8 entities from zha unavailable (25%) → per-entity events, not integration event."""
|
|
||||||
zha_entities = [f"light.zha_{i}" for i in range(8)]
|
|
||||||
states = [
|
|
||||||
_make_state(eid, "unavailable" if i < 2 else "on")
|
|
||||||
for i, eid in enumerate(zha_entities)
|
|
||||||
]
|
|
||||||
registry = [_make_registry_entry(eid, "zha") for eid in zha_entities]
|
|
||||||
|
|
||||||
for eid in zha_entities[:2]:
|
|
||||||
await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
|
|
||||||
|
|
||||||
settings = _make_settings(
|
|
||||||
integration_failure_threshold_pct=0.5,
|
|
||||||
integration_failure_min_entities=3,
|
|
||||||
)
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states, registry), storage, settings
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
# Below count threshold (2 < 3) so individual events
|
|
||||||
assert all(r.event_type == HAEventType.ha_entity_unavailable_long for r in results)
|
|
||||||
assert len(results) == 2
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_integration_failure_below_count_threshold(storage: Storage):
|
|
||||||
"""3/6 entities unavailable (50%) but min_entities=5 → per-entity events."""
|
|
||||||
zha_entities = [f"light.zha_{i}" for i in range(6)]
|
|
||||||
states = [
|
|
||||||
_make_state(eid, "unavailable" if i < 3 else "on")
|
|
||||||
for i, eid in enumerate(zha_entities)
|
|
||||||
]
|
|
||||||
registry = [_make_registry_entry(eid, "zha") for eid in zha_entities]
|
|
||||||
for eid in zha_entities[:3]:
|
|
||||||
await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
|
|
||||||
|
|
||||||
settings = _make_settings(
|
|
||||||
integration_failure_threshold_pct=0.5,
|
|
||||||
integration_failure_min_entities=5, # need 5, only have 3
|
|
||||||
)
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states, registry), storage, settings
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
assert all(r.event_type == HAEventType.ha_entity_unavailable_long for r in results)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_entity_without_integration_gets_individual_event(storage: Storage):
|
|
||||||
"""Entity not in entity registry gets per-entity event regardless of integration grouping."""
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"light.mystery", "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
states = [_make_state("light.mystery", "unavailable")]
|
|
||||||
# Empty registry — no integration info
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states, []), storage, _make_settings()
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_entity_unavailable_long
|
|
||||||
assert "integration" not in results[0].payload
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_mixed_integrations_correctly_partitioned(storage: Storage):
|
|
||||||
"""5 zha entities unavailable (triggers integration event) + 1 mqtt entity (individual)."""
|
|
||||||
zha_entities = [f"light.zha_{i}" for i in range(8)]
|
|
||||||
mqtt_entity = "sensor.mqtt_temp"
|
|
||||||
all_entities = zha_entities + [mqtt_entity]
|
|
||||||
states = (
|
|
||||||
[_make_state(eid, "unavailable" if i < 5 else "on") for i, eid in enumerate(zha_entities)]
|
|
||||||
+ [_make_state(mqtt_entity, "unavailable")]
|
|
||||||
)
|
|
||||||
registry = (
|
|
||||||
[_make_registry_entry(eid, "zha") for eid in zha_entities]
|
|
||||||
+ [_make_registry_entry(mqtt_entity, "mqtt")]
|
|
||||||
)
|
|
||||||
for eid in zha_entities[:5]:
|
|
||||||
await storage.set_entity_unavailable_since(eid, "unavailable", time.time() - 25 * 3600)
|
|
||||||
await storage.set_entity_unavailable_since(mqtt_entity, "unavailable", time.time() - 25 * 3600)
|
|
||||||
|
|
||||||
settings = _make_settings(
|
|
||||||
integration_failure_threshold_pct=0.5,
|
|
||||||
integration_failure_min_entities=3,
|
|
||||||
)
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states, registry), storage, settings
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
event_types = {r.event_type for r in results}
|
|
||||||
assert HAEventType.ha_integration_failed in event_types
|
|
||||||
assert HAEventType.ha_entity_unavailable_long in event_types
|
|
||||||
# Exactly 2 events: 1 integration + 1 individual mqtt entity
|
|
||||||
assert len(results) == 2
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Error handling
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_ha_client_error_returns_dead_event(storage: Storage):
|
|
||||||
client = _make_client(states_error=ConnectionError("HA down"))
|
|
||||||
check = UnavailableEntitiesCheck(client, storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_websocket_dead
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_registry_failure_falls_back_gracefully(storage: Storage):
|
|
||||||
"""Registry endpoint failure → individual entity events without integration info."""
|
|
||||||
states = [_make_state("light.k", "unavailable")]
|
|
||||||
client = _make_client(states)
|
|
||||||
client.get_entity_registry = AsyncMock(side_effect=Exception("registry unavailable"))
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"light.k", "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
check = UnavailableEntitiesCheck(client, storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_entity_unavailable_long
|
|
||||||
assert "integration" not in results[0].payload
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Area / integration in payload
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_area_included_in_payload_when_known(storage: Storage):
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"light.hall", "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
states = [_make_state("light.hall", "unavailable")]
|
|
||||||
registry = [_make_registry_entry("light.hall", "zha", "hallway")]
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states, registry), storage, _make_settings()
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].payload.get("area") == "hallway"
|
|
||||||
assert results[0].payload.get("integration") == "zha"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_area_omitted_when_unknown(storage: Storage):
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"light.k", "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
states = [_make_state("light.k", "unavailable")]
|
|
||||||
registry = [_make_registry_entry("light.k", "zha", "")]
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states, registry), storage, _make_settings()
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
assert "area" not in results[0].payload
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Phase 3 Flag #1: since = min(last_changed, first_seen)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def _make_state_with_last_changed(
|
|
||||||
entity_id: str, state: str, last_changed_iso: str
|
|
||||||
) -> dict:
|
|
||||||
return {
|
|
||||||
"entity_id": entity_id,
|
|
||||||
"state": state,
|
|
||||||
"attributes": {},
|
|
||||||
"last_changed": last_changed_iso,
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_since_uses_last_changed_when_earlier_than_baseline(storage: Storage):
|
|
||||||
"""Entity's last_changed predates our baseline → duration computed from last_changed."""
|
|
||||||
import datetime as dt
|
|
||||||
|
|
||||||
now = time.time()
|
|
||||||
# Baseline recorded 1h ago (agent just started)
|
|
||||||
await storage.set_entity_unavailable_since("light.k", "unavailable", now - 3600)
|
|
||||||
|
|
||||||
# HA says entity changed to unavailable 48h ago
|
|
||||||
lc_iso = (
|
|
||||||
dt.datetime.fromtimestamp(now - 48 * 3600, tz=dt.timezone.utc)
|
|
||||||
.isoformat()
|
|
||||||
.replace("+00:00", "Z")
|
|
||||||
)
|
|
||||||
states = [_make_state_with_last_changed("light.k", "unavailable", lc_iso)]
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states), storage, _make_settings(unavailable_threshold_hours=0.0)
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
assert len(results) == 1
|
|
||||||
# Duration should be ~48h, not ~1h
|
|
||||||
assert results[0].payload["duration_hours"] == pytest.approx(48.0, abs=0.1)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_since_ignores_last_changed_when_later_than_baseline(storage: Storage):
|
|
||||||
"""Baseline predates last_changed → use baseline (entity was unavailable before
|
|
||||||
last_changed, e.g. if HA reports last_changed as now for some reason)."""
|
|
||||||
import datetime as dt
|
|
||||||
|
|
||||||
now = time.time()
|
|
||||||
# Baseline recorded 48h ago
|
|
||||||
await storage.set_entity_unavailable_since("light.k", "unavailable", now - 48 * 3600)
|
|
||||||
|
|
||||||
# HA says last_changed is only 2h ago (shouldn't override the older baseline)
|
|
||||||
lc_iso = (
|
|
||||||
dt.datetime.fromtimestamp(now - 2 * 3600, tz=dt.timezone.utc)
|
|
||||||
.isoformat()
|
|
||||||
.replace("+00:00", "Z")
|
|
||||||
)
|
|
||||||
states = [_make_state_with_last_changed("light.k", "unavailable", lc_iso)]
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states), storage, _make_settings(unavailable_threshold_hours=0.0)
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
|
|
||||||
assert len(results) == 1
|
|
||||||
# Duration should be ~48h (from baseline), not ~2h
|
|
||||||
assert results[0].payload["duration_hours"] == pytest.approx(48.0, abs=0.1)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_since_falls_back_gracefully_when_last_changed_missing(storage: Storage):
|
|
||||||
"""No last_changed in state → uses baseline first_seen without error."""
|
|
||||||
await storage.set_entity_unavailable_since(
|
|
||||||
"light.k", "unavailable", time.time() - 25 * 3600
|
|
||||||
)
|
|
||||||
states = [_make_state("light.k", "unavailable")] # no last_changed key
|
|
||||||
check = UnavailableEntitiesCheck(
|
|
||||||
_make_client(states), storage, _make_settings(unavailable_threshold_hours=0.0)
|
|
||||||
)
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_entity_unavailable_long
|
|
||||||
|
|
@ -1,256 +0,0 @@
|
||||||
"""Unit tests for UpdatesAvailableCheck and UpdatesDigestCheck."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from pathlib import Path
|
|
||||||
from unittest.mock import AsyncMock, MagicMock
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ha_diag.checks.updates_available import (
|
|
||||||
UpdatesAvailableCheck,
|
|
||||||
UpdatesDigestCheck,
|
|
||||||
_build_update_payload,
|
|
||||||
)
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.models import HAEventType
|
|
||||||
from ha_diag.storage import Storage
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def _make_settings(**overrides) -> Settings:
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": "http://test.local:8123",
|
|
||||||
"ha_token": "test",
|
|
||||||
"node_name": "test-node",
|
|
||||||
"location_tag": "test-loc",
|
|
||||||
"alert_cooldown_hours": 0.0,
|
|
||||||
"updates_cooldown_days": 0, # no dedup in most tests
|
|
||||||
"check_interval": 60,
|
|
||||||
"check_interval_unavailable": 3600,
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
def _make_client(states=None, error=None):
|
|
||||||
client = MagicMock()
|
|
||||||
if error:
|
|
||||||
client.get_states = AsyncMock(side_effect=error)
|
|
||||||
else:
|
|
||||||
client.get_states = AsyncMock(return_value=states or [])
|
|
||||||
return client
|
|
||||||
|
|
||||||
|
|
||||||
def _update_state(
|
|
||||||
entity_id: str = "update.homeassistant_core",
|
|
||||||
state: str = "on",
|
|
||||||
title: str = "Home Assistant Core",
|
|
||||||
installed: str = "2025.5.0",
|
|
||||||
latest: str = "2025.6.0",
|
|
||||||
release_summary: str | None = None,
|
|
||||||
release_url: str | None = None,
|
|
||||||
) -> dict:
|
|
||||||
attrs: dict = {
|
|
||||||
"title": title,
|
|
||||||
"installed_version": installed,
|
|
||||||
"latest_version": latest,
|
|
||||||
"in_progress": False,
|
|
||||||
"auto_update": False,
|
|
||||||
}
|
|
||||||
if release_summary:
|
|
||||||
attrs["release_summary"] = release_summary
|
|
||||||
if release_url:
|
|
||||||
attrs["release_url"] = release_url
|
|
||||||
return {"entity_id": entity_id, "state": state, "attributes": attrs}
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# _build_update_payload helper
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def test_build_update_payload_basic():
|
|
||||||
attrs = {"title": "HA Core", "installed_version": "1.0", "latest_version": "2.0"}
|
|
||||||
p = _build_update_payload("update.ha_core", attrs)
|
|
||||||
assert p["entity_id"] == "update.ha_core"
|
|
||||||
assert p["title"] == "HA Core"
|
|
||||||
assert p["installed_version"] == "1.0"
|
|
||||||
assert p["latest_version"] == "2.0"
|
|
||||||
|
|
||||||
|
|
||||||
def test_build_update_payload_release_summary_truncated():
|
|
||||||
long_notes = "x" * 3000
|
|
||||||
attrs = {"release_summary": long_notes}
|
|
||||||
p = _build_update_payload("update.ha_core", attrs)
|
|
||||||
assert len(p["release_summary"]) == 2000
|
|
||||||
|
|
||||||
|
|
||||||
def test_build_update_payload_release_url_omitted_when_absent():
|
|
||||||
p = _build_update_payload("update.ha_core", {})
|
|
||||||
assert "release_url" not in p
|
|
||||||
|
|
||||||
|
|
||||||
def test_build_update_payload_release_url_included_when_present():
|
|
||||||
attrs = {"release_url": "https://github.com/..."}
|
|
||||||
p = _build_update_payload("update.x", attrs)
|
|
||||||
assert p["release_url"] == "https://github.com/..."
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# UpdatesAvailableCheck (daily individual events)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_no_updates_returns_empty(storage: Storage):
|
|
||||||
states = [{"entity_id": "light.living_room", "state": "on", "attributes": {}}]
|
|
||||||
check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_update_off_state_not_emitted(storage: Storage):
|
|
||||||
states = [_update_state(state="off")]
|
|
||||||
check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_single_update_emits_event(storage: Storage):
|
|
||||||
states = [_update_state()]
|
|
||||||
check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
assert results[0].event_type == HAEventType.ha_update_available
|
|
||||||
assert "2025.5.0" in results[0].message
|
|
||||||
assert "2025.6.0" in results[0].message
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_multiple_updates_emit_multiple_events(storage: Storage):
|
|
||||||
states = [
|
|
||||||
_update_state("update.ha_core"),
|
|
||||||
_update_state("update.mosquitto", title="Mosquitto"),
|
|
||||||
]
|
|
||||||
check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 2
|
|
||||||
assert all(r.event_type == HAEventType.ha_update_available for r in results)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_cooldown_prevents_same_update_next_day(storage: Storage):
|
|
||||||
states = [_update_state()]
|
|
||||||
settings = _make_settings(updates_cooldown_days=7)
|
|
||||||
check = UpdatesAvailableCheck(_make_client(states), storage, settings)
|
|
||||||
r1 = await check.run()
|
|
||||||
r2 = await check.run()
|
|
||||||
assert len(r1) == 1
|
|
||||||
assert r2 == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_no_cooldown_allows_repeat(storage: Storage):
|
|
||||||
states = [_update_state()]
|
|
||||||
check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings(updates_cooldown_days=0))
|
|
||||||
r1 = await check.run()
|
|
||||||
r2 = await check.run()
|
|
||||||
assert len(r1) == 1
|
|
||||||
assert len(r2) == 1
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_payload_contains_version_fields(storage: Storage):
|
|
||||||
states = [_update_state(installed="2025.5.0", latest="2025.6.0")]
|
|
||||||
check = UpdatesAvailableCheck(_make_client(states), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
p = results[0].payload
|
|
||||||
assert p["installed_version"] == "2025.5.0"
|
|
||||||
assert p["latest_version"] == "2025.6.0"
|
|
||||||
assert p["in_progress"] is False
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_ha_error_returns_empty(storage: Storage):
|
|
||||||
check = UpdatesAvailableCheck(
|
|
||||||
_make_client(error=ConnectionError("HA down")), storage, _make_settings()
|
|
||||||
)
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# UpdatesDigestCheck (Sunday digest)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_digest_no_updates_returns_empty(storage: Storage):
|
|
||||||
check = UpdatesDigestCheck(_make_client([]), storage, _make_settings())
|
|
||||||
assert await check.run() == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_digest_emits_single_event_for_all_updates(storage: Storage):
|
|
||||||
states = [
|
|
||||||
_update_state("update.ha_core"),
|
|
||||||
_update_state("update.mosquitto", title="Mosquitto"),
|
|
||||||
_update_state("update.esphome", title="ESPHome"),
|
|
||||||
]
|
|
||||||
check = UpdatesDigestCheck(_make_client(states), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert len(results) == 1
|
|
||||||
p = results[0].payload
|
|
||||||
assert p["digest"] is True
|
|
||||||
assert p["count"] == 3
|
|
||||||
assert len(p["updates"]) == 3
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_digest_payload_has_digest_true(storage: Storage):
|
|
||||||
states = [_update_state()]
|
|
||||||
check = UpdatesDigestCheck(_make_client(states), storage, _make_settings())
|
|
||||||
results = await check.run()
|
|
||||||
assert results[0].payload["digest"] is True
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_digest_weekly_dedup_prevents_same_week_refiring(storage: Storage):
|
|
||||||
states = [_update_state()]
|
|
||||||
check = UpdatesDigestCheck(_make_client(states), storage, _make_settings())
|
|
||||||
r1 = await check.run()
|
|
||||||
r2 = await check.run()
|
|
||||||
assert len(r1) == 1
|
|
||||||
assert r2 == []
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_digest_fires_independently_of_daily_dedup(storage: Storage):
|
|
||||||
"""Daily cooldown on entity X doesn't suppress Sunday digest."""
|
|
||||||
states = [_update_state()]
|
|
||||||
settings = _make_settings(updates_cooldown_days=7)
|
|
||||||
|
|
||||||
# Daily check marks alert_key="update_available:update.homeassistant_core"
|
|
||||||
daily = UpdatesAvailableCheck(_make_client(states), storage, settings)
|
|
||||||
await daily.run()
|
|
||||||
|
|
||||||
# Digest uses different key "update_digest:{week}" — should still fire
|
|
||||||
digest = UpdatesDigestCheck(_make_client(states), storage, settings)
|
|
||||||
r = await digest.run()
|
|
||||||
assert len(r) == 1
|
|
||||||
assert r[0].payload["digest"] is True
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_digest_name_is_updates_digest(storage: Storage):
|
|
||||||
check = UpdatesDigestCheck(_make_client([]), storage, _make_settings())
|
|
||||||
assert check.name == "updates_digest"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_daily_check_name_is_updates_available(storage: Storage):
|
|
||||||
check = UpdatesAvailableCheck(_make_client([]), storage, _make_settings())
|
|
||||||
assert check.name == "updates_available"
|
|
||||||
|
|
@ -1,558 +0,0 @@
|
||||||
"""Unit tests for WebSocketMonitor."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import asyncio
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
from unittest.mock import AsyncMock, MagicMock, patch
|
|
||||||
|
|
||||||
import aiohttp
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ha_diag.config import Settings
|
|
||||||
from ha_diag.event_emitter import EventEmitter
|
|
||||||
from ha_diag.models import HAEventType
|
|
||||||
from ha_diag.monitors.websocket_monitor import (
|
|
||||||
WebSocketMonitor,
|
|
||||||
_AuthError,
|
|
||||||
_make_ws_url,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def _make_settings(**overrides) -> Settings:
|
|
||||||
defaults: dict = {
|
|
||||||
"ha_url": "http://test.local:8123",
|
|
||||||
"ha_token": "test-token",
|
|
||||||
"node_name": "test-node",
|
|
||||||
"location_tag": "test-loc",
|
|
||||||
"websocket_enabled": True,
|
|
||||||
"websocket_silence_threshold_seconds": 300,
|
|
||||||
"websocket_watchdog_interval_seconds": 30,
|
|
||||||
"websocket_reconnect_initial_delay": 1.0,
|
|
||||||
"websocket_reconnect_max_delay": 60.0,
|
|
||||||
"websocket_reconnect_jitter": 0.0,
|
|
||||||
"websocket_down_alert_repeat_minutes": 10,
|
|
||||||
}
|
|
||||||
defaults.update(overrides)
|
|
||||||
return Settings(**defaults)
|
|
||||||
|
|
||||||
|
|
||||||
class FakeWS:
|
|
||||||
"""Fake aiohttp ClientWebSocketResponse for unit tests."""
|
|
||||||
|
|
||||||
def __init__(self, auth_messages: list, event_messages: list | None = None):
|
|
||||||
self._auth_queue = list(auth_messages)
|
|
||||||
self._event_queue = list(event_messages or [])
|
|
||||||
self.sent: list = []
|
|
||||||
|
|
||||||
async def receive_json(self) -> dict:
|
|
||||||
if not self._auth_queue:
|
|
||||||
raise ConnectionError("FakeWS: no more auth messages")
|
|
||||||
return self._auth_queue.pop(0)
|
|
||||||
|
|
||||||
async def send_json(self, data: dict) -> None:
|
|
||||||
self.sent.append(data)
|
|
||||||
|
|
||||||
def __aiter__(self):
|
|
||||||
return self
|
|
||||||
|
|
||||||
async def __anext__(self):
|
|
||||||
if not self._event_queue:
|
|
||||||
raise StopAsyncIteration
|
|
||||||
item = self._event_queue.pop(0)
|
|
||||||
if isinstance(item, BaseException):
|
|
||||||
raise item
|
|
||||||
return item
|
|
||||||
|
|
||||||
|
|
||||||
def _text_msg(data: str = '{"type":"event"}') -> aiohttp.WSMessage:
|
|
||||||
return aiohttp.WSMessage(type=aiohttp.WSMsgType.TEXT, data=data, extra=None)
|
|
||||||
|
|
||||||
|
|
||||||
def _close_msg() -> aiohttp.WSMessage:
|
|
||||||
return aiohttp.WSMessage(type=aiohttp.WSMsgType.CLOSE, data=b"", extra=None)
|
|
||||||
|
|
||||||
|
|
||||||
def _mock_session(fake_ws: FakeWS) -> MagicMock:
|
|
||||||
cm = MagicMock()
|
|
||||||
cm.__aenter__ = AsyncMock(return_value=fake_ws)
|
|
||||||
cm.__aexit__ = AsyncMock(return_value=False)
|
|
||||||
session = MagicMock()
|
|
||||||
session.ws_connect.return_value = cm
|
|
||||||
return session
|
|
||||||
|
|
||||||
|
|
||||||
def _make_monitor(
|
|
||||||
settings: Settings | None = None,
|
|
||||||
session=None,
|
|
||||||
emitter: EventEmitter | None = None,
|
|
||||||
tmp_path: Path | None = None,
|
|
||||||
) -> WebSocketMonitor:
|
|
||||||
if settings is None:
|
|
||||||
settings = _make_settings()
|
|
||||||
if emitter is None:
|
|
||||||
p = (tmp_path or Path("/tmp/ws_test_events")).absolute()
|
|
||||||
p.mkdir(parents=True, exist_ok=True)
|
|
||||||
emitter = EventEmitter(p, node_name="test-node")
|
|
||||||
if session is None:
|
|
||||||
session = MagicMock()
|
|
||||||
return WebSocketMonitor(
|
|
||||||
ha_url=settings.ha_url,
|
|
||||||
token=settings.ha_token,
|
|
||||||
settings=settings,
|
|
||||||
emitter=emitter,
|
|
||||||
session=session,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# URL derivation
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def test_make_ws_url_http():
|
|
||||||
assert _make_ws_url("http://ha.local:8123") == "ws://ha.local:8123/api/websocket"
|
|
||||||
|
|
||||||
|
|
||||||
def test_make_ws_url_https():
|
|
||||||
assert _make_ws_url("https://ha.example.com") == "wss://ha.example.com/api/websocket"
|
|
||||||
|
|
||||||
|
|
||||||
def test_make_ws_url_strips_trailing_slash():
|
|
||||||
assert _make_ws_url("http://ha.local:8123/") == "ws://ha.local:8123/api/websocket"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Auth flow (via _connect_and_listen)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_normal_auth_sends_correct_messages(tmp_path):
|
|
||||||
"""Happy path: sends auth + subscribe, ends in subscribed state."""
|
|
||||||
fake_ws = FakeWS(
|
|
||||||
[{"type": "auth_required"}, {"type": "auth_ok"}],
|
|
||||||
[_text_msg('{"type":"result","id":1,"success":true}')],
|
|
||||||
)
|
|
||||||
monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
|
|
||||||
|
|
||||||
await monitor._connect_and_listen()
|
|
||||||
|
|
||||||
assert fake_ws.sent[0] == {"type": "auth", "access_token": "test-token"}
|
|
||||||
assert fake_ws.sent[1]["type"] == "subscribe_events"
|
|
||||||
assert fake_ws.sent[1]["event_type"] == "state_changed"
|
|
||||||
assert monitor._state == "subscribed"
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_last_event_monotonic_updated_on_text_message(tmp_path):
|
|
||||||
"""Receiving TEXT messages updates last_event_monotonic."""
|
|
||||||
fake_ws = FakeWS(
|
|
||||||
[{"type": "auth_required"}, {"type": "auth_ok"}],
|
|
||||||
[_text_msg(), _text_msg()],
|
|
||||||
)
|
|
||||||
monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
|
|
||||||
before = time.monotonic()
|
|
||||||
|
|
||||||
await monitor._connect_and_listen()
|
|
||||||
|
|
||||||
assert monitor._last_event_monotonic >= before
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_auth_invalid_raises_auth_error(tmp_path):
|
|
||||||
"""auth_invalid → _AuthError propagates."""
|
|
||||||
fake_ws = FakeWS([
|
|
||||||
{"type": "auth_required"},
|
|
||||||
{"type": "auth_invalid", "message": "invalid token"},
|
|
||||||
])
|
|
||||||
monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
|
|
||||||
|
|
||||||
with pytest.raises(_AuthError, match="invalid token"):
|
|
||||||
await monitor._connect_and_listen()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_unexpected_initial_message_raises(tmp_path):
|
|
||||||
"""Anything other than auth_required on connect → ConnectionError."""
|
|
||||||
fake_ws = FakeWS([{"type": "unexpected"}])
|
|
||||||
monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
|
|
||||||
|
|
||||||
with pytest.raises(ConnectionError, match="Unexpected initial"):
|
|
||||||
await monitor._connect_and_listen()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_empty_auth_queue_raises_connection_error(tmp_path):
|
|
||||||
"""Connection closed before auth_required → ConnectionError."""
|
|
||||||
fake_ws = FakeWS([])
|
|
||||||
monitor = _make_monitor(session=_mock_session(fake_ws), tmp_path=tmp_path)
|
|
||||||
|
|
||||||
with pytest.raises(ConnectionError):
|
|
||||||
await monitor._connect_and_listen()
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Disconnect / dead alerts (_on_disconnected)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def test_on_disconnected_emits_ha_websocket_dead(tmp_path):
|
|
||||||
emitter = MagicMock()
|
|
||||||
monitor = _make_monitor(emitter=emitter, tmp_path=tmp_path)
|
|
||||||
monitor._state = "disconnected"
|
|
||||||
|
|
||||||
monitor._on_disconnected()
|
|
||||||
|
|
||||||
emitter.emit.assert_called_once()
|
|
||||||
assert emitter.emit.call_args[1]["event_type"] == HAEventType.ha_websocket_dead.value
|
|
||||||
|
|
||||||
|
|
||||||
def test_on_disconnected_within_cooldown_suppresses_second_emit(tmp_path):
|
|
||||||
emitter = MagicMock()
|
|
||||||
monitor = _make_monitor(
|
|
||||||
settings=_make_settings(websocket_down_alert_repeat_minutes=10),
|
|
||||||
emitter=emitter,
|
|
||||||
tmp_path=tmp_path,
|
|
||||||
)
|
|
||||||
monitor._state = "disconnected"
|
|
||||||
|
|
||||||
monitor._on_disconnected() # first emit
|
|
||||||
emitter.emit.reset_mock()
|
|
||||||
monitor._on_disconnected() # within cooldown → suppressed
|
|
||||||
|
|
||||||
emitter.emit.assert_not_called()
|
|
||||||
|
|
||||||
|
|
||||||
def test_on_disconnected_after_cooldown_emits_again(tmp_path):
|
|
||||||
emitter = MagicMock()
|
|
||||||
monitor = _make_monitor(
|
|
||||||
settings=_make_settings(websocket_down_alert_repeat_minutes=10),
|
|
||||||
emitter=emitter,
|
|
||||||
tmp_path=tmp_path,
|
|
||||||
)
|
|
||||||
monitor._state = "disconnected"
|
|
||||||
monitor._on_disconnected()
|
|
||||||
# Backdate to simulate cooldown expiry
|
|
||||||
monitor._last_dead_alert_at = time.monotonic() - (10 * 60 + 5)
|
|
||||||
emitter.emit.reset_mock()
|
|
||||||
|
|
||||||
monitor._on_disconnected()
|
|
||||||
|
|
||||||
emitter.emit.assert_called_once()
|
|
||||||
|
|
||||||
|
|
||||||
def test_on_disconnected_noop_when_stopping(tmp_path):
|
|
||||||
emitter = MagicMock()
|
|
||||||
monitor = _make_monitor(emitter=emitter, tmp_path=tmp_path)
|
|
||||||
monitor._stopping = True
|
|
||||||
|
|
||||||
monitor._on_disconnected()
|
|
||||||
|
|
||||||
emitter.emit.assert_not_called()
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Recovery
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_reconnect_after_dead_emits_recovered(tmp_path):
|
|
||||||
"""Successful reconnect after a dead alert emits ha_websocket_recovered."""
|
|
||||||
emitter = MagicMock()
|
|
||||||
fake_ws = FakeWS([{"type": "auth_required"}, {"type": "auth_ok"}], [])
|
|
||||||
settings = _make_settings()
|
|
||||||
monitor = WebSocketMonitor(
|
|
||||||
ha_url=settings.ha_url,
|
|
||||||
token=settings.ha_token,
|
|
||||||
settings=settings,
|
|
||||||
emitter=emitter,
|
|
||||||
session=_mock_session(fake_ws),
|
|
||||||
)
|
|
||||||
monitor._last_dead_alert_at = time.monotonic() - 30.0 # prior dead was sent
|
|
||||||
|
|
||||||
await monitor._connect_and_listen()
|
|
||||||
|
|
||||||
emitted_types = [c[1]["event_type"] for c in emitter.emit.call_args_list]
|
|
||||||
assert HAEventType.ha_websocket_recovered.value in emitted_types
|
|
||||||
assert monitor._last_dead_alert_at == 0.0 # reset after recovery
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_no_recovered_if_no_prior_dead(tmp_path):
|
|
||||||
"""First-ever connect with no prior dead alert → no recovered emitted."""
|
|
||||||
emitter = MagicMock()
|
|
||||||
fake_ws = FakeWS([{"type": "auth_required"}, {"type": "auth_ok"}], [])
|
|
||||||
settings = _make_settings()
|
|
||||||
monitor = WebSocketMonitor(
|
|
||||||
ha_url=settings.ha_url,
|
|
||||||
token=settings.ha_token,
|
|
||||||
settings=settings,
|
|
||||||
emitter=emitter,
|
|
||||||
session=_mock_session(fake_ws),
|
|
||||||
)
|
|
||||||
monitor._last_dead_alert_at = 0.0
|
|
||||||
|
|
||||||
await monitor._connect_and_listen()
|
|
||||||
|
|
||||||
emitter.emit.assert_not_called()
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Watchdog loop
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_watchdog_emits_dead_when_silent_over_threshold(tmp_path):
|
|
||||||
"""Watchdog detects silence > threshold and emits ha_websocket_dead."""
|
|
||||||
emitter = MagicMock()
|
|
||||||
settings = _make_settings(
|
|
||||||
websocket_silence_threshold_seconds=60,
|
|
||||||
websocket_watchdog_interval_seconds=30,
|
|
||||||
websocket_down_alert_repeat_minutes=0,
|
|
||||||
)
|
|
||||||
monitor = _make_monitor(settings=settings, emitter=emitter, tmp_path=tmp_path)
|
|
||||||
monitor._state = "subscribed"
|
|
||||||
monitor._last_event_monotonic = time.monotonic() - 120.0 # 120s > 60s threshold
|
|
||||||
monitor._last_dead_alert_at = 0.0
|
|
||||||
|
|
||||||
sleep_calls = 0
|
|
||||||
|
|
||||||
async def one_iteration(t):
|
|
||||||
nonlocal sleep_calls
|
|
||||||
sleep_calls += 1
|
|
||||||
if sleep_calls >= 2:
|
|
||||||
raise asyncio.CancelledError()
|
|
||||||
|
|
||||||
with patch("asyncio.sleep", side_effect=one_iteration):
|
|
||||||
with pytest.raises(asyncio.CancelledError):
|
|
||||||
await monitor._watchdog_loop()
|
|
||||||
|
|
||||||
emitter.emit.assert_called_once()
|
|
||||||
assert emitter.emit.call_args[1]["event_type"] == HAEventType.ha_websocket_dead.value
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_watchdog_no_emit_when_events_recent(tmp_path):
|
|
||||||
"""Watchdog does not emit when last event is within silence threshold."""
|
|
||||||
emitter = MagicMock()
|
|
||||||
settings = _make_settings(
|
|
||||||
websocket_silence_threshold_seconds=300,
|
|
||||||
websocket_watchdog_interval_seconds=30,
|
|
||||||
websocket_down_alert_repeat_minutes=0,
|
|
||||||
)
|
|
||||||
monitor = _make_monitor(settings=settings, emitter=emitter, tmp_path=tmp_path)
|
|
||||||
monitor._state = "subscribed"
|
|
||||||
monitor._last_event_monotonic = time.monotonic() - 10.0 # recent
|
|
||||||
|
|
||||||
sleep_calls = 0
|
|
||||||
|
|
||||||
async def one_iteration(t):
|
|
||||||
nonlocal sleep_calls
|
|
||||||
sleep_calls += 1
|
|
||||||
if sleep_calls >= 2:
|
|
||||||
raise asyncio.CancelledError()
|
|
||||||
|
|
||||||
with patch("asyncio.sleep", side_effect=one_iteration):
|
|
||||||
with pytest.raises(asyncio.CancelledError):
|
|
||||||
await monitor._watchdog_loop()
|
|
||||||
|
|
||||||
emitter.emit.assert_not_called()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_watchdog_skips_when_not_subscribed(tmp_path):
|
|
||||||
"""Watchdog does not emit when state is not 'subscribed'."""
|
|
||||||
emitter = MagicMock()
|
|
||||||
settings = _make_settings(
|
|
||||||
websocket_silence_threshold_seconds=1,
|
|
||||||
websocket_watchdog_interval_seconds=30,
|
|
||||||
websocket_down_alert_repeat_minutes=0,
|
|
||||||
)
|
|
||||||
monitor = _make_monitor(settings=settings, emitter=emitter, tmp_path=tmp_path)
|
|
||||||
monitor._state = "disconnected"
|
|
||||||
monitor._last_event_monotonic = time.monotonic() - 9999.0 # very old
|
|
||||||
|
|
||||||
sleep_calls = 0
|
|
||||||
|
|
||||||
async def one_iteration(t):
|
|
||||||
nonlocal sleep_calls
|
|
||||||
sleep_calls += 1
|
|
||||||
if sleep_calls >= 2:
|
|
||||||
raise asyncio.CancelledError()
|
|
||||||
|
|
||||||
with patch("asyncio.sleep", side_effect=one_iteration):
|
|
||||||
with pytest.raises(asyncio.CancelledError):
|
|
||||||
await monitor._watchdog_loop()
|
|
||||||
|
|
||||||
emitter.emit.assert_not_called()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_watchdog_repeat_alert_respects_cooldown(tmp_path):
|
|
||||||
"""Second watchdog dead alert fires only after cooldown."""
|
|
||||||
emitter = MagicMock()
|
|
||||||
settings = _make_settings(
|
|
||||||
websocket_silence_threshold_seconds=60,
|
|
||||||
websocket_watchdog_interval_seconds=30,
|
|
||||||
websocket_down_alert_repeat_minutes=10,
|
|
||||||
)
|
|
||||||
monitor = _make_monitor(settings=settings, emitter=emitter, tmp_path=tmp_path)
|
|
||||||
monitor._state = "subscribed"
|
|
||||||
monitor._last_event_monotonic = time.monotonic() - 3600.0 # 1hr silent
|
|
||||||
# Set last alert to just now → still within 10-min cooldown
|
|
||||||
monitor._last_dead_alert_at = time.monotonic()
|
|
||||||
|
|
||||||
sleep_calls = 0
|
|
||||||
|
|
||||||
async def one_iteration(t):
|
|
||||||
nonlocal sleep_calls
|
|
||||||
sleep_calls += 1
|
|
||||||
if sleep_calls >= 2:
|
|
||||||
raise asyncio.CancelledError()
|
|
||||||
|
|
||||||
with patch("asyncio.sleep", side_effect=one_iteration):
|
|
||||||
with pytest.raises(asyncio.CancelledError):
|
|
||||||
await monitor._watchdog_loop()
|
|
||||||
|
|
||||||
emitter.emit.assert_not_called() # within cooldown
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Reconnect backoff
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_reconnect_backoff_doubles_each_attempt(tmp_path):
|
|
||||||
"""Retry delay doubles on consecutive failures."""
|
|
||||||
delays: list[float] = []
|
|
||||||
call_count = 0
|
|
||||||
|
|
||||||
async def fail_connect():
|
|
||||||
nonlocal call_count
|
|
||||||
call_count += 1
|
|
||||||
raise ConnectionError("refused")
|
|
||||||
|
|
||||||
async def capture_sleep(t):
|
|
||||||
delays.append(t)
|
|
||||||
if call_count >= 3:
|
|
||||||
raise asyncio.CancelledError()
|
|
||||||
|
|
||||||
settings = _make_settings(
|
|
||||||
websocket_reconnect_initial_delay=1.0,
|
|
||||||
websocket_reconnect_max_delay=60.0,
|
|
||||||
websocket_reconnect_jitter=0.0,
|
|
||||||
)
|
|
||||||
monitor = _make_monitor(settings=settings, emitter=MagicMock(), tmp_path=tmp_path)
|
|
||||||
monitor._connect_and_listen = fail_connect
|
|
||||||
|
|
||||||
with patch("asyncio.sleep", side_effect=capture_sleep):
|
|
||||||
with pytest.raises(asyncio.CancelledError):
|
|
||||||
await monitor._connection_loop()
|
|
||||||
|
|
||||||
assert len(delays) >= 2
|
|
||||||
assert delays[0] == pytest.approx(1.0)
|
|
||||||
assert delays[1] == pytest.approx(2.0)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_reconnect_delay_capped_at_max(tmp_path):
|
|
||||||
"""Delay never exceeds websocket_reconnect_max_delay."""
|
|
||||||
delays: list[float] = []
|
|
||||||
call_count = 0
|
|
||||||
|
|
||||||
async def fail_connect():
|
|
||||||
nonlocal call_count
|
|
||||||
call_count += 1
|
|
||||||
raise ConnectionError("refused")
|
|
||||||
|
|
||||||
async def capture_sleep(t):
|
|
||||||
delays.append(t)
|
|
||||||
if call_count >= 8:
|
|
||||||
raise asyncio.CancelledError()
|
|
||||||
|
|
||||||
settings = _make_settings(
|
|
||||||
websocket_reconnect_initial_delay=1.0,
|
|
||||||
websocket_reconnect_max_delay=8.0,
|
|
||||||
websocket_reconnect_jitter=0.0,
|
|
||||||
)
|
|
||||||
monitor = _make_monitor(settings=settings, emitter=MagicMock(), tmp_path=tmp_path)
|
|
||||||
monitor._connect_and_listen = fail_connect
|
|
||||||
|
|
||||||
with patch("asyncio.sleep", side_effect=capture_sleep):
|
|
||||||
with pytest.raises(asyncio.CancelledError):
|
|
||||||
await monitor._connection_loop()
|
|
||||||
|
|
||||||
assert max(delays) <= 8.0
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# is_healthy
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
def test_is_healthy_true_when_subscribed(tmp_path):
|
|
||||||
monitor = _make_monitor(settings=_make_settings(websocket_enabled=True), tmp_path=tmp_path)
|
|
||||||
monitor._state = "subscribed"
|
|
||||||
assert monitor.is_healthy is True
|
|
||||||
|
|
||||||
|
|
||||||
def test_is_healthy_false_when_disconnected(tmp_path):
|
|
||||||
monitor = _make_monitor(settings=_make_settings(websocket_enabled=True), tmp_path=tmp_path)
|
|
||||||
monitor._state = "disconnected"
|
|
||||||
assert monitor.is_healthy is False
|
|
||||||
|
|
||||||
|
|
||||||
def test_is_healthy_false_when_connecting(tmp_path):
|
|
||||||
monitor = _make_monitor(settings=_make_settings(websocket_enabled=True), tmp_path=tmp_path)
|
|
||||||
monitor._state = "connecting"
|
|
||||||
assert monitor.is_healthy is False
|
|
||||||
|
|
||||||
|
|
||||||
def test_is_healthy_true_when_disabled(tmp_path):
|
|
||||||
"""Disabled monitor reports healthy — it's off, not broken."""
|
|
||||||
monitor = _make_monitor(settings=_make_settings(websocket_enabled=False), tmp_path=tmp_path)
|
|
||||||
monitor._state = "disconnected"
|
|
||||||
assert monitor.is_healthy is True
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# start / stop lifecycle
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_stop_cancels_background_tasks(tmp_path):
|
|
||||||
"""stop() cancels the main and watchdog tasks."""
|
|
||||||
|
|
||||||
async def hang():
|
|
||||||
await asyncio.sleep(9999)
|
|
||||||
|
|
||||||
monitor = _make_monitor(tmp_path=tmp_path)
|
|
||||||
monitor._main_task = asyncio.create_task(hang())
|
|
||||||
monitor._watchdog_task = asyncio.create_task(hang())
|
|
||||||
|
|
||||||
await monitor.stop()
|
|
||||||
|
|
||||||
assert monitor._main_task is None
|
|
||||||
assert monitor._watchdog_task is None
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_start_no_tasks_when_disabled(tmp_path):
|
|
||||||
"""start() with websocket_enabled=False does not spawn tasks."""
|
|
||||||
monitor = _make_monitor(
|
|
||||||
settings=_make_settings(websocket_enabled=False),
|
|
||||||
tmp_path=tmp_path,
|
|
||||||
)
|
|
||||||
await monitor.start()
|
|
||||||
assert monitor._main_task is None
|
|
||||||
assert monitor._watchdog_task is None
|
|
||||||
|
|
@ -3,10 +3,8 @@
|
||||||
Zigbee to MQTT bridge, get rid of your proprietary Zigbee bridges.
|
Zigbee to MQTT bridge, get rid of your proprietary Zigbee bridges.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
Deployed on the `piha` node.
|
||||||
|
|
||||||
Deployed on the `chelsty-infra` node (CHELSTY site).
|
Requires a Zigbee adapter (e.g., Sonoff ZBDongle-E) mapped to `/dev/ttyACM0`.
|
||||||
|
|
||||||
Coordinator: **SLZB-06U** over TCP at `192.168.1.105:6638` (`ezsp` adapter).
|
|
||||||
Do not use USB paths (`/dev/ttyUSB0`, `/dev/ttyACM0`) — the coordinator is network-attached.
|
|
||||||
|
|
||||||
Frontend is available on port 8080.
|
Frontend is available on port 8080.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue