fix(stability-agent): run as uid 1000 with docker group access

stability-agent had no USER instruction and no user: in compose, running as root and writing root-owned files to /opt/homelab bind-mount. - Dockerfile: add useradd -m -u 1000 homelab + USER homelab - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) to retain docker.sock:ro access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(node-agent): run as uid 1000 with docker group access
2026-06-03 18:20:54 +02:00 · 2026-06-03 18:20:31 +02:00 · 2026-06-03 18:19:58 +02:00 · 2026-06-03 18:04:38 +02:00 · 2026-06-03 18:02:50 +02:00 · 2026-06-03 17:41:35 +02:00
177 changed files with 18942 additions and 470 deletions
--- a/.claude/skills/deploy/SKILL.md
+++ b/.claude/skills/deploy/SKILL.md
@ -0,0 +1,43 @@
+---
+name: deploy
+description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
+---
+
+Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
+Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
+
+## Targets
+
+| Target | What it deploys |
+|---|---|
+| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
+| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
+| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
+| `solaria` | SOLARIA compute services |
+| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
+
+## Invocation
+
+```bash
+scripts/deploy/deploy.sh <target>            # full pipeline
+scripts/deploy/deploy.sh <target> --dry-run  # preflight + gate only
+scripts/deploy/deploy.sh <target> --no-gate  # emergency: bypass tests
+```
+
+## Exit Code Handling
+
+| Code | Meaning | Required action |
+|---|---|---|
+| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
+| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
+| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
+| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
+| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
+| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
+
+## Rules
+
+- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
+- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
+- Canonical branch is `master` — preflight enforces this.
+- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.
--- a/.claude/skills/save-session/SKILL.md
+++ b/.claude/skills/save-session/SKILL.md
@ -0,0 +1,65 @@
+---
+name: save-session
+description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
+---
+
+**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
+Never invoke proactively. Never invoke mid-task.
+
+## 1. Determine Session Boundary
+
+1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
+2. Fallback if no previous entry exists: 24 hours ago.
+
+## 2. Collect Facts (deterministic only — no invention)
+
+Run exactly:
+```bash
+# All commits since boundary
+git --no-pager log --oneline <boundary>..HEAD
+
+# Changed file summary
+git --no-pager diff --stat <boundary>..HEAD
+```
+
+From the visible conversation transcript: deploys run and their outcomes, test results seen.
+
+## 3. Write the Session Entry
+
+**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
+Never overwrite existing content.
+
+```markdown
+## Session HH:MM
+
+### Commits
+<output of git log --oneline>
+
+### Files changed
+<output of git diff --stat>
+
+### Deploys
+<list from transcript, or "None recorded">
+
+### Narrative
+> _user-provided summary_
+```
+
+The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
+
+## 4. What NOT to Touch
+
+- `backlog.md` — only on explicit "update backlog" instruction
+- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
+- Any other file not listed above
+
+## 5. Commit
+
+Stage and commit **only** the session file:
+
+```bash
+git add docs/sessions/YYYY-MM-DD.md
+git commit -m "docs: session YYYY-MM-DD HH:MM"
+```
+
+No other files. No `git add -A`.
--- a/.claude/skills/worktree-aware/SKILL.md
+++ b/.claude/skills/worktree-aware/SKILL.md
@ -0,0 +1,81 @@
+---
+name: worktree-aware
+description: >
+  Use when working in a git worktree checkout for a parallel agent task.
+  The presence of an .agent-task file in the current working directory indicates
+  a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
+  to the assigned task branch, NEVER push origin master, NEVER touch the main
+  checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
+  completion, report the branch name verbatim and stop — the human merges via
+  scripts/dev/agent.sh.
+---
+
+## When this applies
+
+- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
+- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
+  In the main checkout these rules do not apply.
+
+## Reading the marker
+
+`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
+
+```yaml
+task: my-feature
+branch: task/my-feature
+parent_commit: abc1234
+created_utc: 2026-06-03T10:00:00Z
+worktree_path: /home/oskar/homelab-codex-ws-my-feature
+```
+
+Always read this file first before taking any action.
+
+## Rules
+
+1. **Commit only to your branch.**
+   Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
+   If it does not, stop immediately and report the discrepancy.
+
+2. **Push only to your branch.**
+   The only permitted push is `git push origin task/<name>`.
+   NEVER `git push origin master` or any other branch.
+
+3. **Do not touch the main checkout.**
+   `~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
+   Do not read from, write to, or execute commands inside it.
+
+4. **Stay scoped.**
+   Only change files directly related to your assigned task.
+   If you notice other problems, report them in your final summary as separate follow-up proposals.
+   Do not fix them in this worktree.
+
+5. **Never `git add -A`.**
+   Always stage specific files by name: `git add path/to/file`.
+
+6. **Do not manage worktrees.**
+   Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
+   Worktree lifecycle is the human's responsibility.
+
+7. **Final report before stopping.**
+   When the task is done, provide a structured report containing:
+   - Files changed (path and one-line summary of change)
+   - Tests run and results
+   - All commit hashes on the task branch
+   - **Branch name verbatim** (copy-paste ready)
+   - Follow-up items as bulleted proposals for separate tasks
+
+## Definition of Done
+
+- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
+- Test suite passes
+- Branch pushed: `git push origin task/<name>`
+- Full report delivered in conversation
+
+## What you do NOT do
+
+- Merge branches
+- Create or push tags
+- Run deploys or healthchecks against production nodes
+- Delete branches or worktrees
+- Modify files in other worktrees
+- Push to `origin master` under any circumstances
--- a/.gitignore
+++ b/.gitignore
@ -15,6 +15,7 @@ __pycache__/
 *$py.class
 venv/
 .venv/
+*.egg-info/

 # Tools
 .aider*
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,194 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## What This Repo Is
+
+GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
+
+## Node Roles
+
+| Host | Role |
+|------|------|
+| **SATURN** | Primary control node — only node where commits are made |
+| **SOLARIA** | GPU/compute/AI workloads |
+| **PIHA** | Infra, monitoring |
+| **VPS** | Public ingress, reverse proxy, control plane host |
+| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
+| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
+
+All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
+
+## Deployment
+
+```bash
+scripts/deploy/deploy.sh                        # fresh deploy on current node
+scripts/deploy/deploy.sh --resume              # resume after interruption
+scripts/deploy/deploy.sh --stage verify        # specific stage only
+scripts/deploy/deploy.sh --service mosquitto   # specific service only
+./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
+./scripts/deploy/deploy-node.sh chelsty-infra  # CHELSTY nodes (individually)
+./scripts/bootstrap/prepare-node.sh            # general node bootstrap
+./scripts/bootstrap/chelsty-runtime.sh         # CHELSTY-specific bootstrap
+```
+
+Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
+
+## Service Structure
+
+Every service must follow this layout:
+
+```
+services/<service>/
+├── docker-compose.yml
+├── service.yaml       # Machine-readable contract (primary source of truth for agents)
+├── README.md
+├── env.example        # Template — never commit actual secrets
+└── healthcheck.sh     # Returns 0 (healthy) or 1 (unhealthy)
+```
+
+`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
+
+Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
+
+## Agent System Architecture
+
+The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
+
+1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
+2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
+3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
+4. **Executor** — Executes actions only after they transition to `approved`.
+5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
+
+### Action approval flow
+```
+Agent → /opt/homelab/actions/pending/<id>.json
+      → Telegram notification → Operator approves
+      → /opt/homelab/actions/approved/<id>.json
+      → Executor runs → completed / failed
+```
+
+Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
+
+## Event System
+
+Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
+
+Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
+
+Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
+
+### Supervisor event routing table
+
+| Event type | Source | Action generated | Cooldown |
+|---|---|---|---|
+| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
+| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
+| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
+| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
+| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
+| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
+| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
+| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
+
+HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
+
+## Discovery Entry Points for Agents
+
+When exploring the system, use these files in order:
+1. `inventory/topology.yaml` — node list, roles, mesh type
+2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
+3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
+4. `services/<service>/service.yaml` — operational contract for a service
+
+## VPS-Specific Rules
+
+VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
+
+### Memory limit convention
+
+Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
+
+```yaml
+services:
+  myservice:
+    mem_limit: 256m      # cgroup ceiling; Docker restarts on breach
+    oom_score_adj: -900  # host kernel OOM-killer will not pick this container
+```
+
+Rules:
+- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
+- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
+- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
+
+### Repo-managed services on VPS
+
+All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
+
+| Service | Compose stack | Data path |
+|---|---|---|
+| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
+| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
+| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
+| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
+
+**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
+
+**Cutover checklist** (before running `docker compose up` for any migrated service):
+1. `git pull` on VPS
+2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
+3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
+4. For mosquitto: config stays at old bind path until explicitly migrated
+5. Verify named volumes exist: `docker volume ls | grep <project>`
+
+**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
+
+## CHELSTY-Specific Rules
+
+- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
+- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
+- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
+
+## Runtime Path Conventions
+
+`/opt/homelab/` layout on each node:
+
+- `data/<service>/` — persistent volumes
+- `config/<service>/` — secrets and host-local overrides (not in Git)
+- `logs/<service>/` — service logs
+- `state/` — deployment stage markers, agent heartbeats
+- `events/` — append-only event store
+- `world/` — Observer output (synthesized state)
+- `actions/` — pending / approved / running / completed / failed
+
+## Definition of Done (serwisy)
+
+Before any new or changed service is considered ready:
+
+1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
+2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
+3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
+
+## Naming Conventions
+
+- Hosts: ALL CAPS (`SATURN`, `PIHA`)
+- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
+- Container names must match service names
+- Always `restart: unless-stopped` unless `service.yaml` says otherwise
+
+## Multi-agent worktree mode
+
+`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
+Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
+
+If `.agent-task` exists in your current working directory, you are in a task worktree.
+**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
+before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
+
+Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
+Agents never invoke these — only the human does.
--- a/README.md
+++ b/README.md
@ -13,6 +13,22 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
 | **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
 | **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |

+## Agent System
+
+The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
+
+| Agent | Node | Role |
+|-------|------|------|
+| **stability-agent** | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
+| **node-agent** | all nodes | Publishes container health events to Redis pub/sub |
+| **observer** | VPS | Synthesizes world state from events into `/opt/homelab/world/*.json` |
+| **supervisor** | VPS | Detects drift between desired and actual state; writes `pending` actions |
+| **planner-agent** | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
+| **executor** | VPS | Executes actions only after operator approval |
+| **operator-ui** + **telegram-bot** | VPS / PIHA | Operator reviews and approves/rejects pending actions |
+
+Action approval flow: `pending/` → operator approves → `approved/` → executor runs.
+
 ## Repository Structure

 - `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
@ -29,10 +45,13 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
 ## Documentation Index

 - [Infrastructure Standards](docs/standards.md)
+- [Agent Operating Procedures](docs/agents.md) (For AI/Non-Human Agents)
 - [Deployment Conventions](docs/deployment.md)
 - [Hardware](docs/hardware.md)
 - [Networking](docs/networking.md)
 - [Services](docs/services.md)
+- [Node Capabilities](docs/capabilities.md)
+- [Action Model](services/agent-system/action-model.md)

 ---
 *Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*
--- a/backups/zigbee/coordinator_backup.json
+++ b/backups/zigbee/coordinator_backup.json
@ -0,0 +1,31 @@
+{
+  "metadata": {
+    "format": "zigpy/open-coordinator-backup",
+    "version": 1,
+    "source": "zigbee-herdsman@10.0.7",
+    "internal": {
+      "date": "2026-05-14T14:48:35.098Z",
+      "znpVersion": 1
+    }
+  },
+  "stack_specific": {
+    "zstack": {
+      "tclk_seed": "32d69cbe3f0e15471e5d43f9401e485a"
+    }
+  },
+  "coordinator_ieee": "00124b00257bf416",
+  "pan_id": "46bc",
+  "extended_pan_id": "087730b5f614ea4a",
+  "nwk_update_id": 0,
+  "security_level": 5,
+  "channel": 11,
+  "channel_mask": [
+    11
+  ],
+  "network_key": {
+    "key": "049909949a950d91522cf10cc369a724",
+    "sequence_number": 0,
+    "frame_counter": 0
+  },
+  "devices": []
+}
--- a/docs/agents.md
+++ b/docs/agents.md
@ -0,0 +1,49 @@
+# Agent Operating Procedures
+
+This document defines the operating procedures, constraints, and interaction protocols for non-human agents (AI agents, autonomous scripts) within the Homelab Codex ecosystem.
+
+## 1. Core Principles for Agents
+
+1.  **Read-Only by Default**: Agents should assume read-only access to the `/opt/homelab` runtime unless explicitly executing an approved action.
+2.  **Git as Authority**: The repository on **SATURN** is the source of truth. Agents must not modify the runtime state on nodes directly without corresponding (or pending) Git state, unless it's an emergency mitigation.
+3.  **Human-in-the-Loop (HIL)**: All destructive or structural changes (restarts, deployments, config changes) must follow the [Action Approval Model](../services/agent-system/action-model.md).
+4.  **Idempotency**: All scripts and actions proposed or executed by agents MUST be idempotent.
+5.  **Context-Awareness**: Agents MUST read the `README.md` and `docs/agents.md` at the start of every session to align with current infrastructure standards.
+
+## 2. Agent Roles
+
+| Role | Responsibility | Scope |
+|------|----------------|-------|
+| **Observer** | Monitors health, logs, and events. | Read-only access to `/opt/homelab/events` and `logs`. |
+| **Stability Agent** | Local node watchdog, event emitter. | Local node runtime, `service.yaml` healthchecks. |
+| **Orchestrator** | High-level planning, workload placement. | Repository-wide, multi-node topology. |
+| **Materializer** | Translates high-level intent into Docker/System state. | Execution of `approved` actions. |
+
+## 3. Discovery Protocol
+
+Agents must use the following entry points to understand the system:
+
+1.  **Topology**: `inventory/topology.yaml` for node list and roles.
+2.  **Capabilities**: `hosts/<node>/capabilities.yaml` to understand hardware/software constraints.
+3.  **Service Contract**: `services/<service>/service.yaml` to understand how to check health and manage a service.
+4.  **Operational State**: `/opt/homelab/state/` on local nodes for real-time status.
+
+## 4. Interaction with Humans
+
+Agents communicate with the operator via the `agent-system/telegram-bot`. 
+
+- **Alerting**: Agents emit events to the event system. Critical events are forwarded to Telegram.
+- **Proposals**: When an agent identifies a need for change (e.g., "Service X is failing, suggest restart"), it creates a `pending` action in `/opt/homelab/actions/pending/`.
+- **Approval**: Agents must wait for the action status to transition to `approved` before execution.
+
+## 5. Decision Logic (Reasoning)
+
+When making decisions, agents MUST prioritize:
+1.  **Safety**: Do not violate power constraints (see `capabilities.yaml`).
+2.  **Stability**: Prefer keeping services on their `owner_node` unless it's down.
+3.  **Connectivity**: On intermittent nodes (CHELSTY), avoid actions requiring heavy WAN traffic during low-signal periods.
+
+## 6. Access Control for Agents
+
+- **Filesystem**: Agents should run as the `homelab` user or equivalent with restricted sudo access to `docker compose`.
+- **Secrets**: Agents MUST NOT attempt to read `.env` files unless specifically tasked with credential rotation. They should treat secrets as opaque handles.
--- a/docs/capabilities.md
+++ b/docs/capabilities.md
@ -83,3 +83,10 @@ Future autonomous agents will use this metadata to:
 2.  **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
 3.  **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
 4.  **Propose Failover:** Automatically suggest the best alternative node during an outage.
+
+## Agent Reasoning Logic
+
+When an agent parses `capabilities.yaml`, it should apply these heuristics:
+- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
+- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
+- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.
--- a/docs/chelsty-runtime.md
+++ b/docs/chelsty-runtime.md
@ -1,60 +1,154 @@
 # CHELSTY Runtime

-This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
+This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
+
+| Node | Role | Services |
+|------|------|----------|
+| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
+| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
+
+Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.

 ## Runtime Layout

-The CHELSTY runtime is located at `/opt/homelab`.
-
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
- `/opt/homelab/data/`: Persistent data for services.
- `/opt/homelab/logs/`: Service logs.
-
-### Key Service Locations
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
+```
+/opt/homelab/
+├── config/          # Service-specific configs and secrets (not in Git)
+│   ├── mosquitto/
+│   └── zigbee2mqtt/
+├── data/            # Persistent service data
+│   ├── mosquitto/   # Persistence DB, password file
+│   └── zigbee2mqtt/
+│       └── data/    # z2m config, coordinator backup, network key
+└── logs/
+```

 ## SLZB-06U Integration

-CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
+CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.

- **Coordinator IP**: 192.168.1.105
- **Port**: 6638
- **Protocol**: TCP (ezsp adapter)
+- **Coordinator IP**: `192.168.1.105`
+- **Port**: `6638`
+- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
+- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`

-Zigbee2MQTT is configured to connect to this coordinator over the local network.
+⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.

-## Offline & LTE Assumptions
+## Networking Constraints

- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
- **Home Assistant**: Runs in a separate VM, connecting to the Mosquitto broker on CHELSTY.
+### Mosquitto — `network_mode: host`
+Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
+
+### Zigbee2MQTT — bridge network + extra_hosts
+Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
+
+```yaml
+# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
+services:
+  zigbee2mqtt:
+    extra_hosts:
+      - "mosquitto:host-gateway"
+```
+
+This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
+
+**Why not `network_mode: host` for z2m?**  
+chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
+
+## Zigbee2MQTT Config Location
+
+The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
+
+```
+/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
+```
+
+This path is mounted read-write by the base `docker-compose.yml`:
+```yaml
+volumes:
+  - /opt/homelab/data/zigbee2mqtt/data:/app/data
+```
+
+Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
+
+### Minimal configuration.yaml
+```yaml
+homeassistant: true
+permit_join: false
+mqtt:
+  base_topic: zigbee2mqtt
+  server: mqtt://mosquitto:1883
+serial:
+  port: tcp://192.168.1.105:6638
+  adapter: ezsp
+frontend:
+  port: 8080
+advanced:
+  log_level: info
+```
+
+## chelsty-ha — No node-agent
+
+`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
+
+In `hosts/chelsty-ha/services.yaml`:
+```yaml
+services:
+  homeassistant:
+    monitor: false   # No node-agent; suppresses supervisor action generation
+```
+
+Remove `monitor: false` once node-agent is bootstrapped on this VM.

 ## Deployment Flow

-1. **Initial Bootstrap**:
-   Run the bootstrap script on the CHELSTY node:
-   ```bash
-   ./scripts/bootstrap/chelsty-runtime.sh
-   ```
+### Initial Bootstrap
+```bash
+./scripts/bootstrap/chelsty-runtime.sh
+```

-2. **Manual Configuration**:
-   - Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
-   - Add Mosquitto user:
-     ```bash
-     sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
-     ```
+### Deploy services
+```bash
+./scripts/deploy/deploy-node.sh chelsty-infra
+./scripts/deploy/deploy-node.sh chelsty-ha
+```

-3. **Service Deployment**:
-   Use the staged deployment runtime:
-   ```bash
-   ./scripts/deploy/deploy-node.sh chelsty
-   ```
+### Manual (SSH) — chelsty-infra uses docker-compose v1
+```bash
+ssh oskar@100.122.201.22
+cd ~/homelab-codex-ws/services/<service>
+docker-compose -f docker-compose.yml \
+  -f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
+  up -d --build --force-recreate
+```

-## Recovery Procedure
+> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).

-In case of runtime failure:
-1. Verify Docker and Compose plugin: `docker compose version`
-2. Re-run bootstrap script to ensure directory structure and basic configs.
-3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
-4. Verify SLZB-06U reachability: `ping 192.168.1.105`
+## Recovery Procedures
+
+### Mosquitto stopped
+```bash
+ssh oskar@100.122.201.22 "docker start mosquitto"
+# Ensure restart policy is correct:
+docker update --restart unless-stopped mosquitto
+```
+
+### Zigbee2MQTT won't start
+1. Check logs: `docker logs zigbee2mqtt --tail 50`
+2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
+3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
+4. If config missing, recreate from the minimal template above
+
+### SLZB-06U unreachable
+`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
+
+## Critical Backup Sets
+
+| Data | Path |
+|------|------|
+| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
+| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
+| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
+| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
+
+> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
--- a/docs/chelsty-stability-agent.md
+++ b/docs/chelsty-stability-agent.md
@ -0,0 +1,42 @@
+### CHELSTY Stability Agent
+
+The stability-agent on CHELSTY provides local observability and health monitoring for the node's services and infrastructure.
+
+#### Purpose
+
+It acts as a filesystem-first watchdog that detects anomalies in the local runtime environment without taking autonomous destructive actions (like restarts). It serves as the primary data source for node-level stability metrics.
+
+#### Monitoring Scope
+
+*   **Docker Containers**: Monitors all local containers. If a container is not in the `running` state, a `containers_not_running` event is generated.
+*   **Disk Usage**: Monitors the root filesystem. Generates `disk_usage_high` events if usage exceeds the configured threshold.
+*   **Connectivity**: 
+    *   Checks if the Tailscale socket or interface is available.
+    *   Checks reachability of the local Mosquitto MQTT broker.
+*   **Zigbee2MQTT**: Specifically tracks the presence and status of the Zigbee2MQTT service.
+
+#### Storage and Integration
+
+*   **Heartbeat**: Updated every cycle at `/opt/homelab/state/stability-agent.heartbeat`.
+*   **State Summary**: A JSON summary of all latest checks at `/opt/homelab/state/stability-agent.json`.
+*   **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty-infra/events.jsonl`.
+
+#### Deployment
+
+The service is deployed via Docker Compose on CHELSTY.
+
+```bash
+cd services/stability-agent
+docker compose up -d
+```
+
+#### Configuration
+
+Configuration is managed via environment variables in `docker-compose.override.yml` on the host.
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `STABILITY_CHECK_INTERVAL` | Seconds between checks | `60` |
+| `DISK_THRESHOLD_PCT` | Disk usage alert threshold | `90` |
+| `MQTT_HOST` | MQTT broker hostname | `mosquitto` |
+| `MQTT_PORT` | MQTT broker port | `1883` |
--- a/docs/observer-runtime.md
+++ b/docs/observer-runtime.md
@ -0,0 +1,98 @@
+# Observer Runtime
+
+The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
+
+## Architecture
+
+The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
+
+### Inputs
+- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
+- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
+- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
+
+### World Model Output
+Generated under `/opt/homelab/world/`:
+- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
+- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
+- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
+- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
+- `runtime-summary.json`: High-level overview for dashboards and planner agents.
+
+## Checkpoint Format
+
+The observer tracks per-node progress to avoid silently skipping event directories:
+
+```json
+{
+  "node_checkpoints": {
+    "vps":            "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
+    "piha":           "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
+    "chelsty-infra":  "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
+  }
+}
+```
+
+A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
+
+**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
+
+## Event Types
+
+### Negative events (create/escalate incidents)
+- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
+- `deployment_failed` — record failure in deployments.json
+
+### Positive events (resolve state)
+- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
+- `service_recovered` — alias, same effect
+- `deployment_completed` — marks deployment as completed
+
+### Node events
+- `node_online`, `node_offline` — update node status in nodes.json
+- `disk_pressure_*` — set `disk_pressure` field on the node record
+
+## Incident Lifecycle
+
+1.  **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
+2.  **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
+3.  **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
+4.  **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
+
+### Example Incident JSON
+```json
+{
+  "inc-1715518800-vps-observer": {
+    "id": "inc-1715518800-vps-observer",
+    "node": "vps",
+    "service": "observer",
+    "status": "resolved",
+    "severity": "error",
+    "started_at": 1715518800.0,
+    "last_occurrence": 1715518860.0,
+    "occurrence_count": 2,
+    "trigger_type": "containers_not_running",
+    "resolved_at": 1715519100.0
+  }
+}
+```
+
+## World State Pruning
+
+`_prune_stale_world()` runs every reconcile cycle and removes:
+
+1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
+2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
+3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
+4. **Expired incidents** — resolved incidents older than 7 days.
+
+## Runtime Behavior
+
+### Idempotency
+The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
+
+### Deployment Tracking
+Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
+
+### Topology Filtering
+Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
--- a/docs/sessions/2026-05-27-planner-agent.md
+++ b/docs/sessions/2026-05-27-planner-agent.md
@ -0,0 +1,234 @@
+# SESSION: Budowa planner-agent — LLM-based diagnostics
+
+**DATA:** 2026-05-27  
+**REZULTAT:** planner-agent działa na SOLARIA (`healthy`), Ollama primary, cloud fallback gotowy do włączenia
+
+---
+
+## Co zostało zbudowane
+
+### `services/planner-agent/src/llm_router.py`
+
+Moduł LLM routing z local-first fallback chain:
+
+- **`LLMRouter`** — główna klasa routingu przez litellm
+- **`ModelConfig`** — konfiguracja jednego modelu (name, timeout, api_base, extra_kwargs)
+- **`ModelMetrics`** — liczniki per model × outcome (`success`/`fallback`/`error`); success_rate
+- **`RouteResult`** — wynik routingu z `content`, `model_used`, `attempts`, `latency_ms`
+- **`AttemptRecord`** — zapis jednej próby (model, outcome, reason, latency_ms)
+- **`_extract_json_from_fence()`** — wydobywa JSON z bloków ` ```json ``` ` jeśli model nie odpowie czystym JSON
+
+Domyślny chain: `ollama/qwen2.5:7b` (8s) → `claude-haiku-4-5-20251001` (30s) → `claude-sonnet-4-6` (30s)
+
+Metryki każdego wywołania publikowane na Redis kanał `llm_router_metrics`.
+
+### `services/planner-agent/src/planner.py`
+
+Główna pętla agenta:
+
+- **`PlannerAgent`** — async agent: Redis sub → diagnoza LLM → pending action file → event
+- **`HealthEvent`** — znormalizowane zdarzenie zdrowotne z Redis (node, service, event_type, severity, payload)
+- **`ActionProposal`** — propozycja akcji z pełnymi metadanymi; `.to_action_file()` → format executora
+- **`CooldownTracker`** — gate 5-minutowy per `svc_key` (node/service); NIE rejestruje jeśli LLM się wysypał
+- **`parse_event()`** — normalizuje dwa formaty wejściowe (node-agent / control-plane)
+- **`write_pending_action()`** — atomiczny zapis: `.tmp` → rename
+- **`emit_event()`** — zapis zdarzenia `remediation_started` do systemu plików (bez importów z control-plane)
+
+Pipeline:
+```
+Redis msg → parse_event() → benign skip → cooldown gate → _propose_action() (LLM)
+         → write_pending_action() → emit_event("remediation_started")
+```
+
+### Pliki towarzyszące
+
+| Plik | Opis |
+|------|------|
+| `service.yaml` | Kontrakt operacyjny: owner_node=solaria, deps=redis+ollama, healthcheck=file |
+| `docker-compose.yml` | env_file + extra_hosts:host-gateway + ANTHROPIC_API_KEY w environment |
+| `Dockerfile` | python:3.11-slim, litellm, redis, jsonschema, structlog |
+| `healthcheck.sh` | Sprawdza wiek pliku heartbeat (max 300s) |
+| `requirements.txt` | litellm, redis, jsonschema, structlog |
+| `tests/test_planner.py` | 49 testów jednostkowych |
+| `tests/test_llm_router.py` | 34 testy jednostkowe |
+
+---
+
+## Kluczowe decyzje architektoniczne
+
+### 1. HITL invariant (Human-in-the-loop)
+
+Planner **wyłącznie** zapisuje do `actions/pending/`. Executor wymaga pliku w `actions/approved/`.
+Planner nigdy nie wykona akcji samodzielnie — to fundamentalna zasada systemu.
+
+Implementacja: `write_pending_action()` pisze do `pending/`, żadna ścieżka w kodzie nie dotyka `approved/`.
+
+### 2. Cooldown gate
+
+Per `svc_key` (= `node/service`), domyślnie 5 minut. Cel: nie zalewać operatora powtórzonymi
+propozycjami dla tego samego serwisu.
+
+**Kluczowa decyzja:** cooldown NIE jest rejestrowany jeśli cały chain LLM się wysypał.
+Dzięki temu kolejne zdarzenie może spróbować ponownie, zamiast być cicho zablokowanym
+przez 5 minut mimo że nie powstała żadna propozycja.
+
+### 3. Fallback chain — local-first
+
+Kolejność: Ollama (lokalny GPU) → Haiku → Sonnet.
+
+Uzasadnienie:
+- Ollama nie wysyła danych do zewnętrznych serwisów; niskie opóźnienie dla prostych przypadków
+- Haiku = szybki i tani cloud fallback
+- Sonnet = ostatnia deska ratunku dla trudnych przypadków
+
+Odrzucenie modelu na podstawie: timeout, błąd sieci, wzorzec odmowy, invalid JSON, schema error.
+
+### 4. Brak importów z control-plane
+
+`services/planner-agent/` jest w pełni samodzielny. Nie importuje nic z
+`services/control-plane/`. Emisja eventów jest implementowana lokalnie (kopia logiki
+`scripts/lib/events.py`).
+
+Uzasadnienie: planner musi działać nawet jeśli control-plane jest offline; oddzielne
+cykl deploymentu.
+
+### 5. structlog z PrintLoggerFactory
+
+Nie używamy `structlog.stdlib.add_logger_name` — `PrintLogger` nie ma atrybutu `.name`.
+Zamiast tego łańcuch procesorów: `add_log_level` → `TimeStamper` → `StackInfoRenderer`
+→ `format_exc_info` → `JSONRenderer`.
+
+### 6. NODE_NAME czytany w czasie wywołania, nie importu
+
+`_emit_event_sync` czyta `NODE_NAME` z modułowego `NODE_NAME` przy każdym wywołaniu
+(nie jako default parameter). Umożliwia patchowanie w testach.
+
+---
+
+## Problemy napotkane i rozwiązania
+
+### Problem: `localhost` w kontenerze nie sięga do hosta
+
+**Kontekst:** Ollama działa na SOLARIA pod `localhost:11434`. Kontener Docker
+z domyślną siecią bridge nie może sięgnąć do hosta przez `localhost`.
+
+**Rozwiązanie:**
+1. Dodano `extra_hosts: - "host-gateway:host-gateway"` do docker-compose.yml
+2. `.env` używa `OLLAMA_HOST=http://host-gateway:11434`
+
+### Problem: `environment` vs `env_file` — podwójne zmienne
+
+**Kontekst:** Pierwsza wersja docker-compose.yml miała wszystkie zmienne hardkodowane
+w sekcji `environment` z fallback wartościami (`${VAR:-default}`). Powodowało to
+że `.env` był opcjonalny a nie wymagany.
+
+**Rozwiązanie:** Usunięto wszystkie zmienne runtime z `environment`, przeniesiono do `env_file`.
+Pozostał tylko `ANTHROPIC_API_KEY` w `environment` (opcjonalny sekret, nie powinien być w pliku na dysku).
+
+### Problem: `structlog.stdlib.add_logger_name` crashuje z PrintLogger
+
+**Symptom:** `AttributeError: 'PrintLogger' object has no attribute 'name'`
+
+**Rozwiązanie:** Usunięto `add_logger_name` z łańcucha procesorów. Nie jest
+kompatybilny z `PrintLoggerFactory`.
+
+### Problem: verify stage failuje zaraz po starcie
+
+**Symptom:** `deploy.sh` raportuje FAILED przy verify bo heartbeat nie istnieje.
+
+**Przyczyna:** Race condition — agent potrzebuje kilku sekund na uruchomienie
+pętli i pierwsze `touch()` heartbeatu.
+
+**Rozwiązanie:** Nie jest to prawdziwy błąd. Docker healthcheck ma `start_period: 30s`.
+Kontener pokazuje `(healthy)` po 30s od startu.
+
+### Problem: git pull z divergent branches na solaria
+
+**Symptom:** Solaria miała 2 lokalne commity nie będące na Forgejo + ręczne zmiany w working tree.
+`git pull` failował z "Need to specify how to reconcile divergent branches."
+
+**Rozwiązanie:**
+```bash
+git checkout -- services/planner-agent/docker-compose.yml  # porzuć ręczne zmiany
+git fetch origin
+git rebase origin/master  # rebase local commits on top of master
+```
+
+---
+
+## Status deploymentu na SOLARIA
+
+```
+Container:  planner-agent   Up ~30m (healthy)
+Image:      planner-agent-planner-agent
+Node:       solaria (100.100.231.104)
+Heartbeat:  /opt/homelab/state/planner-agent.heartbeat  (age 0s)
+
+Channels subscribed:
+  - health_events
+  - world_updates
+
+LLM chain:
+  PRIMARY:  ollama/qwen2.5-coder:14b @ http://host-gateway:11434
+  FALLBACK: claude-haiku-4-5-20251001  (disabled — brak ANTHROPIC_API_KEY)
+  FALLBACK: claude-sonnet-4-6          (disabled — brak ANTHROPIC_API_KEY)
+
+Redis:      redis://100.108.208.3:6379  ✓ connected
+```
+
+---
+
+## Co zostało na później
+
+### 1. ANTHROPIC_API_KEY — cloud fallback wyłączony
+
+Haiku i Sonnet są skonfigurowane w chain ale nie mają klucza API.  
+Gdy Ollama nie da rady (złożony przypadek / timeout), chain się wysypie bez fallbacku.
+
+Aby włączyć:
+```bash
+ssh oskar@100.100.231.104
+echo "ANTHROPIC_API_KEY=sk-ant-..." >> /opt/homelab/config/planner-agent/.env
+docker compose -f ~/homelab-codex-ws/services/planner-agent/docker-compose.yml up -d
+```
+
+### 2. End-to-end test z prawdziwym eventem
+
+Planner jest podłączony do Redis i nasłuchuje, ale żadne zdarzenie jeszcze nie
+przeszło przez pełną ścieżkę (LLM call → pending action → operator UI).
+
+Test:
+```bash
+redis-cli -h 100.108.208.3 PUBLISH health_events '{
+  "type": "service_unhealthy",
+  "node": "piha",
+  "service": "mosquitto",
+  "severity": "error",
+  "payload": {"reason": "container exited"},
+  "timestamp": "2026-05-27T20:00:00Z"
+}'
+# Obserwuj: docker logs planner-agent -f
+# Sprawdź: ls /opt/homelab/actions/pending/
+```
+
+### 3. Solaria local commits
+
+Solaria ma 2 lokalne commity (`feat: add ECC skills`, `fix: remove duplicate CLAUDE.md sections`)
+które nie są na Forgejo. Zostały zrebase'owane na top of master ale nie wypchnięte.
+Należy je wypchnąć lub zreviewować i ewentualnie squashować.
+
+### 4. Integracja z operator UI / Telegram
+
+Propozycje w `actions/pending/` nie mają jeszcze kanału notyfikacji do operatora.
+Telegram bot powinien wysyłać powiadomienie gdy pojawi się nowy plik w `pending/`.
+
+---
+
+## Commity tej sesji
+
+```
+ff6fda1  planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
+ca37fca  Add planner-agent: LLM-powered remediation planner
+         (llm_router.py, planner.py, tests, service.yaml, docker-compose.yml,
+          healthcheck.sh, Dockerfile)
+```
--- a/docs/sessions/2026-05-27.md
+++ b/docs/sessions/2026-05-27.md
@ -0,0 +1,103 @@
+# SESSION: Stabilizacja systemu wieloagentowego homelabu
+
+**DATE:** 2026-05-27  
+**RESULT:** System NOMINAL (97/97 services, 0 errors)
+
+---
+
+## PROBLEMS FOUND
+
+- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
+- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
+- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
+- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
+- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
+- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
+- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
+- supervisor nie cancelował resolved actions — pending queue rósł bez końca
+- `service_healthy` event nie zamykał aktywnych incydentów
+- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
+- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
+
+---
+
+## FIXES SHIPPED (commits in master)
+
+```
+7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
+b40b832 Fix ghost service keys from hash-prefixed Docker container names
+28e9534 observer: service_healthy resolves active incidents
+46ae92b supervisor: also cancel pending actions for services removed from desired state
+410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
+b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
+61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
+51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
+fb7828b supervisor: auto-cancel pending actions when drift is resolved
+2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
+267742c vps/node-agent: add network_mode: host for control-plane health probe
+4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
+f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
+a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
+2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
+65bac4e fix(node-agent): mount host SSH key into container for event shipping
+96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
+ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
+c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
+01b7758 feat(node-agent): implement health monitor and safe cleanup policy
+```
+
+### Szczegóły kluczowych napraw
+
+**fix(observer): per-node checkpoints**  
+Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
+
+**fix(observer): ghost key pruning**  
+`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
+
+**fix(node-agent): canonical container name**  
+`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
+
+**fix(node-agent): service_healthy emission**  
+Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
+
+**fix(supervisor): auto-cancel resolved actions**  
+`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
+- serwis stał się healthy (`drift_resolved_auto`)
+- serwis został usunięty z desired state (`service_removed_from_desired_state`)
+
+**fix(supervisor): monitor:false**  
+Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
+
+**fix(agent-system/materializer): control-plane API as source**  
+Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
+
+**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**  
+Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
+
+**fix(chelsty-infra/zigbee2mqtt): writable config**  
+z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
+
+---
+
+## STAN KOŃCOWY
+
+| Node | Status | Serwisy |
+|------|--------|---------|
+| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
+| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
+| solaria | online | node-agent, stability-agent, AI workloads |
+| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
+| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
+
+**Action queue:** 0 pending, 0 approved, 0 running  
+**Incidents:** 0 active  
+**Ghost service keys:** 0  
+
+---
+
+## ZNANE OGRANICZENIA / TODO
+
+- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
+- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
+- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
+- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.
--- a/docs/stability-agent-rollout.md
+++ b/docs/stability-agent-rollout.md
@ -0,0 +1,62 @@
+# Stability Agent Multi-Node Rollout
+
+## Architecture Summary
+The `stability-agent` is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on **PIHA**.
+
+- **Source**: `services/stability-agent`
+- **State Path**: `/opt/homelab/state`
+- **Events Path**: `/opt/homelab/events`
+- **Redis Target**: `100.108.208.3:6379` (PIHA)
+
+## Why UI only showed CHELSTY
+Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
+
+## Deployment
+
+Use the helper script to deploy or generate commands. The script uses explicit Tailscale IPs for remote targets (piha, chelsty, vps) and runs locally for solaria.
+
+```bash
+# Print commands
+./scripts/deploy/deploy-stability-agent.sh <node-name>
+
+# Deploy via SSH (executes ssh oskar@<ip>)
+./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh
+```
+
+### Manual Steps per Node
+The manual steps are encapsulated in `services/stability-agent/deploy-local.sh`. On the target node:
+```bash
+cd /home/oskar/homelab-codex-ws
+git fetch origin
+git checkout master
+git pull origin master
+cd services/stability-agent
+./deploy-local.sh <node-name>
+```
+
+## Verification
+
+### Fleet Overview
+Run the verification script from any node with `redis-cli` access:
+```bash
+./scripts/deploy/verify-agent-fleet.sh
+```
+
+### Redis Inspection (on PIHA)
+```bash
+docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
+docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
+```
+
+Verify Web UI backend:
+```bash
+curl -s http://127.0.0.1:18180/nodes
+curl -k https://agents.okit.pl/nodes
+```
+
+## Troubleshooting
+
+- **Redis empty after compose down**: The `agent-system-redis` on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every `CHECK_INTERVAL`).
+- **Secrets**: `.env` files and local secrets are not committed to the repo. Ensure `MQTT_HOST` and other specific secrets are set via overrides if needed.
+- **Telegram**: Telegram bot notifications can remain disabled if `TELEGRAM_BOT_TOKEN` is absent.
+- **Docker Socket**: If the agent reports `unavailable` for Docker, ensure `/var/run/docker.sock` is mounted and the user has permissions.
--- a/docs/standards.md
+++ b/docs/standards.md
@ -49,9 +49,10 @@ Runtime state must live outside the repository to keep it immutable and clean.
 ## Service Standards

 1.  **Normalization**: Every service MUST follow the `services/<service>/` layout.
-2.  **Metadata**: Every service MUST have a `service.yaml` defining its operational contract.
-3.  **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification.
-4.  **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host.
+2.  **Metadata**: Every service MUST have a `service.yaml` defining its operational contract. This is the primary source of truth for AI agents.
+3.  **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification. Agents use this to emit stability events.
+4.  **Actionability**: Any automated recovery action proposed by an agent must be backed by a `service.yaml` definition.
+5.  **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host. Agents must treat these as "black box" configurations.

 ## Docker Compose Standards

--- a/docs/vps-control-plane.md
+++ b/docs/vps-control-plane.md
@ -0,0 +1,126 @@
+# VPS Control Plane
+
+The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
+
+## Architecture
+
+The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
+
+| Container | Role |
+|-----------|------|
+| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
+| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
+| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
+| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
+
+All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
+
+## Supervisor Behavior
+
+### Desired State
+Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
+
+### Drift Types
+- `missing_service` — service is in desired state but absent from `services.json`
+- `unhealthy_service` — service exists in `services.json` but `status != healthy`
+
+### Action Types
+| Trigger | Action type | Risk |
+|---------|-------------|------|
+| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
+| Any other / unknown | `redeploy` | guarded |
+| Node `disk_pressure: high` | `disk_cleanup` | guarded |
+
+### Action ID Stability
+Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
+
+### Auto-Cancel
+Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
+- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
+- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
+
+Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
+
+### Node Name Resolution
+The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
+
+```bash
+NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
+```
+
+## Deployment
+
+### From SATURN (primary control node)
+```bash
+# Full deploy via SSH
+./scripts/deploy/deploy-control-plane.sh --ssh
+
+# Or manually:
+ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
+```
+
+### Direct on VPS
+```bash
+cd ~/homelab-codex-ws/services/control-plane
+docker compose up -d --build --force-recreate
+```
+
+`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
+
+### Verification
+```bash
+# On VPS
+docker ps --filter "name=control-plane"
+curl -s http://localhost:18180/summary | python3 -m json.tool
+```
+
+## Action Approval Workflow
+
+```
+Supervisor writes → /opt/homelab/actions/pending/<id>.json
+                 → Operator UI (port 18180) or Telegram Bot notifies
+                 → Operator clicks Approve
+                 → /opt/homelab/actions/approved/<id>.json
+                 → Executor executes → completed / failed
+```
+
+Possible action states: `pending → approved → running → completed / failed / rejected`  
+Auto-cancel path: `pending → cancelled/`
+
+## Recovery
+
+### World state is stale or corrupt
+```bash
+# On VPS — delete checkpoint to force full replay
+rm /opt/homelab/state/observer_checkpoint.json
+docker restart control-plane-observer
+```
+
+### Flood of pending actions after bootstrap
+Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
+
+```bash
+# Check node-agent on each node
+ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
+```
+
+### Rebuild from scratch
+```bash
+ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
+```
+
+## Integration
+
+### piha agent-system webui (port 18180 on piha)
+The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
+
+Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
+
+### Nginx Proxy Manager
+The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
+
+### Log Locations
+- Container logs: `docker compose logs -f` (from `services/control-plane/`)
+- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
+- World state: `/opt/homelab/world/`
+- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`
--- a/hosts/chelsty-ha/capabilities.yaml
+++ b/hosts/chelsty-ha/capabilities.yaml
@ -0,0 +1,24 @@
+host: chelsty-ha
+site: chelsty
+
+capabilities:
+  networking:
+    reachability: tailscale-only
+    tailscale_ip: 100.122.201.23
+    ingress_suitability: false
+    bandwidth: LTE
+
+  runtime:
+    container_engine: docker
+    os: debian
+
+  operational:
+    connectivity: intermittent
+    availability_target: best-effort
+    offline_first: true
+    uplink: lte
+
+  deployment:
+    suitability:
+      - homeassistant
+    restricted: false
--- a/hosts/chelsty-ha/host.yaml
+++ b/hosts/chelsty-ha/host.yaml
@ -0,0 +1,20 @@
+hostname: chelsty-ha
+site: chelsty
+
+roles:
+  - homeassistant
+
+network:
+  tailscale_ip: 100.122.201.23
+
+runtime:
+  root: /opt/homelab
+
+deployment:
+  mode: pull
+  managed_by: saturn
+
+  constraints:
+    connectivity:
+      intermittent: true
+      uplink: lte
--- a/hosts/chelsty-ha/services.yaml
+++ b/hosts/chelsty-ha/services.yaml
@ -0,0 +1,12 @@
+host: chelsty-ha
+site: chelsty
+
+services:
+  homeassistant:
+    role: home-automation-controller
+    offline_required: true
+    # monitor: false — chelsty-ha has no node-agent deployed, so there are no
+    # container-health events for the observer to track. HA is monitored
+    # indirectly via the chelsty-infra MQTT broker (if MQTT goes silent, HA
+    # is likely down). Re-enable once node-agent is bootstrapped on this VM.
+    monitor: false
--- a/hosts/chelsty-infra/capabilities.yaml
+++ b/hosts/chelsty-infra/capabilities.yaml
@ -1,3 +1,6 @@
+host: chelsty-infra
+site: chelsty
+
 capabilities:
  hardware:
    cpu:
@ -8,33 +11,34 @@ capabilities:
      total_gb: 16
    acceleration:
      type: none
-  
+
  virtualization:
    supported: true
    type: kvm
-  
+
  storage:
    persistence: persistent
    type: ssd
    capacity_gb: 250
-  
+
  networking:
    reachability: tailscale-only
    ingress_suitability: false
    bandwidth: LTE
-  
+
  runtime:
    container_engine: docker
    os: debian
-  
+
  operational:
    power_constraint: low-power
    connectivity: intermittent
    availability_target: best-effort
-  
+    offline_operation_required: true
+
  deployment:
    suitability:
      - staging
-      - homeassistant
+      - infra
      - edge
    restricted: false
--- a/hosts/chelsty-infra/host.yaml
+++ b/hosts/chelsty-infra/host.yaml
@ -1,9 +1,10 @@
-hostname: chelsty
+hostname: chelsty-infra
+site: chelsty

 roles:
  - edge
  - hypervisor
-  - homeassistant
+  - infra
  - staging

 network:
--- a/hosts/chelsty-infra/networking.yaml
+++ b/hosts/chelsty-infra/networking.yaml
@ -1,4 +1,4 @@
-host: chelsty
+host: chelsty-infra

 uplink:
  type: lte
@ -20,7 +20,7 @@ exposure_classes:

 networks:
  home_automation_lan:
-    purpose: Home Assistant, MQTT, Zigbee coordinator, and local device control.
+    purpose: MQTT broker, Zigbee coordinator, and local device control.
    offline_required: true
    internet_required_for_core_operation: false

--- a/hosts/chelsty-infra/paths.yaml
+++ b/hosts/chelsty-infra/paths.yaml
@ -1,4 +1,4 @@
-host: chelsty
+host: chelsty-infra

 runtime_root: /opt/homelab

@ -9,12 +9,6 @@ conventions:
  logs: /opt/homelab/logs

 services:
-  homeassistant:
-    data: /opt/homelab/data/homeassistant
-    config: /opt/homelab/config/homeassistant
-    logs: /opt/homelab/logs/homeassistant
-    backup_priority: critical
-
  zigbee2mqtt:
    data: /opt/homelab/data/zigbee2mqtt
    config: /opt/homelab/config/zigbee2mqtt
@ -27,13 +21,13 @@ services:
    logs: /opt/homelab/logs/mosquitto
    backup_priority: high

-backup_sets:
-  homeassistant:
-    include:
-      - /opt/homelab/config/homeassistant
-      - /opt/homelab/data/homeassistant
-    restore_note: Restore before starting the Home Assistant container.
+  stability-agent:
+    data: /opt/homelab/state
+    config: /opt/homelab/config/stability-agent
+    logs: /opt/homelab/events
+    backup_priority: low

+backup_sets:
  zigbee2mqtt:
    include:
      - /opt/homelab/config/zigbee2mqtt
--- a/hosts/chelsty-infra/runtime/frigate/config.yml
+++ b/hosts/chelsty-infra/runtime/frigate/config.yml
@ -0,0 +1,88 @@
+# Frigate NVR — chelsty-infra
+# Hardware decode: Intel UHD 630 via VAAPI (/dev/dri/renderD128)
+# Object detection: CPU (no Coral TPU)
+# Cameras: 2x Reolink RLC-540 (5MP, WiFi)
+#
+# Required env vars in /opt/homelab/config/frigate/frigate.env:
+#   CAMERA1_IP, CAMERA1_USER, CAMERA1_PASS
+#   CAMERA2_IP, CAMERA2_USER, CAMERA2_PASS
+#   MQTT_USER, MQTT_PASS  (if mosquitto auth is enabled)
+
+mqtt:
+  enabled: true
+  host: 127.0.0.1
+  port: 1883
+  # user: "{MQTT_USER}"
+  # password: "{MQTT_PASS}"
+
+detectors:
+  cpu1:
+    type: cpu
+    num_threads: 3
+
+ffmpeg:
+  hwaccel_args: preset-vaapi
+  global_args:
+    - -hide_banner
+    - -loglevel
+    - warning
+
+record:
+  enabled: true
+  retain:
+    days: 7
+    mode: all
+  events:
+    retain:
+      default: 14
+      mode: motion
+
+snapshots:
+  enabled: true
+  retain:
+    default: 7
+  quality: 70
+
+objects:
+  track:
+    - person
+    - car
+    - bicycle
+  filters:
+    person:
+      min_area: 5000
+      max_area: 100000
+      threshold: 0.7
+
+cameras:
+  camera1:
+    ffmpeg:
+      inputs:
+        # Main stream — high-res recording
+        - path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_main
+          roles:
+            - record
+        # Sub stream — low-res detection (lower CPU cost)
+        - path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_sub
+          roles:
+            - detect
+    detect:
+      enabled: true
+      width: 640
+      height: 480
+      fps: 5
+
+  camera2:
+    ffmpeg:
+      inputs:
+        - path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_main
+          roles:
+            - record
+        - path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_sub
+          roles:
+            - detect
+    detect:
+      enabled: true
+      width: 640
+      height: 480
+      fps: 5
--- a/hosts/chelsty-infra/runtime/frigate/docker-compose.yml
+++ b/hosts/chelsty-infra/runtime/frigate/docker-compose.yml
@ -0,0 +1,25 @@
+services:
+  frigate:
+    container_name: frigate
+    image: ghcr.io/blakeblackshear/frigate:stable
+    restart: unless-stopped
+    privileged: true
+    shm_size: "256mb"
+    network_mode: host
+    devices:
+      - /dev/dri/renderD128:/dev/dri/renderD128
+    volumes:
+      - /etc/localtime:/etc/localtime:ro
+      - /opt/homelab/config/frigate/config.yml:/config/config.yml
+      - /opt/homelab/config/frigate:/config/credentials:ro
+      - /opt/homelab/data/frigate:/media/frigate
+    tmpfs:
+      - /tmp/cache
+    env_file:
+      - /opt/homelab/config/frigate/frigate.env
+    healthcheck:
+      test: ["CMD-SHELL", "wget -q --spider http://localhost:5000/api/version 2>&1 || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 60s
--- a/hosts/chelsty-infra/runtime/mosquitto/docker-compose.override.yml
+++ b/hosts/chelsty-infra/runtime/mosquitto/docker-compose.override.yml
--- a/hosts/chelsty-infra/runtime/mosquitto/mosquitto.conf
+++ b/hosts/chelsty-infra/runtime/mosquitto/mosquitto.conf
--- a/hosts/chelsty-infra/runtime/node-agent/docker-compose.override.yml
+++ b/hosts/chelsty-infra/runtime/node-agent/docker-compose.override.yml
@ -0,0 +1,11 @@
+services:
+  node-agent:
+    environment:
+      - NODE_NAME=chelsty-infra
+      - NODE_TYPE=lte_node
+      - VPS_EVENTS_HOST=100.95.58.48
+      - VPS_EVENTS_USER=oskar
+      - VPS_EVENTS_PATH=/opt/homelab/events
+      - CHECK_INTERVAL=60
+    volumes:
+      - /home/oskar/.ssh:/root/.ssh:ro
--- a/hosts/chelsty-infra/runtime/stability-agent/docker-compose.override.yml
+++ b/hosts/chelsty-infra/runtime/stability-agent/docker-compose.override.yml
@ -0,0 +1,12 @@
+services:
+  stability-agent:
+    environment:
+      - NODE_NAME=chelsty-infra
+      - SITE_NAME=chelsty
+      - REDIS_HOST=100.108.208.3
+      - REDIS_PORT=6379
+      - REDIS_ENABLED=true
+      - STABILITY_CHECK_INTERVAL=60
+      - DISK_THRESHOLD_PCT=85
+      - MQTT_HOST=mosquitto
+      - MQTT_PORT=1883
--- a/hosts/chelsty-infra/runtime/zigbee2mqtt/configuration.yaml
+++ b/hosts/chelsty-infra/runtime/zigbee2mqtt/configuration.yaml
--- a/hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
+++ b/hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
@ -0,0 +1,21 @@
+services:
+  zigbee2mqtt:
+    # mosquitto runs with network_mode: host on chelsty-infra.
+    # extra_hosts maps the 'mosquitto' hostname to the host gateway IP so that
+    # mqtt://mosquitto:1883 in configuration.yaml reaches the host-networked
+    # mosquitto process. Requires Docker 20.10+ (present on chelsty-infra).
+    extra_hosts:
+      - "mosquitto:host-gateway"
+    environment:
+      - TZ=Europe/Warsaw
+    healthcheck:
+      test: ["CMD-SHELL", "wget -qO- http://localhost:8080 > /dev/null 2>&1 || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 90s
+    # Note: volumes NOT overridden here.
+    # The base docker-compose.yml mounts /opt/homelab/data/zigbee2mqtt/data:/app/data
+    # (read-write). configuration.yaml must be placed in that directory on the node:
+    #   /opt/homelab/data/zigbee2mqtt/data/configuration.yaml
+    # z2m rewrites this file during migrations — read-only mount is not viable.
--- a/hosts/chelsty-infra/services.yaml
+++ b/hosts/chelsty-infra/services.yaml
@ -0,0 +1,37 @@
+host: chelsty-infra
+site: chelsty
+
+services:
+  ha-diag-agent:
+    role: ha-diagnostic-agent
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: false
+    depends_on:
+      local: []
+      external: [homeassistant]
+    config:
+      target_url: http://100.70.180.90:8123  # chelsty-ha via Tailscale (HAOS, separate VM)
+      location_tag: "chelsty"
+      events_dir: /opt/homelab/events/chelsty-infra
+    runtime:
+      config_path: /opt/homelab/config/ha-diag-agent
+      data_path: /var/lib/ha-diag-agent
+
+  node-agent:
+    role: node-stability-monitor
+    # LTE node: node-agent monitors and emits events but does NO Docker cleanup.
+    # Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's
+    # own retain policy is the correct remediation, not docker prune.
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: true
+
+  mosquitto:
+    role: local-mqtt-broker
+
+  zigbee2mqtt:
+    role: zigbee-mqtt-bridge
+
+  frigate:
+    role: nvr
--- a/hosts/chelsty/runtime/zigbee2mqtt/docker-compose.override.yml
+++ b/hosts/chelsty/runtime/zigbee2mqtt/docker-compose.override.yml
@ -1,13 +0,0 @@
-services:
-  zigbee2mqtt:
-    volumes:
-      - ./configuration.yaml:/app/data/configuration.yaml:ro
-    environment:
-      - MQTT_USER=${MQTT_USER}
-      - MQTT_PASSWORD=${MQTT_PASSWORD}
-    # Healthcheck is already defined in base service, but we ensure compatibility
-    healthcheck:
-      test: ["CMD", "curl", "-f", "http://localhost:8080"]
-      interval: 10s
-      timeout: 5s
-      retries: 3
--- a/hosts/chelsty/services.yaml
+++ b/hosts/chelsty/services.yaml
@ -1,108 +0,0 @@
-host: chelsty
-
-exposure_classes:
-  local-only:
-    description: Reachable only from CHELSTY-local networks or container networks.
-    public_ingress: false
-    tailscale_required: false
-  tailscale-internal:
-    description: Reachable through the Tailscale mesh by approved tailnet clients.
-    public_ingress: false
-    tailscale_required: true
-  public:
-    description: Reachable from the public internet through an explicit ingress path.
-    public_ingress: true
-    tailscale_required: false
-
-operational_constraints:
-  uplink: lte
-  connectivity: intermittent
-  offline_operation_required: true
-  must_not_depend_on:
-    - saturn
-    - vps
-    - forgejo
-
-services:
-  homeassistant:
-    role: home-automation-controller
-    deployment_model: docker-compose
-    exposure: tailscale-internal
-    offline_required: true
-    depends_on:
-      local:
-        - mosquitto
-        - zigbee2mqtt
-      external: []
-    ports:
-      - name: http
-        container_port: 8123
-        protocol: tcp
-    runtime:
-      config_path: /opt/homelab/config/homeassistant
-      data_path: /opt/homelab/data/homeassistant
-      logs_path: /opt/homelab/logs/homeassistant
-    backup:
-      recommended: true
-      include:
-        - /opt/homelab/config/homeassistant
-        - /opt/homelab/data/homeassistant
-      notes:
-        - Back up before Home Assistant core, supervisor-equivalent, or integration upgrades.
-        - Keep local restore copies on CHELSTY because LTE connectivity may be unavailable during recovery.
-
-  zigbee2mqtt:
-    role: zigbee-mqtt-bridge
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-    depends_on:
-      local:
-        - mosquitto
-      external:
-        - slzb-06u
-    coordinator:
-      name: slzb-06u
-      connection: network
-      usb_device: null
-    ports:
-      - name: frontend
-        container_port: 8080
-        protocol: tcp
-        exposure: tailscale-internal
-    runtime:
-      config_path: /opt/homelab/config/zigbee2mqtt
-      data_path: /opt/homelab/data/zigbee2mqtt
-      logs_path: /opt/homelab/logs/zigbee2mqtt
-    backup:
-      recommended: true
-      include:
-        - /opt/homelab/config/zigbee2mqtt
-        - /opt/homelab/data/zigbee2mqtt
-      notes:
-        - Include configuration.yaml, database.db, coordinator backup files, and network key material.
-        - Restore Zigbee2MQTT state together with the SLZB-06U coordinator state when replacing hardware.
-
-  mosquitto:
-    role: local-mqtt-broker
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-    depends_on:
-      local: []
-      external: []
-    ports:
-      - name: mqtt
-        container_port: 1883
-        protocol: tcp
-    runtime:
-      config_path: /opt/homelab/config/mosquitto
-      data_path: /opt/homelab/data/mosquitto
-      logs_path: /opt/homelab/logs/mosquitto
-    backup:
-      recommended: true
-      include:
-        - /opt/homelab/config/mosquitto
-        - /opt/homelab/data/mosquitto
-      notes:
-        - Retain ACL, password, persistence, and bridge configuration if enabled.
--- a/hosts/piha/runtime/agent-system/docker-compose.override.yml
+++ b/hosts/piha/runtime/agent-system/docker-compose.override.yml
@ -0,0 +1,8 @@
+services:
+  runtime-materializer:
+    environment:
+      # Pull world state from the VPS control-plane API instead of local Redis.
+      # The observer on VPS is the authoritative writer; mirroring its API output
+      # here ensures the webui /snapshot matches the clean 97-service state that
+      # the control-plane /summary endpoint serves.
+      CONTROL_PLANE_URL: "http://100.95.58.48:18180"
--- a/hosts/piha/runtime/brain-watchdog/docker-compose.override.yml
+++ b/hosts/piha/runtime/brain-watchdog/docker-compose.override.yml
@ -0,0 +1,4 @@
+services:
+  brain-watchdog:
+    mem_limit: 64m
+    restart: unless-stopped
--- a/hosts/piha/runtime/node-agent/docker-compose.override.yml
+++ b/hosts/piha/runtime/node-agent/docker-compose.override.yml
@ -0,0 +1,11 @@
+services:
+  node-agent:
+    environment:
+      - NODE_NAME=piha
+      - NODE_TYPE=sd_card
+      - VPS_EVENTS_HOST=100.95.58.48
+      - VPS_EVENTS_USER=oskar
+      - VPS_EVENTS_PATH=/opt/homelab/events
+      - CHECK_INTERVAL=60
+    volumes:
+      - /home/oskar/.ssh:/root/.ssh:ro
--- a/hosts/piha/runtime/stability-agent/docker-compose.override.yml
+++ b/hosts/piha/runtime/stability-agent/docker-compose.override.yml
@ -0,0 +1,7 @@
+services:
+  stability-agent:
+    environment:
+      - NODE_NAME=piha
+      - REDIS_HOST=100.108.208.3
+      - REDIS_PORT=6379
+      - REDIS_ENABLED=true
--- a/hosts/piha/services.yaml
+++ b/hosts/piha/services.yaml
@ -0,0 +1,42 @@
+host: piha
+
+services:
+  ha-diag-agent:
+    role: ha-diagnostic-agent
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: false
+    depends_on:
+      local: []
+      external: [homeassistant]
+    config:
+      target_url: http://localhost:8123
+      location_tag: "ken"
+      events_dir: /opt/homelab/events/piha
+    runtime:
+      config_path: /opt/homelab/config/ha-diag-agent
+      data_path: /var/lib/ha-diag-agent
+
+  node-agent:
+    role: node-stability-monitor
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: true
+    depends_on:
+      local: []
+      external: []
+    runtime:
+      config_path: /opt/homelab/config/node-agent
+      data_path: /opt/homelab/state
+      logs_path: /opt/homelab/events
+
+  brain-watchdog:
+    role: control-plane-watchdog
+    deployment_model: docker-compose
+    exposure: private
+    offline_required: false
+    depends_on:
+      local: []
+      external: [control-plane]
+    runtime:
+      config_path: /opt/homelab/config/brain-watchdog
--- a/hosts/solaria/runtime/node-agent/docker-compose.override.yml
+++ b/hosts/solaria/runtime/node-agent/docker-compose.override.yml
@ -0,0 +1,11 @@
+services:
+  node-agent:
+    environment:
+      - NODE_NAME=solaria
+      - NODE_TYPE=ai_node
+      - VPS_EVENTS_HOST=100.95.58.48
+      - VPS_EVENTS_USER=oskar
+      - VPS_EVENTS_PATH=/opt/homelab/events
+      - CHECK_INTERVAL=60
+    volumes:
+      - /home/oskar/.ssh:/root/.ssh:ro
--- a/hosts/solaria/runtime/stability-agent/docker-compose.override.yml
+++ b/hosts/solaria/runtime/stability-agent/docker-compose.override.yml
@ -0,0 +1,7 @@
+services:
+  stability-agent:
+    environment:
+      - NODE_NAME=solaria
+      - REDIS_HOST=100.108.208.3
+      - REDIS_PORT=6379
+      - REDIS_ENABLED=true
--- a/hosts/solaria/services.yaml
+++ b/hosts/solaria/services.yaml
@ -0,0 +1,15 @@
+host: solaria
+
+services:
+  node-agent:
+    role: node-stability-monitor
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: true
+    depends_on:
+      local: []
+      external: []
+    runtime:
+      config_path: /opt/homelab/config/node-agent
+      data_path: /opt/homelab/state
+      logs_path: /opt/homelab/events
--- a/hosts/vps/runtime/control-plane/docker-compose.override.yml
+++ b/hosts/vps/runtime/control-plane/docker-compose.override.yml
@ -0,0 +1,39 @@
+# Control-plane production overrides for the VPS deployment.
+#
+# NODE_ALIAS_MAP translates the node names that appear in raw event files
+# (written by node agents / seed scripts) to the canonical names used in
+# inventory/topology.yaml and hosts/*/services.yaml.
+#
+# Current live mapping (from /opt/homelab/events/ inspection):
+#   node-2  →  chelsty   (zigbee2mqtt / mosquitto / homeassistant node)
+#
+# Add further entries when new nodes come online and their event-source names
+# differ from their topology names.  Format is a single-line JSON object, e.g.:
+#   NODE_ALIAS_MAP='{"node-2":"chelsty","node-3":"piha"}'
+#
+# The executor inherits the canonical name from the action JSON written by the
+# supervisor, so NODE_ALIAS_MAP is only required on the supervisor service.
+#
+# Memory limits: VPS has 4 GiB RAM, no swap. oom_score_adj -900 ensures the
+# host kernel OOM-killer never targets control-plane containers. mem_limit
+# provides a per-container cgroup ceiling so a leaking process is restarted by
+# Docker before it can exhaust host memory.
+
+services:
+  operator-ui:
+    mem_limit: 192m
+    oom_score_adj: -900
+
+  observer:
+    mem_limit: 192m
+    oom_score_adj: -900
+
+  supervisor:
+    mem_limit: 400m
+    oom_score_adj: -900
+    environment:
+      - NODE_ALIAS_MAP={"node-2":"chelsty"}
+
+  executor:
+    mem_limit: 64m
+    oom_score_adj: -900
--- a/hosts/vps/runtime/control-plane/env.example
+++ b/hosts/vps/runtime/control-plane/env.example
@ -0,0 +1,7 @@
+# Control Plane Environment Variables
+PORT=8080
+HOMELAB_STATE_ROOT=/opt/homelab/state
+HOMELAB_EVENTS_ROOT=/opt/homelab/events
+HOMELAB_WORLD_ROOT=/opt/homelab/world
+HOMELAB_ACTIONS_ROOT=/opt/homelab/actions
+HOMELAB_CONFIG_ROOT=/opt/homelab/config
--- a/hosts/vps/runtime/node-agent/docker-compose.override.yml
+++ b/hosts/vps/runtime/node-agent/docker-compose.override.yml
@ -0,0 +1,16 @@
+services:
+  node-agent:
+    environment:
+      - NODE_NAME=vps
+      - CHECK_INTERVAL=60
+    # host network mode: node-agent on VPS shares the host's network namespace
+    # so that localhost:18180 resolves to the control-plane's exposed port.
+    # Without this, localhost inside the container is the container's own loopback
+    # and the _check_control_plane_health() probe would always fail.
+    network_mode: host
+    # HARD memory ceiling: node-agent mounts /opt/homelab/events/ (page cache)
+    # and may accumulate Python RSS over hours; 640m cap ensures it is killed and
+    # auto-restarted by Docker before consuming host memory. oom_score_adj -900
+    # prevents the host kernel OOM-killer from picking it as a global victim.
+    mem_limit: 640m
+    oom_score_adj: -900
--- a/hosts/vps/runtime/stability-agent/docker-compose.override.yml
+++ b/hosts/vps/runtime/stability-agent/docker-compose.override.yml
@ -0,0 +1,9 @@
+services:
+  stability-agent:
+    environment:
+      - NODE_NAME=vps
+      - REDIS_HOST=100.108.208.3
+      - REDIS_PORT=6379
+      - REDIS_ENABLED=true
+    mem_limit: 96m
+    oom_score_adj: -900
--- a/hosts/vps/services.txt
+++ b/hosts/vps/services.txt
@ -1 +0,0 @@
-npm
--- a/hosts/vps/services.yaml
+++ b/hosts/vps/services.yaml
@ -0,0 +1,43 @@
+host: vps
+
+services:
+  node-agent:
+    role: node-stability-monitor
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: true
+    depends_on:
+      local: []
+      external: []
+    runtime:
+      config_path: /opt/homelab/config/node-agent
+      data_path: /opt/homelab/state
+      logs_path: /opt/homelab/events
+
+  control-plane:
+    role: management-and-orchestration
+    deployment_model: docker-compose
+    exposure: tailscale-internal
+    offline_required: false
+    depends_on:
+      local:
+        - node-agent
+      external:
+        - piha:redis
+    ports:
+      - name: http
+        container_port: 18180
+        protocol: tcp
+    runtime:
+      config_path: /opt/homelab/config/control-plane
+      data_path: /opt/homelab/data/control-plane
+      logs_path: /opt/homelab/logs/control-plane
+
+  node_exporter:
+    role: metrics-exporter
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: true
+    depends_on:
+      local: []
+      external: []
--- a/inventory/topology.yaml
+++ b/inventory/topology.yaml
@ -17,6 +17,10 @@ nodes:
    roles:
      - infra
      - monitoring
+    services:
+      - node-agent
+      - ha-diag-agent
+      - brain-watchdog

  solaria:
    roles:
@ -27,12 +31,25 @@ nodes:
    roles:
      - edge
      - ingress
+      - control-plane
+    services:
+      # Repo-managed GitOps services (hosts/vps/services.yaml is authoritative)
+      - node-agent
+      - control-plane       # executor, observer, supervisor, operator-ui
+      - node_exporter
+      - stability-agent
+      - npm                 # Nginx Proxy Manager — public ingress, TLS termination
+      - outline             # Team wiki (outline + postgres + redis)
+      - joplin              # Note sync server (joplin-server + postgres)
+      - ai-cluster          # AI workers: codex-worker, openclaw, planner-worker,
+                            # service-ops-worker, redis, mosquitto

-  chelsty:
+  chelsty-infra:
+    site: chelsty
    roles:
      - remote
      - hypervisor
-      - homeassistant
+      - infra
      - staging
    connectivity:
      uplink: lte
@ -40,10 +57,22 @@ nodes:
    home_automation:
      offline_operation_required: true
      services:
-        - homeassistant
        - zigbee2mqtt
        - mosquitto
      coordinator:
        model: SLZB-06U
        connection: network
        usb: false
+
+  chelsty-ha:
+    site: chelsty
+    roles:
+      - remote
+      - homeassistant
+    connectivity:
+      uplink: lte
+      intermittent: true
+    home_automation:
+      offline_operation_required: true
+      services:
+        - homeassistant
--- a/scripts/bootstrap/vps-control-plane.sh
+++ b/scripts/bootstrap/vps-control-plane.sh
@ -0,0 +1,75 @@
+#!/usr/bin/env bash
+# vps-control-plane.sh - Bootstrap script for VPS control plane
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+RUNTIME_DIR="/opt/homelab"
+VPS_CONFIG="$REPO_ROOT/hosts/vps/runtime"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+log() { echo -e "${GREEN}[INFO]${NC} $1"; }
+warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
+error() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
+
+log "Starting VPS control plane bootstrap..."
+
+# 1. Validate Docker availability
+if ! command -v docker &> /dev/null; then
+    error "Docker is not installed. Please install Docker first."
+fi
+
+# 2. Validate compose plugin
+if ! docker compose version &> /dev/null; then
+    error "Docker Compose plugin is not installed."
+fi
+
+log "Docker and Compose plugin verified."
+
+# 3. Create filesystem-first runtime structure
+log "Creating filesystem-first runtime structure in $RUNTIME_DIR..."
+sudo mkdir -p "$RUNTIME_DIR/events" \
+             "$RUNTIME_DIR/state" \
+             "$RUNTIME_DIR/world" \
+             "$RUNTIME_DIR/actions/pending" \
+             "$RUNTIME_DIR/actions/approved" \
+             "$RUNTIME_DIR/actions/running" \
+             "$RUNTIME_DIR/actions/completed" \
+             "$RUNTIME_DIR/actions/failed" \
+             "$RUNTIME_DIR/actions/rejected" \
+             "$RUNTIME_DIR/config" \
+             "$RUNTIME_DIR/logs"
+
+# 4. Set permissions
+log "Setting permissions..."
+sudo chown -R $USER:$USER "$RUNTIME_DIR"
+chmod -R 755 "$RUNTIME_DIR"
+
+# 5. Install environment file
+log "Installing environment configuration..."
+if [ ! -f "$RUNTIME_DIR/config/control-plane.env" ]; then
+    cp "$VPS_CONFIG/control-plane/env.example" "$RUNTIME_DIR/config/control-plane.env"
+    log "Created $RUNTIME_DIR/config/control-plane.env from template."
+else
+    warn "Environment file already exists, skipping installation."
+fi
+
+# 6. Build and start the control plane
+log "Building and starting control plane services..."
+cd "$REPO_ROOT/services/control-plane"
+docker compose build
+docker compose up -d
+
+log "VPS control plane bootstrap complete!"
+
+echo -e "\n${YELLOW}Verification commands:${NC}"
+echo "1. Check container status: docker compose ps"
+echo "2. Check operator UI: curl http://localhost:8080/summary"
+echo "3. Validate world state: ls -l $RUNTIME_DIR/world"
+echo "4. Monitor events: tail -f $RUNTIME_DIR/events/*/*/*.json"
--- a/scripts/deploy/deploy-control-plane.sh
+++ b/scripts/deploy/deploy-control-plane.sh
@ -0,0 +1,23 @@
+#!/bin/bash
+# scripts/deploy/deploy-control-plane.sh
+set -e
+
+VPS_IP="100.95.58.48"
+USER="oskar"
+REMOTE_REPO_PATH="/home/oskar/homelab-codex-ws"
+
+MODE=$1
+
+case "$MODE" in
+    "--ssh")
+        echo "Deploying to VPS ($VPS_IP) via SSH..."
+        ssh -t "$USER@$VPS_IP" "cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh"
+        ;;
+    "--print")
+        echo "ssh -t $USER@$VPS_IP \"cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh\""
+        ;;
+    *)
+        echo "Usage: $0 [--ssh|--print]"
+        exit 1
+        ;;
+esac
--- a/scripts/deploy/deploy-frigate.sh
+++ b/scripts/deploy/deploy-frigate.sh
@ -0,0 +1,26 @@
+#!/usr/bin/env bash
+# deploy-frigate.sh - Deploy Frigate NVR on chelsty-infra (print or SSH)
+
+MODE="print"
+[[ "$1" == "--ssh" ]] && MODE="ssh"
+
+TARGET="100.122.201.22"
+NODE="chelsty-infra"
+REPO_PATH="/home/oskar/homelab-codex-ws"
+SERVICE_PATH="$REPO_PATH/hosts/chelsty-infra/runtime/frigate"
+
+echo "HOST: $NODE"
+echo "MODE: $MODE"
+echo "TARGET: $TARGET"
+
+# Secrets must exist at /opt/homelab/config/frigate/frigate.env on the node
+# before first deploy. See config.yml for required variables.
+DEPLOY_CMD="cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd $SERVICE_PATH && docker-compose pull && docker-compose up -d"
+
+if [[ "$MODE" == "ssh" ]]; then
+    echo "--- Deploying Frigate to $NODE ($TARGET) via SSH ---"
+    ssh oskar@$TARGET "$DEPLOY_CMD"
+else
+    echo "# --- Deployment commands for $NODE ---"
+    echo "ssh oskar@$TARGET '$DEPLOY_CMD'"
+fi
--- a/scripts/deploy/deploy-node.sh
+++ b/scripts/deploy/deploy-node.sh
@ -8,6 +8,7 @@ set -e
 REPO_PATH="${HOME}/homelab-codex-ws"
 RUNTIME_PATH="/opt/homelab"
 HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
+HOST_DIR="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')"

 echo "--- Starting Deployment on ${HOSTNAME} ---"

@ -22,37 +23,47 @@ echo "Pulling latest changes..."
 git pull

 # 2. Identify Services
-# Based on our convention, we look for services assigned to this host
-# For now, we'll check if a 'services.txt' exists in the host folder
-SERVICE_LIST="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')/services.txt"
+SERVICES=()
+if [ -f "${HOST_DIR}/services.txt" ]; then
+    mapfile -t SERVICES < <(grep -v '^\s*#' "${HOST_DIR}/services.txt" | grep -v '^\s*$')
+elif [ -f "${HOST_DIR}/services.yaml" ]; then
+    SERVICES=($(python3 -c "
+import yaml, sys
+try:
+    with open('${HOST_DIR}/services.yaml', 'r') as f:
+        data = yaml.safe_load(f)
+        if data and 'services' in data:
+            if isinstance(data['services'], dict):
+                print(' '.join(data['services'].keys()))
+            elif isinstance(data['services'], list):
+                print(' '.join(data['services']))
+except Exception as e:
+    print(f'Error parsing YAML: {e}', file=sys.stderr)
+    sys.exit(1)
+"))
+fi

-if [ ! -f "$SERVICE_LIST" ]; then
-    echo "No services.txt found for ${HOSTNAME}. Skipping service deployment."
+if [ ${#SERVICES[@]} -eq 0 ]; then
+    echo "No services found for ${HOSTNAME}. Skipping service deployment."
    exit 0
 fi

 # 3. Deploy Services
-while IFS= read -r service || [ -n "$service" ]; do
-    [[ "$service" =~ ^#.*$ ]] && continue # Skip comments
-    [[ -z "$service" ]] && continue      # Skip empty lines
-
+for service in "${SERVICES[@]}"; do
    echo "Deploying service: ${service}..."
-    
+
    COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
-    
+
    if [ ! -f "$COMPOSE_FILE" ]; then
        echo "Warning: Compose file not found for ${service} at ${COMPOSE_FILE}"
        continue
    fi

-    # Target directory in runtime
    TARGET_DIR="${RUNTIME_PATH}/services/${service}"
    mkdir -p "$TARGET_DIR"

-    # We use the compose file from the repo directly
-    # but we can also handle overrides here
-    OVERRIDE_FILE="${RUNTIME_PATH}/config/${service}/docker-compose.override.yml"
-    
+    OVERRIDE_FILE="${HOST_DIR}/runtime/${service}/docker-compose.override.yml"
+
    COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
    if [ -f "$OVERRIDE_FILE" ]; then
        echo "Using override file for ${service}"
@ -60,7 +71,6 @@ while IFS= read -r service || [ -n "$service" ]; do
    fi

    $COMPOSE_CMD up -d --remove-orphans
-
-done < "$SERVICE_LIST"
+done

 echo "--- Deployment Complete ---"
--- a/scripts/deploy/deploy-stability-agent.sh
+++ b/scripts/deploy/deploy-stability-agent.sh
@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+# deploy-stability-agent.sh - Helper to deploy stability-agent (print or SSH)
+
+NODE=$1
+MODE="print"
+[[ "$2" == "--ssh" ]] && MODE="ssh"
+
+if [[ -z "$NODE" ]]; then
+    echo "Usage: $0 <node-name> [--ssh]"
+    echo "Supported nodes: chelsty, piha, solaria, vps"
+    exit 1
+fi
+
+case "$NODE" in
+    piha)    TARGET="100.108.208.3" ;;
+    chelsty) TARGET="100.122.201.22" ;;
+    vps)     TARGET="100.95.58.48" ;;
+    solaria) TARGET="local" ;;
+    *)
+        echo "Error: Unknown node '$NODE'"
+        echo "Supported nodes: chelsty, piha, solaria, vps"
+        exit 1
+        ;;
+esac
+
+echo "HOST: $NODE"
+echo "MODE: $MODE"
+echo "TARGET: $TARGET"
+
+REPO_PATH="/home/oskar/homelab-codex-ws"
+
+if [[ "$NODE" == "solaria" ]]; then
+    if [[ "$MODE" == "ssh" ]]; then
+        echo "--- Running local deployment for solaria ---"
+        cd "$REPO_PATH" && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh solaria
+    else
+        echo "# --- Deployment commands for solaria ---"
+        echo "cd $REPO_PATH"
+        echo "git fetch origin"
+        echo "git checkout master"
+        echo "git pull origin master"
+        echo "cd services/stability-agent"
+        echo "./deploy-local.sh solaria"
+    fi
+else
+    # Remote nodes
+    SSH_CMD="ssh oskar@$TARGET 'cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh $NODE'"
+    if [[ "$MODE" == "ssh" ]]; then
+        echo "--- Deploying to $NODE ($TARGET) via SSH ---"
+        eval "$SSH_CMD"
+    else
+        echo "# --- Deployment commands for $NODE ---"
+        echo "$SSH_CMD"
+    fi
+fi
--- a/scripts/deploy/deploy.sh
+++ b/scripts/deploy/deploy.sh
@ -1,270 +1,321 @@
 #!/usr/bin/env bash
-# deploy.sh - Staged deployment framework for homelab nodes.
+# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
+# Usage: deploy.sh <target> [--dry-run] [--no-gate]
+#   target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
+# Exit codes: 0=ok  1=preflight  2=gate  3=execute  4=verify  5=handoff(sudo)

-set -o pipefail
+set -uo pipefail

-# --- Configuration ---
-export RUNTIME_PATH="/opt/homelab"
-export STATE_DIR="${RUNTIME_PATH}/state/deploy"
-export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
-export REPO_PATH="${HOME}/homelab-codex-ws"
-export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
-export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
+REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
+SSH_USER="${SSH_USER:-oskar}"
+START_TIME=$(date +%s)
+TARGET=""
+DRY_RUN=false
+NO_GATE=false

-# --- Initialization ---
-mkdir -p "$STATE_DIR" "$LOG_DIR"
+usage() {
+    cat >&2 <<'EOF'
+Usage: deploy.sh <target> [--dry-run] [--no-gate]

-# Redirection for logging
-exec > >(tee -a "$LOG_FILE") 2>&1
+Targets:
+  control-plane   observer/supervisor/executor/operator-ui on VPS
+  vps             all VPS GitOps services
+  piha            PIHA services
+  solaria         SOLARIA compute services
+  chelsty-infra   CHELSTY edge node (LTE, longer SSH timeout)

-# --- Load Libraries ---
-LIB_PATH="${REPO_PATH}/scripts/lib"
-source "${LIB_PATH}/log.sh"
-source "${LIB_PATH}/state.sh"
-source "${LIB_PATH}/inventory.sh"
-source "${LIB_PATH}/compose.sh"
-source "${LIB_PATH}/diagnostics.sh"
+Flags:
+  --dry-run   run preflight + gate only; stop before deploy
+  --no-gate   skip pytest + docker build (emergency only; logged as WARNING)

-# --- CLI Parsing ---
-TARGET_HOST=$(hostname)
-TARGET_SERVICE=""
-RESUME=false
-REQUESTED_STAGE=""
+Exit codes: 0=ok  1=preflight  2=gate  3=execute  4=verify  5=handoff(sudo)
+EOF
+    exit 1
+}

 while [[ $# -gt 0 ]]; do
    case $1 in
-        --host)
-            TARGET_HOST="$2"
-            shift 2
-            ;;
-        --service)
-            TARGET_SERVICE="$2"
-            shift 2
-            ;;
-        --resume)
-            RESUME=true
-            shift
-            ;;
-        --stage)
-            REQUESTED_STAGE="$2"
-            shift 2
-            ;;
+        control-plane|vps|piha|solaria|chelsty-infra)
+            TARGET="$1"; shift ;;
+        --dry-run)
+            DRY_RUN=true; shift ;;
+        --no-gate)
+            NO_GATE=true; shift ;;
+        -h|--help)
+            usage ;;
        *)
-            if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
-                REQUESTED_STAGE="$1"
-            fi
-            shift
-            ;;
+            echo "Unknown argument: $1" >&2
+            usage ;;
    esac
 done

-# --- Stages ---
+[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }

-stage_prepare() {
-    local host=$1
-    if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
-        log "INFO" "Skipping PREPARE (already complete)"
+case "$TARGET" in
+    control-plane) SSH_HOST="vps" ;;
+    *)             SSH_HOST="$TARGET" ;;
+esac
+
+case "$TARGET" in
+    chelsty-*) SSH_TIMEOUT=30 ;;
+    *)         SSH_TIMEOUT=5 ;;
+esac
+
+# ── PREFLIGHT ────────────────────────────────────────────────────────────────
+
+preflight() {
+    echo "=== PREFLIGHT ==="
+
+    local branch
+    branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
+    if [[ "$branch" != "master" ]]; then
+        echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
+        exit 1
+    fi
+    echo "[ok] branch: master"
+
+    if ! git -C "$REPO_ROOT" diff --quiet; then
+        echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
+        exit 1
+    fi
+    if ! git -C "$REPO_ROOT" diff --cached --quiet; then
+        echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
+        exit 1
+    fi
+    echo "[ok] working tree clean"
+
+    git -C "$REPO_ROOT" fetch origin master --quiet
+    local unpushed
+    unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
+    if [[ -n "$unpushed" ]]; then
+        echo "ERROR: Unpushed commits on master:" >&2
+        echo "$unpushed" >&2
+        echo "Push first:  git push origin master" >&2
+        exit 1
+    fi
+    echo "[ok] no unpushed commits"
+
+    echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
+    if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
+            "${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
+        echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
+        exit 1
+    fi
+    echo "[ok] ${SSH_HOST} reachable"
+}
+
+# ── GATE ─────────────────────────────────────────────────────────────────────
+
+gate() {
+    if [[ "$NO_GATE" == "true" ]]; then
+        echo "=== GATE: SKIPPED ==="
+        echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
        return 0
    fi

-    log "INFO" "Stage: PREPARE ($host)"
-    set_stage "prepare"
-    
-    emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}"
+    echo "=== GATE ==="

-    cd "$REPO_PATH" || exit 1
-    log "INFO" "Pulling latest changes..."
-    if ! git pull; then
-        log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
-    fi
+    local services=()

-    # Ensure runtime directories exist
-    mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
-
-    struct_log "prepare" "$host" "all" "success" "repo_updated"
-    mark_stage_complete "prepare"
-}
-
-stage_validate() {
-    local host=$1
-    if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
-        log "INFO" "Skipping VALIDATE (already complete)"
-        return 0
-    fi
-
-    log "INFO" "Stage: VALIDATE ($host)"
-    set_stage "validate"
-
-    for service in "${SERVICES[@]}"; do
-        log "INFO" "Validating $service..."
-        if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
-            log "ERROR" "Service definition not found: $service"
-            struct_log "validate" "$host" "$service" "fail" "not_found"
-            return 1
-        fi
-    done
-
-    struct_log "validate" "$host" "all" "success" "validated"
-    mark_stage_complete "validate"
-}
-
-stage_deploy() {
-    local host=$1
-    if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
-        log "INFO" "Skipping DEPLOY (already complete)"
-        return 0
-    fi
-
-    log "INFO" "Stage: DEPLOY ($host)"
-    set_stage "deploy"
-
-    local last_s=$(get_last_service)
-    local skip=false
-    if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
-        skip=true
-    fi
-
-    for service in "${SERVICES[@]}"; do
-        if [[ "$skip" == "true" ]]; then
-            if [[ "$service" == "$last_s" ]]; then
-                skip=false
-                log "INFO" "Resuming from $service..."
-            else
-                log "INFO" "Skipping $service (already processed)"
-                continue
-            fi
-        fi
-
-        log "INFO" "Deploying $service..."
-        set_last_service "$service"
-
-        if ! run_compose_up "$service"; then
-            struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
-            collect_diagnostics "$host" "$service"
-            return 1
-        fi
-
-        struct_log "deploy" "$host" "$service" "success" "deployed"
-    done
-    
-    set_last_service ""
-    mark_stage_complete "deploy"
-}
-
-stage_verify() {
-    local host=$1
-    if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
-        log "INFO" "Skipping VERIFY (already complete)"
-        return 0
-    fi
-
-    log "INFO" "Stage: VERIFY ($host)"
-    set_stage "verify"
-
-    for service in "${SERVICES[@]}"; do
-        log "INFO" "Verifying $service..."
-        local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
-        if [[ -f "$health_script" ]]; then
-            if ! bash "$health_script"; then
-                log "ERROR" "Healthcheck failed for $service"
-                struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
-                collect_diagnostics "$host" "$service"
-                return 1
-            fi
-        else
-            # Generic check if container is running
-            if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
-                log "ERROR" "Container $service is not running"
-                struct_log "verify" "$host" "$service" "fail" "container_not_running"
-                collect_diagnostics "$host" "$service"
-                return 1
-            fi
-        fi
-        struct_log "verify" "$host" "$service" "success" "verified"
-    done
-    mark_stage_complete "verify"
-}
-
-stage_complete() {
-    local host=$1
-    log "INFO" "Stage: COMPLETE ($host)"
-    set_stage "complete"
-    struct_log "complete" "$host" "all" "success" "deployment_finished"
-    clear_deployment_state
-}
-
-# --- Execution Logic ---
-
-run_deployment() {
-    local start_stage=$1
-
-    # Sequential execution from start_stage
-    case "$start_stage" in
-        prepare)
-            stage_prepare "$TARGET_HOST" || return 1
-            ;&
-        validate)
-            stage_validate "$TARGET_HOST" || return 1
-            ;&
-        deploy)
-            stage_deploy "$TARGET_HOST" || return 1
-            ;&
-        verify)
-            stage_verify "$TARGET_HOST" || return 1
-            ;&
-        complete)
-            stage_complete "$TARGET_HOST" || return 1
-            ;;
-        *)
-            log "ERROR" "Invalid stage: $start_stage"
-            return 1
-            ;;
-    esac
-}
-
-# --- Main ---
-
-log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
-
-if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
-    log "ERROR" "Failed to load inventory"
-    exit 1
-fi
-
-EXIT_STATUS=0
-if [[ "$RESUME" == "true" ]]; then
-    CURRENT=$(get_stage)
-    log "INFO" "Resuming from state: $CURRENT"
-    case "$CURRENT" in
-        prepare|validate|deploy|verify)
-            run_deployment "$CURRENT" || EXIT_STATUS=1
-            ;;
-        complete|none)
-            log "INFO" "No interrupted deployment found. Starting from scratch..."
-            run_deployment "prepare" || EXIT_STATUS=1
-            ;;
-        *)
-            log "INFO" "Unknown state. Starting from prepare..."
-            run_deployment "prepare" || EXIT_STATUS=1
-            ;;
-    esac
-elif [[ -n "$REQUESTED_STAGE" ]]; then
-    if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
-        collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
+    if [[ "$TARGET" == "control-plane" ]]; then
+        services=("control-plane")
    else
-        run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
+        local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
+        if [[ ! -f "$svc_yaml" ]]; then
+            echo "ERROR: ${svc_yaml} not found." >&2
+            exit 2
+        fi
+        local svc_list
+        svc_list=$(python3 -c "
+import yaml
+with open('${svc_yaml}') as f:
+    data = yaml.safe_load(f)
+svcs = data.get('services', {})
+if isinstance(svcs, dict):
+    print('\n'.join(svcs.keys()))
+elif isinstance(svcs, list):
+    print('\n'.join(svcs))
+")
+        while IFS= read -r svc; do
+            [[ -z "$svc" ]] && continue
+            if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
+                services+=("$svc")
+            fi
+        done <<< "$svc_list"
    fi
-else
-    # New deployment - clear previous state
-    clear_deployment_state
-    run_deployment "prepare" || EXIT_STATUS=1
+
+    if [[ ${#services[@]} -eq 0 ]]; then
+        echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
+        return 0
+    fi
+
+    echo "Services under gate: ${services[*]}"
+    local gate_failed=false
+
+    for svc in "${services[@]}"; do
+        local svc_dir="${REPO_ROOT}/services/${svc}"
+
+        if [[ -d "${svc_dir}/tests" ]]; then
+            echo "--- pytest: ${svc} ---"
+            if ! python3 -m pytest "${svc_dir}/tests" -q; then
+                echo "GATE FAIL: pytest failed for ${svc}" >&2
+                gate_failed=true
+            fi
+        fi
+
+        echo "--- docker build: ${svc} ---"
+        if ! docker build --quiet "${svc_dir}" >/dev/null; then
+            echo "GATE FAIL: docker build failed for ${svc}" >&2
+            gate_failed=true
+        fi
+    done
+
+    if [[ "$gate_failed" == "true" ]]; then
+        exit 2
+    fi
+    echo "[ok] gate passed"
+}
+
+# ── EXECUTE ──────────────────────────────────────────────────────────────────
+
+execute() {
+    echo "=== EXECUTE ==="
+
+    local cmd_output
+    local cmd_exit=0
+
+    if [[ "$TARGET" == "control-plane" ]]; then
+        echo "Running deploy-control-plane.sh --ssh..."
+        cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
+            || cmd_exit=$?
+    else
+        echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
+        cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
+            "${SSH_USER}@${SSH_HOST}" \
+            'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
+            || cmd_exit=$?
+    fi
+
+    echo "$cmd_output"
+
+    if echo "$cmd_output" | grep -qF "[sudo] password"; then
+        echo "" >&2
+        echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
+        echo "Run manually:" >&2
+        if [[ "$TARGET" == "control-plane" ]]; then
+            echo "  ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
+        else
+            echo "  ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
+        fi
+        exit 5
+    fi
+
+    if [[ $cmd_exit -ne 0 ]]; then
+        echo "ERROR: Deploy command exited ${cmd_exit}." >&2
+        exit 3
+    fi
+
+    echo "[ok] execute completed"
+}
+
+# ── VERIFY ───────────────────────────────────────────────────────────────────
+
+verify() {
+    echo "=== VERIFY ==="
+
+    local ps_output
+    local ps_exit=0
+    ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
+        "${SSH_USER}@${SSH_HOST}" \
+        'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
+        || ps_exit=$?
+
+    if [[ $ps_exit -ne 0 ]]; then
+        echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
+        echo "$ps_output" >&2
+        exit 4
+    fi
+
+    echo "$ps_output"
+
+    local failed=false
+
+    local not_up
+    not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
+    if [[ -n "$not_up" ]]; then
+        echo "ERROR: Containers not in Up state:" >&2
+        echo "$not_up" >&2
+        failed=true
+    fi
+
+    local unhealthy
+    unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
+    if [[ -n "$unhealthy" ]]; then
+        echo "ERROR: Unhealthy containers:" >&2
+        echo "$unhealthy" >&2
+        failed=true
+    fi
+
+    if [[ "$TARGET" == "control-plane" ]]; then
+        for cp_svc in supervisor observer executor operator-ui; do
+            if ! echo "$ps_output" | grep -q "$cp_svc"; then
+                echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
+                failed=true
+            fi
+        done
+    fi
+
+    if [[ "$failed" == "true" ]]; then
+        echo "" >&2
+        echo "Full docker ps output above." >&2
+        exit 4
+    fi
+
+    echo "[ok] all containers healthy"
+}
+
+# ── REPORT ───────────────────────────────────────────────────────────────────
+
+report() {
+    local mode="${1:-deploy}"
+    local end_time
+    end_time=$(date +%s)
+    local elapsed
+    elapsed=$(( end_time - START_TIME ))
+    local commit_hash
+    commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
+    local gate_s verify_s
+
+    if [[ "$NO_GATE" == "true" ]]; then
+        gate_s="skip"
+    else
+        gate_s="ok"
+    fi
+
+    if [[ "$mode" == "dry-run" ]]; then
+        verify_s="skip(dry-run)"
+    else
+        verify_s="green"
+    fi
+
+    echo ""
+    if [[ "$mode" == "dry-run" ]]; then
+        echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
+    else
+        echo "DEPLOY OK  | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
+    fi
+}
+
+# ── MAIN ─────────────────────────────────────────────────────────────────────
+
+preflight
+gate
+
+if [[ "$DRY_RUN" == "true" ]]; then
+    report dry-run
+    exit 0
 fi

-if [[ $EXIT_STATUS -eq 0 ]]; then
-    print_summary "$TARGET_HOST" "SUCCESS"
-    log "INFO" "--- Homelab Deployment Finished Successfully ---"
-else
-    print_summary "$TARGET_HOST" "FAILED"
-    log "ERROR" "--- Homelab Deployment Failed ---"
-    exit 1
-fi
+execute
+verify
+report
--- a/scripts/deploy/orchestrate-deploy.sh
+++ b/scripts/deploy/orchestrate-deploy.sh
@ -1,15 +1,30 @@
 #!/usr/bin/env bash
 # orchestrate-deploy.sh - To be run on SATURN
-# Triggers deployment on remote execution nodes.
+# Triggers deployment on remote execution nodes via inventory.

 set -e

-HOSTS=("solaria" "piha" "vps")
-USER="oskar" # Default user
+REPO_PATH="${HOME}/homelab-codex-ws"
+USER="oskar"

-for HOST in "${HOSTS[@]}"; do
+while IFS=' ' read -r HOST TAG; do
    echo ">>> Triggering deployment on ${HOST}..."
-    ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
-done
+    if [[ "$TAG" == "lte" ]]; then
+        ssh -o ConnectTimeout=30 "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh" || \
+            echo "WARNING: Deployment on ${HOST} failed or timed out (LTE/intermittent node, skipping)"
+    else
+        ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
+    fi
+done < <(python3 -c "
+import yaml, sys
+with open('${REPO_PATH}/inventory/topology.yaml') as f:
+    data = yaml.safe_load(f)
+skip = {'saturn', 'solaria'}
+for name, info in (data.get('nodes') or {}).items():
+    if name in skip:
+        continue
+    uplink = ((info or {}).get('connectivity') or {}).get('uplink', '')
+    print(name, 'lte' if uplink == 'lte' else 'standard')
+")

 echo ">>> All deployments triggered."
--- a/scripts/deploy/verify-agent-fleet.sh
+++ b/scripts/deploy/verify-agent-fleet.sh
@ -0,0 +1,68 @@
+#!/usr/bin/env bash
+# verify-agent-fleet.sh - Check the status of stability agents across the fleet
+
+REDIS_CMD="docker exec agent-system-redis redis-cli --raw"
+
+# Check if docker is available
+if ! command -v docker &> /dev/null; then
+    echo "Error: docker command not found."
+    exit 1
+fi
+
+# Check if container is running
+if ! docker ps --filter "name=agent-system-redis" --format "{{.Names}}" | grep -q "agent-system-redis"; then
+    echo "Error: agent-system-redis container not found or not running."
+    echo "This script must be run on PIHA (the node hosting the Redis container)."
+    exit 1
+fi
+
+REQUIRED_NODES=("piha" "chelsty" "solaria" "vps")
+MISSING_NODES=0
+
+echo "--- Homelab Agent Fleet Status ---"
+printf "%-10s %-15s %-10s %-10s %-30s\n" "NODE" "HOSTNAME" "HEALTH" "STATUS" "LAST_SEEN"
+printf "%s\n" "--------------------------------------------------------------------------------"
+
+for NODE in "${REQUIRED_NODES[@]}"; do
+    KEY="homelab:nodes:$NODE"
+    
+    # Check if key exists
+    EXISTS=$($REDIS_CMD EXISTS "$KEY" 2>/dev/null | tr -d '\r\n')
+
+    if [[ "$EXISTS" != "1" ]]; then
+        printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "MISSING" "N/A" "N/A" "N/A"
+        MISSING_NODES=$((MISSING_NODES + 1))
+        continue
+    fi
+
+    HOSTNAME=$($REDIS_CMD HGET "$KEY" hostname 2>/dev/null | tr -d '\r\n')
+    HEALTH=$($REDIS_CMD HGET "$KEY" health 2>/dev/null | tr -d '\r\n')
+    STATUS=$($REDIS_CMD HGET "$KEY" status 2>/dev/null | tr -d '\r\n')
+    LAST_SEEN=$($REDIS_CMD HGET "$KEY" last_seen 2>/dev/null | tr -d '\r\n')
+
+    printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "$HOSTNAME" "$HEALTH" "$STATUS" "$LAST_SEEN"
+done
+
+echo ""
+echo "--- Control Plane Summary ---"
+if command -v jq >/dev/null; then
+    curl -s http://127.0.0.1:18180/summary | jq .
+else
+    curl -s http://127.0.0.1:18180/summary
+fi
+
+echo ""
+echo "--- Control Plane Nodes ---"
+if command -v jq >/dev/null; then
+    curl -s http://127.0.0.1:18180/nodes | jq .
+else
+    curl -s http://127.0.0.1:18180/nodes
+fi
+
+if [[ $MISSING_NODES -gt 0 ]]; then
+    echo ""
+    echo "Error: $MISSING_NODES required nodes are missing from Redis."
+    exit 1
+fi
+
+exit 0
--- a/scripts/dev/agent.sh
+++ b/scripts/dev/agent.sh
@ -0,0 +1,361 @@
+#!/usr/bin/env bash
+# Multi-agent worktree manager.
+# EXIT: 0 ok, 1 preflight, 2 operation failed.
+set -euo pipefail
+
+trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
+
+RESERVED_NAMES=(master main HEAD list merge clean new)
+MAX_WORKTREES=4
+
+die()    { echo "ERROR: $*" >&2; exit "${2:-2}"; }
+prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
+
+# ── helpers ──────────────────────────────────────────────────────────────────
+
+is_main_checkout() {
+  local git_dir common_dir
+  git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
+  common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
+  [ "$git_dir" = "$common_dir" ]
+}
+
+require_main_checkout() {
+  is_main_checkout || prefail "must run from the main checkout, not a worktree"
+}
+
+require_master_branch() {
+  local branch
+  branch=$(git rev-parse --abbrev-ref HEAD)
+  [ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
+}
+
+require_clean_tree() {
+  local dirty
+  dirty=$(git status --porcelain)
+  [ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
+}
+
+worktree_paths() {
+  # list worktree paths (excluding main); || true prevents grep exit-1 when empty
+  local main_path
+  main_path=$(git rev-parse --show-toplevel)
+  git worktree list --porcelain \
+    | awk '/^worktree /{p=$2} /^$/{print p}' \
+    | grep -v "^${main_path}$" \
+    || true
+}
+
+worktree_count() {
+  worktree_paths | wc -l
+}
+
+branch_exists_local()  { git show-ref --verify --quiet "refs/heads/$1"; }
+branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
+
+utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
+
+age_str() {
+  local created_utc="$1"
+  local now_ts created_ts diff_s
+  now_ts=$(date -u +%s)
+  # strip Z, replace T with space for `date -d`
+  created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
+  diff_s=$(( now_ts - created_ts ))
+  if   (( diff_s < 60 ));   then echo "${diff_s}s"
+  elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
+  elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
+  else echo "$(( diff_s/86400 ))d"
+  fi
+}
+
+validate_name() {
+  local name="$1"
+  if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
+    prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
+  fi
+  for r in "${RESERVED_NAMES[@]}"; do
+    if [ "$name" = "$r" ]; then
+      prefail "'$name' is a reserved word"
+    fi
+  done
+}
+
+# ── subcommands ───────────────────────────────────────────────────────────────
+
+cmd_new() {
+  local name="${1:-}"
+  [ -n "$name" ] || { usage; exit 1; }
+
+  validate_name "$name"
+  require_main_checkout
+  require_master_branch
+  require_clean_tree
+
+  # worktree limit
+  local count
+  count=$(worktree_count)
+  if (( count >= MAX_WORKTREES )); then
+    echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
+    cmd_list
+    exit 1
+  fi
+
+  # branch collision
+  if branch_exists_local "task/$name"; then
+    prefail "branch task/$name already exists locally"
+  fi
+  git fetch origin master --quiet
+  if branch_exists_remote "refs/heads/task/$name"; then
+    prefail "branch task/$name already exists on origin"
+  fi
+
+  # directory collision
+  local main_path wt_path
+  main_path=$(git rev-parse --show-toplevel)
+  wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
+  [ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
+
+  # create worktree
+  git worktree add -b "task/$name" "$wt_path" origin/master \
+    || die "git worktree add failed"
+
+  # write marker
+  local parent_commit
+  parent_commit=$(git rev-parse origin/master)
+  cat > "$wt_path/.agent-task" <<EOF
+task: $name
+branch: task/$name
+parent_commit: $parent_commit
+created_utc: $(utc_now)
+worktree_path: $wt_path
+EOF
+
+  echo ""
+  echo "Worktree created: $wt_path"
+  echo "Branch:           task/$name"
+  echo ""
+  echo "── Start Claude Code in this worktree ──────────────────────────────────────"
+  echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
+  echo "─────────────────────────────────────────────────────────────────────────────"
+}
+
+cmd_list() {
+  local main_path
+  main_path=$(git rev-parse --show-toplevel)
+
+  # fetch to get up-to-date ahead/behind
+  git fetch origin master --quiet 2>/dev/null || true
+
+  local paths
+  paths=$(worktree_paths)
+
+  if [ -z "$paths" ]; then
+    echo "(no active task worktrees)"
+    return
+  fi
+
+  printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
+    "NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
+
+  while IFS= read -r wt_path; do
+    [ -z "$wt_path" ] && continue
+
+    local marker="$wt_path/.agent-task"
+    local task_name branch parent_commit created_utc
+    if [ -f "$marker" ]; then
+      task_name=$(  grep '^task:'          "$marker" | awk '{print $2}')
+      branch=$(     grep '^branch:'        "$marker" | awk '{print $2}')
+      parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
+      created_utc=$(grep '^created_utc:'   "$marker" | awk '{print $2}')
+    else
+      task_name="(no marker)"
+      branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
+      parent_commit="?"
+      created_utc=""
+    fi
+
+    local status="clean"
+    local dirty
+    dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
+    [ -n "$dirty" ] && status="dirty"
+
+    local ahead behind ab
+    ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
+    behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
+    ab="+${ahead}/-${behind}"
+
+    local age=""
+    [ -n "$created_utc" ] && age=$(age_str "$created_utc")
+
+    local short_parent="${parent_commit:0:7}"
+    local short_created="${created_utc:0:10}"
+
+    printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
+      "$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
+  done <<< "$paths"
+}
+
+cmd_merge() {
+  local name="${1:-}"
+  [ -n "$name" ] || { usage; exit 1; }
+
+  require_main_checkout
+  require_master_branch
+  require_clean_tree
+
+  git fetch origin --quiet
+
+  branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
+
+  local main_path wt_path
+  main_path=$(git rev-parse --show-toplevel)
+  wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
+
+  # attempt ff-only merge
+  local merge_failed=0
+  git merge --ff-only "task/$name" || merge_failed=1
+
+  if (( merge_failed )); then
+    # abort any partial merge state
+    git merge --abort 2>/dev/null || true
+    echo ""
+    echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
+    echo "       The branch has likely diverged from master." >&2
+    echo "" >&2
+    echo "Diagnose with:" >&2
+    echo "  git log master..task/$name        # commits only on task branch" >&2
+    echo "  git log task/$name..master        # commits master has that task doesn't" >&2
+    echo "" >&2
+    echo "Then decide: rebase task/$name onto master, or merge manually." >&2
+    echo "Worktree and branch are preserved — no changes made." >&2
+    exit 2
+  fi
+
+  echo "Merged task/$name into master (fast-forward)."
+
+  git push origin master || die "git push origin master failed"
+  echo "Pushed master to origin."
+
+  if [ -d "$wt_path" ]; then
+    git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
+    echo "Removed worktree: $wt_path"
+  else
+    echo "(worktree directory $wt_path not found — skipping worktree remove)"
+  fi
+
+  git branch -d "task/$name" || die "git branch -d task/$name failed"
+  echo "Deleted local branch task/$name."
+
+  git push origin --delete "task/$name" 2>/dev/null \
+    && echo "Deleted remote branch task/$name." \
+    || echo "(remote branch task/$name not found — nothing to delete)"
+
+  echo ""
+  echo "Done. task/$name merged and cleaned up."
+}
+
+cmd_clean() {
+  local main_path
+  main_path=$(git rev-parse --show-toplevel)
+  git fetch origin --quiet 2>/dev/null || true
+
+  local to_remove=()
+
+  # orphaned registered worktrees: branch deleted or fully merged into master
+  local paths
+  paths=$(worktree_paths)
+  while IFS= read -r wt_path; do
+    [ -z "$wt_path" ] && continue
+    local branch
+    branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
+    [ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
+
+    # branch gone locally?
+    if ! branch_exists_local "$branch"; then
+      to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
+      continue
+    fi
+
+    # branch fully merged into master?
+    local ahead
+    ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
+    if [ "$ahead" = "0" ]; then
+      to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
+    fi
+  done <<< "$paths"
+
+  # dangling directories: ../homelab-codex-ws-* not registered
+  local registered_paths
+  registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
+  local parent_dir
+  parent_dir=$(dirname "$main_path")
+  while IFS= read -r candidate; do
+    [ -d "$candidate" ] || continue
+    if ! echo "$registered_paths" | grep -qF "$candidate"; then
+      to_remove+=("dangling:$candidate")
+    fi
+  done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
+
+  if [ ${#to_remove[@]} -eq 0 ]; then
+    echo "Nothing to clean."
+    return 0
+  fi
+
+  echo "Found ${#to_remove[@]} item(s) to clean:"
+  for entry in "${to_remove[@]}"; do
+    echo "  $entry"
+  done
+  echo ""
+
+  local overall_rc=0
+  for entry in "${to_remove[@]}"; do
+    local kind="${entry%%:*}"
+    local path="${entry#*:}"
+    # strip trailing annotation in parens
+    local raw_path
+    raw_path="${path%% (*}"
+
+    local confirm
+    read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
+    if [[ "$confirm" =~ ^[Yy]$ ]]; then
+      if [ "$kind" = "worktree" ]; then
+        git worktree remove --force "$raw_path" 2>/dev/null \
+          || { echo "  WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
+      else
+        rm -rf "$raw_path"
+      fi
+      echo "  Removed."
+    else
+      echo "  Skipped."
+    fi
+  done
+
+  return $overall_rc
+}
+
+usage() {
+  cat <<'EOF'
+Usage: agent.sh <subcommand> [args]
+
+  agent.sh new <name>    Create a new task worktree (branch task/<name>)
+  agent.sh list          List active task worktrees with status
+  agent.sh merge <name>  Fast-forward merge task/<name> into master and clean up
+  agent.sh clean         Remove orphaned or dangling worktrees (interactive)
+
+EXIT: 0 ok, 1 preflight, 2 operation failed.
+EOF
+}
+
+# ── dispatch ──────────────────────────────────────────────────────────────────
+
+SUBCOMMAND="${1:-}"
+shift || true
+
+case "$SUBCOMMAND" in
+  new)   cmd_new   "$@" ;;
+  list)  cmd_list  "$@" ;;
+  merge) cmd_merge "$@" ;;
+  clean) cmd_clean "$@" ;;
+  *)     usage; exit 1  ;;
+esac
--- a/scripts/monitor/health-monitor.sh
+++ b/scripts/monitor/health-monitor.sh
@ -0,0 +1,338 @@
+#!/usr/bin/env bash
+# health-monitor.sh - Homelab node health monitor and safe disk cleanup
+#
+# Designed to run standalone on the host (cron or direct) or to be called by
+# the node-agent Python daemon. All cleanup decisions follow the conservative
+# policy agreed in the design review:
+#
+#  lte_node  (chelsty-infra, chelsty-ha) : NO cleanup at all
+#  sd_card   (piha, saturn)              : dangling images + stopped containers,
+#                                          rate-limited to once per 24 h
+#  ai_node   (solaria)                   : dangling images + stopped containers
+#                                          + build cache (NEVER -a)
+#  standard  (vps)                       : dangling images + stopped containers
+#                                          + build cache
+#
+# VPS additionally rotates control-plane filesystem artefacts:
+#   actions/completed + failed  > 7 days
+#   logs/deploy                 > 30 days
+#   events/**                   > 3 days AND past observer checkpoint
+#
+# NEVER TOUCHED (any node): /opt/homelab/data/, config/, state/,
+#   actions/pending|approved|running, Frigate recordings, Ollama models,
+#   Zigbee2MQTT data, Mosquitto data, HA database/config.
+
+set -euo pipefail
+
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+RUNTIME_PATH="${RUNTIME_PATH:-/opt/homelab}"
+EVENTS_DIR="${RUNTIME_PATH}/events"
+STATE_DIR="${RUNTIME_PATH}/state"
+LOGS_DIR="${RUNTIME_PATH}/logs"
+ACTIONS_DIR="${RUNTIME_PATH}/actions"
+
+NODE_NAME="${NODE_NAME:-$(hostname)}"
+TIMESTAMP=$(date +%s)
+DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
+
+# Thresholds
+DISK_WARN_PCT=75
+DISK_CRIT_PCT=85
+MEM_WARN_PCT=85
+MEM_CRIT_PCT=95
+
+# Rate-limit file for SD-card nodes (max one Docker cleanup per 24 h)
+CLEANUP_LOCK="${STATE_DIR}/last-docker-cleanup"
+CLEANUP_INTERVAL=86400   # seconds
+
+# Node classifications
+LTE_NODES="chelsty-infra chelsty-ha"
+SD_CARD_NODES="piha saturn"
+AI_NODES="solaria"
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+log()  { echo "$(date -u +%H:%M:%S) [INFO]  $*"; }
+warn() { echo "$(date -u +%H:%M:%S) [WARN]  $*" >&2; }
+err()  { echo "$(date -u +%H:%M:%S) [ERROR] $*" >&2; }
+
+contains() {
+    local word="$1"; shift
+    for w in "$@"; do [[ "$w" == "$word" ]] && return 0; done
+    return 1
+}
+
+get_node_type() {
+    # shellcheck disable=SC2086
+    if contains "$NODE_NAME" $LTE_NODES;    then echo "lte_node";  return; fi
+    if contains "$NODE_NAME" $SD_CARD_NODES; then echo "sd_card";   return; fi
+    if contains "$NODE_NAME" $AI_NODES;     then echo "ai_node";   return; fi
+    echo "standard"
+}
+
+# ---------------------------------------------------------------------------
+# Event emission
+# ---------------------------------------------------------------------------
+
+emit_event() {
+    local type="$1" severity="$2" service="${3:-}" message="$4" payload="${5:-{}}"
+    local id="evt-${NODE_NAME}-${TIMESTAMP}-${type}"
+    local dir="${EVENTS_DIR}/${NODE_NAME}"
+    mkdir -p "$dir"
+    cat > "${dir}/${id}.json" <<EOF
+{
+  "id": "${id}",
+  "timestamp": ${TIMESTAMP},
+  "date": "${DATE}",
+  "type": "${type}",
+  "severity": "${severity}",
+  "node": "${NODE_NAME}",
+  "service": "${service}",
+  "message": "${message}",
+  "payload": ${payload}
+}
+EOF
+}
+
+# ---------------------------------------------------------------------------
+# Health checks
+# ---------------------------------------------------------------------------
+
+check_disk() {
+    # Use /opt/homelab as the check target — it lives on the host filesystem
+    # and this path is correct both when running natively and in a container
+    # that mounts /opt/homelab from the host.
+    local mount="${RUNTIME_PATH}"
+    local usage_pct avail_mb total_mb
+    usage_pct=$(df "${mount}" 2>/dev/null | awk 'NR==2 {gsub(/%/,"",$5); print $5}') || return
+    avail_mb=$(df  "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $4/1024}')       || return
+    total_mb=$(df  "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $2/1024}')       || return
+
+    if [[ "${usage_pct}" -ge "${DISK_CRIT_PCT}" ]]; then
+        warn "Disk CRITICAL: ${usage_pct}% used (${avail_mb} MB free)"
+        emit_event "disk_pressure" "high" "" \
+            "Disk usage critical: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
+            "{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
+    elif [[ "${usage_pct}" -ge "${DISK_WARN_PCT}" ]]; then
+        warn "Disk elevated: ${usage_pct}% used"
+        emit_event "disk_pressure" "medium" "" \
+            "Disk usage elevated: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
+            "{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
+    fi
+    echo "${usage_pct}"
+}
+
+check_memory() {
+    local total avail pct avail_mb
+    total=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
+    avail=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
+    pct=$(( (total - avail) * 100 / total ))
+    avail_mb=$(( avail / 1024 ))
+
+    if [[ "${pct}" -ge "${MEM_CRIT_PCT}" ]]; then
+        warn "Memory CRITICAL: ${pct}% used"
+        emit_event "high_memory" "high" "" \
+            "Memory usage critical: ${pct}% (${avail_mb} MB available)" \
+            "{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
+    elif [[ "${pct}" -ge "${MEM_WARN_PCT}" ]]; then
+        warn "Memory elevated: ${pct}%"
+        emit_event "high_memory" "medium" "" \
+            "Memory usage elevated: ${pct}% (${avail_mb} MB available)" \
+            "{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
+    fi
+    echo "${pct}"
+}
+
+check_cpu() {
+    # Two-sample /proc/stat delta for accurate instantaneous CPU usage.
+    local idle1 total1 idle2 total2 pct
+    read -r idle1 total1 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
+    sleep 1
+    read -r idle2 total2 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
+
+    local d_idle=$(( idle2 - idle1 ))
+    local d_total=$(( total2 - total1 ))
+    pct=$(( d_total > 0 ? 100 - d_idle * 100 / d_total : 0 ))
+
+    if [[ "${pct}" -ge 90 ]]; then
+        warn "CPU elevated: ${pct}%"
+        emit_event "high_cpu" "medium" "" \
+            "CPU usage elevated: ${pct}%" \
+            "{\"usage_pct\": ${pct}}"
+    fi
+    echo "${pct}"
+}
+
+check_containers() {
+    command -v docker &>/dev/null || return
+
+    # Containers that have exited but carry a restart policy meaning they should be up
+    local cname
+    while IFS= read -r cname; do
+        [[ -z "$cname" ]] && continue
+        warn "Container exited (should be running): ${cname}"
+        emit_event "containers_not_running" "high" "${cname}" \
+            "Container '${cname}' has exited unexpectedly (restart=unless-stopped)" \
+            "{\"container\": \"${cname}\"}"
+    done < <(docker ps -a \
+        --filter "status=exited" \
+        --filter "label=com.docker.compose.project" \
+        --format "{{.Names}}" 2>/dev/null || true)
+
+    # Containers that are running but their health check is failing
+    while IFS= read -r cname; do
+        [[ -z "$cname" ]] && continue
+        warn "Container unhealthy: ${cname}"
+        emit_event "healthcheck_failed" "high" "${cname}" \
+            "Container '${cname}' is running but health check is failing" \
+            "{\"container\": \"${cname}\"}"
+    done < <(docker ps \
+        --filter "health=unhealthy" \
+        --format "{{.Names}}" 2>/dev/null || true)
+}
+
+# ---------------------------------------------------------------------------
+# Safe Docker cleanup (per policy)
+# ---------------------------------------------------------------------------
+
+_sd_card_rate_ok() {
+    if [[ -f "${CLEANUP_LOCK}" ]]; then
+        local last_ts elapsed
+        last_ts=$(cat "${CLEANUP_LOCK}" 2>/dev/null || echo 0)
+        elapsed=$(( TIMESTAMP - last_ts ))
+        if [[ "${elapsed}" -lt "${CLEANUP_INTERVAL}" ]]; then
+            log "Docker cleanup skipped: last run ${elapsed}s ago (limit ${CLEANUP_INTERVAL}s)"
+            return 1
+        fi
+    fi
+    return 0
+}
+
+_mark_cleanup_done() {
+    echo "${TIMESTAMP}" > "${CLEANUP_LOCK}"
+}
+
+run_safe_cleanup() {
+    command -v docker &>/dev/null || return
+    local node_type
+    node_type=$(get_node_type)
+
+    case "${node_type}" in
+        lte_node)
+            # NO cleanup on LTE nodes. Any docker operation risks triggering
+            # a pull over a metered/intermittent connection.
+            log "Skipping Docker cleanup: LTE node (${NODE_NAME})"
+            ;;
+
+        sd_card)
+            # Dangling images + stopped containers only.
+            # Rate-limited to once per 24 hours to protect SD card write endurance.
+            _sd_card_rate_ok || return
+            log "Running rate-limited Docker cleanup (SD card node)"
+            docker image prune -f     >/dev/null 2>&1 || true
+            docker container prune -f >/dev/null 2>&1 || true
+            _mark_cleanup_done
+            ;;
+
+        ai_node)
+            # Dangling images + stopped containers + build cache.
+            # NEVER docker image prune -a (would remove Ollama runtime images,
+            # requiring a multi-hour re-pull of model weights).
+            log "Running AI-node Docker cleanup (dangling images + containers + build cache)"
+            docker image prune -f     >/dev/null 2>&1 || true
+            docker container prune -f >/dev/null 2>&1 || true
+            docker builder prune -f   >/dev/null 2>&1 || true
+            ;;
+
+        standard)
+            # VPS and other standard nodes: full safe cleanup.
+            log "Running standard Docker cleanup"
+            docker image prune -f     >/dev/null 2>&1 || true
+            docker container prune -f >/dev/null 2>&1 || true
+            docker builder prune -f   >/dev/null 2>&1 || true
+            ;;
+    esac
+}
+
+# ---------------------------------------------------------------------------
+# VPS-specific: control-plane filesystem rotation
+# ---------------------------------------------------------------------------
+
+cleanup_control_plane_fs() {
+    log "Running control-plane filesystem rotation"
+
+    # Completed / failed actions older than 7 days
+    for status in completed failed; do
+        local dir="${ACTIONS_DIR}/${status}"
+        [[ -d "${dir}" ]] || continue
+        find "${dir}" -name "*.json" -mtime +7 -delete 2>/dev/null && \
+            log "Cleaned ${status} actions older than 7 days" || true
+    done
+
+    # Deploy logs older than 30 days
+    local deploy_logs="${LOGS_DIR}/deploy"
+    if [[ -d "${deploy_logs}" ]]; then
+        find "${deploy_logs}" -name "*.log" -mtime +30 -delete 2>/dev/null && \
+            log "Cleaned deploy logs older than 30 days" || true
+    fi
+
+    # Event files older than 3 days AND already past the observer checkpoint.
+    # The dual condition ensures we never delete an event the observer hasn't seen.
+    local checkpoint="${STATE_DIR}/observer_checkpoint.json"
+    if [[ -f "${checkpoint}" ]] && command -v python3 &>/dev/null; then
+        local last_processed
+        last_processed=$(python3 -c "
+import json, sys
+try:
+    d = json.load(open('${checkpoint}'))
+    print(d.get('last_processed_file', ''))
+except Exception:
+    print('')
+" 2>/dev/null || echo "")
+
+        if [[ -n "${last_processed}" ]]; then
+            find "${EVENTS_DIR}" -name "*.json" -mtime +3 | while IFS= read -r f; do
+                # Only delete files that sort before the checkpoint path
+                # (i.e., the observer has already processed them).
+                if [[ "$f" < "${last_processed}" ]]; then
+                    rm -f "$f"
+                    log "Cleaned old event: $(basename "$f")"
+                fi
+            done
+        else
+            log "No observer checkpoint set; skipping event file cleanup"
+        fi
+    fi
+}
+
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+
+mkdir -p "${EVENTS_DIR}/${NODE_NAME}" "${STATE_DIR}"
+
+log "Health check starting on ${NODE_NAME} (type=$(get_node_type))"
+
+disk_pct=$(check_disk || echo 0)
+mem_pct=$(check_memory || echo 0)
+cpu_pct=$(check_cpu || echo 0)
+check_containers
+
+run_safe_cleanup
+
+# VPS: also rotate control-plane filesystem artefacts
+if [[ "${NODE_NAME}" == "vps" ]]; then
+    cleanup_control_plane_fs
+fi
+
+# Emit a node_health heartbeat so the observer can update node status
+# and the supervisor can see up-to-date resource metrics.
+emit_event "node_health" "info" "" \
+    "Health check completed on ${NODE_NAME}" \
+    "{\"disk_pct\": ${disk_pct}, \"mem_pct\": ${mem_pct}, \"cpu_pct\": ${cpu_pct}}"
+
+log "Health check complete (disk=${disk_pct}% mem=${mem_pct}% cpu=${cpu_pct}%)"
--- a/scripts/observer/observer.py
+++ b/scripts/observer/observer.py
@ -0,0 +1,520 @@
+import os
+import json
+import time
+import glob
+import logging
+import yaml
+from datetime import datetime, timezone
+from pathlib import Path
+
+
+def _atomic_write_json(path: Path, data) -> None:
+    """Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
+    tmp = path.with_suffix(".tmp")
+    with open(tmp, "w") as f:
+        json.dump(data, f, indent=2)
+        f.flush()
+        os.fsync(f.fileno())
+    os.replace(tmp, path)
+
+
+def _parse_ts(ts) -> float:
+    """Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
+
+    Events from node-agent use int(time.time()); events from stability-agent / events.py
+    use ISO format ('2026-06-03T10:30:00Z').  Both appear in incident fields such as
+    last_occurrence and resolved_at, so any arithmetic on them must go through here.
+    Returns 0.0 on None or unparseable input so callers can use plain comparisons.
+    """
+    if ts is None:
+        return 0.0
+    if isinstance(ts, (int, float)):
+        return float(ts)
+    try:
+        return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
+    except Exception:
+        return 0.0
+
+# Constants and Paths
+RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
+EVENTS_DIR = Path(RUNTIME_PATH) / "events"
+STATE_DIR = Path(RUNTIME_PATH) / "state"
+LOGS_DIR = Path(RUNTIME_PATH) / "logs"
+WORLD_DIR = Path(RUNTIME_PATH) / "world"
+OBSERVER_STATE_FILE = STATE_DIR / "observer_checkpoint.json"
+
+REPO_ROOT = Path(__file__).parent.parent.parent
+INVENTORY_TOPOLOGY = REPO_ROOT / "inventory" / "topology.yaml"
+
+# Logging setup
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger("observer")
+
+class Observer:
+    def __init__(self):
+        # Per-node-directory checkpoint: {"vps": "last/file/path", "piha": "last/file/path"}
+        # Replaces the old single last_processed_file which silently skipped event dirs
+        # that sort alphabetically before the checkpoint (e.g. piha/ < vps/).
+        self.node_checkpoints: dict = {}
+        self.world_state = {
+            "nodes": {},
+            "services": {},
+            "deployments": {},
+            "incidents": {},
+            "summary": {
+                "last_update": datetime.now(timezone.utc).isoformat(),
+                "status": "initializing",
+                "active_incidents_count": 0
+            }
+        }
+        self.inventory = self._load_inventory()
+        self._ensure_dirs()
+        self._load_checkpoint()
+
+    def _ensure_dirs(self):
+        WORLD_DIR.mkdir(parents=True, exist_ok=True)
+        STATE_DIR.mkdir(parents=True, exist_ok=True)
+        EVENTS_DIR.mkdir(parents=True, exist_ok=True)
+        LOGS_DIR.mkdir(parents=True, exist_ok=True)
+
+    def _load_inventory(self):
+        inventory = {"nodes": {}, "services": {}}
+        try:
+            if INVENTORY_TOPOLOGY.exists():
+                with open(INVENTORY_TOPOLOGY, "r") as f:
+                    topo = yaml.safe_load(f)
+                    for node_name, node_info in topo.get("nodes", {}).items():
+                        inventory["nodes"][node_name] = {
+                            "roles": node_info.get("roles", []),
+                            "connectivity": node_info.get("connectivity", {})
+                        }
+            
+            # Load service assignments from hosts files
+            hosts_dir = REPO_ROOT / "hosts"
+            for host_dir in hosts_dir.iterdir():
+                if host_dir.is_dir():
+                    svc_file = host_dir / "services.yaml"
+                    if svc_file.exists():
+                        with open(svc_file, "r") as f:
+                            svc_data = yaml.safe_load(f)
+                            host_name = svc_data.get("host")
+                            for svc_name, svc_info in svc_data.get("services", {}).items():
+                                if host_name not in inventory["services"]:
+                                    inventory["services"][host_name] = {}
+                                inventory["services"][host_name][svc_name] = {
+                                    "role": svc_info.get("role"),
+                                    "exposure": svc_info.get("exposure")
+                                }
+        except Exception as e:
+            logger.error(f"Failed to load inventory: {e}")
+        return inventory
+
+    def _load_checkpoint(self):
+        if OBSERVER_STATE_FILE.exists():
+            try:
+                with open(OBSERVER_STATE_FILE, "r") as f:
+                    checkpoint = json.load(f)
+
+                if "node_checkpoints" in checkpoint:
+                    # New format: per-directory checkpoints.
+                    self.node_checkpoints = checkpoint["node_checkpoints"]
+                elif "last_processed_file" in checkpoint:
+                    # Migrate old single-file checkpoint: extract node dir from path.
+                    old = checkpoint["last_processed_file"]
+                    if old:
+                        try:
+                            node_dir = Path(old).relative_to(EVENTS_DIR).parts[0]
+                            self.node_checkpoints = {node_dir: old}
+                            logger.info(f"Migrated old checkpoint → node_checkpoints: {self.node_checkpoints}")
+                        except Exception:
+                            pass  # Bad path — start fresh
+
+                self._load_world_from_disk()
+            except Exception as e:
+                logger.error(f"Failed to load checkpoint: {e}")
+
+    def _load_world_from_disk(self):
+        # Optional: Load existing state to resume faster
+        files = {
+            "nodes": WORLD_DIR / "nodes.json",
+            "services": WORLD_DIR / "services.json",
+            "deployments": WORLD_DIR / "deployments.json",
+            "incidents": WORLD_DIR / "incidents.json",
+            "summary": WORLD_DIR / "runtime-summary.json"
+        }
+        for key, path in files.items():
+            if path.exists():
+                try:
+                    with open(path, "r") as f:
+                        self.world_state[key] = json.load(f)
+                except Exception as e:
+                    logger.error(f"Failed to load {key} state: {e}")
+
+    def _save_checkpoint(self):
+        try:
+            _atomic_write_json(OBSERVER_STATE_FILE, {"node_checkpoints": self.node_checkpoints})
+        except Exception as e:
+            logger.error(f"Failed to save checkpoint: {e}")
+
+    def _prune_stale_world(self):
+        """Remove world-state entries for nodes absent from the topology inventory.
+
+        Root cause this guards against: when NODE_NAME env var is unset, node_agent.py
+        falls back to socket.gethostname(), which inside a Docker container returns the
+        12-char hex container ID (e.g. 'be17cb6eb0f6') instead of the canonical host name
+        ('vps').  The observer ingests those events and creates ghost entries that never
+        expire on their own.
+
+        Also ages out resolved incidents older than 7 days to keep world state lean.
+        """
+        known_nodes = set(self.inventory["nodes"].keys())
+        if not known_nodes:
+            # Inventory failed to load — don't prune to avoid wiping valid state.
+            return
+
+        stale_nodes = [n for n in list(self.world_state["nodes"].keys())
+                       if n not in known_nodes]
+        for n in stale_nodes:
+            logger.info(f"Pruning stale node from world state: {n}")
+            del self.world_state["nodes"][n]
+
+        stale_svcs = [k for k in list(self.world_state["services"].keys())
+                      if k.split("/")[0] in stale_nodes]
+        for k in stale_svcs:
+            logger.info(f"Pruning stale service from world state: {k}")
+            del self.world_state["services"][k]
+
+        # Prune ghost service keys whose service-name portion is a hash-prefixed
+        # Docker stale-state artifact (e.g. "9e36297651e7_control-plane-observer").
+        # These are created when node-agent incorrectly uses c.name instead of the
+        # compose label, and accumulate on every container rebuild.
+        # Pattern: <node>/<12hexchars>_<real-name>
+        ghost_svcs = [
+            k for k in list(self.world_state["services"].keys())
+            if len(k.split("/", 1)) == 2
+            and len(k.split("/", 1)[1]) > 13
+            and k.split("/", 1)[1][12] == "_"
+            and all(ch in "0123456789abcdef" for ch in k.split("/", 1)[1][:12])
+        ]
+        for k in ghost_svcs:
+            logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
+            del self.world_state["services"][k]
+
+        now = time.time()
+
+        try:
+            # Collect incident_ids currently referenced by any service entry.
+            linked_ids: set = {
+                svc.get("incident_id")
+                for svc in self.world_state["services"].values()
+                if svc.get("incident_id")
+            }
+
+            # Case 1 — service is healthy but still points at an active incident.
+            # process_event already calls _resolve_incident on service_healthy events,
+            # but if the observer restarted with on-disk state where the link was
+            # intact (inconsistency from a pre-atomic-write crash), it may not get
+            # resolved until the next service_healthy event is processed.  Resolve
+            # immediately — a healthy service cannot have an ongoing incident.
+            for svc_key, svc in self.world_state["services"].items():
+                if svc.get("status") != "healthy":
+                    continue
+                inc_id = svc.get("incident_id")
+                if not inc_id:
+                    continue
+                inc = self.world_state["incidents"].get(inc_id, {})
+                if inc.get("status") == "active":
+                    logger.info(
+                        f"Auto-resolving incident {inc_id} for {svc_key}: "
+                        f"service is healthy"
+                    )
+                    inc["status"] = "resolved"
+                    inc["resolved_at"] = now
+                    svc["incident_id"] = None
+                    linked_ids.discard(inc_id)
+
+            # Case 2 — orphaned active incident: no service entry links to it and
+            # last_occurrence is older than 5 minutes (guard against creation races).
+            # These are the stale records left behind when on-disk state was
+            # inconsistent: the service entry had incident_id cleared but incidents.json
+            # still had the record as "active".
+            for inc_id, inc in self.world_state["incidents"].items():
+                if inc.get("status") != "active":
+                    continue
+                if inc_id in linked_ids:
+                    continue
+                age = now - _parse_ts(inc.get("last_occurrence"))
+                if age > 300:  # 5-minute guard
+                    logger.info(
+                        f"Auto-resolving orphaned incident {inc_id} "
+                        f"(service={inc.get('service')}, node={inc.get('node')}): "
+                        f"no service references it, age={int(age)}s"
+                    )
+                    inc["status"] = "resolved"
+                    inc["resolved_at"] = now
+
+        except Exception as exc:
+            logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
+
+        # Remove resolved incidents older than 7 days.
+        # Use _parse_ts so ISO-string resolved_at values are handled correctly.
+        stale_incidents = [
+            k for k, v in self.world_state["incidents"].items()
+            if v.get("status") == "resolved"
+            and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
+        ]
+        for k in stale_incidents:
+            del self.world_state["incidents"][k]
+
+    def _save_world(self):
+        self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat()
+        active_incidents = [
+            k for k, v in self.world_state["incidents"].items() if v.get("status") == "active"
+        ]
+        self.world_state["summary"]["active_incidents_count"] = len(active_incidents)
+        self.world_state["summary"]["node_count"] = len(self.world_state["nodes"])
+        self.world_state["summary"]["service_count"] = len(self.world_state["services"])
+
+        if active_incidents:
+            self.world_state["summary"]["status"] = "degraded"
+        else:
+            self.world_state["summary"]["status"] = "nominal"
+
+        files = {
+            "nodes.json": self.world_state["nodes"],
+            "services.json": self.world_state["services"],
+            "deployments.json": self.world_state["deployments"],
+            "incidents.json": self.world_state["incidents"],
+            "recommendations.json": [],
+            "runtime-summary.json": self.world_state["summary"]
+        }
+        for filename, data in files.items():
+            try:
+                _atomic_write_json(WORLD_DIR / filename, data)
+            except Exception as e:
+                logger.error(f"Failed to save {filename}: {e}")
+
+    def process_event(self, event):
+        etype = event.get("type")
+        node = event.get("node")
+        service = event.get("service")
+        severity = event.get("severity")
+        timestamp = event.get("timestamp")
+        cid = event.get("correlation_id")
+        payload = event.get("payload", {})
+
+        # 1. Update Node State
+        if node not in self.world_state["nodes"]:
+            self.world_state["nodes"][node] = {
+                "status": "unknown",
+                "last_seen": None,
+                "roles": self.inventory["nodes"].get(node, {}).get("roles", [])
+            }
+        self.world_state["nodes"][node]["last_seen"] = timestamp
+
+        if etype == "node_online":
+            self.world_state["nodes"][node]["status"] = "online"
+        elif etype == "node_offline":
+            self.world_state["nodes"][node]["status"] = "offline"
+
+        elif etype == "node_health":
+            # Regular heartbeat from node-agent; updates resource metrics.
+            # Clears disk_pressure if disk is now healthy (< warn threshold).
+            self.world_state["nodes"][node]["status"] = "online"
+            self.world_state["nodes"][node].update({
+                "disk_usage_pct": payload.get("disk_pct"),
+                "mem_usage_pct":  payload.get("mem_pct"),
+                "cpu_usage_pct":  payload.get("cpu_pct"),
+            })
+            if (payload.get("disk_pct") or 0) < 75:
+                self.world_state["nodes"][node].pop("disk_pressure", None)
+
+        elif etype == "disk_pressure":
+            # Emitted when disk usage crosses 75 % (medium) or 85 % (high).
+            # The supervisor reads disk_pressure to generate disk_cleanup actions.
+            self.world_state["nodes"][node]["disk_pressure"] = severity
+            self.world_state["nodes"][node]["disk_usage_pct"] = payload.get("usage_pct")
+
+        elif etype == "high_memory":
+            # Memory pressure observation; recorded on the node for correlation.
+            # No automated action — operator decides if a container restart helps.
+            self.world_state["nodes"][node]["memory_pressure"] = severity
+            self.world_state["nodes"][node]["mem_usage_pct"] = payload.get("usage_pct")
+
+        elif etype == "high_cpu":
+            # CPU pressure observation; recorded for visibility.
+            self.world_state["nodes"][node]["cpu_pressure"] = severity
+            self.world_state["nodes"][node]["cpu_usage_pct"] = payload.get("usage_pct")
+
+        # 2. Update Service State
+        if service and service != "all":
+            svc_key = f"{node}/{service}"
+            if svc_key not in self.world_state["services"]:
+                self.world_state["services"][svc_key] = {
+                    "node": node,
+                    "service": service,
+                    "status": "unknown",
+                    "last_check": None,
+                    "incident_id": None
+                }
+            self.world_state["services"][svc_key]["last_check"] = timestamp
+
+            if etype == "service_recovered":
+                self.world_state["services"][svc_key]["status"] = "healthy"
+                self._resolve_incident(svc_key, timestamp)
+            elif etype == "service_healthy":
+                # Positive confirmation from node-agent that a managed container
+                # is running. This keeps services.json populated so the supervisor
+                # can correctly detect drift (absent entry = never reported = unknown,
+                # not the same as confirmed missing).
+                # Also resolve any active incident — if a service that had been
+                # unhealthy/crashing is now confirmed healthy, the incident is over.
+                self.world_state["services"][svc_key]["status"] = "healthy"
+                self._resolve_incident(svc_key, timestamp)
+            elif etype in ["service_unhealthy", "healthcheck_failed"]:
+                self.world_state["services"][svc_key]["status"] = "unhealthy"
+                self._handle_incident(svc_key, event)
+
+        # 3. Update Deployment State
+        if etype.startswith("deployment_") and cid:
+            if cid not in self.world_state["deployments"]:
+                self.world_state["deployments"][cid] = {
+                    "node": node,
+                    "service": service,
+                    "status": "unknown",
+                    "started_at": None,
+                    "finished_at": None,
+                    "events": []
+                }
+            self.world_state["deployments"][cid]["events"].append({
+                "type": etype,
+                "timestamp": timestamp,
+                "payload": payload
+            })
+            if etype == "deployment_started":
+                self.world_state["deployments"][cid]["status"] = "in_progress"
+                self.world_state["deployments"][cid]["started_at"] = timestamp
+            elif etype == "deployment_completed":
+                self.world_state["deployments"][cid]["status"] = "completed"
+                self.world_state["deployments"][cid]["finished_at"] = timestamp
+            elif etype == "deployment_failed":
+                self.world_state["deployments"][cid]["status"] = "failed"
+                self.world_state["deployments"][cid]["finished_at"] = timestamp
+                # Deployment failure often creates an incident
+                self._handle_deployment_failure(event)
+
+    def _handle_incident(self, svc_key, event):
+        # Correlation: collapse repeated failures for the same service on the same node
+        active_incident = self.world_state["services"][svc_key].get("incident_id")
+        
+        if active_incident and active_incident in self.world_state["incidents"]:
+            incident = self.world_state["incidents"][active_incident]
+            if incident["status"] == "active":
+                incident["last_occurrence"] = event["timestamp"]
+                incident["occurrence_count"] = incident.get("occurrence_count", 1) + 1
+                incident["events"].append(event["timestamp"])
+                return
+
+        # Create new incident
+        incident_id = f"inc-{int(time.time())}-{event.get('node')}-{event.get('service')}"
+        self.world_state["incidents"][incident_id] = {
+            "id": incident_id,
+            "node": event.get("node"),
+            "service": event.get("service"),
+            "status": "active",
+            "severity": event.get("severity"),
+            # trigger_type records the event type that opened this incident so that
+            # the supervisor can choose the appropriate remediation action
+            # (e.g. container_restart for containers_not_running / mqtt_unreachable
+            # vs. a full redeploy for other causes).
+            "trigger_type": event.get("type"),
+            "started_at": event.get("timestamp"),
+            "last_occurrence": event.get("timestamp"),
+            "occurrence_count": 1,
+            "events": [event["timestamp"]],
+            "correlation_id": event.get("correlation_id")
+        }
+        self.world_state["services"][svc_key]["incident_id"] = incident_id
+
+    def _resolve_incident(self, svc_key, timestamp):
+        incident_id = self.world_state["services"][svc_key].get("incident_id")
+        if incident_id and incident_id in self.world_state["incidents"]:
+            if self.world_state["incidents"][incident_id]["status"] == "active":
+                self.world_state["incidents"][incident_id]["status"] = "resolved"
+                self.world_state["incidents"][incident_id]["resolved_at"] = timestamp
+        self.world_state["services"][svc_key]["incident_id"] = None
+
+    def _handle_deployment_failure(self, event):
+        # Specific logic for deployment failures
+        svc_key = f"{event.get('node')}/{event.get('service')}"
+        self._handle_incident(svc_key, event)
+        
+        # Link diagnostics if available in payload
+        incident_id = self.world_state["services"][svc_key].get("incident_id")
+        if incident_id and incident_id in self.world_state["incidents"]:
+            payload = event.get("payload", {})
+            if "diagnostics_file" in payload:
+                self.world_state["incidents"][incident_id]["diagnostics_ref"] = payload["diagnostics_file"]
+            elif "error" in payload:
+                self.world_state["incidents"][incident_id]["last_error"] = payload["error"]
+
+    def run_once(self):
+        # Update heartbeat
+        heartbeat_file = STATE_DIR / "observer.heartbeat"
+        try:
+            heartbeat_file.touch()
+        except Exception as e:
+            logger.error(f"Failed to touch heartbeat file: {e}")
+
+        # Collect all event files grouped by node directory.
+        # Per-node checkpoints are compared within each directory independently,
+        # so late-arriving events from remote nodes (sorted earlier in the path)
+        # are never skipped just because another node's checkpoint is further ahead.
+        all_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
+
+        new_files = []
+        for file_path in all_files:
+            try:
+                node_dir = str(Path(file_path).relative_to(EVENTS_DIR).parts[0])
+            except (IndexError, ValueError):
+                node_dir = "__unknown__"
+            last_for_node = self.node_checkpoints.get(node_dir, "")
+            if file_path > last_for_node:
+                new_files.append((node_dir, file_path))
+
+        if not new_files:
+            # Even if no new events, prune stale entries and refresh summary freshness.
+            self._prune_stale_world()
+            self._save_world()
+            return
+
+        logger.info(f"Processing {len(new_files)} new events across "
+                    f"{len({n for n, _ in new_files})} node(s)")
+        for node_dir, file_path in new_files:
+            try:
+                with open(file_path, "r") as f:
+                    event = json.load(f)
+                    self.process_event(event)
+                # Advance per-node checkpoint (only forward — no regression).
+                if file_path > self.node_checkpoints.get(node_dir, ""):
+                    self.node_checkpoints[node_dir] = file_path
+            except Exception as e:
+                logger.error(f"Error processing {file_path}: {e}")
+
+        self._save_checkpoint()
+        self._prune_stale_world()
+        self._save_world()
+
+    def loop(self, interval=5):
+        logger.info("Starting observer loop")
+        while True:
+            self.run_once()
+            time.sleep(interval)
+
+if __name__ == "__main__":
+    import sys
+    observer = Observer()
+    if "--run-once" in sys.argv:
+        observer.run_once()
+    else:
+        observer.loop()
--- a/scripts/observer/test_setup.sh
+++ b/scripts/observer/test_setup.sh
@ -0,0 +1,83 @@
+#!/usr/bin/env bash
+mkdir -p /tmp/homelab/events/2026-05-12/saturn
+mkdir -p /tmp/homelab/state
+mkdir -p /tmp/homelab/logs
+mkdir -p /tmp/homelab/world
+
+cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120000_node_online_1.json
+{
+  "timestamp": "2026-05-12T12:00:00Z",
+  "node": "saturn",
+  "type": "node_online",
+  "severity": "info",
+  "source": "system",
+  "service": "all",
+  "correlation_id": "init",
+  "payload": {}
+}
+EOF
+
+cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120500_service_unhealthy_1.json
+{
+  "timestamp": "2026-05-12T12:05:00Z",
+  "node": "saturn",
+  "type": "service_unhealthy",
+  "severity": "error",
+  "source": "healthcheck",
+  "service": "mosquitto",
+  "correlation_id": "hc-1",
+  "payload": {"error": "connection refused"}
+}
+EOF
+
+cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120600_service_unhealthy_2.json
+{
+  "timestamp": "2026-05-12T12:06:00Z",
+  "node": "saturn",
+  "type": "service_unhealthy",
+  "severity": "error",
+  "source": "healthcheck",
+  "service": "mosquitto",
+  "correlation_id": "hc-2",
+  "payload": {"error": "connection refused"}
+}
+EOF
+
+cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121000_service_recovered_1.json
+{
+  "timestamp": "2026-05-12T12:10:00Z",
+  "node": "saturn",
+  "type": "service_recovered",
+  "severity": "info",
+  "source": "healthcheck",
+  "service": "mosquitto",
+  "correlation_id": "hc-3",
+  "payload": {}
+}
+EOF
+
+cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121500_deployment_started_1.json
+{
+  "timestamp": "2026-05-12T12:15:00Z",
+  "node": "saturn",
+  "type": "deployment_started",
+  "severity": "info",
+  "source": "deploy_agent",
+  "service": "mosquitto",
+  "correlation_id": "deploy-1",
+  "payload": {"version": "2.0.18"}
+}
+EOF
+
+cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121600_deployment_failed_1.json
+{
+  "timestamp": "2026-05-12T12:16:00Z",
+  "node": "saturn",
+  "type": "deployment_failed",
+  "severity": "error",
+  "source": "deploy_agent",
+  "service": "mosquitto",
+  "correlation_id": "deploy-1",
+  "payload": {"error": "container crash", "diagnostics_file": "/opt/homelab/logs/diagnostics-deploy-1.log"}
+}
+EOF
--- a/services/agent-system/README.md
+++ b/services/agent-system/README.md
@ -0,0 +1,55 @@
+### Agent System
+Central runtime materializer and Operator Control Plane UI.
+
+#### Components
+- **Redis**: Central state store (on PIHA).
+- **Runtime Materializer**: Converts Redis state to JSON files in `/opt/homelab/world`.
+- **Web UI**: Exposes API endpoints and serving the Operator UI.
+- **Telegram Bot**: Provides operator commands and action approvals via Telegram.
+
+#### Configuration
+Environment variables should be set in `.env` (see `env.example`).
+Key variables for the Telegram Bot:
+- `TELEGRAM_BOT_TOKEN`: Your bot token from @BotFather.
+- `TELEGRAM_ALLOWED_USER_IDS`: Comma-separated list of authorized Telegram User IDs.
+- `CONTROL_PLANE_URL`: URL to the `agent-system-webui` (default: `http://webui:8080`).
+
+#### Telegram Commands
+- `/status`: Check bot and API connectivity.
+- `/summary`: System health overview.
+- `/nodes`: List homelab nodes and their status.
+- `/services`: Summary of services across nodes.
+- `/unhealthy`: List all unhealthy components.
+- `/incidents`: View active incidents.
+- `/actions`: Summary of operator actions.
+- `/help`: List all commands.
+
+#### Deployment (on PIHA)
+```bash
+cd services/agent-system
+./deploy.sh
+```
+
+#### Deployment (on CHELSTY)
+```bash
+cd services/stability-agent
+docker compose up -d --build
+```
+
+#### Verification
+The `deploy.sh` script automatically verifies the local endpoints.
+You can also manually check:
+```bash
+# Check runtime summary
+curl http://localhost:18180/summary
+
+# Check discovered nodes
+curl http://localhost:18180/nodes
+
+# Check discovered services
+curl http://localhost:18180/services
+```
+
+#### Directory Structure
+- `/opt/homelab/world`: Contains materialized JSON state.
+- `/opt/homelab/state`: Contains operator configuration and local heartbeats.
--- a/services/agent-system/action-model.md
+++ b/services/agent-system/action-model.md
@ -0,0 +1,52 @@
+### Action Approval Data Model
+
+Actions are JSON files stored in `/opt/homelab/actions/{status}/{action_id}.json`.
+
+#### Statuses
+- `pending`: Waiting for operator approval. AI agents create actions in this state.
+- `approved`: Approved by operator, ready for execution.
+- `rejected`: Rejected by operator, will not be executed.
+- `running`: Currently being executed by an agent (e.g. `materializer`).
+- `completed`: Successfully executed.
+- `failed`: Execution failed.
+
+#### Human-in-the-Loop (HIL) Protocol
+1. **Request**: Agent identifies a required change and writes a JSON to `actions/pending/`.
+2. **Notification**: System notifies the human operator.
+3. **Audit**: Human reviews `details.reason` and `details.diff`.
+4. **Authorization**: Human moves file to `approved/`.
+5. **Execution**: Agent monitors `approved/` and executes the task.
+
+#### Schema
+```json
+{
+  "action_id": "string",
+  "service": "string",
+  "node": "string",
+  "type": "deploy_service | restart_service | rollback | scale",
+  "risk": "nominal | guarded | critical",
+  "status": "pending | approved | rejected | ...",
+  "created_at": <unix_seconds>,
+  "updated_at": <unix_seconds>,
+  "details": {
+    "image": "string",
+    "reason": "string",
+    "diff": "string"
+  },
+  "transition_history": [
+    {
+      "from": "string | null",
+      "to": "string",
+      "timestamp": <unix_seconds>,
+      "by": "string (system | operator-tg-12345 | webui)"
+    }
+  ]
+}
+```
+
+#### Workflow
+1. A system component (e.g. `runtime-materializer` or a future analyzer) creates a file in `actions/pending/`.
+2. `telegram-bot` detects the file, sends a message to allowed users.
+3. Operator clicks "Approve" or "Reject".
+4. `telegram-bot` moves the file to `actions/approved/` or `actions/rejected/` atomically, appending a transition to `transition_history`.
+5. The responsible agent (e.g. `stability-agent` on the target node) picks up the `approved` action, moves it to `running`, executes it, and finally moves it to `completed` or `failed`.
--- a/services/agent-system/deploy.sh
+++ b/services/agent-system/deploy.sh
@ -0,0 +1,28 @@
+#!/bin/bash
+set -e
+
+echo ">>> Validating docker-compose configuration..."
+docker compose config
+
+echo ">>> Building and starting Agent System services..."
+docker compose up -d --build
+
+echo ">>> Services status:"
+docker ps --filter "name=agent-system" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
+
+if [ -z "$TELEGRAM_BOT_TOKEN" ]; then
+  echo ">>> Telegram bot status: DISABLED (token missing)"
+else
+  echo ">>> Telegram bot status: ENABLED"
+fi
+
+echo ">>> Verifying API endpoints..."
+sleep 5 # Give it a moment to start
+
+endpoints=("summary" "nodes" "services")
+for ep in "${endpoints[@]}"; do
+  echo "Checking /$ep..."
+  curl -s -f http://localhost:18180/$ep > /dev/null && echo "  OK" || echo "  FAILED"
+done
+
+echo ">>> Deployment complete."
--- a/services/agent-system/docker-compose.yml
+++ b/services/agent-system/docker-compose.yml
@ -0,0 +1,47 @@
+services:
+  redis:
+    image: redis:7
+    container_name: agent-system-redis
+    ports:
+      - "6379:6379"
+    restart: unless-stopped
+
+  webui:
+    build: ./webui
+    container_name: agent-system-webui
+    ports:
+      - "18180:8080"
+    volumes:
+      - /opt/homelab:/opt/homelab
+    depends_on:
+      - redis
+    restart: unless-stopped
+
+  runtime-materializer:
+    build: ./runtime-materializer
+    container_name: agent-system-runtime-materializer
+    environment:
+      REDIS_HOST: redis
+      REDIS_PORT: "6379"
+      HOMELAB_WORLD_ROOT: /opt/homelab/world
+      WORLD_DIR: /opt/homelab/world
+      MATERIALIZE_INTERVAL: "10"
+    volumes:
+      - /opt/homelab:/opt/homelab
+    depends_on:
+      - redis
+    restart: unless-stopped
+
+  telegram-bot:
+    build: ./telegram-bot
+    container_name: agent-system-telegram-bot
+    environment:
+      TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN}
+      TELEGRAM_ALLOWED_USER_IDS: ${TELEGRAM_ALLOWED_USER_IDS}
+      CONTROL_PLANE_URL: ${CONTROL_PLANE_URL:-http://webui:8080}
+      ENABLE_LLM_FALLBACK: ${ENABLE_LLM_FALLBACK:-false}
+      OPENCLAW_BASE_URL: ${OPENCLAW_BASE_URL}
+      ACTIONS_ROOT: /opt/homelab/actions
+    volumes:
+      - /opt/homelab:/opt/homelab
+    restart: on-failure
--- a/services/agent-system/env.example
+++ b/services/agent-system/env.example
@ -0,0 +1,19 @@
+# Telegram Bot Configuration
+# Get token from @BotFather
+TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrsTUVwxyz
+# Comma-separated list of Telegram User IDs
+TELEGRAM_ALLOWED_USER_IDS=12345678,87654321
+# Local control-plane API (default is internal compose address)
+CONTROL_PLANE_URL=http://webui:8080
+# Optional LLM fallback logic
+ENABLE_LLM_FALLBACK=false
+OPENCLAW_BASE_URL=http://openclaw.internal
+
+# Runtime Materializer Configuration
+REDIS_HOST=100.108.208.3
+REDIS_PORT=6379
+
+# Paths
+HOMELAB_ROOT=/opt/homelab
+ACTIONS_ROOT=/opt/homelab/actions
+WORLD_DIR=/opt/homelab/world
--- a/services/agent-system/runtime-materializer/Dockerfile
+++ b/services/agent-system/runtime-materializer/Dockerfile
@ -0,0 +1,16 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+# Install redis python package as requested
+RUN pip install --no-cache-dir redis
+
+COPY materializer.py .
+
+# Ensure the world directory exists in the container (though it will likely be a volume)
+RUN mkdir -p /opt/homelab/world
+
+# Use unbuffered output to see logs in docker
+ENV PYTHONUNBUFFERED=1
+
+CMD ["python", "materializer.py"]
--- a/services/agent-system/runtime-materializer/materializer.py
+++ b/services/agent-system/runtime-materializer/materializer.py
@ -0,0 +1,251 @@
+import redis
+import json
+import os
+import time
+import argparse
+import urllib.request
+import urllib.error
+from datetime import datetime
+
+# Configuration from environment variables
+REDIS_HOST = os.environ.get("REDIS_HOST", "redis")
+REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
+WORLD_DIR = os.environ.get("WORLD_DIR", "/opt/homelab/world")
+
+# When set, materialize from the control-plane HTTP API instead of Redis.
+# This is the authoritative source of truth: the observer writes clean world
+# state to the control-plane API, which the materializer mirrors locally so
+# the webui's /snapshot (and all other endpoints) reflect the same data.
+#
+# Example: CONTROL_PLANE_URL=http://100.95.58.48:18180
+CONTROL_PLANE_URL = os.environ.get("CONTROL_PLANE_URL", "").rstrip("/")
+
+
+def get_redis_client():
+    """Returns a Redis client with decoding enabled."""
+    return redis.Redis(
+        host=REDIS_HOST,
+        port=REDIS_PORT,
+        decode_responses=True,
+        socket_timeout=5
+    )
+
+def safe_json_loads(data, default=None):
+    """Safely loads JSON from a string."""
+    if not data:
+        return default
+    try:
+        if isinstance(data, (dict, list)):
+            return data
+        return json.loads(data)
+    except (json.JSONDecodeError, TypeError):
+        return data
+
+def normalize_health(health):
+    """Normalizes health values for the UI."""
+    if not health:
+        return "nominal"
+    h = str(health).lower()
+    if h in ["healthy", "ok", "running", "nominal"]:
+        return "nominal"
+    if h in ["degraded", "warning"]:
+        return "degraded"
+    return "error"
+
+
+def _fetch_json(url):
+    """Fetch JSON from a URL, returning parsed data or None on error."""
+    try:
+        with urllib.request.urlopen(url, timeout=10) as resp:
+            return json.loads(resp.read())
+    except Exception as e:
+        print(f"[{datetime.now().isoformat()}] Error fetching {url}: {e}")
+        return None
+
+
+def write_json(filename, data):
+    path = os.path.join(WORLD_DIR, filename)
+    with open(path, "w") as f:
+        json.dump(data, f, indent=2)
+
+
+def materialize_from_api():
+    """Mirror world state from the control-plane API to local world files.
+
+    The control-plane observer on VPS is the single authoritative writer of
+    world state. By fetching from its HTTP API we get the same clean, pruned
+    data that the /summary endpoint serves — no stale Redis artefacts.
+
+    Returns True if all fetches succeeded and files were written, False otherwise.
+    """
+    print(f"[{datetime.now().isoformat()}] Materializing from control-plane API: {CONTROL_PLANE_URL}")
+
+    endpoints = {
+        "nodes.json":          f"{CONTROL_PLANE_URL}/nodes",
+        "services.json":       f"{CONTROL_PLANE_URL}/services",
+        "incidents.json":      f"{CONTROL_PLANE_URL}/incidents",
+        "deployments.json":    f"{CONTROL_PLANE_URL}/deployments",
+        "recommendations.json":f"{CONTROL_PLANE_URL}/recommendations",
+        "runtime-summary.json":f"{CONTROL_PLANE_URL}/summary",
+        "events.json":         f"{CONTROL_PLANE_URL}/events",
+    }
+
+    fetched = {}
+    for filename, url in endpoints.items():
+        data = _fetch_json(url)
+        if data is None:
+            print(f"[{datetime.now().isoformat()}] Aborting: failed to fetch {url}")
+            return False
+        fetched[filename] = data
+
+    os.makedirs(WORLD_DIR, exist_ok=True)
+    for filename, data in fetched.items():
+        write_json(filename, data)
+
+    svc_count = len(fetched.get("services.json") or [])
+    print(f"[{datetime.now().isoformat()}] Materialized from API: {svc_count} services → {WORLD_DIR}")
+    return True
+
+
+def materialize():
+    """Reads state from Redis and writes JSON files to the world directory."""
+    print(f"[{datetime.now().isoformat()}] Materializing world state...")
+    try:
+        r = get_redis_client()
+
+        # 1. Nodes
+        nodes = []
+        node_keys = r.keys("homelab:nodes:*")
+        for key in node_keys:
+            node_data = r.hgetall(key)
+            if node_data:
+                # Normalize health
+                if "health" in node_data:
+                    node_data["health"] = normalize_health(node_data["health"])
+                # Parse JSON fields if they exist
+                if "capabilities" in node_data:
+                    node_data["capabilities"] = safe_json_loads(node_data["capabilities"], [])
+                if "checks" in node_data:
+                    node_data["checks"] = safe_json_loads(node_data["checks"], {})
+                nodes.append(node_data)
+
+        # 2. Services
+        services = []
+        service_keys = r.keys("homelab:services:*")
+        for key in service_keys:
+            svc_data = r.hgetall(key)
+            if svc_data:
+                # Normalize health
+                if "health" in svc_data:
+                    svc_data["health"] = normalize_health(svc_data["health"])
+                if "dependencies" in svc_data:
+                    svc_data["dependencies"] = safe_json_loads(svc_data["dependencies"], [])
+                if "recommendations" in svc_data:
+                    svc_data["recommendations"] = safe_json_loads(svc_data["recommendations"], [])
+                services.append(svc_data)
+
+        # 3. Events (Stream)
+        events = []
+        try:
+            # Get last 100 events from the stream
+            raw_events = r.xrevrange("homelab:events", count=100)
+            for event_id, data in raw_events:
+                event = data.copy()
+                event["id"] = event_id
+                if "details" in event:
+                    event["details"] = safe_json_loads(event["details"], {})
+                events.append(event)
+        except redis.exceptions.ResponseError:
+            # homelab:events might not be a stream or doesn't exist
+            pass
+
+        # 4. Incidents (Hash)
+        incidents = []
+        incident_keys = r.keys("homelab:incidents:*")
+        for key in incident_keys:
+            incident_data = r.hgetall(key)
+            if incident_data:
+                # Normalize health if present
+                if "health" in incident_data:
+                    incident_data["health"] = normalize_health(incident_data["health"])
+                incidents.append(incident_data)
+
+        # 5. Deployments (Hash)
+        deployments = []
+        deployment_keys = r.keys("homelab:deployments:*")
+        for key in deployment_keys:
+            dep_data = r.hgetall(key)
+            if dep_data:
+                deployments.append(dep_data)
+
+        # 6. Recommendations (Hash)
+        recommendations = []
+        recommendation_keys = r.keys("homelab:recommendations:*")
+        for key in recommendation_keys:
+            rec_data = r.hgetall(key)
+            if rec_data:
+                recommendations.append(rec_data)
+
+        # 7. Runtime Summary
+        unhealthy_services = [s for s in services if s.get("health") != "nominal"]
+        active_incidents = [i for i in incidents if i.get("status") not in ["resolved", "closed"]]
+
+        status = "nominal"
+        if len(active_incidents) > 0 or len(unhealthy_services) > 5:
+            status = "error"
+        elif len(unhealthy_services) > 0:
+            status = "degraded"
+
+        summary = {
+            "status": status,
+            "timestamp": datetime.utcnow().isoformat() + "Z",
+            "last_update": int(time.time()),
+            "node_count": len(nodes),
+            "service_count": len(services),
+            "active_incidents_count": len(active_incidents),
+            "unhealthy_services_count": len(unhealthy_services),
+            "incident_count": len(incidents),
+            "recent_events_count": len(events),
+            "stale": False
+        }
+
+        # Ensure directory exists
+        os.makedirs(WORLD_DIR, exist_ok=True)
+
+        write_json("runtime-summary.json", summary)
+        write_json("nodes.json", nodes)
+        write_json("services.json", services)
+        write_json("incidents.json", incidents)
+        write_json("events.json", events)
+        write_json("deployments.json", deployments)
+        write_json("recommendations.json", recommendations)
+
+        print(f"[{datetime.now().isoformat()}] Successfully materialized to {WORLD_DIR}")
+
+    except redis.exceptions.ConnectionError as e:
+        print(f"Redis connection error: {e}")
+    except Exception as e:
+        print(f"Unexpected error during materialization: {e}")
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Homelab Runtime Materializer")
+    parser.add_argument("--once", action="store_true", help="Run once and exit")
+    parser.add_argument("--interval", type=int, default=30, help="Sleep interval between runs (seconds)")
+    args = parser.parse_args()
+
+    if CONTROL_PLANE_URL:
+        print(f"Mode: control-plane API ({CONTROL_PLANE_URL})")
+        run_fn = materialize_from_api
+    else:
+        print(f"Mode: Redis ({REDIS_HOST}:{REDIS_PORT})")
+        run_fn = materialize
+
+    interval = int(os.environ.get("MATERIALIZE_INTERVAL", args.interval))
+
+    if args.once:
+        run_fn()
+    else:
+        print(f"Starting materializer loop (interval: {interval}s)...")
+        while True:
+            run_fn()
+            time.sleep(interval)
--- a/services/agent-system/scripts/create-test-action.sh
+++ b/services/agent-system/scripts/create-test-action.sh
@ -0,0 +1,39 @@
+#!/bin/bash
+# Script to create a test pending action for Telegram bot verification.
+
+ACTIONS_PENDING_DIR=${ACTIONS_ROOT:-/opt/homelab/actions}/pending
+mkdir -p "$ACTIONS_PENDING_DIR"
+
+ACTION_ID="test-$(date +%s)"
+FILE_PATH="$ACTIONS_PENDING_DIR/$ACTION_ID.json"
+
+TIMESTAMP=$(date +%s)
+
+cat <<EOF > "$FILE_PATH"
+{
+  "action_id": "$ACTION_ID",
+  "service": "frigate",
+  "node": "chelsty",
+  "type": "deploy_service",
+  "risk": "guarded",
+  "status": "pending",
+  "created_at": $TIMESTAMP,
+  "updated_at": $TIMESTAMP,
+  "details": {
+    "image": "blakeblackshear/frigate:0.13.0",
+    "reason": "Security update for Frigate",
+    "diff": "image: blakeblackshear/frigate:0.12.0 -> 0.13.0"
+  },
+  "transition_history": [
+    {
+      "from": null,
+      "to": "pending",
+      "timestamp": $TIMESTAMP,
+      "by": "system-test"
+    }
+  ]
+}
+EOF
+
+echo "Test action created: $FILE_PATH"
+echo "If the telegram-bot is running and configured, you should receive a notification."
--- a/services/agent-system/telegram-bot/Dockerfile
+++ b/services/agent-system/telegram-bot/Dockerfile
@ -0,0 +1,10 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+COPY bot.py .
+
+CMD ["python", "bot.py"]
--- a/services/agent-system/telegram-bot/bot.py
+++ b/services/agent-system/telegram-bot/bot.py
@ -0,0 +1,454 @@
+import os
+import json
+import time
+import asyncio
+import logging
+import urllib.request
+import urllib.error
+from pathlib import Path
+from telegram import Update, InlineKeyboardButton, InlineKeyboardMarkup
+from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler, CallbackQueryHandler, MessageHandler, filters
+
+# Setup logging
+logging.basicConfig(
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    level=logging.INFO
+)
+logger = logging.getLogger(__name__)
+
+# Configuration
+TOKEN = os.getenv("TELEGRAM_BOT_TOKEN")
+ALLOWED_IDS = [int(i.strip()) for i in os.getenv("TELEGRAM_ALLOWED_USER_IDS", "").split(",") if i.strip()]
+ACTIONS_ROOT = Path(os.getenv("ACTIONS_ROOT", "/opt/homelab/actions"))
+CONTROL_PLANE_URL = os.getenv("CONTROL_PLANE_URL", "http://webui:8080")
+ENABLE_LLM_FALLBACK = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
+OPENCLAW_BASE_URL = os.getenv("OPENCLAW_BASE_URL")
+
+async def fetch_api(path):
+    """Helper to fetch JSON from the Control Plane API."""
+    url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
+    try:
+        def do_request():
+            req = urllib.request.Request(url)
+            with urllib.request.urlopen(req, timeout=5) as response:
+                if response.status != 200:
+                    return None
+                return json.loads(response.read().decode())
+        return await asyncio.to_thread(do_request)
+    except Exception as e:
+        logger.error(f"Error fetching {url}: {e}")
+        return None
+
+async def post_api(path, data):
+    """Helper to POST JSON to the Control Plane API."""
+    url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
+    try:
+        body = json.dumps(data).encode("utf-8")
+        def do_request():
+            req = urllib.request.Request(url, data=body, method="POST")
+            req.add_header("Content-Type", "application/json")
+            with urllib.request.urlopen(req, timeout=5) as response:
+                return response.status == 200
+        return await asyncio.to_thread(do_request)
+    except Exception as e:
+        logger.error(f"Error posting to {url}: {e}")
+        return False
+
+def _format_pending_action(action_id: str, data: dict) -> str:
+    """Build the Telegram Markdown message for a pending action notification.
+
+    Extracted so it can be unit-tested without a live Telegram connection.
+    """
+    # Supervisor writes risk_level; action-model.md legacy schema used risk.
+    risk = data.get("risk_level") or data.get("risk", "unknown")
+    message = (
+        f"⚠️ *Pending Action*\n"
+        f"ID: `{action_id}`\n"
+        f"Type: `{data.get('type', 'unknown')}`\n"
+        f"Service: `{data.get('service', 'unknown')}`\n"
+        f"Node: `{data.get('node', 'unknown')}`\n"
+        f"Risk: *{risk}*\n"
+    )
+    # description carries the human-readable substance of the action (required for
+    # alert_only actions where it is the entire operator-visible message).
+    description = data.get("description", "")
+    if description:
+        truncated = description[:300] + ("..." if len(description) > 300 else "")
+        message += f"Description: `{truncated}`\n"
+    # Legacy details block (old action-model.md schema) — kept for backwards compat.
+    if "details" in data:
+        details_str = json.dumps(data["details"], indent=2)
+        if len(details_str) > 1000:
+            details_str = details_str[:1000] + "..."
+        message += f"\nDetails:\n```json\n{details_str}\n```"
+    return message
+
+
+class ApprovalBot:
+    def __init__(self):
+        self.pending_dir = ACTIONS_ROOT / "pending"
+        self.approved_dir = ACTIONS_ROOT / "approved"
+        self.rejected_dir = ACTIONS_ROOT / "rejected"
+        # Track which action IDs we have already notified in this session to avoid spam
+        self.notified_actions = set()
+
+    async def check_pending_actions(self, context: ContextTypes.DEFAULT_TYPE):
+        """Job that periodically checks for new pending action files."""
+        if not self.pending_dir.exists():
+            return
+
+        try:
+            for action_file in self.pending_dir.glob("*.json"):
+                action_id = action_file.stem
+                if action_id in self.notified_actions:
+                    continue
+
+                try:
+                    data = json.loads(action_file.read_text())
+                    # Only notify if it's truly pending
+                    if data.get("status") == "pending":
+                        await self.notify_users(context, action_id, data)
+                        self.notified_actions.add(action_id)
+                except Exception as e:
+                    logger.error(f"Error processing action file {action_file}: {e}")
+        except Exception as e:
+            logger.error(f"Error scanning pending directory: {e}")
+
+    async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
+        """Sends an approval request message to all allowed users."""
+        message = _format_pending_action(action_id, data)
+
+        keyboard = [
+            [
+                InlineKeyboardButton("✅ Approve", callback_data=f"approve:{action_id}"),
+                InlineKeyboardButton("❌ Reject", callback_data=f"reject:{action_id}"),
+            ]
+        ]
+        reply_markup = InlineKeyboardMarkup(keyboard)
+
+        for user_id in ALLOWED_IDS:
+            try:
+                await context.bot.send_message(
+                    chat_id=user_id,
+                    text=message,
+                    parse_mode="Markdown",
+                    reply_markup=reply_markup
+                )
+                logger.info(f"Notified user {user_id} about action {action_id}")
+            except Exception as e:
+                logger.error(f"Failed to notify user {user_id}: {e}")
+
+    async def handle_callback(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
+        """Handles button clicks for Approve/Reject."""
+        query = update.callback_query
+        user_id = query.from_user.id
+
+        if user_id not in ALLOWED_IDS:
+            await query.answer("Unauthorized", show_alert=True)
+            return
+
+        await query.answer()
+
+        cb_data = query.data
+        if ":" not in cb_data:
+            return
+
+        action, action_id = cb_data.split(":", 1)
+        target_status = "approved" if action == "approve" else "rejected"
+
+        # Use API for mutation if available, fallback to local disk move
+        success = await post_api("/action/mutate", {"id": action_id, "status": target_status})
+        msg = "Success" if success else "API call failed"
+
+        if not success:
+            # Fallback to direct disk manipulation (original behavior)
+            success, msg = self.move_action(action_id, target_status, user_id, query.from_user.username or str(user_id))
+
+        if success:
+            status_text = "✅ Approved" if target_status == "approved" else "❌ Rejected"
+            await query.edit_message_text(
+                text=query.message.text + f"\n\n{status_text} by {query.from_user.first_name}",
+                parse_mode="Markdown"
+            )
+            # Remove from notified list as it's no longer pending
+            if action_id in self.notified_actions:
+                self.notified_actions.remove(action_id)
+        else:
+            await query.message.reply_text(f"Failed to process action {action_id}: {msg}")
+
+    def move_action(self, action_id, target_status, user_id, username):
+        """Moves action file and updates its status and history."""
+        source_path = self.pending_dir / f"{action_id}.json"
+        if not source_path.exists():
+            return False, "Action file no longer exists in pending."
+
+        target_dir = self.approved_dir if target_status == "approved" else self.rejected_dir
+        target_dir.mkdir(parents=True, exist_ok=True)
+        target_path = target_dir / f"{action_id}.json"
+
+        try:
+            data = json.loads(source_path.read_text())
+            current_status = data.get("status", "pending")
+
+            # Update data
+            data["status"] = target_status
+            data["updated_at"] = time.time()
+
+            history = data.get("transition_history", [])
+            history.append({
+                "from": current_status,
+                "to": target_status,
+                "timestamp": time.time(),
+                "by": f"tg:{username}"
+            })
+            data["transition_history"] = history
+
+            # Atomic move: write to new location, then delete old
+            target_path.write_text(json.dumps(data, indent=2))
+            source_path.unlink()
+            logger.info(f"Action {action_id} moved from {current_status} to {target_status} by {username}")
+            return True, "Success"
+        except Exception as e:
+            logger.error(f"Error moving action file: {e}")
+            return False, str(e)
+
+async def start_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    """Simple start command to help users find their ID."""
+    user = update.effective_user
+    message = (
+        f"Hello {user.first_name}! 🤖\n"
+        f"Your Telegram User ID is: `{user.id}`\n\n"
+    )
+    if user.id in ALLOWED_IDS:
+        message += "✅ You are authorized to manage the homelab.\n\n"
+        message += "Use /help to see available commands."
+    else:
+        message += "❌ You are NOT authorized. Add your ID to `TELEGRAM_ALLOWED_USER_IDS`."
+
+    await update.message.reply_text(message, parse_mode="Markdown")
+
+async def status_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    if update.effective_user.id not in ALLOWED_IDS: return
+    res = await fetch_api("/summary")
+    status = "✅ Online" if res else "❌ Unreachable"
+    message = (
+        f"🤖 *Telegram Bot Status*\n"
+        f"Control Plane API: {status}\n"
+        f"Target URL: `{CONTROL_PLANE_URL}`\n"
+    )
+    await update.message.reply_text(message, parse_mode="Markdown")
+
+async def summary_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    if update.effective_user.id not in ALLOWED_IDS: return
+    data = await fetch_api("/summary")
+    if not data:
+        await update.message.reply_text("❌ Failed to fetch summary from Control Plane.")
+        return
+
+    msg = "📊 *System Summary*\n"
+    msg += f"Status: `{data.get('status', 'unknown')}`\n"
+    msg += f"Nodes: {data.get('node_count', 0)}\n"
+    msg += f"Services: {data.get('service_count', 0)}\n"
+    msg += f"Active Incidents: {data.get('active_incidents_count', 0)}\n"
+    if data.get('stale'):
+        msg += "\n⚠️ *Warning: Data is stale!*"
+
+    await update.message.reply_text(msg, parse_mode="Markdown")
+
+async def nodes_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    if update.effective_user.id not in ALLOWED_IDS: return
+    nodes = await fetch_api("/nodes")
+    if nodes is None:
+        await update.message.reply_text("❌ Failed to fetch nodes.")
+        return
+
+    if not nodes:
+        await update.message.reply_text("No nodes discovered in the fleet.")
+        return
+
+    msg = "🖥️ *Nodes Status*\n"
+    for node in nodes:
+        health_icon = "✅" if node.get('health') == 'nominal' else "⚠️" if node.get('health') == 'degraded' else "❌"
+        msg += f"{health_icon} *{node.get('hostname')}*: `{node.get('status', 'unknown')}`\n"
+        msg += f"   Last seen: {node.get('last_seen', 'N/A')}\n"
+
+    await update.message.reply_text(msg, parse_mode="Markdown")
+
+async def services_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    if update.effective_user.id not in ALLOWED_IDS: return
+    services = await fetch_api("/services")
+    if services is None:
+        await update.message.reply_text("❌ Failed to fetch services.")
+        return
+
+    # Summarize by node
+    nodes = {}
+    for s in services:
+        node = s.get("node", "unknown")
+        if node not in nodes: nodes[node] = []
+        nodes[node].append(s)
+
+    msg = "⚙️ *Services Summary*\n"
+    if not nodes:
+        msg += "No services discovered."
+    else:
+        for node, svc_list in sorted(nodes.items()):
+            nominal = len([s for s in svc_list if s.get("health") == "nominal"])
+            msg += f"• *{node}*: {nominal}/{len(svc_list)} nominal\n"
+
+    msg += "\nUse /unhealthy to see issues."
+    await update.message.reply_text(msg, parse_mode="Markdown")
+
+async def unhealthy_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    if update.effective_user.id not in ALLOWED_IDS: return
+    services = await fetch_api("/services")
+    nodes = await fetch_api("/nodes")
+
+    msg = "⚠️ *Unhealthy Components*\n"
+    found = False
+
+    if services:
+        for s in services:
+            health = s.get("health", "").lower()
+            if health != "nominal":
+                msg += f"• Service *{s.get('name')}* on *{s.get('node')}*: `{health}`\n"
+                found = True
+
+    if nodes:
+        for n in nodes:
+            checks = n.get("checks", {})
+            if isinstance(checks, str):
+                try: checks = json.loads(checks)
+                except: checks = {}
+
+            docker = checks.get("docker", {})
+            if docker.get("status") == "ok":
+                for c in docker.get("containers", []):
+                    if c.get("state") != "running":
+                        msg += f"• Container *{c.get('name')}* on *{n.get('hostname')}*: `{c.get('state')}`\n"
+                        found = True
+
+    if not found:
+        msg += "All systems nominal. ✅"
+
+    await update.message.reply_text(msg, parse_mode="Markdown")
+
+async def incidents_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    if update.effective_user.id not in ALLOWED_IDS: return
+    incidents = await fetch_api("/incidents")
+    if incidents is None:
+        await update.message.reply_text("❌ Failed to fetch incidents.")
+        return
+
+    active = [i for i in incidents if i.get("status") not in ("resolved", "closed")]
+    if not active:
+        await update.message.reply_text("No active incidents. ✅")
+        return
+
+    msg = "🚨 *Active Incidents*\n"
+    for inc in active:
+        severity = inc.get('severity', 'info').upper()
+        msg += f"• [{severity}] *{inc.get('type')}*: {inc.get('message')}\n"
+
+    await update.message.reply_text(msg, parse_mode="Markdown")
+
+async def actions_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    if update.effective_user.id not in ALLOWED_IDS: return
+    actions = await fetch_api("/actions")
+    if actions is None:
+        await update.message.reply_text("❌ Actions endpoint unavailable.")
+        return
+
+    msg = "⚡ *Actions Summary*\n"
+    total = 0
+    for status, act_list in actions.items():
+        if act_list:
+            msg += f"• {status.capitalize()}: {len(act_list)}\n"
+            total += len(act_list)
+
+    if total == 0:
+        msg = "No actions recorded."
+
+    await update.message.reply_text(msg, parse_mode="Markdown")
+
+async def help_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    msg = (
+        "📖 *Supported Commands*\n\n"
+        "/status - Check bot and API connectivity\n"
+        "/summary - System health overview\n"
+        "/nodes - List homelab nodes and their status\n"
+        "/services - Summary of services across nodes\n"
+        "/unhealthy - List all unhealthy components\n"
+        "/incidents - View active incidents\n"
+        "/actions - Summary of operator actions\n"
+        "/help - Show this help message\n\n"
+        "Free text will be handled by the guidance system."
+    )
+    await update.message.reply_text(msg, parse_mode="Markdown")
+
+async def handle_fallback(update: Update, context: ContextTypes.DEFAULT_TYPE):
+    """Handles non-command messages."""
+    if update.effective_user.id not in ALLOWED_IDS: return
+
+    if ENABLE_LLM_FALLBACK and OPENCLAW_BASE_URL:
+        # Placeholder for OpenClaw LLM fallback
+        # In a real scenario, this would call the LLM API
+        logger.info(f"LLM fallback requested for: {update.message.text}")
+
+    await update.message.reply_text(
+        "Use /summary, /nodes, /services, /unhealthy, /incidents, /actions."
+    )
+
+async def run_bot():
+    if not TOKEN:
+        print("CRITICAL: TELEGRAM_BOT_TOKEN is not set. Telegram bot will not start.")
+        # Keep process alive to not crash compose if not desired, but here we just exit
+        # Requirement says: "do not fail if Telegram token is absent, but telegram-bot should be disabled or exit cleanly"
+        return
+
+    bot_logic = ApprovalBot()
+
+    application = ApplicationBuilder().token(TOKEN).build()
+
+    application.add_handler(CommandHandler("start", start_command))
+    application.add_handler(CommandHandler("status", status_command))
+    application.add_handler(CommandHandler("summary", summary_command))
+    application.add_handler(CommandHandler("nodes", nodes_command))
+    application.add_handler(CommandHandler("services", services_command))
+    application.add_handler(CommandHandler("unhealthy", unhealthy_command))
+    application.add_handler(CommandHandler("incidents", incidents_command))
+    application.add_handler(CommandHandler("actions", actions_command))
+    application.add_handler(CommandHandler("help", help_command))
+
+    application.add_handler(MessageHandler(filters.TEXT & (~filters.COMMAND), handle_fallback))
+    application.add_handler(CallbackQueryHandler(bot_logic.handle_callback))
+
+    # Schedule the pending actions check
+    job_queue = application.job_queue
+    if job_queue:
+        job_queue.run_repeating(bot_logic.check_pending_actions, interval=10, first=5)
+    else:
+        logger.warning("JobQueue is not available. Periodic pending actions check will be skipped.")
+
+    logger.info("Starting Telegram Approval Bot...")
+    await application.initialize()
+    await application.start()
+    await application.updater.start_polling()
+
+    # Run until the application is stopped
+    stop_event = asyncio.Event()
+    try:
+        await stop_event.wait()
+    except (KeyboardInterrupt, SystemExit):
+        logger.info("Stopping bot...")
+    finally:
+        await application.stop()
+        await application.shutdown()
+
+if __name__ == "__main__":
+    try:
+        asyncio.run(run_bot())
+    except KeyboardInterrupt:
+        pass
+    except Exception as e:
+        logger.error(f"Fatal error: {e}")
--- a/services/agent-system/telegram-bot/requirements.txt
+++ b/services/agent-system/telegram-bot/requirements.txt
@ -0,0 +1 @@
+python-telegram-bot[job-queue]==20.7
--- a/services/agent-system/telegram-bot/tests/init.py
+++ b/services/agent-system/telegram-bot/tests/init.py
--- a/services/agent-system/telegram-bot/tests/conftest.py
+++ b/services/agent-system/telegram-bot/tests/conftest.py
@ -0,0 +1,38 @@
+"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
+from __future__ import annotations
+
+import sys
+import types
+from unittest.mock import MagicMock
+
+
+def _make_telegram_stub() -> types.ModuleType:
+    mod = types.ModuleType("telegram")
+    mod.Update = MagicMock
+    mod.InlineKeyboardButton = MagicMock
+    mod.InlineKeyboardMarkup = MagicMock
+    return mod
+
+
+def _make_telegram_ext_stub() -> types.ModuleType:
+    mod = types.ModuleType("telegram.ext")
+    mod.ApplicationBuilder = MagicMock
+
+    # ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
+    # evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
+    ContextTypesMock = MagicMock()
+    ContextTypesMock.DEFAULT_TYPE = type(None)
+    mod.ContextTypes = ContextTypesMock
+
+    mod.CommandHandler = MagicMock
+    mod.CallbackQueryHandler = MagicMock
+    mod.MessageHandler = MagicMock
+    mod.filters = MagicMock()
+    return mod
+
+
+# Insert before any import of bot.py
+if "telegram" not in sys.modules:
+    sys.modules["telegram"] = _make_telegram_stub()
+if "telegram.ext" not in sys.modules:
+    sys.modules["telegram.ext"] = _make_telegram_ext_stub()
--- a/services/agent-system/telegram-bot/tests/test_format.py
+++ b/services/agent-system/telegram-bot/tests/test_format.py
@ -0,0 +1,116 @@
+"""Tests for _format_pending_action — no Telegram connection required.
+
+telegram stubs are set up in conftest.py before this module is imported.
+"""
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+import pytest
+
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from bot import _format_pending_action
+
+
+# ---------------------------------------------------------------------------
+# Bug 1 — risk_level field
+# ---------------------------------------------------------------------------
+
+def test_risk_level_shown_when_present():
+    data = {
+        "type": "container_restart", "service": "homeassistant",
+        "node": "chelsty-ha", "risk_level": "low",
+    }
+    msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
+    assert "Risk: *low*" in msg
+    assert "unknown" not in msg
+
+
+def test_risk_falls_back_to_legacy_risk_key():
+    data = {
+        "type": "redeploy", "service": "mosquitto",
+        "node": "chelsty-infra", "risk": "guarded",
+    }
+    msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
+    assert "Risk: *guarded*" in msg
+
+
+def test_risk_unknown_when_both_absent():
+    data = {"type": "redeploy", "service": "foo", "node": "bar"}
+    msg = _format_pending_action("redeploy-bar-foo", data)
+    assert "Risk: *unknown*" in msg
+
+
+# ---------------------------------------------------------------------------
+# Bug 2 — description field
+# ---------------------------------------------------------------------------
+
+def test_description_shown_for_alert_only():
+    data = {
+        "type": "alert_only", "service": "homeassistant",
+        "node": "chelsty-ha", "risk_level": "info",
+        "description": "3 entities unavailable for >1h",
+    }
+    msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
+    assert "3 entities unavailable for >1h" in msg
+    assert "Description:" in msg
+
+
+def test_description_shown_for_container_restart():
+    data = {
+        "type": "container_restart", "service": "homeassistant",
+        "node": "chelsty-ha", "risk_level": "low",
+        "description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
+    }
+    msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
+    assert "HA WebSocket unresponsive" in msg
+
+
+def test_description_absent_no_crash():
+    data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
+    msg = _format_pending_action("redeploy-bar-foo", data)
+    assert "Description:" not in msg
+    assert "Risk: *guarded*" in msg
+
+
+def test_description_truncated_at_300_chars():
+    long_desc = "x" * 400
+    data = {
+        "type": "alert_only", "service": "homeassistant",
+        "node": "chelsty-ha", "risk_level": "info",
+        "description": long_desc,
+    }
+    msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
+    assert "x" * 300 in msg
+    assert "..." in msg
+    assert "x" * 301 not in msg
+
+
+# ---------------------------------------------------------------------------
+# Combined — real HA alert_only action shape
+# ---------------------------------------------------------------------------
+
+def test_ha_alert_only_full_action():
+    """Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
+    data = {
+        "action_id": "alert-ha-entity-unavailable-chelsty-ha",
+        "type": "alert_only",
+        "node": "chelsty-ha",
+        "service": "homeassistant",
+        "risk_level": "info",
+        "confidence": 1.0,
+        "description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
+        "status": "pending",
+        "payload": {
+            "location_tag": "chelsty",
+            "reason": "ha_entity_unavailable_long",
+            "count": 3,
+        },
+    }
+    msg = _format_pending_action(data["action_id"], data)
+    assert "alert_only" in msg
+    assert "chelsty-ha" in msg
+    assert "Risk: *info*" in msg
+    assert "3 entities unavailable" in msg
+    assert "unknown" not in msg
--- a/services/agent-system/webui/Dockerfile
+++ b/services/agent-system/webui/Dockerfile
@ -0,0 +1,7 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+COPY web.py index.html ./
+
+EXPOSE 8080
+CMD ["python", "web.py"]
--- a/services/agent-system/webui/index.html
+++ b/services/agent-system/webui/index.html
@ -0,0 +1,769 @@
+<!doctype html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>Operator Control Plane</title>
+  <style>
+    :root {
+      --bg-color: #0a0c0e;
+      --sidebar-color: #14171a;
+      --card-color: #1c2024;
+      --border-color: #2a3540;
+      --text-color: #e7edf3;
+      --text-muted: #94a3b8;
+      --accent-color: #3eaf7c;
+      --nominal: #3eaf7c;
+      --degraded: #e7c000;
+      --unstable: #e67e22;
+      --reconciling: #3498db;
+      --error: #c0392b;
+      --safe: #3eaf7c;
+      --guarded: #e67e22;
+      --dangerous: #c0392b;
+    }
+
+    body {
+      margin: 0;
+      font-family: 'Inter', system-ui, -apple-system, sans-serif;
+      background: var(--bg-color);
+      color: var(--text-color);
+      display: flex;
+      height: 100vh;
+      overflow: hidden;
+    }
+
+    /* Sidebar */
+    .sidebar {
+      width: 240px;
+      background: var(--sidebar-color);
+      border-right: 1px solid var(--border-color);
+      display: flex;
+      flex-direction: column;
+      flex-shrink: 0;
+    }
+
+    .sidebar-header {
+      padding: 24px;
+      font-weight: 800;
+      font-size: 14px;
+      letter-spacing: 0.1em;
+      color: var(--accent-color);
+      border-bottom: 1px solid var(--border-color);
+    }
+
+    .nav-list {
+      list-style: none;
+      padding: 12px 0;
+      margin: 0;
+      flex-grow: 1;
+    }
+
+    .nav-item {
+      padding: 12px 24px;
+      cursor: pointer;
+      font-size: 14px;
+      color: var(--text-muted);
+      transition: all 0.2s;
+      display: flex;
+      align-items: center;
+      gap: 12px;
+    }
+
+    .nav-item:hover {
+      background: rgba(255, 255, 255, 0.05);
+      color: var(--text-color);
+    }
+
+    .nav-item.active {
+      background: rgba(62, 175, 124, 0.1);
+      color: var(--accent-color);
+      border-left: 3px solid var(--accent-color);
+    }
+
+    .sidebar-footer {
+      padding: 16px;
+      border-top: 1px solid var(--border-color);
+      font-size: 12px;
+    }
+
+    /* Content Area */
+    .main-content {
+      flex-grow: 1;
+      display: flex;
+      flex-direction: column;
+      overflow: hidden;
+    }
+
+    header {
+      height: 64px;
+      border-bottom: 1px solid var(--border-color);
+      display: flex;
+      align-items: center;
+      padding: 0 24px;
+      justify-content: space-between;
+      background: var(--bg-color);
+    }
+
+    .view-title {
+      font-size: 18px;
+      font-weight: 600;
+    }
+
+    .content-scroll {
+      flex-grow: 1;
+      overflow-y: auto;
+      padding: 24px;
+    }
+
+    /* Cards & Grids */
+    .grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
+      gap: 20px;
+    }
+
+    .card {
+      background: var(--card-color);
+      border: 1px solid var(--border-color);
+      padding: 20px;
+      border-radius: 4px;
+      position: relative;
+    }
+
+    .card-header {
+      display: flex;
+      justify-content: space-between;
+      align-items: center;
+      margin-bottom: 16px;
+    }
+
+    .card-title {
+      font-weight: 700;
+      font-size: 16px;
+    }
+
+    /* Status Badges */
+    .badge {
+      padding: 4px 8px;
+      border-radius: 4px;
+      font-size: 11px;
+      font-weight: 700;
+      text-transform: uppercase;
+    }
+
+    .status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
+    .status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
+    .status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
+    .status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
+    .status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
+
+    /* Timeline */
+    .timeline {
+      display: flex;
+      flex-direction: column;
+      gap: 12px;
+    }
+
+    .event {
+      padding: 12px;
+      border-left: 2px solid var(--border-color);
+      background: rgba(255, 255, 255, 0.02);
+      font-family: ui-monospace, monospace;
+      font-size: 13px;
+    }
+
+    .event.high { border-left-color: var(--error); }
+    .event.medium { border-left-color: var(--unstable); }
+    .event.low { border-left-color: var(--nominal); }
+
+    .event-header {
+      display: flex;
+      justify-content: space-between;
+      margin-bottom: 4px;
+      color: var(--text-muted);
+    }
+
+    /* Forms & Inputs */
+    .controls {
+      display: flex;
+      gap: 12px;
+      margin-top: 20px;
+    }
+
+    input, button {
+      background: var(--card-color);
+      border: 1px solid var(--border-color);
+      color: var(--text-color);
+      padding: 8px 16px;
+      font-size: 14px;
+      border-radius: 4px;
+    }
+
+    button {
+      cursor: pointer;
+      font-weight: 600;
+    }
+
+    button:hover { background: var(--border-color); }
+
+    .btn-primary { background: var(--accent-color); color: white; border: none; }
+    .btn-primary:hover { background: #359b6d; }
+
+    /* Utility */
+    .hidden { display: none !important; }
+    .mono { font-family: ui-monospace, monospace; }
+    .label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
+    .value { font-weight: 500; margin-bottom: 12px; }
+
+    .risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
+    .risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
+    .risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
+
+  </style>
+</head>
+<body>
+  <aside class="sidebar">
+    <div class="sidebar-header">HOMELAB OPERATOR</div>
+    <ul class="nav-list">
+      <li class="nav-item active" onclick="showView('dashboard', this)">
+        <span>Dashboard</span>
+      </li>
+      <li class="nav-item" onclick="showView('actions', this)">
+        <span>Action Queue</span>
+      </li>
+      <li class="nav-item" onclick="showView('nodes', this)">
+        <span>Nodes</span>
+      </li>
+      <li class="nav-item" onclick="showView('services', this)">
+        <span>Services</span>
+      </li>
+      <li class="nav-item" onclick="showView('deployments', this)">
+        <span>Deployments</span>
+      </li>
+      <li class="nav-item" onclick="showView('topology', this)">
+        <span>Topology</span>
+      </li>
+      <li class="nav-item" onclick="showView('events', this)">
+        <span>Events</span>
+      </li>
+      <li class="nav-item" onclick="showView('correlation', this)">
+        <span>Correlation</span>
+      </li>
+      <li class="nav-item" onclick="showView('recommendations', this)">
+        <span>Recommendations</span>
+      </li>
+      <li class="nav-item" onclick="showView('settings', this)">
+        <span>Settings</span>
+      </li>
+    </ul>
+    <div class="sidebar-footer">
+      <div id="summary-status">System Status: Loading...</div>
+    </div>
+  </aside>
+
+  <main class="main-content">
+    <div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
+      RUNTIME STATE IS STALE
+    </div>
+    <header>
+      <div style="display:flex; align-items:center; gap:20px">
+        <div class="view-title" id="current-view-title">Dashboard</div>
+        <select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
+          <option value="observe">OBSERVE</option>
+          <option value="recommend">RECOMMEND</option>
+          <option value="approval" selected>APPROVAL</option>
+          <option value="autonomous">AUTONOMOUS</option>
+          <option value="maintenance">MAINTENANCE</option>
+        </select>
+      </div>
+      <div class="header-actions" style="display:flex; gap:8px; align-items:center">
+        <button onclick="refreshData()">Refresh</button>
+        <button id="copy-ai-btn" onclick="copyForAI()">Copy for AI</button>
+      </div>
+    </header>
+
+    <div class="content-scroll">
+      <!-- Dashboard View -->
+      <div id="view-dashboard" class="view">
+        <div class="grid">
+          <div class="card">
+            <div class="card-title">System Overview</div>
+            <div id="dashboard-summary" style="margin-top:20px"></div>
+          </div>
+          <div class="card">
+            <div class="card-title">Pending Actions</div>
+            <div id="dashboard-actions-summary" style="margin-top:20px"></div>
+          </div>
+          <div class="card">
+            <div class="card-title">Active Incidents</div>
+            <div id="dashboard-incidents" style="margin-top:20px"></div>
+          </div>
+        </div>
+      </div>
+
+      <!-- Actions View -->
+      <div id="view-actions" class="view hidden">
+        <div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
+          <div>
+            <h3>Pending Approval</h3>
+            <div id="actions-pending" class="timeline"></div>
+          </div>
+          <div>
+            <h3>Active / History</h3>
+            <div id="actions-history" class="timeline"></div>
+          </div>
+        </div>
+      </div>
+
+      <!-- Nodes View -->
+      <div id="view-nodes" class="view hidden">
+        <div class="grid" id="nodes-list"></div>
+      </div>
+
+      <!-- Services View -->
+      <div id="view-services" class="view hidden">
+        <div class="grid" id="services-list"></div>
+      </div>
+
+      <!-- Deployments View -->
+      <div id="view-deployments" class="view hidden">
+        <div class="grid" id="deployments-list"></div>
+      </div>
+
+      <!-- Topology View -->
+      <div id="view-topology" class="view hidden">
+        <div class="card" style="min-height:500px">
+          <div class="card-title">Runtime Topology</div>
+          <div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
+        </div>
+      </div>
+
+      <!-- Events View -->
+      <div id="view-events" class="view hidden">
+        <div class="timeline" id="events-timeline"></div>
+      </div>
+
+      <!-- Correlation View -->
+      <div id="view-correlation" class="view hidden">
+        <div id="correlation-chains" class="grid"></div>
+      </div>
+
+      <!-- Recommendations View -->
+      <div id="view-recommendations" class="view hidden">
+        <div class="grid" id="recommendations-list"></div>
+      </div>
+
+      <!-- Settings View -->
+      <div id="view-settings" class="view hidden">
+        <div class="card">
+          <div class="card-title">Configuration</div>
+          <div id="settings-content" style="margin-top:20px"></div>
+        </div>
+      </div>
+    </div>
+  </main>
+
+  <script>
+    let currentView = 'dashboard';
+    const pollInterval = 5000;
+
+    function showView(viewId, el) {
+      document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
+      document.getElementById('view-' + viewId).classList.remove('hidden');
+      document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
+      if (el) el.classList.add('active');
+      currentView = viewId;
+      document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
+      refreshData();
+    }
+
+    async function fetchData(endpoint) {
+      try {
+        const res = await fetch(endpoint, {cache: 'no-store'});
+        return await res.json();
+      } catch (e) {
+        console.error('Fetch error:', endpoint, e);
+        return null;
+      }
+    }
+
+    async function postData(endpoint, data) {
+      try {
+        const res = await fetch(endpoint, {
+          method: 'POST',
+          headers: {'Content-Type': 'application/json'},
+          body: JSON.stringify(data)
+        });
+        return await res.json();
+      } catch (e) {
+        console.error('Post error:', endpoint, e);
+        return null;
+      }
+    }
+
+    async function mutateAction(id, status) {
+      const res = await postData('/action/mutate', {id, status});
+      if (res && res.status === 'ok') {
+        refreshData();
+      } else {
+        alert('Mutation failed');
+      }
+    }
+
+    async function setOperatorMode(mode) {
+      console.log('Operator mode set to:', mode);
+      const res = await postData('/mode', {mode});
+      if (res && res.status === 'ok') {
+          console.log('Mode updated successfully');
+      }
+    }
+
+    function formatTime(ts) {
+      if (!ts) return 'N/A';
+      return new Date(ts * 1000).toLocaleString();
+    }
+
+    function getStatusClass(status) {
+      status = (status || '').toLowerCase();
+      if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
+      if (['degraded', 'warning'].includes(status)) return 'status-degraded';
+      if (['unstable'].includes(status)) return 'status-unstable';
+      if (['reconciling'].includes(status)) return 'status-reconciling';
+      if (['error', 'down', 'failed'].includes(status)) return 'status-error';
+      return '';
+    }
+
+    async function refreshData() {
+      // Refresh summary always
+      const summary = await fetchData('/summary');
+      if (summary) {
+        const statusEl = document.getElementById('summary-status');
+        statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
+        statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
+        
+        // Handle stale state
+        const staleBanner = document.getElementById('stale-banner');
+        if (summary.stale) {
+            staleBanner.classList.remove('hidden');
+            staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
+        } else {
+            staleBanner.classList.add('hidden');
+        }
+
+        if (currentView === 'dashboard') {
+          const dashSummary = document.getElementById('dashboard-summary');
+          dashSummary.innerHTML = `
+            <div class="label">Nodes</div><div class="value">${summary.node_count}</div>
+            <div class="label">Services</div><div class="value">${summary.service_count}</div>
+            <div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
+          `;
+        }
+      }
+
+      if (currentView === 'dashboard' || currentView === 'actions') {
+          const actions = await fetchData('/actions');
+          if (actions) {
+              if (currentView === 'dashboard') {
+                  const dashActions = document.getElementById('dashboard-actions-summary');
+                  const pendingCount = actions.pending.length;
+                  dashActions.innerHTML = `
+                    <div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
+                    <div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
+                  `;
+              }
+              if (currentView === 'actions') {
+                  const pendingEl = document.getElementById('actions-pending');
+                  const historyEl = document.getElementById('actions-history');
+                  
+                  pendingEl.innerHTML = actions.pending.map(a => `
+                    <div class="card" style="margin-bottom:12px">
+                        <div class="card-header">
+                            <div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
+                            <span class="badge risk-${a.risk_level}">${a.risk_level}</span>
+                        </div>
+                        <p>${a.description || a.action_type || 'No description'}</p>
+                        <div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
+                        <div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
+                        <div class="controls">
+                            <button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
+                            <button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
+                        </div>
+                    </div>
+                  `).join('') || 'No pending actions.';
+
+                  const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
+                  historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
+                    <div class="event">
+                        <div class="event-header">
+                            <span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
+                            <span class="badge ${getStatusClass(a.status)}">${a.status}</span>
+                        </div>
+                        <div>${a.description || a.action_type || 'No description'}</div>
+                        <small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
+                        ${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
+                        ${a.transition_history ? `
+                            <div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
+                                <strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
+                            </div>
+                        ` : ''}
+                    </div>
+                  `).join('') || 'No history.';
+              }
+          }
+      }
+
+      if (currentView === 'dashboard' || currentView === 'events') {
+          const incidents = await fetchData('/incidents');
+          if (currentView === 'dashboard') {
+              const dashIncidents = document.getElementById('dashboard-incidents');
+              if (!incidents || incidents.length === 0) {
+                  dashIncidents.textContent = 'No active incidents.';
+              } else {
+                  dashIncidents.innerHTML = incidents.map(inc => `
+                      <div class="event ${inc.severity}">
+                          <strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
+                          <small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
+                      </div>
+                  `).join('');
+              }
+          }
+      }
+
+      if (currentView === 'nodes') {
+        const nodes = await fetchData('/nodes');
+        const list = document.getElementById('nodes-list');
+        list.innerHTML = nodes.map(node => `
+          <div class="card">
+            <div class="card-header">
+              <div class="card-title">${node.hostname}</div>
+              <span class="badge ${getStatusClass(node.health)}">${node.health}</span>
+            </div>
+            <div class="label">ID</div><div class="value mono">${node.id}</div>
+            <div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
+            <div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
+            <div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
+            <div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
+            <div class="label">Runtime Status</div><div class="value">${node.status}</div>
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'services') {
+        const services = await fetchData('/services');
+        const list = document.getElementById('services-list');
+        list.innerHTML = services.map(svc => `
+          <div class="card">
+            <div class="card-header">
+              <div class="card-title">${svc.name}</div>
+              <span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
+            </div>
+            <div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
+            <div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
+            <div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
+            <div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'deployments') {
+        const deps = await fetchData('/deployments');
+        const list = document.getElementById('deployments-list');
+        list.innerHTML = deps.map(dep => `
+          <div class="card">
+            <div class="card-header">
+              <div class="card-title">${dep.service}</div>
+              <span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
+            </div>
+            <div class="label">ID</div><div class="value mono">${dep.id}</div>
+            <div class="label">Stage</div><div class="value">${dep.stage}</div>
+            <div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
+            <div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
+            ${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'events') {
+        const events = await fetchData('/events');
+        const timeline = document.getElementById('events-timeline');
+        timeline.innerHTML = events.map(ev => `
+          <div class="event ${ev.severity}">
+            <div class="event-header">
+              <span>${ev.type.toUpperCase()}</span>
+              <span>${formatTime(ev.timestamp)}</span>
+            </div>
+            <div>${ev.message}</div>
+            <div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'recommendations') {
+        const recs = await fetchData('/recommendations');
+        const list = document.getElementById('recommendations-list');
+        list.innerHTML = recs.map(rec => `
+          <div class="card">
+            <div class="card-header">
+              <div class="card-title">${rec.title}</div>
+              <span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
+            </div>
+            <p>${rec.description}</p>
+            <div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
+            <div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
+            <div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
+            <div class="controls">
+                <button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
+            </div>
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'topology') {
+          const nodes = await fetchData('/nodes');
+          const services = await fetchData('/services');
+          const topMap = document.getElementById('topology-map');
+          if (nodes && services) {
+              topMap.innerHTML = nodes.map(node => {
+                  const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
+                  return `
+                    <div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
+                        <div class="card-header">
+                            <div class="card-title">${node.hostname}</div>
+                            <span class="badge ${getStatusClass(node.health)}">${node.health}</span>
+                        </div>
+                        <div class="label">Capabilities</div>
+                        <div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
+                        <div class="label">Services</div>
+                        <div style="font-size:12px; margin-bottom:10px">
+                            ${nodeServices.length > 0 ? nodeServices.map(s => `
+                                <div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
+                                    <span>${s.name}</span>
+                                    <span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
+                                </div>
+                                ${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
+                            `).join('') : '<div class="value">None</div>'}
+                        </div>
+                    </div>
+                  `;
+              }).join('');
+          }
+      }
+
+      if (currentView === 'correlation') {
+          const incidents = await fetchData('/incidents');
+          const actions = await fetchData('/actions');
+          const list = document.getElementById('correlation-chains');
+          if (incidents && actions) {
+              const allActions = Object.values(actions).flat();
+              list.innerHTML = incidents.map(inc => {
+                  const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
+                  return `
+                    <div class="card">
+                        <div class="card-header">
+                            <div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
+                            <span class="badge status-error">Active</span>
+                        </div>
+                        <p>${inc.message}</p>
+                        <div class="label">Related Actions</div>
+                        ${related.map(a => `
+                            <div class="event" style="margin-top:5px">
+                                <strong>${a.type}</strong> (${a.status})<br>
+                                <small>${a.description}</small>
+                            </div>
+                        `).join('') || '<div class="value">No actions yet</div>'}
+                    </div>
+                  `;
+              }).join('');
+          }
+      }
+      if (currentView === 'settings') {
+          const config = await fetchData('/config');
+          const content = document.getElementById('settings-content');
+          content.innerHTML = `
+              <div class="label">Auto Mode</div>
+              <div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
+              <div class="label">Action Thresholds</div>
+              <div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
+              <div class="label">Telegram Integration</div>
+              <div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
+              <button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
+          `;
+      }
+    }
+
+    async function copyForAI() {
+      const btn = document.getElementById('copy-ai-btn');
+      const original = btn.textContent;
+      btn.textContent = 'Copying...';
+      btn.disabled = true;
+
+      try {
+        const snap = await fetchData('/snapshot');
+        if (!snap) throw new Error('snapshot fetch failed');
+
+        const now = new Date(snap.timestamp);
+        const dateStr = now.toISOString().slice(0, 16).replace('T', ' ');
+        const lines = [];
+
+        lines.push(`=== HOMELAB SNAPSHOT ${dateStr} ===`);
+
+        if (snap.nodes && snap.nodes.length > 0) {
+          lines.push('NODES: ' + snap.nodes.map(n =>
+            `${(n.hostname || n.id || '?').toUpperCase()} ${(n.health || 'unknown').toUpperCase()}`
+          ).join(', '));
+        } else {
+          lines.push('NODES: none');
+        }
+
+        if (snap.non_nominal_services && snap.non_nominal_services.length > 0) {
+          lines.push('ERRORS: ' + snap.non_nominal_services.map(s =>
+            `${s.name} (${s.node}) - ${s.health}`
+          ).join(', '));
+        } else {
+          lines.push(`ERRORS: none (${snap.nominal_service_count} nominal)`);
+        }
+
+        const activeIncidents = (snap.incidents || []).filter(i => !['resolved', 'closed'].includes(i.status));
+        if (activeIncidents.length > 0) {
+          lines.push('INCIDENTS: ' + activeIncidents.map(i =>
+            `[${i.severity}] ${i.message} (${i.node})`
+          ).join('; '));
+        } else {
+          lines.push('INCIDENTS: none');
+        }
+
+        if (snap.events && snap.events.length > 0) {
+          lines.push(`EVENTS (last ${snap.events.length}):`);
+          snap.events.forEach(ev => {
+            const ts = ev.timestamp
+              ? new Date(ev.timestamp * 1000).toISOString().slice(11, 19)
+              : '?';
+            const svc = ev.service ? '/' + ev.service : '';
+            lines.push(`  ${ts} [${ev.severity || ev.level || '?'}] ${ev.type} - ${ev.message || ''} (${ev.node || ''}${svc})`);
+          });
+        } else {
+          lines.push('EVENTS (last 10): none');
+        }
+
+        const s = snap.summary || {};
+        lines.push(`SUMMARY: status=${s.status || '?'} nodes=${s.node_count ?? '?'} services=${s.service_count ?? '?'} incidents=${s.incident_count ?? '?'}`);
+
+        await navigator.clipboard.writeText(lines.join('\n'));
+        btn.textContent = 'Copied!';
+        setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
+      } catch (e) {
+        console.error('copyForAI error:', e);
+        btn.textContent = 'Error';
+        setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
+      }
+    }
+
+    // Initial load
+    refreshData();
+    // Poll for updates
+    setInterval(refreshData, pollInterval);
+
+  </script>
+</body>
+</html>
--- a/services/agent-system/webui/web.py
+++ b/services/agent-system/webui/web.py
@ -0,0 +1,301 @@
+import json
+import os
+import time
+from datetime import datetime, timezone
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+from pathlib import Path
+
+
+STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
+EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
+WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
+ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
+CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
+
+STATIC_DIR = Path(__file__).parent
+
+DEFAULT_CONFIG = {
+    "operator_mode": "approval",
+    "auto_mode": True,
+    "action_thresholds": {
+        "restart_ha": 0.8,
+        "check_network": 0.9,
+    },
+    "default_threshold": 0.9,
+    "allowed_auto_actions": ["restart_ha"],
+}
+
+
+def read_json_file(path, default=None):
+    if not path.exists():
+        return default if default is not None else []
+    try:
+        return json.loads(path.read_text())
+    except Exception:
+        return default if default is not None else []
+
+
+def get_config():
+    config_path = STATE_DIR / "operator-config.json"
+    if config_path.exists():
+        return read_json_file(config_path, DEFAULT_CONFIG)
+    return DEFAULT_CONFIG
+
+
+def save_config(config):
+    STATE_DIR.mkdir(parents=True, exist_ok=True)
+    (STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
+
+
+def current_nodes():
+    return read_json_file(WORLD_DIR / "nodes.json")
+
+
+def current_services():
+    return read_json_file(WORLD_DIR / "services.json")
+
+
+def current_deployments():
+    return read_json_file(WORLD_DIR / "deployments.json")
+
+
+def current_incidents():
+    return read_json_file(WORLD_DIR / "incidents.json")
+
+
+def current_recommendations():
+    return read_json_file(WORLD_DIR / "recommendations.json")
+
+
+def current_summary():
+    path = WORLD_DIR / "runtime-summary.json"
+    summary = read_json_file(path, default={})
+    if summary:
+        last_update_val = summary.get("last_update")
+        if last_update_val:
+            try:
+                if isinstance(last_update_val, str):
+                    last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
+                else:
+                    last_update = float(last_update_val)
+            except Exception:
+                last_update = os.path.getmtime(path)
+        else:
+            last_update = os.path.getmtime(path)
+        summary["last_update"] = last_update
+        summary["stale"] = (time.time() - last_update) > 60
+    return summary
+
+
+def current_events():
+    return read_json_file(WORLD_DIR / "events.json", default=[])
+
+
+def current_actions():
+    actions = {}
+    statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
+    for status in statuses:
+        actions[status] = []
+        status_dir = ACTIONS_DIR / status
+        if status_dir.exists():
+            for f in status_dir.glob("*.json"):
+                data = read_json_file(f)
+                if data:
+                    # Injects some metadata for UI
+                    data["id"] = data.get("action_id") or f.stem
+                    data["status"] = status
+                    actions[status].append(data)
+    return actions
+
+
+def mutate_action(action_id, target_status):
+    statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
+    if target_status not in statuses:
+        return False, f"Invalid target status: {target_status}"
+
+    # Find where the action is
+    source_path = None
+    current_status = None
+    for status in statuses:
+        p = ACTIONS_DIR / status / f"{action_id}.json"
+        if p.exists():
+            source_path = p
+            current_status = status
+            break
+
+    if not source_path:
+        return False, f"Action {action_id} not found"
+
+    target_dir = ACTIONS_DIR / target_status
+    target_dir.mkdir(parents=True, exist_ok=True)
+    target_path = target_dir / f"{action_id}.json"
+
+    try:
+        data = json.loads(source_path.read_text())
+        data["status"] = target_status
+        data["updated_at"] = time.time()
+        
+        # Keep history of transitions
+        history = data.get("transition_history", [])
+        history.append({
+            "from": current_status,
+            "to": target_status,
+            "timestamp": time.time()
+        })
+        data["transition_history"] = history
+
+        target_path.write_text(json.dumps(data, indent=2))
+        if source_path != target_path:
+            source_path.unlink()
+        return True, "Success"
+    except Exception as e:
+        return False, str(e)
+
+
+def get_snapshot():
+    nodes = current_nodes()
+    services = current_services()
+    incidents = current_incidents()
+    events = current_events()
+    summary = current_summary()
+
+    non_nominal = [s for s in services if s.get("health") != "nominal"]
+    nominal_count = len(services) - len(non_nominal)
+
+    return {
+        "timestamp": datetime.now(timezone.utc).isoformat(),
+        "summary": summary,
+        "nodes": nodes,
+        "non_nominal_services": non_nominal,
+        "nominal_service_count": nominal_count,
+        "total_service_count": len(services),
+        "incidents": incidents,
+        "events": events[:10],
+    }
+
+
+def send_json(status, payload, handler):
+    body = (json.dumps(payload) + "\n").encode("utf-8")
+    handler.send_response(status)
+    handler.send_header("Content-Type", "application/json")
+    handler.send_header("Content-Length", str(len(body)))
+    handler.end_headers()
+    handler.wfile.write(body)
+
+
+class Handler(BaseHTTPRequestHandler):
+    def do_GET(self):
+        if self.path == "/config":
+            send_json(200, get_config(), self)
+            return
+
+        if self.path == "/nodes":
+            send_json(200, current_nodes(), self)
+            return
+
+        if self.path == "/services":
+            send_json(200, current_services(), self)
+            return
+
+        if self.path == "/deployments":
+            send_json(200, current_deployments(), self)
+            return
+
+        if self.path == "/incidents":
+            send_json(200, current_incidents(), self)
+            return
+
+        if self.path == "/recommendations":
+            send_json(200, current_recommendations(), self)
+            return
+
+        if self.path == "/summary":
+            send_json(200, current_summary(), self)
+            return
+
+        if self.path == "/events":
+            send_json(200, current_events(), self)
+            return
+
+        if self.path == "/actions":
+            send_json(200, current_actions(), self)
+            return
+
+        if self.path == "/snapshot":
+            send_json(200, get_snapshot(), self)
+            return
+
+        if self.path in ("/", "/index.html"):
+            body = (STATIC_DIR / "index.html").read_bytes()
+            self.send_response(200)
+            self.send_header("Content-Type", "text/html; charset=utf-8")
+            self.send_header("Content-Length", str(len(body)))
+            self.end_headers()
+            self.wfile.write(body)
+            return
+
+        self.send_error(404)
+
+    def do_POST(self):
+        if self.path not in (
+            "/config",
+            "/action/mutate",
+            "/mode",
+        ):
+            self.send_error(404)
+            return
+
+        length = int(self.headers.get("Content-Length", "0"))
+        raw_body = self.rfile.read(length).decode("utf-8")
+        try:
+            payload = json.loads(raw_body)
+        except json.JSONDecodeError:
+            self.send_error(400, "Invalid JSON")
+            return
+
+        if self.path == "/config":
+            config = get_config()
+            config.update(payload)
+            save_config(config)
+            send_json(200, {"status": "ok"}, self)
+            return
+
+        if self.path == "/mode":
+            mode = payload.get("mode")
+            if not mode:
+                self.send_error(400, "mode is required")
+                return
+            config = get_config()
+            config["operator_mode"] = mode
+            save_config(config)
+            send_json(200, {"status": "ok"}, self)
+            return
+
+        if self.path == "/action/mutate":
+            action_id = payload.get("id")
+            target = payload.get("status")
+            if not action_id or not target:
+                self.send_error(400, "id and status are required")
+                return
+            success, msg = mutate_action(action_id, target)
+            if success:
+                send_json(200, {"status": "ok"}, self)
+            else:
+                self.send_error(500, msg)
+            return
+
+    def log_message(self, format, *args):
+        return
+
+
+if __name__ == "__main__":
+    # Ensure directories exist
+    for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
+        d.mkdir(parents=True, exist_ok=True)
+    for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
+        (ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
+        
+    port = int(os.getenv("PORT", "8080"))
+    print(f"Operator Control Plane starting on 0.0.0.0:{port}")
+    server = ThreadingHTTPServer(("0.0.0.0", port), Handler)
+    server.serve_forever()
--- a/services/brain-watchdog/Dockerfile
+++ b/services/brain-watchdog/Dockerfile
@ -0,0 +1,10 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+COPY src/ src/
+
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONPATH=/app/src
+
+CMD ["python", "-m", "brain_watchdog.main"]
--- a/services/brain-watchdog/docker-compose.yml
+++ b/services/brain-watchdog/docker-compose.yml
@ -0,0 +1,30 @@
+services:
+  brain-watchdog:
+    build: .
+    container_name: brain-watchdog
+    restart: unless-stopped
+
+    env_file:
+      - /opt/homelab/config/brain-watchdog/.env
+
+    volumes:
+      - brain_watchdog_data:/data
+
+    healthcheck:
+      test:
+        - "CMD"
+        - "python"
+        - "-c"
+        - |
+          import os, time, json, sys
+          p = '/data/state.json'
+          if not os.path.exists(p): sys.exit(1)
+          age = time.time() - os.path.getmtime(p)
+          sys.exit(0 if age < 300 else 1)
+      interval: 1m
+      timeout: 10s
+      retries: 3
+      start_period: 30s
+
+volumes:
+  brain_watchdog_data:
--- a/services/brain-watchdog/env.example
+++ b/services/brain-watchdog/env.example
@ -0,0 +1,7 @@
+CONTROL_PLANE_URL=
+STALE_THRESHOLD=600
+INTERVAL=60
+FAILS_BEFORE_ALERT=3
+TG_TOKEN=
+TG_CHAT_ID=
+HEALTHCHECKS_URL=
--- a/services/brain-watchdog/healthcheck.sh
+++ b/services/brain-watchdog/healthcheck.sh
@ -0,0 +1,10 @@
+#!/bin/sh
+# Healthy if state.json was written within the last 5 minutes.
+python -c "
+import os, time, sys
+p = '/data/state.json'
+if not os.path.exists(p):
+    sys.exit(1)
+age = time.time() - os.path.getmtime(p)
+sys.exit(0 if age < 300 else 1)
+"
--- a/services/brain-watchdog/pytest.ini
+++ b/services/brain-watchdog/pytest.ini
@ -0,0 +1,3 @@
+[pytest]
+pythonpath = src
+testpaths = tests
--- a/services/brain-watchdog/service.yaml
+++ b/services/brain-watchdog/service.yaml
@ -0,0 +1,34 @@
+service:
+  name: brain-watchdog
+  owner_node: piha
+  exposure: private
+  description: >
+    External watchdog for the control-plane on VPS. Queries /summary over
+    Tailscale and alerts via Telegram Bot API directly — no dependency on the
+    control-plane itself. Freshness is computed locally from last_update epoch.
+
+  dependencies:
+    - control-plane   # external — on VPS; deliberately untrusted for liveness
+
+  healthcheck:
+    type: docker
+    interval: 60s
+    timeout: 10s
+    retries: 3
+    start_period: 30s
+
+  restart_policy: unless-stopped
+
+  persistence:
+    paths:
+      - /data   # state.json: fail_count, alerted, last_ok
+
+  runtime:
+    env_vars:
+      - CONTROL_PLANE_URL     # Tailscale IP + port of operator-ui (required)
+      - STALE_THRESHOLD       # seconds before brain is considered stale (default: 600)
+      - INTERVAL              # poll interval seconds (default: 60)
+      - FAILS_BEFORE_ALERT    # consecutive failures before Telegram alert (default: 3)
+      - TG_TOKEN              # Telegram Bot API token (required)
+      - TG_CHAT_ID            # Telegram chat/user ID (required)
+      - HEALTHCHECKS_URL      # optional healthchecks.io ping URL
--- a/services/brain-watchdog/src/brain_watchdog/init.py
+++ b/services/brain-watchdog/src/brain_watchdog/init.py
--- a/services/brain-watchdog/src/brain_watchdog/main.py
+++ b/services/brain-watchdog/src/brain_watchdog/main.py
@ -0,0 +1,157 @@
+"""
+brain-watchdog: external watchdog for the control-plane on VPS.
+
+Runs on PIHA; queries /summary directly over Tailscale and alerts via
+Telegram Bot API without going through the control-plane itself.
+Never trusts the self-reported "status" field — freshness is computed
+locally from last_update epoch vs. time.time().
+"""
+
+import json
+import os
+import time
+import urllib.error
+import urllib.request
+from pathlib import Path
+
+CONTROL_PLANE_URL = os.environ["CONTROL_PLANE_URL"].rstrip("/")
+STALE_THRESHOLD = int(os.environ.get("STALE_THRESHOLD", "600"))
+INTERVAL = int(os.environ.get("INTERVAL", "60"))
+FAILS_BEFORE_ALERT = int(os.environ.get("FAILS_BEFORE_ALERT", "3"))
+TG_TOKEN = os.environ["TG_TOKEN"]
+TG_CHAT_ID = os.environ["TG_CHAT_ID"]
+HEALTHCHECKS_URL = os.environ.get("HEALTHCHECKS_URL", "").strip()
+
+STATE_FILE = Path("/data/state.json")
+
+
+def load_state() -> dict:
+    if STATE_FILE.exists():
+        try:
+            return json.loads(STATE_FILE.read_text())
+        except Exception:
+            pass
+    return {"fail_count": 0, "alerted": False, "last_ok": 0.0}
+
+
+def save_state(state: dict) -> None:
+    STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
+    STATE_FILE.write_text(json.dumps(state))
+
+
+def http_get(url: str, timeout: int = 10) -> tuple[int | None, dict | None]:
+    try:
+        with urllib.request.urlopen(url, timeout=timeout) as resp:
+            return resp.status, json.loads(resp.read())
+    except urllib.error.HTTPError as exc:
+        return exc.code, None
+    except Exception:
+        return None, None
+
+
+def send_telegram(message: str) -> bool:
+    url = f"https://api.telegram.org/bot{TG_TOKEN}/sendMessage"
+    payload = json.dumps(
+        {"chat_id": TG_CHAT_ID, "text": message, "parse_mode": "HTML"}
+    ).encode()
+    req = urllib.request.Request(
+        url, data=payload, headers={"Content-Type": "application/json"}
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=10) as resp:
+            return resp.status == 200
+    except Exception as exc:
+        print(f"[telegram] send failed: {exc}", flush=True)
+        return False
+
+
+def ping_healthchecks() -> None:
+    if not HEALTHCHECKS_URL:
+        return
+    try:
+        urllib.request.urlopen(HEALTHCHECKS_URL, timeout=10)
+    except Exception as exc:
+        print(f"[healthchecks] ping failed: {exc}", flush=True)
+
+
+def check() -> tuple[bool, str]:
+    """Return (ok, human-readable reason).  Never reads 'status' field."""
+    status, body = http_get(f"{CONTROL_PLANE_URL}/summary")
+
+    if status is None:
+        return False, "panel unreachable (connection error)"
+
+    if status != 200:
+        return False, f"panel returned HTTP {status}"
+
+    if not body:
+        return False, "panel returned empty / invalid JSON"
+
+    raw = body.get("last_update")
+    if raw is None:
+        return False, "summary missing last_update field"
+
+    try:
+        last_update_ts = float(raw)
+    except (TypeError, ValueError):
+        return False, f"last_update not parseable: {raw!r}"
+
+    age = time.time() - last_update_ts
+    if age > STALE_THRESHOLD:
+        return False, (
+            f"brain stale: last update {int(age // 60)}m ago "
+            f"(threshold {STALE_THRESHOLD // 60}m)"
+        )
+
+    return True, f"ok (age {int(age)}s)"
+
+
+def main() -> None:
+    print(
+        f"[brain-watchdog] starting — "
+        f"url={CONTROL_PLANE_URL} "
+        f"stale_threshold={STALE_THRESHOLD}s "
+        f"interval={INTERVAL}s "
+        f"fails_before_alert={FAILS_BEFORE_ALERT}",
+        flush=True,
+    )
+    state = load_state()
+
+    while True:
+        ok, reason = check()
+        ts = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+        print(f"[{ts}] {'OK  ' if ok else 'FAIL'} — {reason}", flush=True)
+
+        if ok:
+            if state["alerted"]:
+                send_telegram(
+                    "✅ <b>brain-watchdog: control-plane RECOVERED</b>\n"
+                    f"{reason}"
+                )
+                print("[telegram] sent recovery alert", flush=True)
+            state["fail_count"] = 0
+            state["alerted"] = False
+            state["last_ok"] = time.time()
+            save_state(state)
+            ping_healthchecks()
+        else:
+            state["fail_count"] = state.get("fail_count", 0) + 1
+            save_state(state)
+
+            if state["fail_count"] >= FAILS_BEFORE_ALERT and not state["alerted"]:
+                sent = send_telegram(
+                    "🚨 <b>brain-watchdog: control-plane DOWN</b>\n"
+                    f"Reason: {reason}\n"
+                    f"Consecutive failures: {state['fail_count']}\n"
+                    f"URL: <code>{CONTROL_PLANE_URL}</code>"
+                )
+                if sent:
+                    state["alerted"] = True
+                    save_state(state)
+                    print("[telegram] sent alert", flush=True)
+
+        time.sleep(INTERVAL)
+
+
+if __name__ == "__main__":
+    main()
--- a/services/brain-watchdog/tests/init.py
+++ b/services/brain-watchdog/tests/init.py
--- a/services/brain-watchdog/tests/test_main.py
+++ b/services/brain-watchdog/tests/test_main.py
@ -0,0 +1,66 @@
+"""
+Tests for brain_watchdog.main.
+
+Module-level env vars are required at import time; set them before the first
+import of the module so tests can run without a real control-plane.
+"""
+import importlib.util
+import os
+import time
+from unittest.mock import patch
+
+os.environ.setdefault("CONTROL_PLANE_URL", "http://test-cp:8080")
+os.environ.setdefault("TG_TOKEN", "test_token")
+os.environ.setdefault("TG_CHAT_ID", "12345")
+
+import brain_watchdog.main as bwm
+
+
+def test_package_importable():
+    spec = importlib.util.find_spec("brain_watchdog")
+    assert spec is not None
+
+
+def test_check_ok_fresh():
+    now = time.time()
+    with patch.object(bwm, "http_get", return_value=(200, {"last_update": now - 10})):
+        ok, reason = bwm.check()
+    assert ok
+    assert "ok" in reason
+
+
+def test_check_fail_stale():
+    now = time.time()
+    stale_ts = now - (bwm.STALE_THRESHOLD + 120)
+    with patch.object(bwm, "http_get", return_value=(200, {"last_update": stale_ts})):
+        ok, reason = bwm.check()
+    assert not ok
+    assert "stale" in reason
+
+
+def test_check_fail_unreachable():
+    with patch.object(bwm, "http_get", return_value=(None, None)):
+        ok, reason = bwm.check()
+    assert not ok
+    assert "unreachable" in reason
+
+
+def test_check_fail_http_error():
+    with patch.object(bwm, "http_get", return_value=(503, None)):
+        ok, reason = bwm.check()
+    assert not ok
+    assert "503" in reason
+
+
+def test_check_fail_missing_last_update():
+    with patch.object(bwm, "http_get", return_value=(200, {"other": "data"})):
+        ok, reason = bwm.check()
+    assert not ok
+    assert "last_update" in reason
+
+
+def test_check_fail_unparseable_timestamp():
+    with patch.object(bwm, "http_get", return_value=(200, {"last_update": "not-a-number"})):
+        ok, reason = bwm.check()
+    assert not ok
+    assert "parseable" in reason
--- a/services/control-plane/Dockerfile
+++ b/services/control-plane/Dockerfile
@ -0,0 +1,24 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+RUN pip install --no-cache-dir pyyaml
+
+# Create homelab user
+RUN useradd -m -u 1000 homelab
+
+# Copy sources
+COPY src/ /app/src/
+# Also need the observer script if we want to run it from here, 
+# but I'll copy it from the repo during build or mount it.
+# Actually, I'll copy the entire scripts/ directory to /repo/scripts 
+# so the supervisor/executor can find them.
+
+# For simplicity, we'll assume the repo is mounted at /repo
+ENV REPO_ROOT=/repo
+ENV RUNTIME_PATH=/opt/homelab
+ENV PYTHONUNBUFFERED=1
+
+# Default command (will be overridden in docker-compose)
+USER homelab
+CMD ["python", "src/operator_ui.py"]
--- a/services/control-plane/deploy-local.sh
+++ b/services/control-plane/deploy-local.sh
@ -0,0 +1,73 @@
+#!/bin/bash
+# services/control-plane/deploy-local.sh
+set -e
+
+# 1. Validate it is deploying control-plane
+if [[ ! $(pwd) == *"/services/control-plane" ]]; then
+    echo "Error: Script must be run from services/control-plane directory"
+    exit 1
+fi
+
+if [[ ! -f "docker-compose.yml" ]]; then
+    echo "Error: docker-compose.yml not found"
+    exit 1
+fi
+
+echo "--- Preparing Control Plane Directories ---"
+# 2. Prepare required dirs
+# /opt/homelab/config
+# /opt/homelab/actions/{pending,approved,rejected,running,completed,failed}
+# /opt/homelab/world
+# /opt/homelab/state
+
+DIRS=(
+    "/opt/homelab/config"
+    "/opt/homelab/actions/pending"
+    "/opt/homelab/actions/approved"
+    "/opt/homelab/actions/rejected"
+    "/opt/homelab/actions/running"
+    "/opt/homelab/actions/completed"
+    "/opt/homelab/actions/failed"
+    "/opt/homelab/world"
+    "/opt/homelab/state"
+)
+
+for dir in "${DIRS[@]}"; do
+    if [ ! -d "$dir" ]; then
+        echo "Creating $dir"
+        sudo mkdir -p "$dir"
+    fi
+done
+
+# 3. chown/chmod for UID 1000 — self-healing: only calls sudo when actually needed
+echo "Checking /opt/homelab ownership..."
+_chown_needed=$(find /opt/homelab \( ! -uid 1000 -o ! -gid 1000 \) -print -quit 2>/dev/null)
+if [[ -n "$_chown_needed" ]]; then
+    echo "Found files not owned by 1000:1000 (e.g. $_chown_needed) — fixing..."
+    sudo chown -R 1000:1000 /opt/homelab
+else
+    echo "Ownership already correct, skipping chown"
+fi
+
+echo "Checking /opt/homelab directory permissions..."
+_chmod_needed=$(find /opt/homelab -type d ! -perm -775 -print -quit 2>/dev/null)
+if [[ -n "$_chmod_needed" ]]; then
+    echo "Found directories with wrong permissions (e.g. $_chmod_needed) — fixing..."
+    sudo chmod -R 775 /opt/homelab 2>/dev/null || true
+else
+    echo "Permissions already correct, skipping chmod"
+fi
+
+# 4. Run docker compose up -d --build --force-recreate
+echo "--- Starting Control Plane Services ---"
+COMPOSE_ARGS="-f docker-compose.yml"
+OVERRIDE_FILE="../../hosts/vps/runtime/control-plane/docker-compose.override.yml"
+if [ -f "$OVERRIDE_FILE" ]; then
+    echo "Using override: $OVERRIDE_FILE"
+    COMPOSE_ARGS="$COMPOSE_ARGS -f $OVERRIDE_FILE"
+fi
+docker compose $COMPOSE_ARGS up -d --build --force-recreate
+
+# 5. Print docker ps for control-plane containers
+echo "--- Deployment Status ---"
+docker ps --filter "name=control-plane"
--- a/services/control-plane/docker-compose.yml
+++ b/services/control-plane/docker-compose.yml
@ -0,0 +1,76 @@
+services:
+  operator-ui:
+    build: .
+    container_name: control-plane-ui
+    user: "1000:1000"
+    command: python src/operator_ui.py
+    ports:
+      - "18180:8080"
+    volumes:
+      - /opt/homelab:/opt/homelab
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8080/', timeout=3).read()"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  observer:
+    build: .
+    container_name: control-plane-observer
+    user: "1000:1000"
+    command: python /repo/scripts/observer/observer.py
+    volumes:
+      - /opt/homelab:/opt/homelab
+      - ../..:/repo:ro
+    restart: unless-stopped
+    environment:
+      - REPO_ROOT=/repo
+      - RUNTIME_PATH=/opt/homelab
+    healthcheck:
+      test: ["CMD", "test", "-f", "/opt/homelab/state/observer.heartbeat"]
+      interval: 30s
+      timeout: 5s
+      retries: 3
+      start_period: 5s
+
+  supervisor:
+    build: .
+    container_name: control-plane-supervisor
+    user: "1000:1000"
+    command: python src/supervisor.py
+    volumes:
+      - /opt/homelab:/opt/homelab
+      - ../..:/repo:ro
+    restart: unless-stopped
+    environment:
+      - REPO_ROOT=/repo
+      - RUNTIME_PATH=/opt/homelab
+    healthcheck:
+      test: ["CMD", "test", "-f", "/opt/homelab/state/supervisor.heartbeat"]
+      interval: 60s
+      timeout: 5s
+      retries: 3
+      start_period: 10s
+
+  executor:
+    build: .
+    container_name: control-plane-executor
+    user: "1000:1000"
+    group_add:
+      - "999"
+    command: python src/executor.py
+    volumes:
+      - /opt/homelab:/opt/homelab
+      - ../..:/repo
+      - /var/run/docker.sock:/var/run/docker.sock
+    restart: unless-stopped
+    environment:
+      - REPO_ROOT=/repo
+      - RUNTIME_PATH=/opt/homelab
+    healthcheck:
+      test: ["CMD", "test", "-f", "/opt/homelab/state/executor.heartbeat"]
+      interval: 30s
+      timeout: 5s
+      retries: 3
+      start_period: 5s
--- a/services/control-plane/pyproject.toml
+++ b/services/control-plane/pyproject.toml
@ -0,0 +1,19 @@
+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "control-plane"
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = [
+    "pyyaml>=6.0",
+]
+
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.1",
+]
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
--- a/services/control-plane/src/executor.py
+++ b/services/control-plane/src/executor.py
@ -0,0 +1,246 @@
+import os
+import json
+import time
+import logging
+import subprocess
+from pathlib import Path
+
+
+def _atomic_write_json(path: Path, data) -> None:
+    """Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
+    tmp = path.with_suffix(".tmp")
+    with open(tmp, "w") as f:
+        json.dump(data, f, indent=2)
+        f.flush()
+        os.fsync(f.fileno())
+    os.replace(tmp, path)
+
+# Constants and Paths
+RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
+ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
+REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
+
+# SSH configuration
+# SSH_USER can be overridden per-deployment environment.
+SSH_USER = os.getenv("SSH_USER", "oskar")
+SSH_OPTIONS = [
+    "-o", "StrictHostKeyChecking=no",
+    "-o", "ConnectTimeout=10",
+    "-o", "BatchMode=yes",
+]
+
+# Logging setup
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger("executor")
+
+
+class Executor:
+    def __init__(self):
+        self._ensure_dirs()
+
+    def _ensure_dirs(self):
+        for s in ["approved", "running", "completed", "failed", "rejected"]:
+            (ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
+
+    def process_actions(self):
+        # Update heartbeat
+        heartbeat_file = ACTIONS_DIR.parent / "state" / "executor.heartbeat"
+        try:
+            heartbeat_file.touch()
+        except Exception as e:
+            logger.error(f"Failed to touch heartbeat file: {e}")
+
+        approved_dir = ACTIONS_DIR / "approved"
+        action_files = sorted(approved_dir.glob("*.json"))
+
+        for action_file in action_files:
+            self._execute_action(action_file)
+
+    def _execute_action(self, action_file):
+        action_id = action_file.stem
+        logger.info(f"Executing action: {action_id}")
+
+        # Move to running
+        running_path = ACTIONS_DIR / "running" / f"{action_id}.json"
+        try:
+            with open(action_file, "r") as f:
+                data = json.load(f)
+            data["status"] = "running"
+            data["started_at"] = time.time()
+            _atomic_write_json(running_path, data)
+            action_file.unlink()
+        except Exception as e:
+            logger.error(f"Failed to move {action_id} to running: {e}")
+            return
+
+        # Dispatch by action type
+        success = False
+        error_msg = ""
+        try:
+            action_type = data.get("type")
+            node = data.get("node")
+            service = data.get("service")
+
+            if action_type == "redeploy":
+                # Full service redeploy via the repo deploy script
+                cmd = [
+                    str(REPO_ROOT / "scripts" / "deploy" / "deploy-node.sh"),
+                    node,
+                    service
+                ]
+                logger.info(f"Running command: {' '.join(cmd)}")
+                result = subprocess.run(cmd, capture_output=True, text=True, cwd=str(REPO_ROOT))
+                if result.returncode == 0:
+                    success = True
+                else:
+                    success = False
+                    error_msg = result.stderr or result.stdout
+
+            elif action_type == "container_restart":
+                # Lightweight restart: SSH to node and docker restart the container.
+                # container_name is set by the supervisor; falls back to service name.
+                container_name = data.get("container_name") or service
+                success, error_msg = self._execute_container_restart(node, container_name)
+
+            elif action_type == "disk_cleanup":
+                # Operator-approved aggressive Docker cleanup (image prune -a +
+                # volume prune). Commands come from the action payload so the
+                # supervisor controls exactly what runs; the executor adds a
+                # safety check to reject anything touching protected paths.
+                payload = data.get("payload", {})
+                success, error_msg = self._execute_disk_cleanup(node, payload)
+
+            elif action_type == "alert_only":
+                # Operator acknowledged the alert; no automated execution needed.
+                success = True
+
+            else:
+                success = False
+                error_msg = f"Unknown action type: {action_type}"
+
+        except Exception as e:
+            success = False
+            error_msg = str(e)
+
+        # Move to completed/failed
+        target_status = "completed" if success else "failed"
+        target_path = ACTIONS_DIR / target_status / f"{action_id}.json"
+        try:
+            data["status"] = target_status
+            data["finished_at"] = time.time()
+            if not success:
+                data["error"] = error_msg
+            _atomic_write_json(target_path, data)
+            running_path.unlink()
+            logger.info(f"Action {action_id} {target_status}")
+        except Exception as e:
+            logger.error(f"Failed to move {action_id} to {target_status}: {e}")
+
+    def _execute_container_restart(self, node, container_name, retry_delay=10):
+        """
+        SSH to the target node and run `docker restart <container_name>`.
+
+        Attempts the restart up to 2 times (initial + 1 retry). If the first
+        attempt fails, waits retry_delay seconds then tries once more before
+        declaring the action failed.
+
+        Returns (success: bool, error_msg: str).
+        """
+        cmd = [
+            "ssh",
+            *SSH_OPTIONS,
+            f"{SSH_USER}@{node}",
+            f"docker restart {container_name}",
+        ]
+        logger.info(f"SSH container restart: {' '.join(cmd)}")
+
+        max_attempts = 2
+        last_error = ""
+
+        for attempt in range(1, max_attempts + 1):
+            result = subprocess.run(cmd, capture_output=True, text=True)
+
+            if result.returncode == 0:
+                logger.info(
+                    f"Container '{container_name}' on {node} restarted successfully "
+                    f"(attempt {attempt}/{max_attempts})"
+                )
+                return True, ""
+
+            last_error = (result.stderr or result.stdout).strip()
+            logger.warning(
+                f"container_restart attempt {attempt}/{max_attempts} failed "
+                f"for '{container_name}' on {node}: {last_error}"
+            )
+
+            if attempt < max_attempts:
+                logger.info(f"Retrying in {retry_delay}s...")
+                time.sleep(retry_delay)
+
+        logger.error(
+            f"container_restart exhausted all {max_attempts} attempts "
+            f"for '{container_name}' on {node}"
+        )
+        return False, last_error
+
+    def _execute_disk_cleanup(self, node: str, payload: dict):
+        """
+        SSH to the target node and run the operator-approved disk cleanup
+        commands from the action payload.
+
+        Safety invariants enforced here regardless of payload content:
+          - No command may reference /opt/homelab/data/, /opt/homelab/config/,
+            or /opt/homelab/state/ (application data and configuration).
+          - No command may contain rm -rf / or similar destructive patterns.
+        If any command fails the safety check the entire action is rejected
+        (not run at all) and the rejection reason is recorded.
+
+        Returns (success: bool, error_msg: str).
+        """
+        commands = payload.get("commands", [
+            "docker image prune -a -f",
+            "docker volume prune -f",
+        ])
+
+        # Safety gate: reject commands that touch protected paths
+        FORBIDDEN = [
+            "/opt/homelab/data",
+            "/opt/homelab/config",
+            "/opt/homelab/state",
+            "rm -rf /",
+        ]
+        for cmd in commands:
+            for forbidden in FORBIDDEN:
+                if forbidden in cmd:
+                    msg = f"Rejected: command contains forbidden pattern '{forbidden}': {cmd}"
+                    logger.error(msg)
+                    return False, msg
+
+        full_command = " && ".join(commands)
+        cmd = [
+            "ssh",
+            *SSH_OPTIONS,
+            f"{SSH_USER}@{node}",
+            full_command,
+        ]
+        logger.info(f"Disk cleanup on {node}: {full_command}")
+
+        result = subprocess.run(cmd, capture_output=True, text=True)
+        if result.returncode == 0:
+            logger.info(f"Disk cleanup on {node} succeeded")
+            return True, ""
+
+        error_msg = (result.stderr or result.stdout).strip()
+        logger.error(f"Disk cleanup on {node} failed: {error_msg}")
+        return False, error_msg
+
+    def loop(self, interval=10):
+        logger.info("Starting executor loop")
+        while True:
+            self.process_actions()
+            time.sleep(interval)
+
+
+if __name__ == "__main__":
+    executor = Executor()
+    executor.loop()
--- a/services/control-plane/src/index.html
+++ b/services/control-plane/src/index.html
@ -0,0 +1,701 @@
+<!doctype html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>Operator Control Plane</title>
+  <style>
+    :root {
+      --bg-color: #0a0c0e;
+      --sidebar-color: #14171a;
+      --card-color: #1c2024;
+      --border-color: #2a3540;
+      --text-color: #e7edf3;
+      --text-muted: #94a3b8;
+      --accent-color: #3eaf7c;
+      --nominal: #3eaf7c;
+      --degraded: #e7c000;
+      --unstable: #e67e22;
+      --reconciling: #3498db;
+      --error: #c0392b;
+      --safe: #3eaf7c;
+      --guarded: #e67e22;
+      --dangerous: #c0392b;
+    }
+
+    body {
+      margin: 0;
+      font-family: 'Inter', system-ui, -apple-system, sans-serif;
+      background: var(--bg-color);
+      color: var(--text-color);
+      display: flex;
+      height: 100vh;
+      overflow: hidden;
+    }
+
+    /* Sidebar */
+    .sidebar {
+      width: 240px;
+      background: var(--sidebar-color);
+      border-right: 1px solid var(--border-color);
+      display: flex;
+      flex-direction: column;
+      flex-shrink: 0;
+    }
+
+    .sidebar-header {
+      padding: 24px;
+      font-weight: 800;
+      font-size: 14px;
+      letter-spacing: 0.1em;
+      color: var(--accent-color);
+      border-bottom: 1px solid var(--border-color);
+    }
+
+    .nav-list {
+      list-style: none;
+      padding: 12px 0;
+      margin: 0;
+      flex-grow: 1;
+    }
+
+    .nav-item {
+      padding: 12px 24px;
+      cursor: pointer;
+      font-size: 14px;
+      color: var(--text-muted);
+      transition: all 0.2s;
+      display: flex;
+      align-items: center;
+      gap: 12px;
+    }
+
+    .nav-item:hover {
+      background: rgba(255, 255, 255, 0.05);
+      color: var(--text-color);
+    }
+
+    .nav-item.active {
+      background: rgba(62, 175, 124, 0.1);
+      color: var(--accent-color);
+      border-left: 3px solid var(--accent-color);
+    }
+
+    .sidebar-footer {
+      padding: 16px;
+      border-top: 1px solid var(--border-color);
+      font-size: 12px;
+    }
+
+    /* Content Area */
+    .main-content {
+      flex-grow: 1;
+      display: flex;
+      flex-direction: column;
+      overflow: hidden;
+    }
+
+    header {
+      height: 64px;
+      border-bottom: 1px solid var(--border-color);
+      display: flex;
+      align-items: center;
+      padding: 0 24px;
+      justify-content: space-between;
+      background: var(--bg-color);
+    }
+
+    .view-title {
+      font-size: 18px;
+      font-weight: 600;
+    }
+
+    .content-scroll {
+      flex-grow: 1;
+      overflow-y: auto;
+      padding: 24px;
+    }
+
+    /* Cards & Grids */
+    .grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
+      gap: 20px;
+    }
+
+    .card {
+      background: var(--card-color);
+      border: 1px solid var(--border-color);
+      padding: 20px;
+      border-radius: 4px;
+      position: relative;
+    }
+
+    .card-header {
+      display: flex;
+      justify-content: space-between;
+      align-items: center;
+      margin-bottom: 16px;
+    }
+
+    .card-title {
+      font-weight: 700;
+      font-size: 16px;
+    }
+
+    /* Status Badges */
+    .badge {
+      padding: 4px 8px;
+      border-radius: 4px;
+      font-size: 11px;
+      font-weight: 700;
+      text-transform: uppercase;
+    }
+
+    .status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
+    .status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
+    .status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
+    .status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
+    .status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
+
+    /* Timeline */
+    .timeline {
+      display: flex;
+      flex-direction: column;
+      gap: 12px;
+    }
+
+    .event {
+      padding: 12px;
+      border-left: 2px solid var(--border-color);
+      background: rgba(255, 255, 255, 0.02);
+      font-family: ui-monospace, monospace;
+      font-size: 13px;
+    }
+
+    .event.high { border-left-color: var(--error); }
+    .event.medium { border-left-color: var(--unstable); }
+    .event.low { border-left-color: var(--nominal); }
+
+    .event-header {
+      display: flex;
+      justify-content: space-between;
+      margin-bottom: 4px;
+      color: var(--text-muted);
+    }
+
+    /* Forms & Inputs */
+    .controls {
+      display: flex;
+      gap: 12px;
+      margin-top: 20px;
+    }
+
+    input, button {
+      background: var(--card-color);
+      border: 1px solid var(--border-color);
+      color: var(--text-color);
+      padding: 8px 16px;
+      font-size: 14px;
+      border-radius: 4px;
+    }
+
+    button {
+      cursor: pointer;
+      font-weight: 600;
+    }
+
+    button:hover { background: var(--border-color); }
+
+    .btn-primary { background: var(--accent-color); color: white; border: none; }
+    .btn-primary:hover { background: #359b6d; }
+
+    /* Utility */
+    .hidden { display: none !important; }
+    .mono { font-family: ui-monospace, monospace; }
+    .label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
+    .value { font-weight: 500; margin-bottom: 12px; }
+
+    .risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
+    .risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
+    .risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
+
+  </style>
+</head>
+<body>
+  <aside class="sidebar">
+    <div class="sidebar-header">HOMELAB OPERATOR</div>
+    <ul class="nav-list">
+      <li class="nav-item active" onclick="showView('dashboard', this)">
+        <span>Dashboard</span>
+      </li>
+      <li class="nav-item" onclick="showView('actions', this)">
+        <span>Action Queue</span>
+      </li>
+      <li class="nav-item" onclick="showView('nodes', this)">
+        <span>Nodes</span>
+      </li>
+      <li class="nav-item" onclick="showView('services', this)">
+        <span>Services</span>
+      </li>
+      <li class="nav-item" onclick="showView('deployments', this)">
+        <span>Deployments</span>
+      </li>
+      <li class="nav-item" onclick="showView('topology', this)">
+        <span>Topology</span>
+      </li>
+      <li class="nav-item" onclick="showView('events', this)">
+        <span>Events</span>
+      </li>
+      <li class="nav-item" onclick="showView('correlation', this)">
+        <span>Correlation</span>
+      </li>
+      <li class="nav-item" onclick="showView('recommendations', this)">
+        <span>Recommendations</span>
+      </li>
+      <li class="nav-item" onclick="showView('settings', this)">
+        <span>Settings</span>
+      </li>
+    </ul>
+    <div class="sidebar-footer">
+      <div id="summary-status">System Status: Loading...</div>
+    </div>
+  </aside>
+
+  <main class="main-content">
+    <div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
+      RUNTIME STATE IS STALE
+    </div>
+    <header>
+      <div style="display:flex; align-items:center; gap:20px">
+        <div class="view-title" id="current-view-title">Dashboard</div>
+        <select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
+          <option value="observe">OBSERVE</option>
+          <option value="recommend">RECOMMEND</option>
+          <option value="approval" selected>APPROVAL</option>
+          <option value="autonomous">AUTONOMOUS</option>
+          <option value="maintenance">MAINTENANCE</option>
+        </select>
+      </div>
+      <div class="header-actions">
+        <button onclick="refreshData()">Refresh</button>
+      </div>
+    </header>
+
+    <div class="content-scroll">
+      <!-- Dashboard View -->
+      <div id="view-dashboard" class="view">
+        <div class="grid">
+          <div class="card">
+            <div class="card-title">System Overview</div>
+            <div id="dashboard-summary" style="margin-top:20px"></div>
+          </div>
+          <div class="card">
+            <div class="card-title">Pending Actions</div>
+            <div id="dashboard-actions-summary" style="margin-top:20px"></div>
+          </div>
+          <div class="card">
+            <div class="card-title">Active Incidents</div>
+            <div id="dashboard-incidents" style="margin-top:20px"></div>
+          </div>
+        </div>
+      </div>
+
+      <!-- Actions View -->
+      <div id="view-actions" class="view hidden">
+        <div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
+          <div>
+            <h3>Pending Approval</h3>
+            <div id="actions-pending" class="timeline"></div>
+          </div>
+          <div>
+            <h3>Active / History</h3>
+            <div id="actions-history" class="timeline"></div>
+          </div>
+        </div>
+      </div>
+
+      <!-- Nodes View -->
+      <div id="view-nodes" class="view hidden">
+        <div class="grid" id="nodes-list"></div>
+      </div>
+
+      <!-- Services View -->
+      <div id="view-services" class="view hidden">
+        <div class="grid" id="services-list"></div>
+      </div>
+
+      <!-- Deployments View -->
+      <div id="view-deployments" class="view hidden">
+        <div class="grid" id="deployments-list"></div>
+      </div>
+
+      <!-- Topology View -->
+      <div id="view-topology" class="view hidden">
+        <div class="card" style="min-height:500px">
+          <div class="card-title">Runtime Topology</div>
+          <div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
+        </div>
+      </div>
+
+      <!-- Events View -->
+      <div id="view-events" class="view hidden">
+        <div class="timeline" id="events-timeline"></div>
+      </div>
+
+      <!-- Correlation View -->
+      <div id="view-correlation" class="view hidden">
+        <div id="correlation-chains" class="grid"></div>
+      </div>
+
+      <!-- Recommendations View -->
+      <div id="view-recommendations" class="view hidden">
+        <div class="grid" id="recommendations-list"></div>
+      </div>
+
+      <!-- Settings View -->
+      <div id="view-settings" class="view hidden">
+        <div class="card">
+          <div class="card-title">Configuration</div>
+          <div id="settings-content" style="margin-top:20px"></div>
+        </div>
+      </div>
+    </div>
+  </main>
+
+  <script>
+    let currentView = 'dashboard';
+    const pollInterval = 5000;
+
+    function showView(viewId, el) {
+      document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
+      document.getElementById('view-' + viewId).classList.remove('hidden');
+      document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
+      if (el) el.classList.add('active');
+      currentView = viewId;
+      document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
+      refreshData();
+    }
+
+    async function fetchData(endpoint) {
+      try {
+        const res = await fetch(endpoint, {cache: 'no-store'});
+        return await res.json();
+      } catch (e) {
+        console.error('Fetch error:', endpoint, e);
+        return null;
+      }
+    }
+
+    async function postData(endpoint, data) {
+      try {
+        const res = await fetch(endpoint, {
+          method: 'POST',
+          headers: {'Content-Type': 'application/json'},
+          body: JSON.stringify(data)
+        });
+        return await res.json();
+      } catch (e) {
+        console.error('Post error:', endpoint, e);
+        return null;
+      }
+    }
+
+    async function mutateAction(id, status) {
+      const res = await postData('/action/mutate', {id, status});
+      if (res && res.status === 'ok') {
+        refreshData();
+      } else {
+        alert('Mutation failed');
+      }
+    }
+
+    async function setOperatorMode(mode) {
+      console.log('Operator mode set to:', mode);
+      const res = await postData('/mode', {mode});
+      if (res && res.status === 'ok') {
+          console.log('Mode updated successfully');
+      }
+    }
+
+    function formatTime(ts) {
+      if (!ts) return 'N/A';
+      return new Date(ts * 1000).toLocaleString();
+    }
+
+    function getStatusClass(status) {
+      status = (status || '').toLowerCase();
+      if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
+      if (['degraded', 'warning'].includes(status)) return 'status-degraded';
+      if (['unstable'].includes(status)) return 'status-unstable';
+      if (['reconciling'].includes(status)) return 'status-reconciling';
+      if (['error', 'down', 'failed'].includes(status)) return 'status-error';
+      return '';
+    }
+
+    async function refreshData() {
+      // Refresh summary always
+      const summary = await fetchData('/summary');
+      if (summary) {
+        const statusEl = document.getElementById('summary-status');
+        statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
+        statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
+        
+        // Handle stale state
+        const staleBanner = document.getElementById('stale-banner');
+        if (summary.stale) {
+            staleBanner.classList.remove('hidden');
+            staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
+        } else {
+            staleBanner.classList.add('hidden');
+        }
+
+        if (currentView === 'dashboard') {
+          const dashSummary = document.getElementById('dashboard-summary');
+          dashSummary.innerHTML = `
+            <div class="label">Nodes</div><div class="value">${summary.node_count}</div>
+            <div class="label">Services</div><div class="value">${summary.service_count}</div>
+            <div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
+          `;
+        }
+      }
+
+      if (currentView === 'dashboard' || currentView === 'actions') {
+          const actions = await fetchData('/actions');
+          if (actions) {
+              if (currentView === 'dashboard') {
+                  const dashActions = document.getElementById('dashboard-actions-summary');
+                  const pendingCount = actions.pending.length;
+                  dashActions.innerHTML = `
+                    <div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
+                    <div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
+                  `;
+              }
+              if (currentView === 'actions') {
+                  const pendingEl = document.getElementById('actions-pending');
+                  const historyEl = document.getElementById('actions-history');
+                  
+                  pendingEl.innerHTML = actions.pending.map(a => `
+                    <div class="card" style="margin-bottom:12px">
+                        <div class="card-header">
+                            <div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
+                            <span class="badge risk-${a.risk_level}">${a.risk_level}</span>
+                        </div>
+                        <p>${a.description || a.action_type || 'No description'}</p>
+                        <div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
+                        <div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
+                        <div class="controls">
+                            <button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
+                            <button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
+                        </div>
+                    </div>
+                  `).join('') || 'No pending actions.';
+
+                  const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
+                  historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
+                    <div class="event">
+                        <div class="event-header">
+                            <span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
+                            <span class="badge ${getStatusClass(a.status)}">${a.status}</span>
+                        </div>
+                        <div>${a.description || a.action_type || 'No description'}</div>
+                        <small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
+                        ${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
+                        ${a.transition_history ? `
+                            <div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
+                                <strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
+                            </div>
+                        ` : ''}
+                    </div>
+                  `).join('') || 'No history.';
+              }
+          }
+      }
+
+      if (currentView === 'dashboard' || currentView === 'events') {
+          const incidents = await fetchData('/incidents');
+          if (currentView === 'dashboard') {
+              const dashIncidents = document.getElementById('dashboard-incidents');
+              if (!incidents || incidents.length === 0) {
+                  dashIncidents.textContent = 'No active incidents.';
+              } else {
+                  dashIncidents.innerHTML = incidents.map(inc => `
+                      <div class="event ${inc.severity}">
+                          <strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
+                          <small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
+                      </div>
+                  `).join('');
+              }
+          }
+      }
+
+      if (currentView === 'nodes') {
+        const nodes = await fetchData('/nodes');
+        const list = document.getElementById('nodes-list');
+        list.innerHTML = nodes.map(node => `
+          <div class="card">
+            <div class="card-header">
+              <div class="card-title">${node.hostname}</div>
+              <span class="badge ${getStatusClass(node.health)}">${node.health}</span>
+            </div>
+            <div class="label">ID</div><div class="value mono">${node.id}</div>
+            <div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
+            <div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
+            <div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
+            <div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
+            <div class="label">Runtime Status</div><div class="value">${node.status}</div>
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'services') {
+        const services = await fetchData('/services');
+        const list = document.getElementById('services-list');
+        list.innerHTML = services.map(svc => `
+          <div class="card">
+            <div class="card-header">
+              <div class="card-title">${svc.name}</div>
+              <span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
+            </div>
+            <div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
+            <div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
+            <div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
+            <div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'deployments') {
+        const deps = await fetchData('/deployments');
+        const list = document.getElementById('deployments-list');
+        list.innerHTML = deps.map(dep => `
+          <div class="card">
+            <div class="card-header">
+              <div class="card-title">${dep.service}</div>
+              <span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
+            </div>
+            <div class="label">ID</div><div class="value mono">${dep.id}</div>
+            <div class="label">Stage</div><div class="value">${dep.stage}</div>
+            <div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
+            <div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
+            ${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'events') {
+        const events = await fetchData('/events');
+        const timeline = document.getElementById('events-timeline');
+        timeline.innerHTML = events.map(ev => `
+          <div class="event ${ev.severity}">
+            <div class="event-header">
+              <span>${ev.type.toUpperCase()}</span>
+              <span>${formatTime(ev.timestamp)}</span>
+            </div>
+            <div>${ev.message}</div>
+            <div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'recommendations') {
+        const recs = await fetchData('/recommendations');
+        const list = document.getElementById('recommendations-list');
+        list.innerHTML = recs.map(rec => `
+          <div class="card">
+            <div class="card-header">
+              <div class="card-title">${rec.title}</div>
+              <span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
+            </div>
+            <p>${rec.description}</p>
+            <div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
+            <div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
+            <div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
+            <div class="controls">
+                <button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
+            </div>
+          </div>
+        `).join('');
+      }
+
+      if (currentView === 'topology') {
+          const nodes = await fetchData('/nodes');
+          const services = await fetchData('/services');
+          const topMap = document.getElementById('topology-map');
+          if (nodes && services) {
+              topMap.innerHTML = nodes.map(node => {
+                  const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
+                  return `
+                    <div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
+                        <div class="card-header">
+                            <div class="card-title">${node.hostname}</div>
+                            <span class="badge ${getStatusClass(node.health)}">${node.health}</span>
+                        </div>
+                        <div class="label">Capabilities</div>
+                        <div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
+                        <div class="label">Services</div>
+                        <div style="font-size:12px; margin-bottom:10px">
+                            ${nodeServices.length > 0 ? nodeServices.map(s => `
+                                <div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
+                                    <span>${s.name}</span>
+                                    <span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
+                                </div>
+                                ${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
+                            `).join('') : '<div class="value">None</div>'}
+                        </div>
+                    </div>
+                  `;
+              }).join('');
+          }
+      }
+
+      if (currentView === 'correlation') {
+          const incidents = await fetchData('/incidents');
+          const actions = await fetchData('/actions');
+          const list = document.getElementById('correlation-chains');
+          if (incidents && actions) {
+              const allActions = Object.values(actions).flat();
+              list.innerHTML = incidents.map(inc => {
+                  const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
+                  return `
+                    <div class="card">
+                        <div class="card-header">
+                            <div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
+                            <span class="badge status-error">Active</span>
+                        </div>
+                        <p>${inc.message}</p>
+                        <div class="label">Related Actions</div>
+                        ${related.map(a => `
+                            <div class="event" style="margin-top:5px">
+                                <strong>${a.type}</strong> (${a.status})<br>
+                                <small>${a.description}</small>
+                            </div>
+                        `).join('') || '<div class="value">No actions yet</div>'}
+                    </div>
+                  `;
+              }).join('');
+          }
+      }
+      if (currentView === 'settings') {
+          const config = await fetchData('/config');
+          const content = document.getElementById('settings-content');
+          content.innerHTML = `
+              <div class="label">Auto Mode</div>
+              <div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
+              <div class="label">Action Thresholds</div>
+              <div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
+              <div class="label">Telegram Integration</div>
+              <div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
+              <button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
+          `;
+      }
+    }
+
+    // Initial load
+    refreshData();
+    // Poll for updates
+    setInterval(refreshData, pollInterval);
+
+  </script>
+</body>
+</html>
--- a/services/control-plane/src/operator_ui.py
+++ b/services/control-plane/src/operator_ui.py
@ -0,0 +1,426 @@
+import heapq
+import json
+import os
+import re
+import time
+from datetime import datetime
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+from pathlib import Path
+
+
+STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
+EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
+WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
+ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
+CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
+
+STATIC_DIR = Path(__file__).parent
+
+_EVENT_TS_RE = re.compile(r"-(\d{9,11})-")
+
+DEFAULT_CONFIG = {
+    "operator_mode": "approval",
+    "auto_mode": True,
+    "action_thresholds": {
+        "restart_ha": 0.8,
+        "check_network": 0.9,
+    },
+    "default_threshold": 0.9,
+    "allowed_auto_actions": ["restart_ha"],
+}
+
+
+def read_json_file(path, default=None):
+    if not path.exists():
+        return default if default is not None else []
+    try:
+        return json.loads(path.read_text())
+    except Exception:
+        return default if default is not None else []
+
+
+def get_config():
+    config_path = STATE_DIR / "operator-config.json"
+    if config_path.exists():
+        return read_json_file(config_path, DEFAULT_CONFIG)
+    return DEFAULT_CONFIG
+
+
+def save_config(config):
+    STATE_DIR.mkdir(parents=True, exist_ok=True)
+    (STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
+
+
+EVENTS_MAX_AGE_HOURS = int(os.getenv("EVENTS_MAX_AGE_HOURS", "24"))
+EVENTS_MAX_COUNT = int(os.getenv("EVENTS_MAX_COUNT", "200"))
+
+
+def _node_health(info):
+    status = info.get("status", "unknown")
+    if status == "offline":
+        return "error"
+    if info.get("disk_pressure") == "high":
+        return "degraded"
+    if status == "online":
+        return "nominal"
+    return status
+
+
+def current_nodes():
+    """Return nodes as a list of dicts shaped for the UI.
+
+    The observer stores nodes as a keyed dict {node_name: {...}}.  The frontend
+    calls .map() which requires an array, so we convert here rather than change
+    the on-disk format (which the supervisor also reads).
+    """
+    raw = read_json_file(WORLD_DIR / "nodes.json", default={})
+    if isinstance(raw, list):
+        return raw
+    result = []
+    for name, info in raw.items():
+        result.append({
+            "id":            name,
+            "hostname":      name,
+            "health":        _node_health(info),
+            "status":        info.get("status", "unknown"),
+            "capabilities":  info.get("roles", []),
+            "connectivity":  "tailscale",
+            "incidents":     0,
+            "last_seen":     info.get("last_seen"),
+            "disk_usage_pct": info.get("disk_usage_pct"),
+            "mem_usage_pct":  info.get("mem_usage_pct"),
+            "cpu_usage_pct":  info.get("cpu_usage_pct"),
+            "disk_pressure":  info.get("disk_pressure"),
+        })
+    return result
+
+
+def current_services():
+    """Return services as a list of dicts shaped for the UI.
+
+    Observer stores services as {"node/service": {...}}.  Converted to a list
+    with the fields the services and topology views expect.
+    """
+    raw = read_json_file(WORLD_DIR / "services.json", default={})
+    if isinstance(raw, list):
+        return raw
+    result = []
+    for key, info in raw.items():
+        svc_status = info.get("status", "unknown")
+        result.append({
+            "id":               key,
+            "name":             info.get("service", key),
+            "node":             info.get("node", ""),
+            "health":           ("nominal" if svc_status == "healthy"
+                                 else ("error" if svc_status == "unhealthy"
+                                       else svc_status)),
+            "desired_state":    "running",
+            "actual_state":     svc_status,
+            "deployment_state": "deployed",
+            "dependencies":     [],
+            "recommendations":  [],
+            "last_check":       info.get("last_check"),
+            "incident_id":      info.get("incident_id"),
+        })
+    return result
+
+
+def current_deployments():
+    """Return deployments as a list sorted newest-first."""
+    raw = read_json_file(WORLD_DIR / "deployments.json", default={})
+    if isinstance(raw, list):
+        return raw
+    result = []
+    for dep_id, info in raw.items():
+        result.append({
+            "id":          dep_id,
+            "service":     info.get("service", ""),
+            "node":        info.get("node", ""),
+            "status":      info.get("status", "unknown"),
+            "stage":       info.get("status", "unknown"),
+            "diagnostics": info.get("last_error", ""),
+            "resumable":   info.get("status") == "failed",
+            "started_at":  info.get("started_at"),
+            "finished_at": info.get("finished_at"),
+        })
+    return sorted(result, key=lambda x: x.get("started_at") or 0, reverse=True)
+
+
+def current_incidents():
+    """Return active incidents as a list sorted most-recent-first.
+
+    Only incidents with status='active' are returned; resolved and cancelled
+    records are excluded so the dashboard reflects the current operational state.
+    """
+    raw = read_json_file(WORLD_DIR / "incidents.json", default={})
+    if isinstance(raw, list):
+        return [i for i in raw if i.get("status") == "active"]
+    result = []
+    for inc in raw.values():
+        if inc.get("status") != "active":
+            continue
+        # Synthesise a human-readable message if not stored (observer doesn't set one).
+        if "message" not in inc:
+            inc = dict(inc)
+            inc["message"] = (
+                f"{inc.get('service', '?')} on {inc.get('node', '?')} "
+                f"is {inc.get('trigger_type', 'unhealthy')}"
+            )
+        result.append(inc)
+    return sorted(result, key=lambda x: x.get("last_occurrence") or 0, reverse=True)
+
+
+def current_recommendations():
+    return read_json_file(WORLD_DIR / "recommendations.json")
+
+
+def current_summary():
+    path = WORLD_DIR / "runtime-summary.json"
+    summary = read_json_file(path, default={})
+    if summary:
+        last_update_val = summary.get("last_update")
+        if last_update_val:
+            try:
+                if isinstance(last_update_val, str):
+                    last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
+                else:
+                    last_update = float(last_update_val)
+            except Exception:
+                last_update = os.path.getmtime(path)
+        else:
+            last_update = os.path.getmtime(path)
+        summary["last_update"] = last_update
+        summary["stale"] = (time.time() - last_update) > 60
+    return summary
+
+
+def _event_file_ts(p: Path) -> int:
+    """Extract epoch timestamp from event filename: evt-<node>-<ts>-<type>-<svc>.json"""
+    m = _EVENT_TS_RE.search(p.stem)
+    return int(m.group(1)) if m else 0
+
+
+def current_events():
+    """Return the EVENTS_MAX_COUNT most-recent events, sorted newest-first.
+
+    Event files are named evt-<node>-<epoch>-<type>-<svc>.json.  The directory
+    can contain hundreds of thousands of files (one file per event, written by
+    node-agent).  Loading every file on each request causes catastrophic RSS
+    growth — 242 k files ≈ 420 MB of Python objects + 100 MB JSON serialisation.
+
+    Fix: use heapq.nlargest to stream through file paths (O(N_files) time,
+    O(EVENTS_MAX_COUNT) memory), extracting the epoch from the filename without
+    opening any file.  Only the winning EVENTS_MAX_COUNT files are then read.
+    """
+    if not EVENTS_DIR.exists():
+        return []
+
+    cutoff = time.time() - EVENTS_MAX_AGE_HOURS * 3600
+
+    # Stream all paths through a max-heap — never materialises the full list.
+    candidates = heapq.nlargest(
+        EVENTS_MAX_COUNT,
+        EVENTS_DIR.glob("**/*.json"),
+        key=_event_file_ts,
+    )
+
+    events = []
+    for f in candidates:
+        data = read_json_file(f)
+        if data and (data.get("timestamp") or 0) > cutoff:
+            data["_source"] = f.name
+            events.append(data)
+
+    return sorted(events, key=lambda x: x.get("timestamp") or 0, reverse=True)
+
+
+def current_actions():
+    actions = {}
+    statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
+    for status in statuses:
+        actions[status] = []
+        status_dir = ACTIONS_DIR / status
+        if status_dir.exists():
+            for f in status_dir.glob("*.json"):
+                data = read_json_file(f)
+                if data:
+                    # Injects some metadata for UI
+                    data["id"] = data.get("action_id") or f.stem
+                    data["status"] = status
+                    actions[status].append(data)
+    return actions
+
+
+def mutate_action(action_id, target_status):
+    statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
+    if target_status not in statuses:
+        return False, f"Invalid target status: {target_status}"
+
+    # Find where the action is
+    source_path = None
+    current_status = None
+    for status in statuses:
+        p = ACTIONS_DIR / status / f"{action_id}.json"
+        if p.exists():
+            source_path = p
+            current_status = status
+            break
+
+    if not source_path:
+        return False, f"Action {action_id} not found"
+
+    target_dir = ACTIONS_DIR / target_status
+    target_dir.mkdir(parents=True, exist_ok=True)
+    target_path = target_dir / f"{action_id}.json"
+
+    try:
+        data = json.loads(source_path.read_text())
+        data["status"] = target_status
+        data["updated_at"] = time.time()
+        
+        # Keep history of transitions
+        history = data.get("transition_history", [])
+        history.append({
+            "from": current_status,
+            "to": target_status,
+            "timestamp": time.time()
+        })
+        data["transition_history"] = history
+
+        target_path.write_text(json.dumps(data, indent=2))
+        if source_path != target_path:
+            source_path.unlink()
+        return True, "Success"
+    except Exception as e:
+        return False, str(e)
+
+
+def send_json(status, payload, handler):
+    body = (json.dumps(payload) + "\n").encode("utf-8")
+    handler.send_response(status)
+    handler.send_header("Content-Type", "application/json")
+    handler.send_header("Content-Length", str(len(body)))
+    handler.end_headers()
+    handler.wfile.write(body)
+
+
+class Handler(BaseHTTPRequestHandler):
+    def do_GET(self):
+        if self.path == "/config":
+            send_json(200, get_config(), self)
+            return
+
+        if self.path == "/nodes":
+            send_json(200, current_nodes(), self)
+            return
+
+        if self.path == "/services":
+            send_json(200, current_services(), self)
+            return
+
+        if self.path == "/deployments":
+            send_json(200, current_deployments(), self)
+            return
+
+        if self.path == "/incidents":
+            send_json(200, current_incidents(), self)
+            return
+
+        if self.path == "/recommendations":
+            send_json(200, current_recommendations(), self)
+            return
+
+        if self.path == "/summary":
+            send_json(200, current_summary(), self)
+            return
+
+        if self.path == "/events":
+            send_json(200, current_events(), self)
+            return
+
+        if self.path == "/actions":
+            send_json(200, current_actions(), self)
+            return
+
+        if self.path in ("/", "/index.html"):
+            body = (STATIC_DIR / "index.html").read_bytes()
+            self.send_response(200)
+            self.send_header("Content-Type", "text/html; charset=utf-8")
+            self.send_header("Content-Length", str(len(body)))
+            self.end_headers()
+            self.wfile.write(body)
+            return
+
+        self.send_error(404)
+
+    def do_POST(self):
+        if self.path not in (
+            "/config",
+            "/action/mutate",
+            "/mode",
+        ):
+            self.send_error(404)
+            return
+
+        length = int(self.headers.get("Content-Length", "0"))
+        raw_body = self.rfile.read(length).decode("utf-8")
+        try:
+            payload = json.loads(raw_body)
+        except json.JSONDecodeError:
+            self.send_error(400, "Invalid JSON")
+            return
+
+        if self.path == "/config":
+            config = get_config()
+            config.update(payload)
+            save_config(config)
+            send_json(200, {"status": "ok"}, self)
+            return
+
+        if self.path == "/mode":
+            mode = payload.get("mode")
+            if not mode:
+                self.send_error(400, "mode is required")
+                return
+            config = get_config()
+            config["operator_mode"] = mode
+            save_config(config)
+            send_json(200, {"status": "ok"}, self)
+            return
+
+        if self.path == "/action/mutate":
+            action_id = payload.get("id")
+            target = payload.get("status")
+            if not action_id or not target:
+                self.send_error(400, "id and status are required")
+                return
+            success, msg = mutate_action(action_id, target)
+            if success:
+                send_json(200, {"status": "ok"}, self)
+            else:
+                self.send_error(500, msg)
+            return
+
+    def log_message(self, format, *args):
+        return
+
+
+class OperatorHTTPServer(ThreadingHTTPServer):
+    # Use daemon threads so finished request threads do not accumulate in the
+    # internal _threads list.  ThreadingMixIn only tracks non-daemon threads
+    # (for joining at server_close); with daemon_threads=True that list stays
+    # empty, preventing unbounded growth of dead Thread objects over time.
+    daemon_threads = True
+
+
+if __name__ == "__main__":
+    # Ensure directories exist
+    for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
+        d.mkdir(parents=True, exist_ok=True)
+    for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
+        (ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
+
+    port = int(os.getenv("PORT", "8080"))
+    print(f"Operator Control Plane starting on 0.0.0.0:{port}")
+    server = OperatorHTTPServer(("0.0.0.0", port), Handler)
+    server.serve_forever()
--- a/services/control-plane/src/supervisor.py
+++ b/services/control-plane/src/supervisor.py
@ -0,0 +1,771 @@
+import os
+import json
+import time
+import logging
+import yaml
+from pathlib import Path
+
+
+def _atomic_write_json(path: Path, data) -> None:
+    """Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
+    tmp = path.with_suffix(".tmp")
+    with open(tmp, "w") as f:
+        json.dump(data, f, indent=2)
+        f.flush()
+        os.fsync(f.fileno())
+    os.replace(tmp, path)
+
+# Constants and Paths
+RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
+WORLD_DIR = Path(RUNTIME_PATH) / "world"
+ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
+EVENTS_DIR = Path(RUNTIME_PATH) / "events"
+REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
+
+# Node alias map: maps alternative node names (as they appear in events/world state)
+# to canonical topology node names (as they appear in hosts/*/services.yaml and topology.yaml).
+# Override at runtime via NODE_ALIAS_MAP env var as a JSON string, e.g.:
+#   NODE_ALIAS_MAP='{"node-2": "chelsty", "node-1": "piha"}'
+_NODE_ALIAS_ENV = os.getenv("NODE_ALIAS_MAP", "{}")
+try:
+    NODE_ALIAS_MAP = json.loads(_NODE_ALIAS_ENV)
+except Exception:
+    NODE_ALIAS_MAP = {}
+
+# Event trigger types that should result in a lightweight container_restart
+# rather than a full redeploy. The container is present but not running,
+# or a dependency (MQTT) is unreachable — a restart is the right first step.
+CONTAINER_RESTART_TRIGGERS = {"containers_not_running", "mqtt_unreachable"}
+
+# Nodes where automatic disk_cleanup actions must NOT be generated.
+# On chelsty nodes disk fullness is overwhelmingly caused by Frigate recordings
+# or the HA database — Docker cleanup will not help and the operator must
+# decide explicitly (e.g. adjust Frigate retain policy or purge HA recorder).
+NO_DISK_CLEANUP_NODES = {"chelsty-infra", "chelsty-ha"}
+
+# ---------------------------------------------------------------------------
+# HA diagnostic event routing (ha-diag-agent events)
+# ---------------------------------------------------------------------------
+
+# ha_websocket_dead: HA WebSocket unresponsive → restart the homeassistant container.
+# Separate from CONTAINER_RESTART_TRIGGERS because these events are routed directly
+# from the events dir (not via the world-state drift loop) to avoid conflicts with
+# the stability-agent's independent container health tracking on the same service key.
+HA_CONTAINER_RESTART_EVENTS = {"ha_websocket_dead"}
+
+# Alert-only events — operator notification, no automated action.
+HA_ALERT_ONLY_EVENTS = {
+    "ha_integration_failed",
+    "ha_entity_unavailable_long",
+    "ha_automation_failing",
+    "ha_update_available",
+    "ha_recorder_lag",
+    "ha_system_health_degraded",
+}
+
+# Stable action-ID suffix for each alert-only type
+_HA_ALERT_ID_SUFFIX = {
+    "ha_integration_failed":       "integration-failed",
+    "ha_entity_unavailable_long":  "entity-unavailable",
+    "ha_automation_failing":       "automation-failing",
+    "ha_update_available":         "update-available",
+    "ha_recorder_lag":             "recorder-lag",
+    "ha_system_health_degraded":   "system-health-degraded",
+}
+
+# 30-min cooldown after a container_restart completes; prevents restart loops
+# when HA repeatedly fails to connect (e.g. bad config, slow startup).
+HA_WEBSOCKET_RESTART_COOLDOWN = 1800
+
+# 1-hour cooldown for alert-only events; avoids repeated Telegram noise for
+# persistent conditions (e.g. an entity that stays unavailable for hours).
+HA_ALERT_COOLDOWN = 3600
+
+# Suppress ha_* events if homeassistant had a containers_not_running incident
+# within this window — HA is in a planned restart/update and alerts would be noise.
+HA_TRANSITION_WINDOW = 300  # 5 minutes
+
+# When True, events that would generate container_restart are downgraded to alert_only
+# with a "[SHADOW MODE]" note. Safe default for initial deployment; set
+# HA_DIAG_SHADOW_MODE=false on the control-plane node when ready for live actions.
+HA_DIAG_SHADOW_MODE = os.getenv("HA_DIAG_SHADOW_MODE", "true").lower() == "true"
+
+# Logging setup
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger("supervisor")
+
+
+class Supervisor:
+    def __init__(self):
+        self.desired_state = {"services": {}}
+        self.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
+        # In-memory set of already-routed HA event IDs; prevents re-processing
+        # on each reconcile cycle. Grows to at most ~hundreds of entries/day.
+        self._ha_processed_event_ids: set = set()
+        self._ensure_dirs()
+        logger.info(
+            "shadow_mode=%s — HA container_restart actions %s",
+            HA_DIAG_SHADOW_MODE,
+            "downgraded to alert_only" if HA_DIAG_SHADOW_MODE else "enabled",
+        )
+
+    def _ensure_dirs(self):
+        ACTIONS_DIR.mkdir(parents=True, exist_ok=True)
+        (ACTIONS_DIR / "pending").mkdir(parents=True, exist_ok=True)
+
+    # ------------------------------------------------------------------
+    # Node name resolution
+    # ------------------------------------------------------------------
+
+    def _resolve_node(self, name):
+        """Resolve an event/world-state node name to its canonical topology name."""
+        return NODE_ALIAS_MAP.get(name, name)
+
+    # ------------------------------------------------------------------
+    # Container name lookup
+    # ------------------------------------------------------------------
+
+    def _get_container_name(self, service):
+        """
+        Determine the Docker container name for a service.
+        Parses container_name from the service's docker-compose.yml.
+        Falls back to the service name if not found.
+        """
+        compose_path = REPO_ROOT / "services" / service / "docker-compose.yml"
+        if compose_path.exists():
+            try:
+                with open(compose_path, "r") as f:
+                    compose = yaml.safe_load(f)
+                for svc_block in compose.get("services", {}).values():
+                    cname = svc_block.get("container_name")
+                    if cname:
+                        return cname
+            except Exception as e:
+                logger.warning(f"Could not parse docker-compose for {service}: {e}")
+        # Convention: container name matches service name
+        return service
+
+    # ------------------------------------------------------------------
+    # State loading
+    # ------------------------------------------------------------------
+
+    def _load_desired_state(self):
+        services = {}
+        hosts_dir = REPO_ROOT / "hosts"
+        if not hosts_dir.exists():
+            logger.warning(f"Hosts directory {hosts_dir} does not exist")
+            return
+
+        for host_dir in hosts_dir.iterdir():
+            if host_dir.is_dir():
+                svc_file = host_dir / "services.yaml"
+                if svc_file.exists():
+                    try:
+                        with open(svc_file, "r") as f:
+                            data = yaml.safe_load(f)
+                            host_name = data.get("host")
+                            for svc_name, svc_info in data.get("services", {}).items():
+                                svc_info = svc_info or {}
+                                # monitor: false — service is documented as desired but
+                                # intentionally excluded from supervisor action generation.
+                                # Use this when a service is not yet bootstrapped on an
+                                # offline/LTE node so the queue stays clean until it is.
+                                if svc_info.get("monitor") is False:
+                                    logger.debug(
+                                        f"Skipping {host_name}/{svc_name}: monitor=false"
+                                    )
+                                    continue
+                                svc_key = f"{host_name}/{svc_name}"
+                                services[svc_key] = {
+                                    "node": host_name,
+                                    "service": svc_name,
+                                    "desired": "running"
+                                }
+                    except Exception as e:
+                        logger.error(f"Failed to load {svc_file}: {e}")
+        self.desired_state["services"] = services
+
+    def _load_actual_state(self) -> bool:
+        """Load world state from disk.  Returns False if any file is unreadable
+        (empty / mid-write truncation), in which case actual_state is NOT updated
+        so the caller can skip this reconcile cycle rather than treating missing
+        data as a real drift signal."""
+        files = {
+            "services": WORLD_DIR / "services.json",
+            "nodes": WORLD_DIR / "nodes.json",
+            "incidents": WORLD_DIR / "incidents.json"
+        }
+        raw = {}
+        for key, path in files.items():
+            if path.exists():
+                try:
+                    with open(path, "r") as f:
+                        raw[key] = json.load(f)
+                except Exception as e:
+                    logger.warning(
+                        f"World state {path.name} unreadable (truncated write?): {e} "
+                        f"— skipping reconcile cycle, keeping last known state"
+                    )
+                    return False
+            else:
+                raw[key] = {}
+
+        # Normalize node names in services using alias map so that
+        # event-sourced names (e.g. "node-2") resolve to canonical
+        # topology names (e.g. "chelsty") before comparison with desired state.
+        normalized_services = {}
+        for svc_key, svc_info in raw.get("services", {}).items():
+            svc_info = dict(svc_info)
+            raw_node = svc_info.get("node", "")
+            canonical_node = self._resolve_node(raw_node)
+            if canonical_node != raw_node:
+                logger.debug(f"Resolved node alias: {raw_node} → {canonical_node}")
+                svc_info["node"] = canonical_node
+                svc_name = svc_info.get("service") or svc_key.split("/", 1)[-1]
+                svc_key = f"{canonical_node}/{svc_name}"
+            normalized_services[svc_key] = svc_info
+
+        # Normalize node names in incidents as well
+        normalized_incidents = {}
+        for inc_id, inc in raw.get("incidents", {}).items():
+            inc = dict(inc)
+            raw_node = inc.get("node", "")
+            inc["node"] = self._resolve_node(raw_node)
+            normalized_incidents[inc_id] = inc
+
+        self.actual_state["services"] = normalized_services
+        self.actual_state["nodes"] = raw.get("nodes", {})
+        self.actual_state["incidents"] = normalized_incidents
+        return True
+
+    # ------------------------------------------------------------------
+    # Incident helpers
+    # ------------------------------------------------------------------
+
+    def _get_incident_trigger(self, svc_key):
+        """
+        Return the trigger_type of the active incident for a service, or None.
+        trigger_type is set by the observer when it creates an incident from
+        a specific event type (e.g. 'containers_not_running', 'mqtt_unreachable').
+        """
+        svc_info = self.actual_state["services"].get(svc_key, {})
+        incident_id = svc_info.get("incident_id")
+        if not incident_id:
+            return None
+        incident = self.actual_state["incidents"].get(incident_id, {})
+        if incident.get("status") == "active":
+            return incident.get("trigger_type")
+        return None
+
+    # ------------------------------------------------------------------
+    # Reconciliation loop
+    # ------------------------------------------------------------------
+
+    def reconcile(self):
+        # Update heartbeat
+        heartbeat_file = WORLD_DIR.parent / "state" / "supervisor.heartbeat"
+        try:
+            heartbeat_file.touch()
+        except Exception as e:
+            logger.error(f"Failed to touch heartbeat file: {e}")
+
+        self._load_desired_state()
+        if not self._load_actual_state():
+            return  # world state unreadable this cycle — skip to avoid false drift
+
+        drifts = []
+
+        # 1. Check for missing or unhealthy services
+        for svc_key, desired_info in self.desired_state["services"].items():
+            actual_info = self.actual_state["services"].get(svc_key)
+
+            if not actual_info:
+                drifts.append({
+                    "type": "missing_service",
+                    "svc_key": svc_key,
+                    "node": desired_info["node"],
+                    "service": desired_info["service"],
+                    "trigger_type": None,
+                })
+            elif actual_info.get("status") != "healthy":
+                trigger_type = self._get_incident_trigger(svc_key)
+                drifts.append({
+                    "type": "unhealthy_service",
+                    "svc_key": svc_key,
+                    "node": desired_info["node"],
+                    "service": desired_info["service"],
+                    "status": actual_info.get("status"),
+                    "trigger_type": trigger_type,
+                })
+
+        # 2. Generate service-level recommendations
+        for drift in drifts:
+            self._generate_recommendation(drift)
+
+        # 3. Generate node-level recommendations (disk pressure)
+        for node_name, node_info in self.actual_state["nodes"].items():
+            if node_name in NO_DISK_CLEANUP_NODES:
+                continue
+            if node_info.get("disk_pressure") == "high":
+                self._generate_disk_cleanup_recommendation(node_name)
+
+        # 4. Cancel pending actions whose drift has been resolved.
+        #    When a service becomes healthy again (because node-agent emits
+        #    service_healthy and the observer updates services.json), any
+        #    previously queued redeploy/container_restart action for that
+        #    service is no longer needed. Move it to "cancelled/" so the
+        #    operator can see it was auto-resolved rather than silently dropped.
+        self._cancel_resolved_pending_actions()
+
+        # 5. Route HA diagnostic events emitted by ha-diag-agent.
+        #    Processed directly from the events directory — not via the world-state
+        #    drift loop — to avoid conflicts with stability-agent's independent
+        #    container health tracking for the homeassistant service.
+        self._process_ha_events()
+
+    # ------------------------------------------------------------------
+    # Recommendation generation
+    # ------------------------------------------------------------------
+
+    def _generate_recommendation(self, drift):
+        node = drift["node"]
+        service = drift["service"]
+        trigger_type = drift.get("trigger_type")
+
+        # Choose action type first so we can build the stable, deterministic ID.
+        # Stable IDs mean reconcile is truly idempotent: the same drift always
+        # produces the same filename, so we never create duplicates even across
+        # restarts of the supervisor.
+        if trigger_type in CONTAINER_RESTART_TRIGGERS:
+            action_id = f"container-restart-{node}-{service}"
+        else:
+            action_id = f"redeploy-{node}-{service}"
+
+        # Skip if an action for this ID is already live in any active state
+        # (pending → approved → running).  This prevents re-creation after
+        # a human approves an action that hasn't executed yet.
+        for state in ("pending", "approved", "running"):
+            if (ACTIONS_DIR / state / f"{action_id}.json").exists():
+                logger.debug(f"Skipping {action_id}: already in state '{state}'")
+                return
+
+        if trigger_type in CONTAINER_RESTART_TRIGGERS:
+            # Lightweight remediation: the container exists but is not running
+            # (containers_not_running) or its MQTT dependency is unreachable
+            # (mqtt_unreachable). A docker restart is sufficient and low-risk.
+            container_name = self._get_container_name(service)
+            action = {
+                "action_id": action_id,
+                "timestamp": time.time(),
+                "type": "container_restart",
+                "node": node,
+                "service": service,
+                "container_name": container_name,
+                "risk_level": "low",
+                "confidence": 0.95,
+                "description": (
+                    f"Restart container '{container_name}' on {node} "
+                    f"(service: {service}, reason: {trigger_type})"
+                ),
+                "status": "pending",
+                "payload": {
+                    "reason": trigger_type,
+                    "svc_key": drift["svc_key"],
+                },
+            }
+        else:
+            # Full redeploy: container is running but service is broken,
+            # or the cause is unknown / not a simple restart candidate.
+            action = {
+                "action_id": action_id,
+                "timestamp": time.time(),
+                "type": "redeploy",
+                "node": node,
+                "service": service,
+                "risk_level": "guarded",
+                "confidence": 0.9,
+                "description": f"Redeploy {service} on {node} due to {drift['type']}",
+                "status": "pending",
+                "payload": {
+                    "reason": drift["type"],
+                    "svc_key": drift["svc_key"],
+                },
+            }
+
+        action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
+        try:
+            _atomic_write_json(action_path, action)
+            logger.info(
+                f"Generated recommendation: {action_id} "
+                f"(type={action['type']}, risk={action['risk_level']})"
+            )
+        except Exception as e:
+            logger.error(f"Failed to save recommendation {action_id}: {e}")
+
+    def _generate_disk_cleanup_recommendation(self, node: str):
+        """
+        Generate a disk_cleanup action when node-agent reports critical disk
+        pressure (>85 %) on a node that supports automated Docker cleanup.
+
+        This is an OPERATOR-APPROVED action (risk=guarded): it runs
+        `docker image prune -a -f` and `docker volume prune -f`, which are
+        more aggressive than the safe auto-cleanup the node-agent runs itself.
+
+        Nodes in NO_DISK_CLEANUP_NODES never reach this method (filtered in
+        reconcile) because their disk fullness is caused by application data
+        (Frigate, HA) that the operator must handle manually.
+        """
+        action_id = f"disk-cleanup-{node}"
+
+        for state in ("pending", "approved", "running"):
+            if (ACTIONS_DIR / state / f"{action_id}.json").exists():
+                logger.debug(f"Skipping {action_id}: already in state '{state}'")
+                return
+
+        action = {
+            "action_id":   action_id,
+            "timestamp":   time.time(),
+            "type":        "disk_cleanup",
+            "node":        node,
+            "service":     "",
+            "risk_level":  "guarded",
+            "confidence":  0.85,
+            "description": (
+                f"Aggressive disk cleanup on {node}: docker image prune -a "
+                f"and docker volume prune (requires operator approval)"
+            ),
+            "status": "pending",
+            "payload": {
+                "reason": "disk_pressure",
+                "commands": [
+                    "docker image prune -a -f",
+                    "docker volume prune -f",
+                ],
+            },
+        }
+
+        action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
+        try:
+            _atomic_write_json(action_path, action)
+            logger.info(
+                f"Generated disk cleanup recommendation: {action_id} "
+                f"(node={node}, risk=guarded)"
+            )
+        except Exception as e:
+            logger.error(f"Failed to save disk cleanup recommendation {action_id}: {e}")
+
+    def _cancel_resolved_pending_actions(self):
+        """
+        Auto-cancel pending service actions (redeploy / container_restart) whose
+        target service is now healthy in the actual state.
+
+        This keeps the action queue clean: when node-agent starts reporting
+        service_healthy for a container that previously had no world-state entry,
+        the pending 'missing_service' redeploy action that was generated before
+        the first health confirmation should be removed automatically rather than
+        sitting in the queue until an operator manually rejects it.
+
+        Only pending actions are considered — approved/running actions have already
+        been committed to by the operator and must not be cancelled automatically.
+        """
+        cancelled_dir = ACTIONS_DIR / "cancelled"
+        cancelled_dir.mkdir(parents=True, exist_ok=True)
+        pending_dir = ACTIONS_DIR / "pending"
+        if not pending_dir.exists():
+            return
+
+        for action_file in list(pending_dir.glob("*.json")):
+            try:
+                with open(action_file, "r") as f:
+                    action = json.load(f)
+            except Exception as e:
+                logger.error(f"Failed to read action {action_file.name}: {e}")
+                continue
+
+            action_type = action.get("type")
+            node = action.get("node")
+            service = action.get("service")
+
+            # Only auto-cancel service-level actions (not disk_cleanup)
+            if action_type not in ("redeploy", "container_restart"):
+                continue
+            if not node or not service:
+                continue
+
+            svc_key = f"{node}/{service}"
+
+            cancel_reason = None
+
+            # Case 1: service is no longer in desired state (removed from services.yaml
+            # or marked monitor:false). The action was generated under old config.
+            if svc_key not in self.desired_state["services"]:
+                cancel_reason = "service_removed_from_desired_state"
+
+            # Case 2: drift resolved — service is now healthy in actual state.
+            elif self.actual_state["services"].get(svc_key, {}).get("status") == "healthy":
+                cancel_reason = "drift_resolved_auto"
+
+            if cancel_reason:
+                dest = cancelled_dir / action_file.name
+                try:
+                    action["status"] = "cancelled"
+                    action["cancelled_reason"] = cancel_reason
+                    action["cancelled_at"] = time.time()
+                    _atomic_write_json(dest, action)
+                    action_file.unlink()
+                    logger.info(
+                        f"Auto-cancelled {action_file.name}: "
+                        f"{svc_key} — {cancel_reason}"
+                    )
+                except Exception as e:
+                    logger.error(f"Failed to cancel action {action_file.name}: {e}")
+
+    # ------------------------------------------------------------------
+    # HA diagnostic event routing
+    # ------------------------------------------------------------------
+
+    def _process_ha_events(self):
+        """Scan the events directory for unprocessed ha_* events and route them."""
+        if not EVENTS_DIR.exists():
+            return
+        for event_file in sorted(EVENTS_DIR.glob("**/*.json")):
+            event_id = event_file.stem
+            if event_id in self._ha_processed_event_ids:
+                continue
+            self._ha_processed_event_ids.add(event_id)
+            try:
+                with open(event_file) as f:
+                    event = json.load(f)
+            except Exception as e:
+                logger.debug(f"Could not read event {event_file}: {e}")
+                continue
+            if not event.get("type", "").startswith("ha_"):
+                continue
+            self._route_ha_event(event)
+
+    def _route_ha_event(self, event: dict):
+        event_type = event.get("type", "")
+        node = event.get("node", "")
+        if not node:
+            return
+
+        if event_type in HA_CONTAINER_RESTART_EVENTS:
+            if self._is_ha_in_transition(node):
+                logger.debug(
+                    f"Suppressing {event_type} on {node}: homeassistant in transition"
+                )
+                return
+            if HA_DIAG_SHADOW_MODE:
+                logger.info(
+                    "shadow_mode: suppressed container_restart for %s", event_type
+                )
+                self._generate_ha_shadow_alert(node, event)
+            else:
+                self._generate_ha_container_restart(node, event)
+
+        elif event_type == "ha_websocket_recovered":
+            self._cancel_ha_container_restart(node)
+
+        elif event_type in HA_ALERT_ONLY_EVENTS:
+            if self._is_ha_in_transition(node):
+                logger.debug(
+                    f"Suppressing {event_type} on {node}: homeassistant in transition"
+                )
+                return
+            self._generate_ha_alert_only(node, event)
+
+    def _is_ha_in_transition(self, node: str) -> bool:
+        """Return True if homeassistant container had a recent containers_not_running incident.
+
+        Suppresses ha_* alerts during planned HA restarts/updates to avoid
+        flooding the operator with secondary diagnostic alerts.
+        """
+        svc_key = f"{node}/homeassistant"
+        svc_info = self.actual_state["services"].get(svc_key, {})
+        incident_id = svc_info.get("incident_id")
+        if not incident_id:
+            return False
+        incident = self.actual_state["incidents"].get(incident_id, {})
+        return (
+            incident.get("status") == "active"
+            and incident.get("trigger_type") == "containers_not_running"
+            and time.time() - (incident.get("last_occurrence") or 0) < HA_TRANSITION_WINDOW
+        )
+
+    def _ha_action_recently_completed(self, action_id: str, cooldown: int) -> bool:
+        """Return True if action completed/rejected/cancelled within the cooldown window."""
+        for state in ("completed", "rejected", "cancelled"):
+            path = ACTIONS_DIR / state / f"{action_id}.json"
+            if path.exists():
+                try:
+                    with open(path) as f:
+                        data = json.load(f)
+                    finished = (
+                        data.get("finished_at")
+                        or data.get("cancelled_at")
+                        or data.get("updated_at")
+                        or 0
+                    )
+                    if time.time() - finished < cooldown:
+                        return True
+                except Exception:
+                    pass
+        return False
+
+    def _generate_ha_container_restart(self, node: str, event: dict):
+        service = "homeassistant"
+        action_id = f"container-restart-{node}-{service}"
+
+        for state in ("pending", "approved", "running"):
+            if (ACTIONS_DIR / state / f"{action_id}.json").exists():
+                logger.debug(f"Skipping {action_id}: already in state '{state}'")
+                return
+
+        if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
+            logger.debug(
+                f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
+            )
+            return
+
+        payload = dict(event.get("payload", {}))
+        payload["reason"] = "ha_websocket_dead"
+        payload["svc_key"] = f"{node}/{service}"
+
+        container_name = self._get_container_name(service)
+        action = {
+            "action_id": action_id,
+            "timestamp": time.time(),
+            "type": "container_restart",
+            "node": node,
+            "service": service,
+            "container_name": container_name,
+            "risk_level": "low",
+            "confidence": 0.9,
+            "description": (
+                f"Restart '{container_name}' on {node}: HA WebSocket unresponsive"
+            ),
+            "status": "pending",
+            "payload": payload,
+        }
+        self._write_pending_action(action)
+
+    def _generate_ha_shadow_alert(self, node: str, event: dict):
+        """Shadow-mode downgrade: emit alert_only instead of container_restart.
+
+        Uses the same action_id and cooldown as the real restart so that
+        cooldown semantics are identical regardless of shadow mode state.
+        """
+        service = "homeassistant"
+        action_id = f"container-restart-{node}-{service}"
+
+        for state in ("pending", "approved", "running"):
+            if (ACTIONS_DIR / state / f"{action_id}.json").exists():
+                logger.debug(f"Skipping {action_id}: already in state '{state}'")
+                return
+
+        if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
+            logger.debug(
+                f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
+            )
+            return
+
+        payload = dict(event.get("payload", {}))
+        payload["reason"] = "ha_websocket_dead"
+        payload["svc_key"] = f"{node}/{service}"
+        payload["shadow_mode"] = True
+
+        action = {
+            "action_id": action_id,
+            "timestamp": time.time(),
+            "type": "alert_only",
+            "node": node,
+            "service": service,
+            "risk_level": "info",
+            "confidence": 0.9,
+            "description": (
+                f"[SHADOW MODE] would have triggered container_restart "
+                f"for {service} on {node}: HA WebSocket unresponsive"
+            ),
+            "status": "pending",
+            "payload": payload,
+        }
+        self._write_pending_action(action)
+
+    def _generate_ha_alert_only(self, node: str, event: dict):
+        event_type = event.get("type", "")
+        suffix = _HA_ALERT_ID_SUFFIX.get(event_type, event_type.replace("_", "-"))
+        action_id = f"alert-ha-{suffix}-{node}"
+
+        for state in ("pending", "approved", "running"):
+            if (ACTIONS_DIR / state / f"{action_id}.json").exists():
+                logger.debug(f"Skipping {action_id}: already in state '{state}'")
+                return
+
+        if self._ha_action_recently_completed(action_id, HA_ALERT_COOLDOWN):
+            logger.debug(
+                f"Skipping {action_id}: within {HA_ALERT_COOLDOWN}s cooldown"
+            )
+            return
+
+        payload = dict(event.get("payload", {}))
+        payload["reason"] = event_type
+
+        action = {
+            "action_id": action_id,
+            "timestamp": time.time(),
+            "type": "alert_only",
+            "node": node,
+            "service": event.get("service", "homeassistant"),
+            "risk_level": "info",
+            "confidence": 1.0,
+            "description": event.get(
+                "message", f"HA diagnostic alert: {event_type} on {node}"
+            ),
+            "status": "pending",
+            "payload": payload,
+        }
+        self._write_pending_action(action)
+
+    def _cancel_ha_container_restart(self, node: str):
+        """Move a pending ha_websocket_dead container_restart to cancelled on recovery."""
+        action_id = f"container-restart-{node}-homeassistant"
+        pending_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
+        if not pending_path.exists():
+            return
+        cancelled_dir = ACTIONS_DIR / "cancelled"
+        cancelled_dir.mkdir(parents=True, exist_ok=True)
+        dest = cancelled_dir / f"{action_id}.json"
+        try:
+            with open(pending_path) as f:
+                action = json.load(f)
+            action["status"] = "cancelled"
+            action["cancelled_reason"] = "ha_websocket_recovered"
+            action["cancelled_at"] = time.time()
+            _atomic_write_json(dest, action)
+            pending_path.unlink()
+            logger.info(f"Cancelled {action_id}: ha_websocket_recovered on {node}")
+        except Exception as e:
+            logger.error(f"Failed to cancel {action_id}: {e}")
+
+    def _write_pending_action(self, action: dict):
+        action_id = action["action_id"]
+        action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
+        try:
+            _atomic_write_json(action_path, action)
+            logger.info(
+                f"Generated HA action: {action_id} "
+                f"(type={action['type']}, risk={action['risk_level']})"
+            )
+        except Exception as e:
+            logger.error(f"Failed to save action {action_id}: {e}")
+
+    def loop(self, interval=30):
+        logger.info("Starting supervisor loop")
+        while True:
+            self.reconcile()
+            time.sleep(interval)
+
+
+if __name__ == "__main__":
+    supervisor = Supervisor()
+    supervisor.loop()
--- a/services/control-plane/tests/init.py
+++ b/services/control-plane/tests/init.py
--- a/services/control-plane/tests/test_incident_lifecycle.py
+++ b/services/control-plane/tests/test_incident_lifecycle.py
@ -0,0 +1,333 @@
+"""Tests for incident lifecycle: auto-resolve, orphan detection, timestamp parsing."""
+from __future__ import annotations
+
+import json
+import sys
+import time
+from pathlib import Path
+
+import pytest
+
+# Observer lives outside the control-plane package; add scripts/ to path.
+sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent / "scripts"))
+from observer.observer import Observer, _parse_ts, _atomic_write_json
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+def _make_observer(tmp_path: Path) -> Observer:
+    """Return an Observer with all runtime paths redirected to tmp_path."""
+    import observer.observer as obs_mod
+
+    world = tmp_path / "world"
+    state = tmp_path / "state"
+    events = tmp_path / "events"
+    logs = tmp_path / "logs"
+    repo = tmp_path / "repo"
+
+    for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
+        d.mkdir(parents=True, exist_ok=True)
+
+    # Minimal topology so inventory isn't empty (avoids prune-guard early-return)
+    (repo / "inventory" / "topology.yaml").write_text(
+        "nodes:\n  vps:\n    roles: [control-plane]\n    connectivity: {}\n"
+    )
+
+    original_world = obs_mod.WORLD_DIR
+    original_state = obs_mod.STATE_DIR
+    original_events = obs_mod.EVENTS_DIR
+    original_logs = obs_mod.LOGS_DIR
+    original_inventory = obs_mod.INVENTORY_TOPOLOGY
+    original_repo = obs_mod.REPO_ROOT
+
+    obs_mod.WORLD_DIR = world
+    obs_mod.STATE_DIR = state
+    obs_mod.EVENTS_DIR = events
+    obs_mod.LOGS_DIR = logs
+    obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
+    obs_mod.REPO_ROOT = repo
+
+    obs = Observer()
+
+    # Restore module-level constants (monkeypatching at module level is sufficient
+    # for the Observer instance which captures paths at construction time via globals)
+    obs_mod.WORLD_DIR = original_world
+    obs_mod.STATE_DIR = original_state
+    obs_mod.EVENTS_DIR = original_events
+    obs_mod.LOGS_DIR = original_logs
+    obs_mod.INVENTORY_TOPOLOGY = original_inventory
+    obs_mod.REPO_ROOT = original_repo
+
+    return obs
+
+
+def _make_observer_simple(tmp_path: Path):
+    """Return an Observer instance and patch its world_state in-place."""
+    import observer.observer as obs_mod
+
+    world = tmp_path / "world"
+    state = tmp_path / "state"
+    events = tmp_path / "events"
+    logs = tmp_path / "logs"
+    repo = tmp_path / "repo"
+
+    for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
+        d.mkdir(parents=True, exist_ok=True)
+
+    (repo / "inventory" / "topology.yaml").write_text(
+        "nodes:\n  vps:\n    roles: [control-plane]\n    connectivity: {}\n"
+    )
+
+    # Patch before construction
+    obs_mod.WORLD_DIR = world
+    obs_mod.STATE_DIR = state
+    obs_mod.EVENTS_DIR = events
+    obs_mod.LOGS_DIR = logs
+    obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
+    obs_mod.REPO_ROOT = repo
+
+    obs = Observer()
+    return obs
+
+
+# ---------------------------------------------------------------------------
+# 1. _parse_ts — timestamp normalisation
+# ---------------------------------------------------------------------------
+
+def test_parse_ts_int():
+    ts = int(time.time()) - 3600
+    assert abs(_parse_ts(ts) - ts) < 1
+
+
+def test_parse_ts_float():
+    ts = time.time() - 100.5
+    assert abs(_parse_ts(ts) - ts) < 0.01
+
+
+def test_parse_ts_iso_string():
+    # ISO format as emitted by events.py / stability-agent
+    from datetime import datetime, timezone
+    iso = "2026-06-01T00:03:22Z"
+    expected = datetime(2026, 6, 1, 0, 3, 22, tzinfo=timezone.utc).timestamp()
+    result = _parse_ts(iso)
+    assert result > 0
+    assert isinstance(result, float)
+    assert abs(result - expected) < 1
+
+
+def test_parse_ts_none_returns_zero():
+    assert _parse_ts(None) == 0.0
+
+
+def test_parse_ts_garbage_returns_zero():
+    assert _parse_ts("not-a-date") == 0.0
+
+
+def test_parse_ts_zero_int():
+    assert _parse_ts(0) == 0.0
+
+
+# ---------------------------------------------------------------------------
+# 2. Lifecycle: service_healthy event resolves linked incident
+# ---------------------------------------------------------------------------
+
+def test_service_healthy_resolves_active_incident(tmp_path):
+    obs = _make_observer_simple(tmp_path)
+    inc_id = "inc-111-vps-outline"
+    obs.world_state["services"]["vps/outline"] = {
+        "node": "vps", "service": "outline",
+        "status": "unhealthy", "last_check": None,
+        "incident_id": inc_id,
+    }
+    obs.world_state["incidents"][inc_id] = {
+        "id": inc_id, "node": "vps", "service": "outline",
+        "status": "active", "trigger_type": "service_unhealthy",
+        "started_at": int(time.time()) - 600,
+        "last_occurrence": int(time.time()) - 600,
+        "occurrence_count": 1, "events": [],
+    }
+
+    obs.process_event({
+        "type": "service_healthy",
+        "node": "vps",
+        "service": "outline",
+        "severity": "info",
+        "timestamp": int(time.time()),
+        "payload": {},
+    })
+
+    assert obs.world_state["services"]["vps/outline"]["status"] == "healthy"
+    assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
+    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
+
+
+def test_service_healthy_does_not_resolve_other_incidents(tmp_path):
+    """service_healthy for service A must not touch incident for service B."""
+    obs = _make_observer_simple(tmp_path)
+    inc_b = "inc-222-vps-supervisor"
+    obs.world_state["services"]["vps/supervisor"] = {
+        "node": "vps", "service": "supervisor",
+        "status": "unhealthy", "last_check": None,
+        "incident_id": inc_b,
+    }
+    obs.world_state["incidents"][inc_b] = {
+        "id": inc_b, "status": "active",
+        "last_occurrence": int(time.time()) - 300,
+    }
+
+    obs.process_event({
+        "type": "service_healthy",
+        "node": "vps",
+        "service": "outline",   # different service
+        "severity": "info",
+        "timestamp": int(time.time()),
+        "payload": {},
+    })
+
+    assert obs.world_state["incidents"][inc_b]["status"] == "active"
+
+
+# ---------------------------------------------------------------------------
+# 3. _prune_stale_world: healthy-service-linked incident → immediate resolve
+# ---------------------------------------------------------------------------
+
+def test_prune_resolves_healthy_linked_incident(tmp_path):
+    """If a service is healthy but still points at an active incident, resolve it."""
+    obs = _make_observer_simple(tmp_path)
+    inc_id = "inc-333-vps-outline"
+    obs.world_state["services"]["vps/outline"] = {
+        "node": "vps", "service": "outline",
+        "status": "healthy",          # <-- healthy but incident_id still set
+        "last_check": None,
+        "incident_id": inc_id,
+    }
+    obs.world_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "active",
+        "started_at": int(time.time()) - 7200,
+        "last_occurrence": int(time.time()) - 7200,
+    }
+
+    obs._prune_stale_world()
+
+    assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
+    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
+
+
+def test_prune_resolves_healthy_linked_incident_iso_timestamp(tmp_path):
+    """Healthy-linked incident with ISO-string last_occurrence must still resolve."""
+    obs = _make_observer_simple(tmp_path)
+    inc_id = "inc-444-vps-outline"
+    obs.world_state["services"]["vps/outline"] = {
+        "node": "vps", "service": "outline",
+        "status": "healthy", "last_check": None, "incident_id": inc_id,
+    }
+    obs.world_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "active",
+        "last_occurrence": "2026-06-01T00:03:22Z",  # ISO string from events.py
+    }
+
+    obs._prune_stale_world()   # must not raise TypeError
+
+    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
+
+
+# ---------------------------------------------------------------------------
+# 4. _prune_stale_world: orphaned incident (no service link) → resolve after 5 min
+# ---------------------------------------------------------------------------
+
+def test_prune_resolves_orphaned_incident_old_enough(tmp_path):
+    """Orphaned active incident older than 5 min must be auto-resolved."""
+    obs = _make_observer_simple(tmp_path)
+    inc_id = "inc-555-vps-supervisor"
+    # No service entry links to this incident
+    obs.world_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "active", "node": "vps", "service": "supervisor",
+        "last_occurrence": int(time.time()) - 400,   # 6.7 min ago
+    }
+
+    obs._prune_stale_world()
+
+    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
+
+
+def test_prune_does_not_resolve_orphaned_incident_too_recent(tmp_path):
+    """Orphaned incident younger than 5 min must stay active (guard against race)."""
+    obs = _make_observer_simple(tmp_path)
+    inc_id = "inc-666-vps-supervisor"
+    obs.world_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "active",
+        "last_occurrence": int(time.time()) - 60,   # 1 min ago — within guard
+    }
+
+    obs._prune_stale_world()
+
+    assert obs.world_state["incidents"][inc_id]["status"] == "active"
+
+
+def test_prune_resolves_orphaned_incident_iso_timestamp(tmp_path):
+    """Orphaned incident with ISO-string last_occurrence must resolve correctly."""
+    obs = _make_observer_simple(tmp_path)
+    inc_id = "inc-777-vps-outline"
+    # ISO timestamp well in the past (2026-06-01)
+    obs.world_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "active",
+        "last_occurrence": "2026-06-01T00:03:22Z",
+    }
+
+    obs._prune_stale_world()   # must not raise TypeError
+
+    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
+
+
+def test_prune_does_not_touch_linked_incident(tmp_path):
+    """An active incident still linked from a non-healthy service must stay active."""
+    obs = _make_observer_simple(tmp_path)
+    inc_id = "inc-888-vps-outline"
+    obs.world_state["services"]["vps/outline"] = {
+        "node": "vps", "service": "outline",
+        "status": "unhealthy",   # <-- still unhealthy
+        "last_check": None,
+        "incident_id": inc_id,
+    }
+    obs.world_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "active",
+        "last_occurrence": int(time.time()) - 3600,
+    }
+
+    obs._prune_stale_world()
+
+    assert obs.world_state["incidents"][inc_id]["status"] == "active"
+
+
+# ---------------------------------------------------------------------------
+# 5. 7-day stale incident prune with ISO resolved_at
+# ---------------------------------------------------------------------------
+
+def test_prune_removes_old_resolved_incident_iso_resolved_at(tmp_path):
+    """Resolved incidents with ISO-string resolved_at older than 7 days must be pruned."""
+    obs = _make_observer_simple(tmp_path)
+    inc_id = "inc-old-resolved"
+    obs.world_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "resolved",
+        "resolved_at": "2026-05-01T00:00:00Z",  # >7 days before 2026-06-03
+    }
+
+    obs._prune_stale_world()
+
+    assert inc_id not in obs.world_state["incidents"]
+
+
+def test_prune_keeps_recently_resolved_incident(tmp_path):
+    """Resolved incidents within 7 days must be kept."""
+    obs = _make_observer_simple(tmp_path)
+    inc_id = "inc-recent-resolved"
+    obs.world_state["incidents"][inc_id] = {
+        "id": inc_id, "status": "resolved",
+        "resolved_at": time.time() - 86400,  # 1 day ago
+    }
+
+    obs._prune_stale_world()
+
+    assert inc_id in obs.world_state["incidents"]
--- a/services/control-plane/tests/test_state_reliability.py
+++ b/services/control-plane/tests/test_state_reliability.py
@ -0,0 +1,199 @@
+"""Tests for atomic writes and resilient world-state loading in the supervisor."""
+from __future__ import annotations
+
+import json
+import sys
+import time
+from pathlib import Path
+
+import pytest
+
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+import supervisor as supervisor_module
+from supervisor import Supervisor, _atomic_write_json
+
+
+# ---------------------------------------------------------------------------
+# Helpers (reused from test_supervisor_ha)
+# ---------------------------------------------------------------------------
+
+def _setup_supervisor(tmp_path: Path, monkeypatch) -> Supervisor:
+    actions = tmp_path / "actions"
+    events = tmp_path / "events"
+    world = tmp_path / "world"
+    repo = tmp_path / "repo"
+
+    for d in (actions, events, world, repo / "hosts"):
+        d.mkdir(parents=True, exist_ok=True)
+
+    monkeypatch.setattr(supervisor_module, "ACTIONS_DIR", actions)
+    monkeypatch.setattr(supervisor_module, "EVENTS_DIR", events)
+    monkeypatch.setattr(supervisor_module, "WORLD_DIR", world)
+    monkeypatch.setattr(supervisor_module, "REPO_ROOT", repo)
+
+    sup = Supervisor()
+    sup.desired_state = {"services": {}}
+    sup.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
+    return sup
+
+
+# ---------------------------------------------------------------------------
+# 1. atomic_write_json correctness
+# ---------------------------------------------------------------------------
+
+def test_atomic_write_json_produces_valid_json(tmp_path):
+    path = tmp_path / "out.json"
+    data = {"services": {"vps/outline": {"status": "healthy"}}, "count": 42}
+    _atomic_write_json(path, data)
+
+    assert path.exists(), "output file must exist after atomic write"
+    loaded = json.loads(path.read_text())
+    assert loaded == data
+
+
+def test_atomic_write_json_no_tmp_left_behind(tmp_path):
+    path = tmp_path / "world.json"
+    _atomic_write_json(path, {"ok": True})
+
+    tmp = path.with_suffix(".tmp")
+    assert not tmp.exists(), ".tmp must be cleaned up by os.replace"
+
+
+def test_atomic_write_json_overwrites_existing(tmp_path):
+    path = tmp_path / "state.json"
+    path.write_text('{"old": true}')
+    _atomic_write_json(path, {"new": True})
+    assert json.loads(path.read_text()) == {"new": True}
+
+
+def test_atomic_write_json_nested_structure(tmp_path):
+    path = tmp_path / "complex.json"
+    data = {
+        "nodes": {"vps": {"status": "online", "disk_usage_pct": 42}},
+        "incidents": {},
+        "list": [1, 2, 3],
+    }
+    _atomic_write_json(path, data)
+    assert json.loads(path.read_text()) == data
+
+
+# ---------------------------------------------------------------------------
+# 2. Resilient loader: empty / truncated file → skip cycle, no drift
+# ---------------------------------------------------------------------------
+
+def _populate_desired(sup: Supervisor, svc_key: str = "vps/outline"):
+    node, service = svc_key.split("/", 1)
+    sup.desired_state["services"][svc_key] = {
+        "node": node,
+        "service": service,
+        "desired": "running",
+    }
+
+
+def test_empty_services_json_skips_reconcile(tmp_path, monkeypatch):
+    """Empty services.json (truncated write) must not generate any redeploy action."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _populate_desired(sup)
+
+    # Write empty services.json — simulates a mid-write truncation
+    (tmp_path / "world" / "services.json").write_text("")
+    (tmp_path / "world" / "nodes.json").write_text("{}")
+    (tmp_path / "world" / "incidents.json").write_text("{}")
+
+    sup.reconcile()
+
+    pending = list((tmp_path / "actions" / "pending").glob("*.json"))
+    assert pending == [], f"No actions should be generated on empty state file, got: {[p.name for p in pending]}"
+
+
+def test_truncated_services_json_skips_reconcile(tmp_path, monkeypatch):
+    """Partially-written (truncated mid-write) JSON must not generate any action."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _populate_desired(sup)
+
+    (tmp_path / "world" / "services.json").write_text('{"vps/outline": {"status": "hea')
+    (tmp_path / "world" / "nodes.json").write_text("{}")
+    (tmp_path / "world" / "incidents.json").write_text("{}")
+
+    sup.reconcile()
+
+    pending = list((tmp_path / "actions" / "pending").glob("*.json"))
+    assert pending == [], f"No actions expected on truncated state, got: {[p.name for p in pending]}"
+
+
+def test_empty_incidents_json_skips_reconcile(tmp_path, monkeypatch):
+    """Empty incidents.json (any world-state file failing) skips full cycle."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _populate_desired(sup)
+
+    (tmp_path / "world" / "services.json").write_text("{}")
+    (tmp_path / "world" / "nodes.json").write_text("{}")
+    (tmp_path / "world" / "incidents.json").write_text("")
+
+    sup.reconcile()
+
+    pending = list((tmp_path / "actions" / "pending").glob("*.json"))
+    assert pending == [], f"No actions expected when any state file is unreadable, got: {[p.name for p in pending]}"
+
+
+def test_load_actual_state_returns_false_on_empty_file(tmp_path, monkeypatch):
+    """_load_actual_state must return False (not raise) when a file is empty."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+
+    (tmp_path / "world" / "services.json").write_text("")
+    (tmp_path / "world" / "nodes.json").write_text("{}")
+    (tmp_path / "world" / "incidents.json").write_text("{}")
+
+    result = sup._load_actual_state()
+    assert result is False
+
+
+def test_load_actual_state_returns_true_on_valid_files(tmp_path, monkeypatch):
+    """_load_actual_state returns True and populates actual_state on valid files."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+
+    services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
+    (tmp_path / "world" / "services.json").write_text(json.dumps(services))
+    (tmp_path / "world" / "nodes.json").write_text('{"vps": {"status": "online"}}')
+    (tmp_path / "world" / "incidents.json").write_text("{}")
+
+    result = sup._load_actual_state()
+    assert result is True
+    assert "vps/outline" in sup.actual_state["services"]
+
+
+def test_parse_failure_preserves_last_known_good_state(tmp_path, monkeypatch):
+    """When a file becomes unreadable, actual_state retains the previous good values."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+
+    # First successful load
+    services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
+    (tmp_path / "world" / "services.json").write_text(json.dumps(services))
+    (tmp_path / "world" / "nodes.json").write_text("{}")
+    (tmp_path / "world" / "incidents.json").write_text("{}")
+    assert sup._load_actual_state() is True
+    assert "vps/outline" in sup.actual_state["services"]
+
+    # File becomes empty (race condition)
+    (tmp_path / "world" / "services.json").write_text("")
+    assert sup._load_actual_state() is False
+
+    # State must be unchanged from the previous good load
+    assert "vps/outline" in sup.actual_state["services"], \
+        "Last-known-good state must be preserved on parse failure"
+
+
+def test_healthy_service_does_not_generate_action(tmp_path, monkeypatch):
+    """A desired service that appears healthy in world state generates no action."""
+    sup = _setup_supervisor(tmp_path, monkeypatch)
+    _populate_desired(sup)
+
+    services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
+    (tmp_path / "world" / "services.json").write_text(json.dumps(services))
+    (tmp_path / "world" / "nodes.json").write_text("{}")
+    (tmp_path / "world" / "incidents.json").write_text("{}")
+
+    sup.reconcile()
+
+    pending = list((tmp_path / "actions" / "pending").glob("*.json"))
+    assert pending == [], "Healthy service must not generate any action"
--- a/Show more
+++ b/Show more