191 changed files with 518 additions and 19298 deletions
--- a/.claude/skills/deploy/SKILL.md
+++ b/.claude/skills/deploy/SKILL.md
@ -1,43 +0,0 @@
---
-name: deploy
-description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
---
-
-Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
-Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
-
-## Targets
-
-| Target | What it deploys |
-|---|---|
-| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
-| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
-| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
-| `solaria` | SOLARIA compute services |
-| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
-
-## Invocation
-
-```bash
-scripts/deploy/deploy.sh <target>            # full pipeline
-scripts/deploy/deploy.sh <target> --dry-run  # preflight + gate only
-scripts/deploy/deploy.sh <target> --no-gate  # emergency: bypass tests
-```
-
-## Exit Code Handling
-
-| Code | Meaning | Required action |
-|---|---|---|
-| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
-| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
-| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
-| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
-| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
-| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
-
-## Rules
-
- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
- Canonical branch is `master` — preflight enforces this.
- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.
--- a/.claude/skills/node-onboarding/SKILL.md
+++ b/.claude/skills/node-onboarding/SKILL.md
@ -1,152 +0,0 @@
---
-name: node-onboarding
-description: >
-  Use when the user wants to add or onboard a new node to homelab-codex —
-  repo manifest, Tailscale mesh, node-agent, monitoring, and UI registration.
-  Keywords: "nowy node", "dodaj node", "onboarding", "onboard node".
-living_doc: true
-maturity: partial  # PROVEN: 00-access, 20-base, 30-node-agent; WRITTEN: 40-register, 50-verify (live pending). Update after each step lands on a real node.
---
-
-> **Living document** — sections marked **SCAFFOLD** are stubs waiting for battle-testing on a real node.
-> Promote to **PROVEN** after each step passes end-to-end. Do not treat SCAFFOLD sections as authoritative.
-
-## Trigger
-
-User asks to onboard / add a new node. Load this skill before touching any onboarding script or node.yaml.
-
---
-
-## Workflow — one step at a time
-
-```
-preflight (read-only)
-  └─ 00-access        [PROVEN]
-       └─ 20-base     [PROVEN]
-            └─ 30-node-agent   [PROVEN]
-                 └─ 40-register     [WRITTEN — live pending]
-                      └─ 50-verify  [WRITTEN — live pending]
-```
-
-Never skip ahead. Each step must exit 0 before the next begins.
-
---
-
-## Invocation
-
-```bash
-# Full onboarding (all steps in order)
-scripts/onboard/onboard.sh --node <name>
-
-# Single step
-scripts/onboard/onboard.sh --node <name> --step 00-access
-
-# Resume from a step
-scripts/onboard/onboard.sh --node <name> --from 10-bootstrap-runtime
-
-# Dry-run — probes run for real; mutations are printed, not executed
-scripts/onboard/onboard.sh --node <name> --dry-run
-```
-
---
-
-## Step status table
-
-| Step | File | Status | What it does |
-|------|------|--------|--------------|
-| `00-preflight` | `steps/00-preflight.sh` | SCAFFOLD | Read-only: arch, RAM, docker, swap, MM runtime → YAML snippet for node.yaml |
-| `00-access` | `steps/00-access.sh` | **PROVEN** | SSH key → `first_contact`, install Tailscale, `tailscale up` (interactive URL), verify over mesh |
-| `10-bootstrap-runtime` | `steps/10-bootstrap-runtime.sh` | SCAFFOLD | Create `/opt/homelab/` layout, `chown <ssh_user>` |
-| `20-base` | `steps/20-base.sh` | **PROVEN** | swap→zram, `/opt/homelab/` layout, event dir `/opt/homelab/events/<node>/` |
-| `20-install-docker` | `steps/20-install-docker.sh` | SCAFFOLD | Install Docker Engine if `docker_present=false`; skip if already installed |
-| `30-node-agent` | `steps/30-node-agent.sh` | **PROVEN** | rsync base compose + override, `docker compose up -d --build`, verify container + events |
-| `40-register` | `steps/40-register.sh` | WRITTEN | Dopisuje node do `inventory/topology.yaml` + tworzy `hosts/<node>/services.yaml`, commit na branchu (bez push) |
-| `50-verify` | `steps/50-verify.sh` | WRITTEN | SSH node: container+events; SSH VPS: restart observer + heartbeat poll + world/nodes.json |
-
---
-
-## node.yaml — key fields
-
-```yaml
-name: LUSTRO                        # ALL CAPS
-role: edge                          # edge | compute | infra
-ssh_user: pi                        # existing user on the node
-first_contact: pi@192.168.31.19     # LAN IP — NEVER .local (mDNS unreliable in automation)
-tailscale:
-  hostname: lustro                  # mesh name; switch to this after tailscale up
-  ip:                               # fill after join
-deploy_autonomy: true               # false → print manual instructions and stop
-git_control: false                  # false → push-based from SATURN (edge nodes)
-hardware:
-  arch: arm64                       # filled by 00-preflight
-  ram_mb: 4096                      # filled by 00-preflight
-  swap:
-    kind: zram                      # zram | file | none
-  docker_present: true              # filled by 00-preflight
-  mm_runtime: systemd:magicmirror.service   # filled by 00-preflight; none if absent
-services:
-  node-agent:
-    runtime:
-      engine: docker
-      mem_limit: 256m               # mandatory on RAM-constrained hosts (≤4 GB)
-```
-
-preflight fills `arch`, `ram_mb`, `docker_present`, `mm_runtime` — do NOT guess these.
-
-Full schema: `scripts/onboard/README.md`.
-
---
-
-## Operational rules (PROVEN)
-
-**PLAN-FIRST** — before any mutation, show exactly what will touch the remote host.
-Always run `--dry-run` first; dry-run must print real commands (`run()` propagation).
-
-**Idempotency** — every step is safe to re-run. Keys, Tailscale join, Docker install → skip if already done.
-
-**Isolation** — do NOT touch existing services on the node (e.g. MagicMirror as systemd unit).
-
-**Worktree discipline** — onboarding is a feature. Work in a task worktree (`agent.sh new`), never in the main checkout (`~/homelab-codex-ws` is deploy-only). See [[worktree-aware]].
-
---
-
-## Gotchas (battle-tested)
-
-| Problem | Fix |
-|---------|-----|
-| mDNS `.local` resolve fail | Always use LAN IP in `first_contact`; `.local` OK interactively, not in automation |
-| uid=1000 collision on RPi OS | If `pi` already holds uid=1000 → USE that user, don't create `oskar`. node-agent `1000:1000` matches out-of-box; creating a second uid=1000 breaks MM ownership |
-| passwordless sudo not guaranteed | Verify `sudo -n true` exits 0 before any sudo-over-SSH step. RPi OS default may require password; ssh without TTY will hang |
-| swap file on SD card | Use zram, not a swap file (SD wear). Add migration to `10-bootstrap-runtime` |
-| RAM ≤4 GB with heavy app | `mem_limit` on node-agent is mandatory — same OOM profile as VPS |
-| Docker already installed | Check `docker_present` from preflight; skip install step if true |
-| SSH known-hosts warning in parsed output | Pass `-o LogLevel=ERROR` to SSH for new mesh hosts |
-| `yaml_get` drops value prefix after `:` | Non-greedy colon: `s/^[[:space:]]*[^:]*:[[:space:]]*//'` — handles `systemd:unit` correctly |
-| `yaml_get` keeps inline YAML comments | Strip with `s/[[:space:]]\+#.*$//` after extraction (requires ≥1 space before `#`) |
-| dry-run stops at orchestrator level | `run()` wrapper + `export DRY_RUN=1` propagated to all step scripts; probes execute for real |
-| rsync push Permission denied to VPS events/ | ssh-user must be in the **group that owns `/opt/homelab/events/`** (aerbot/1000 on VPS). Symptom: silent WARNING in node-agent log, 292k files backlog, panel stale. Fix: `usermod -aG 1000 <user>` on VPS + re-login |
-| node-agent SSH key mount target | Mount the push key under the **container's HOME**: `/home/homelab/.ssh` (uid 1000 `homelab`), **NOT `/root/.ssh`** — ssh in `_ship_events_to_vps()` has no `-i` and only looks in `$HOME/.ssh`; a `/root/.ssh` mount is blind → `Permission denied` (lustro 2026-06-11, fix `a5a1352`). The new node's pubkey must also land in `authorized_keys` of `oskar@VPS` |
-| observer not seeing new node after topology.yaml edit | `_load_inventory()` runs once at `__init__`. After `git pull` on VPS (bind-mount is live), **`docker restart control-plane-observer`** is required — no redeploy needed |
-| worktree on wrong branch | Always check `git branch --show-current` on entry. One task = one worktree (`agent.sh new`). Never manually `git checkout` between task branches in the same worktree |
-
---
-
-## lib/ reference
-
-```
-lib/common.sh  — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
-lib/remote.sh  — rrun/rcopy/rsync_dir/rcheck (SSH wrappers; uses ONBOARD_SSH_USER / ONBOARD_HOST)
-```
-
-`run()` contract: in dry-run mode prints intent without executing; probes (ssh BatchMode=yes, `command -v`, status queries) always execute so the plan is realistic.
-
---
-
-## Definition of Done
-
-A node is fully onboarded when:
-
-1. `50-verify` exits 0 — event visible in control-plane UI and Telegram alert path confirmed.
-2. `hosts/<node>/node.yaml` committed with all preflight fields filled.
-3. `hosts/<node>/capabilities.yaml` present and accurate.
-4. Node appears in `inventory/topology.yaml`.
--- a/.claude/skills/save-session/SKILL.md
+++ b/.claude/skills/save-session/SKILL.md
@ -1,65 +0,0 @@
---
-name: save-session
-description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
---
-
-**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
-Never invoke proactively. Never invoke mid-task.
-
-## 1. Determine Session Boundary
-
-1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
-2. Fallback if no previous entry exists: 24 hours ago.
-
-## 2. Collect Facts (deterministic only — no invention)
-
-Run exactly:
-```bash
-# All commits since boundary
-git --no-pager log --oneline <boundary>..HEAD
-
-# Changed file summary
-git --no-pager diff --stat <boundary>..HEAD
-```
-
-From the visible conversation transcript: deploys run and their outcomes, test results seen.
-
-## 3. Write the Session Entry
-
-**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
-Never overwrite existing content.
-
-```markdown
-## Session HH:MM
-
-### Commits
-<output of git log --oneline>
-
-### Files changed
-<output of git diff --stat>
-
-### Deploys
-<list from transcript, or "None recorded">
-
-### Narrative
-> _user-provided summary_
-```
-
-The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
-
-## 4. What NOT to Touch
-
- `backlog.md` — only on explicit "update backlog" instruction
- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
- Any other file not listed above
-
-## 5. Commit
-
-Stage and commit **only** the session file:
-
-```bash
-git add docs/sessions/YYYY-MM-DD.md
-git commit -m "docs: session YYYY-MM-DD HH:MM"
-```
-
-No other files. No `git add -A`.
--- a/.claude/skills/worktree-aware/SKILL.md
+++ b/.claude/skills/worktree-aware/SKILL.md
@ -1,81 +0,0 @@
---
-name: worktree-aware
-description: >
-  Use when working in a git worktree checkout for a parallel agent task.
-  The presence of an .agent-task file in the current working directory indicates
-  a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
-  to the assigned task branch, NEVER push origin master, NEVER touch the main
-  checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
-  completion, report the branch name verbatim and stop — the human merges via
-  scripts/dev/agent.sh.
---
-
-## When this applies
-
- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
-  In the main checkout these rules do not apply.
-
-## Reading the marker
-
-`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
-
-```yaml
-task: my-feature
-branch: task/my-feature
-parent_commit: abc1234
-created_utc: 2026-06-03T10:00:00Z
-worktree_path: /home/oskar/homelab-codex-ws-my-feature
-```
-
-Always read this file first before taking any action.
-
-## Rules
-
-1. **Commit only to your branch.**
-   Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
-   If it does not, stop immediately and report the discrepancy.
-
-2. **Push only to your branch.**
-   The only permitted push is `git push origin task/<name>`.
-   NEVER `git push origin master` or any other branch.
-
-3. **Do not touch the main checkout.**
-   `~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
-   Do not read from, write to, or execute commands inside it.
-
-4. **Stay scoped.**
-   Only change files directly related to your assigned task.
-   If you notice other problems, report them in your final summary as separate follow-up proposals.
-   Do not fix them in this worktree.
-
-5. **Never `git add -A`.**
-   Always stage specific files by name: `git add path/to/file`.
-
-6. **Do not manage worktrees.**
-   Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
-   Worktree lifecycle is the human's responsibility.
-
-7. **Final report before stopping.**
-   When the task is done, provide a structured report containing:
-   - Files changed (path and one-line summary of change)
-   - Tests run and results
-   - All commit hashes on the task branch
-   - **Branch name verbatim** (copy-paste ready)
-   - Follow-up items as bulleted proposals for separate tasks
-
-## Definition of Done
-
- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
- Test suite passes
- Branch pushed: `git push origin task/<name>`
- Full report delivered in conversation
-
-## What you do NOT do
-
- Merge branches
- Create or push tags
- Run deploys or healthchecks against production nodes
- Delete branches or worktrees
- Modify files in other worktrees
- Push to `origin master` under any circumstances
--- a/.gitignore
+++ b/.gitignore
@ -15,13 +15,10 @@ __pycache__/
 *$py.class
 venv/
 .venv/
-*.egg-info/

 # Tools
 .aider*
 .codex
-# worktree task marker created by scripts/dev/agent.sh new — must stay untracked per worktree
-.agent-task

 # OS files
 .DS_Store
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -1,212 +0,0 @@
-# CLAUDE.md
-
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-
-## What This Repo Is
-
-GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
-
-## Node Roles
-
-| Host | Role |
-|------|------|
-| **SATURN** | Primary control node — only node where commits are made |
-| **SOLARIA** | GPU/compute/AI workloads |
-| **PIHA** | Infra, monitoring |
-| **VPS** | Public ingress, reverse proxy, control plane host |
-| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
-| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
-
-All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
-
-## Deployment
-
-```bash
-scripts/deploy/deploy.sh                        # fresh deploy on current node
-scripts/deploy/deploy.sh --resume              # resume after interruption
-scripts/deploy/deploy.sh --stage verify        # specific stage only
-scripts/deploy/deploy.sh --service mosquitto   # specific service only
-./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
-./scripts/deploy/deploy-node.sh chelsty-infra  # CHELSTY nodes (individually)
-./scripts/bootstrap/prepare-node.sh            # general node bootstrap
-./scripts/bootstrap/chelsty-runtime.sh         # CHELSTY-specific bootstrap
-scripts/onboard/onboard.sh --node <name>       # onboard a new node (idempotent, bash)
-scripts/onboard/onboard.sh --node <name> --step 00-access   # single step
-scripts/onboard/onboard.sh --node <name> --dry-run          # simulate
-```
-
-Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
-
-## Node Onboarding
-
-New nodes are onboarded via `scripts/onboard/` — an idempotent bash tool driven by
-`hosts/<node>/node.yaml` manifests (no Ansible). See `scripts/onboard/README.md` for
-the full schema, step status table, and gotchas.
-
-Key fields in `node.yaml`: `ssh_user`, `first_contact` (LAN IP — not `.local`),
-`tailscale.hostname`, `deploy_autonomy`, `git_control`, `hardware.*`.
-
-## Service Structure
-
-Every service must follow this layout:
-
-```
-services/<service>/
-├── docker-compose.yml
-├── service.yaml       # Machine-readable contract (primary source of truth for agents)
-├── README.md
-├── env.example        # Template — never commit actual secrets
-└── healthcheck.sh     # Returns 0 (healthy) or 1 (unhealthy)
-```
-
-`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
-
-Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
-
-## Agent System Architecture
-
-The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
-
-1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
-2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
-3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
-4. **Executor** — Executes actions only after they transition to `approved`.
-5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
-
-### Action approval flow
-```
-Agent → /opt/homelab/actions/pending/<id>.json
-      → Telegram notification → Operator approves
-      → /opt/homelab/actions/approved/<id>.json
-      → Executor runs → completed / failed
-```
-
-Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
-
-## Event System
-
-Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
-
-Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
-
-Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
-
-### Supervisor event routing table
-
-| Event type | Source | Action generated | Cooldown |
-|---|---|---|---|
-| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
-| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
-| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
-| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
-| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
-| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
-| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
-| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
-| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
-| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
-| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
-| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
-
-HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
-
-## Discovery Entry Points for Agents
-
-When exploring the system, use these files in order:
-1. `inventory/topology.yaml` — node list, roles, mesh type
-2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
-3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
-4. `services/<service>/service.yaml` — operational contract for a service
-
-## VPS-Specific Rules
-
-VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
-
-### Memory limit convention
-
-Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
-
-```yaml
-services:
-  myservice:
-    mem_limit: 256m      # cgroup ceiling; Docker restarts on breach
-    oom_score_adj: -900  # host kernel OOM-killer will not pick this container
-```
-
-Rules:
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
-
-### Repo-managed services on VPS
-
-All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
-
-| Service | Compose stack | Data path |
-|---|---|---|
-| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
-| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
-| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
-| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
-
-**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
-
-**Cutover checklist** (before running `docker compose up` for any migrated service):
-1. `git pull` on VPS
-2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
-3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
-4. For mosquitto: config stays at old bind path until explicitly migrated
-5. Verify named volumes exist: `docker volume ls | grep <project>`
-
-**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
-
-## CHELSTY-Specific Rules
-
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
-
-## Runtime Path Conventions
-
-`/opt/homelab/` layout on each node:
-
- `data/<service>/` — persistent volumes
- `config/<service>/` — secrets and host-local overrides (not in Git)
- `logs/<service>/` — service logs
- `state/` — deployment stage markers, agent heartbeats
- `events/` — append-only event store
- `world/` — Observer output (synthesized state)
- `actions/` — pending / approved / running / completed / failed
-
-## Definition of Done (serwisy)
-
-Before any new or changed service is considered ready:
-
-1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
-2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
-3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
-
-## Naming Conventions
-
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
- Container names must match service names
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
-
-## Multi-agent worktree mode
-
-`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
-Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
-
-**DISCIPLINE RULE — enforced after 2026-06-08 session violation:**
-All feature/implementation work MUST happen in a task worktree, never directly in the main
-checkout. The main checkout is for reading context and running deploys only. If you are
-about to create a new branch or make implementation commits while `pwd` is
-`~/homelab-codex-ws`, stop and ask the operator to run `agent.sh new <name>` first.
-
-If `.agent-task` exists in your current working directory, you are in a task worktree.
-**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
-before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
-
-Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
-Agents never invoke these — only the human does.
--- a/README.md
+++ b/README.md
@ -13,22 +13,6 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
 | **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
 | **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |

-## Agent System
-
-The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
-
-| Agent | Node | Role |
-|-------|------|------|
-| **stability-agent** | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
-| **node-agent** | all nodes | Publishes container health events to Redis pub/sub |
-| **observer** | VPS | Synthesizes world state from events into `/opt/homelab/world/*.json` |
-| **supervisor** | VPS | Detects drift between desired and actual state; writes `pending` actions |
-| **planner-agent** | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
-| **executor** | VPS | Executes actions only after operator approval |
-| **operator-ui** + **telegram-bot** | VPS / PIHA | Operator reviews and approves/rejects pending actions |
-
-Action approval flow: `pending/` → operator approves → `approved/` → executor runs.
-
 ## Repository Structure

 - `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
@ -45,13 +29,10 @@ Action approval flow: `pending/` → operator approves → `approved/` → execu
 ## Documentation Index

 - [Infrastructure Standards](docs/standards.md)
- [Agent Operating Procedures](docs/agents.md) (For AI/Non-Human Agents)
 - [Deployment Conventions](docs/deployment.md)
 - [Hardware](docs/hardware.md)
 - [Networking](docs/networking.md)
 - [Services](docs/services.md)
- [Node Capabilities](docs/capabilities.md)
- [Action Model](services/agent-system/action-model.md)

 ---
 *Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*
--- a/docs/agents.md
+++ b/docs/agents.md
@ -1,49 +0,0 @@
-# Agent Operating Procedures
-
-This document defines the operating procedures, constraints, and interaction protocols for non-human agents (AI agents, autonomous scripts) within the Homelab Codex ecosystem.
-
-## 1. Core Principles for Agents
-
-1.  **Read-Only by Default**: Agents should assume read-only access to the `/opt/homelab` runtime unless explicitly executing an approved action.
-2.  **Git as Authority**: The repository on **SATURN** is the source of truth. Agents must not modify the runtime state on nodes directly without corresponding (or pending) Git state, unless it's an emergency mitigation.
-3.  **Human-in-the-Loop (HIL)**: All destructive or structural changes (restarts, deployments, config changes) must follow the [Action Approval Model](../services/agent-system/action-model.md).
-4.  **Idempotency**: All scripts and actions proposed or executed by agents MUST be idempotent.
-5.  **Context-Awareness**: Agents MUST read the `README.md` and `docs/agents.md` at the start of every session to align with current infrastructure standards.
-
-## 2. Agent Roles
-
-| Role | Responsibility | Scope |
-|------|----------------|-------|
-| **Observer** | Monitors health, logs, and events. | Read-only access to `/opt/homelab/events` and `logs`. |
-| **Stability Agent** | Local node watchdog, event emitter. | Local node runtime, `service.yaml` healthchecks. |
-| **Orchestrator** | High-level planning, workload placement. | Repository-wide, multi-node topology. |
-| **Materializer** | Translates high-level intent into Docker/System state. | Execution of `approved` actions. |
-
-## 3. Discovery Protocol
-
-Agents must use the following entry points to understand the system:
-
-1.  **Topology**: `inventory/topology.yaml` for node list and roles.
-2.  **Capabilities**: `hosts/<node>/capabilities.yaml` to understand hardware/software constraints.
-3.  **Service Contract**: `services/<service>/service.yaml` to understand how to check health and manage a service.
-4.  **Operational State**: `/opt/homelab/state/` on local nodes for real-time status.
-
-## 4. Interaction with Humans
-
-Agents communicate with the operator via the `agent-system/telegram-bot`. 
-
- **Alerting**: Agents emit events to the event system. Critical events are forwarded to Telegram.
- **Proposals**: When an agent identifies a need for change (e.g., "Service X is failing, suggest restart"), it creates a `pending` action in `/opt/homelab/actions/pending/`.
- **Approval**: Agents must wait for the action status to transition to `approved` before execution.
-
-## 5. Decision Logic (Reasoning)
-
-When making decisions, agents MUST prioritize:
-1.  **Safety**: Do not violate power constraints (see `capabilities.yaml`).
-2.  **Stability**: Prefer keeping services on their `owner_node` unless it's down.
-3.  **Connectivity**: On intermittent nodes (CHELSTY), avoid actions requiring heavy WAN traffic during low-signal periods.
-
-## 6. Access Control for Agents
-
- **Filesystem**: Agents should run as the `homelab` user or equivalent with restricted sudo access to `docker compose`.
- **Secrets**: Agents MUST NOT attempt to read `.env` files unless specifically tasked with credential rotation. They should treat secrets as opaque handles.
--- a/docs/backlog.md
+++ b/docs/backlog.md
@ -1,123 +0,0 @@
-# Tech-debt backlog
-
-Centralny tracker tech-długu i znanych usterek. Wpisy ze sesji — dodawaj z datą i kontekstem.
-
---
-
-## Aktywne
-
-### 🔴 BLOKUJĄCE — FLOTA-BOMBA: node-agent SSH mount ślepy po recreate
-
-**Data**: 2026-06-11
-**Źródło**: sesja lustro ssh shipping fix
-**Problem**: solaria/piha/chelsty to stare **root** kontenery node-agenta (piha Created
-2026-05-27, uid 0) — sprzed dodania `user: "1000:1000"` do bazowego compose. Ich override
-montuje klucz SSH w `/root/.ssh`, co działa tylko dla uid 0. Pierwszy `--force-recreate` /
-reboot hosta / update obrazu przełączy kontener na uid 1000 (`homelab`, HOME=/home/homelab)
-i shipping eventów na VPS padnie z "Permission denied" — dokładnie jak na lustrze
-(naprawione `a5a1352`). `ssh` w `_ship_events_to_vps()` nie ma `-i` i szuka klucza
-w `$HOME/.ssh`.
-**⚠️ NIE RECREATE node-agenta na solaria/piha/chelsty przed fixem.**
-**Fix**: ujednolicić mount → `/home/homelab/.ssh` we wszystkich
-`hosts/*/runtime/node-agent/docker-compose.override.yml` (wzór: `hosts/lustro/`)
-ALBO dodać `-i $HOME/.ssh/id_rsa` w `_ship_events_to_vps()`.
-
---
-
-### ha-diag-agent deploy ZABLOKOWANY (placeholder token)
-
-**Data**: 2026-06-11
-**Źródło**: sesja — deploy config merged (`5e9db5c`), `.env` na piha utworzony
-(`/opt/homelab/config/ha-diag-agent/.env`, chmod 600) ale token = PLACEHOLDER.
-**Blokada**: chelsty-ha offline → brak tokenu i połączenia.
-**Do decyzji**: cel HA — chelsty-ha vs HA Ken (`homeassistant5` na piha; z kontenera
-NIE `localhost`).
-**Przed `shadow_mode=false`**: target restartu w supervisorze = nazwa kontenera
-`homeassistant5`; curl endpointu HA z tokenem = HTTP 200.
-
---
-
-### observer-poison-quarantine — review brancha (`78c9e4a`)
-
-**Data**: 2026-06-11
-**Źródło**: sesja — patch Codexa zachowany na `task/observer-poison-quarantine`, NIE w master.
-**Do zrobienia**: zweryfikować, czy observer realnie wiesza się na malformed evencie
-(poison NIE był przyczyną awarii lustra — hipoteza niezweryfikowana, obalona przez
-verify-before-fix). Realny bug → merge; inaczej → drop brancha i worktree.
-
---
-
-### node_agent.py — drobne sprzątanie shippingu
-
-**Data**: 2026-06-11
-**Źródło**: sesja lustro ssh shipping fix
-1. **Stale komentarz** `node_agent.py:546-548` — twierdzi, że kontener "runs as root";
-   nieaktualne od `user: "1000:1000"`.
-2. **Sukces shippingu na `logger.debug`** → podnieść do `info` lub dodać licznik —
-   działający shipping jest niewidoczny w logach przy INFO, co utrudniało diagnozę
-   (cicha awaria wyglądała identycznie jak ciche działanie).
-
---
-
-### event-bloat: wyczyścić spłynięty backlog lustro na VPS
-
-**Data**: 2026-06-11
-**Źródło**: sesja — po fixie shippingu 7600+ plików backlogu spłynęło do
-`/opt/homelab/events/lustro/` na VPS.
-**Fix**: wyczyścić stare pliki (observer już je przetworzył); docelowo polityka retencji
-w event-store.
-
---
-
-### rsync `--omit-dir-times` (node-agent)
-
-**Data**: 2026-06-09
-**Źródło**: flota recovery session
-**Objaw**: rsync exit code 23 po każdym push — `set-times` na katalogu `/opt/homelab/events/`
-zwraca EPERM (oskar nie jest właścicielem katalogu; aerbot jest). Pliki są kopiowane poprawnie,
-ale exit 23 zaśmieca logi i może maskować prawdziwe błędy.
-**Fix**: dodać `--omit-dir-times` do wywołania `rsync` w `node-agent.py`.
-**Lokalizacja**: `services/node-agent/src/node_agent.py` — wywołanie rsync w pętli push.
-**Update 2026-06-11**: potwierdzone flotowo — każdy node loguje fałszywe
-"Event shipping failed" (rsync code 23) co cykl, mimo że pliki przechodzą; katalogi
-`/opt/homelab/events/*` na VPS należą do `aerbot`, klient nie ustawi na nich czasów.
-
---
-
-### Deklaratywny zapis `oskar ∈ aerbot` w manifeście VPS
-
-**Data**: 2026-06-09
-**Źródło**: flota recovery — root cause: oskar spoza grupy aerbot(1000) → rsync Permission denied
-**Problem**: przynależność do grupy jest zarządzana ręcznie (`usermod -aG 1000 oskar` ad-hoc).
-Brak gwarancji po przeinstalowaniu VPS lub zmianie usera.
-**Fix**: dodać do `hosts/vps/host.yaml` lub `hosts/vps/capabilities.yaml` sekcję
-`users: oskar: groups: [aerbot]` — i wyegzekwować w deploy/bootstrap skrypcie VPS.
-Alternatywa: zmienić właściciela `/opt/homelab/events/` na `oskar:oskar` i zaktualizować
-node-agent deploy skrypty.
-
---
-
-### Rozdzielenie worktree per task (agent.sh)
-
-**Data**: 2026-06-09
-**Źródło**: sesja — `homelab-codex-ws-node-onboarding` używany raz dla `task/node-onboarding`,
-raz dla `task/fix-event-bloat` przez ręczne `git checkout`.
-**Problem**: jeden worktree współdzielony przez dwa branche = anty-wzorzec. `git branch`
-mogło wskazywać zły branch; `+` w listingu = pozornie "w innym worktree" ale nieprawda.
-Prowadzi do commitowania na złej gałęzi.
-**Fix**: egzekwować — jeden task = jeden worktree (`agent.sh new <task-name>`). Przy wejściu
-do worktree zawsze `git branch --show-current` i weryfikacja `.agent-task`.
-Długoterminowo: `agent.sh new` powinien odmawiać jeśli żądana gałąź jest już sprawdzona.
-
---
-
-## Zamknięte
-
-### Observer staleness — martwy node pokazywany NOMINAL
-
-**Data**: 2026-06-08 (złapane), status: OTWARTY w sensie implementacji
-**Problem**: observer/supervisor trzyma ostatni znany stan; brak heartbeat TTL.
-Chelsty-infra milczy, ale status NOMINAL podważa zaufanie do panelu.
-**Fix**: heartbeat TTL → po przekroczeniu oznacz status `stale` lub `down`.
-**Powiązane**: brain-watchdog ślepy na per-node freshness.
-*(Otwarty jako TODO implementacyjny — przeniesiony z sesji 2026-06-08)*
--- a/docs/capabilities.md
+++ b/docs/capabilities.md
@ -83,10 +83,3 @@ Future autonomous agents will use this metadata to:
 2.  **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
 3.  **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
 4.  **Propose Failover:** Automatically suggest the best alternative node during an outage.
-
-## Agent Reasoning Logic
-
-When an agent parses `capabilities.yaml`, it should apply these heuristics:
- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.
--- a/docs/chelsty-runtime.md
+++ b/docs/chelsty-runtime.md
@ -1,154 +1,60 @@
 # CHELSTY Runtime

-This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
-
-| Node | Role | Services |
-|------|------|----------|
-| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
-| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
-
-Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
+This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.

 ## Runtime Layout

-```
-/opt/homelab/
-├── config/          # Service-specific configs and secrets (not in Git)
-│   ├── mosquitto/
-│   └── zigbee2mqtt/
-├── data/            # Persistent service data
-│   ├── mosquitto/   # Persistence DB, password file
-│   └── zigbee2mqtt/
-│       └── data/    # z2m config, coordinator backup, network key
-└── logs/
-```
+The CHELSTY runtime is located at `/opt/homelab`.
+
+- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
+- `/opt/homelab/data/`: Persistent data for services.
+- `/opt/homelab/logs/`: Service logs.
+
+### Key Service Locations
+- **Mosquitto**: `/opt/homelab/config/mosquitto/`
+- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`

 ## SLZB-06U Integration

-CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
+CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.

- **Coordinator IP**: `192.168.1.105`
- **Port**: `6638`
- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
+- **Coordinator IP**: 192.168.1.105
+- **Port**: 6638
+- **Protocol**: TCP (ezsp adapter)

-⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
+Zigbee2MQTT is configured to connect to this coordinator over the local network.

-## Networking Constraints
+## Offline & LTE Assumptions

-### Mosquitto — `network_mode: host`
-Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
-
-### Zigbee2MQTT — bridge network + extra_hosts
-Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
-
-```yaml
-# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
-services:
-  zigbee2mqtt:
-    extra_hosts:
-      - "mosquitto:host-gateway"
-```
-
-This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
-
-**Why not `network_mode: host` for z2m?**  
-chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
-
-## Zigbee2MQTT Config Location
-
-The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
-
-```
-/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
-```
-
-This path is mounted read-write by the base `docker-compose.yml`:
-```yaml
-volumes:
-  - /opt/homelab/data/zigbee2mqtt/data:/app/data
-```
-
-Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
-
-### Minimal configuration.yaml
-```yaml
-homeassistant: true
-permit_join: false
-mqtt:
-  base_topic: zigbee2mqtt
-  server: mqtt://mosquitto:1883
-serial:
-  port: tcp://192.168.1.105:6638
-  adapter: ezsp
-frontend:
-  port: 8080
-advanced:
-  log_level: info
-```
-
-## chelsty-ha — No node-agent
-
-`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
-
-In `hosts/chelsty-ha/services.yaml`:
-```yaml
-services:
-  homeassistant:
-    monitor: false   # No node-agent; suppresses supervisor action generation
-```
-
-Remove `monitor: false` once node-agent is bootstrapped on this VM.
+- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
+- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
+- **Home Assistant**: Runs in a separate VM, connecting to the Mosquitto broker on CHELSTY.

 ## Deployment Flow

-### Initial Bootstrap
-```bash
-./scripts/bootstrap/chelsty-runtime.sh
-```
+1. **Initial Bootstrap**:
+   Run the bootstrap script on the CHELSTY node:
+   ```bash
+   ./scripts/bootstrap/chelsty-runtime.sh
+   ```

-### Deploy services
-```bash
-./scripts/deploy/deploy-node.sh chelsty-infra
-./scripts/deploy/deploy-node.sh chelsty-ha
-```
+2. **Manual Configuration**:
+   - Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
+   - Add Mosquitto user:
+     ```bash
+     sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
+     ```

-### Manual (SSH) — chelsty-infra uses docker-compose v1
-```bash
-ssh oskar@100.122.201.22
-cd ~/homelab-codex-ws/services/<service>
-docker-compose -f docker-compose.yml \
-  -f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
-  up -d --build --force-recreate
-```
+3. **Service Deployment**:
+   Use the staged deployment runtime:
+   ```bash
+   ./scripts/deploy/deploy-node.sh chelsty
+   ```

-> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
+## Recovery Procedure

-## Recovery Procedures
-
-### Mosquitto stopped
-```bash
-ssh oskar@100.122.201.22 "docker start mosquitto"
-# Ensure restart policy is correct:
-docker update --restart unless-stopped mosquitto
-```
-
-### Zigbee2MQTT won't start
-1. Check logs: `docker logs zigbee2mqtt --tail 50`
-2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
-3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
-4. If config missing, recreate from the minimal template above
-
-### SLZB-06U unreachable
-`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
-
-## Critical Backup Sets
-
-| Data | Path |
-|------|------|
-| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
-| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
-| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
-| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
-
-> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
+In case of runtime failure:
+1. Verify Docker and Compose plugin: `docker compose version`
+2. Re-run bootstrap script to ensure directory structure and basic configs.
+3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
+4. Verify SLZB-06U reachability: `ping 192.168.1.105`
--- a/docs/chelsty-stability-agent.md
+++ b/docs/chelsty-stability-agent.md
@ -19,7 +19,7 @@ It acts as a filesystem-first watchdog that detects anomalies in the local runti

 *   **Heartbeat**: Updated every cycle at `/opt/homelab/state/stability-agent.heartbeat`.
 *   **State Summary**: A JSON summary of all latest checks at `/opt/homelab/state/stability-agent.json`.
-*   **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty-infra/events.jsonl`.
+*   **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty/events.jsonl`.

 #### Deployment

--- a/docs/observer-runtime.md
+++ b/docs/observer-runtime.md
@ -1,98 +0,0 @@
-# Observer Runtime
-
-The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
-
-## Architecture
-
-The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
-
-### Inputs
- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
-
-### World Model Output
-Generated under `/opt/homelab/world/`:
- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
-
-## Checkpoint Format
-
-The observer tracks per-node progress to avoid silently skipping event directories:
-
-```json
-{
-  "node_checkpoints": {
-    "vps":            "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
-    "piha":           "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
-    "chelsty-infra":  "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
-  }
-}
-```
-
-A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
-
-**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
-
-## Event Types
-
-### Negative events (create/escalate incidents)
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
- `deployment_failed` — record failure in deployments.json
-
-### Positive events (resolve state)
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
- `service_recovered` — alias, same effect
- `deployment_completed` — marks deployment as completed
-
-### Node events
- `node_online`, `node_offline` — update node status in nodes.json
- `disk_pressure_*` — set `disk_pressure` field on the node record
-
-## Incident Lifecycle
-
-1.  **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
-2.  **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
-3.  **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
-4.  **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
-
-### Example Incident JSON
-```json
-{
-  "inc-1715518800-vps-observer": {
-    "id": "inc-1715518800-vps-observer",
-    "node": "vps",
-    "service": "observer",
-    "status": "resolved",
-    "severity": "error",
-    "started_at": 1715518800.0,
-    "last_occurrence": 1715518860.0,
-    "occurrence_count": 2,
-    "trigger_type": "containers_not_running",
-    "resolved_at": 1715519100.0
-  }
-}
-```
-
-## World State Pruning
-
-`_prune_stale_world()` runs every reconcile cycle and removes:
-
-1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
-2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
-3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
-4. **Expired incidents** — resolved incidents older than 7 days.
-
-## Runtime Behavior
-
-### Idempotency
-The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
-
-### Deployment Tracking
-Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
-
-### Topology Filtering
-Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
--- a/docs/sessions/2026-05-27-planner-agent.md
+++ b/docs/sessions/2026-05-27-planner-agent.md
@ -1,234 +0,0 @@
-# SESSION: Budowa planner-agent — LLM-based diagnostics
-
-**DATA:** 2026-05-27  
-**REZULTAT:** planner-agent działa na SOLARIA (`healthy`), Ollama primary, cloud fallback gotowy do włączenia
-
---
-
-## Co zostało zbudowane
-
-### `services/planner-agent/src/llm_router.py`
-
-Moduł LLM routing z local-first fallback chain:
-
- **`LLMRouter`** — główna klasa routingu przez litellm
- **`ModelConfig`** — konfiguracja jednego modelu (name, timeout, api_base, extra_kwargs)
- **`ModelMetrics`** — liczniki per model × outcome (`success`/`fallback`/`error`); success_rate
- **`RouteResult`** — wynik routingu z `content`, `model_used`, `attempts`, `latency_ms`
- **`AttemptRecord`** — zapis jednej próby (model, outcome, reason, latency_ms)
- **`_extract_json_from_fence()`** — wydobywa JSON z bloków ` ```json ``` ` jeśli model nie odpowie czystym JSON
-
-Domyślny chain: `ollama/qwen2.5:7b` (8s) → `claude-haiku-4-5-20251001` (30s) → `claude-sonnet-4-6` (30s)
-
-Metryki każdego wywołania publikowane na Redis kanał `llm_router_metrics`.
-
-### `services/planner-agent/src/planner.py`
-
-Główna pętla agenta:
-
- **`PlannerAgent`** — async agent: Redis sub → diagnoza LLM → pending action file → event
- **`HealthEvent`** — znormalizowane zdarzenie zdrowotne z Redis (node, service, event_type, severity, payload)
- **`ActionProposal`** — propozycja akcji z pełnymi metadanymi; `.to_action_file()` → format executora
- **`CooldownTracker`** — gate 5-minutowy per `svc_key` (node/service); NIE rejestruje jeśli LLM się wysypał
- **`parse_event()`** — normalizuje dwa formaty wejściowe (node-agent / control-plane)
- **`write_pending_action()`** — atomiczny zapis: `.tmp` → rename
- **`emit_event()`** — zapis zdarzenia `remediation_started` do systemu plików (bez importów z control-plane)
-
-Pipeline:
-```
-Redis msg → parse_event() → benign skip → cooldown gate → _propose_action() (LLM)
-         → write_pending_action() → emit_event("remediation_started")
-```
-
-### Pliki towarzyszące
-
-| Plik | Opis |
-|------|------|
-| `service.yaml` | Kontrakt operacyjny: owner_node=solaria, deps=redis+ollama, healthcheck=file |
-| `docker-compose.yml` | env_file + extra_hosts:host-gateway + ANTHROPIC_API_KEY w environment |
-| `Dockerfile` | python:3.11-slim, litellm, redis, jsonschema, structlog |
-| `healthcheck.sh` | Sprawdza wiek pliku heartbeat (max 300s) |
-| `requirements.txt` | litellm, redis, jsonschema, structlog |
-| `tests/test_planner.py` | 49 testów jednostkowych |
-| `tests/test_llm_router.py` | 34 testy jednostkowe |
-
---
-
-## Kluczowe decyzje architektoniczne
-
-### 1. HITL invariant (Human-in-the-loop)
-
-Planner **wyłącznie** zapisuje do `actions/pending/`. Executor wymaga pliku w `actions/approved/`.
-Planner nigdy nie wykona akcji samodzielnie — to fundamentalna zasada systemu.
-
-Implementacja: `write_pending_action()` pisze do `pending/`, żadna ścieżka w kodzie nie dotyka `approved/`.
-
-### 2. Cooldown gate
-
-Per `svc_key` (= `node/service`), domyślnie 5 minut. Cel: nie zalewać operatora powtórzonymi
-propozycjami dla tego samego serwisu.
-
-**Kluczowa decyzja:** cooldown NIE jest rejestrowany jeśli cały chain LLM się wysypał.
-Dzięki temu kolejne zdarzenie może spróbować ponownie, zamiast być cicho zablokowanym
-przez 5 minut mimo że nie powstała żadna propozycja.
-
-### 3. Fallback chain — local-first
-
-Kolejność: Ollama (lokalny GPU) → Haiku → Sonnet.
-
-Uzasadnienie:
- Ollama nie wysyła danych do zewnętrznych serwisów; niskie opóźnienie dla prostych przypadków
- Haiku = szybki i tani cloud fallback
- Sonnet = ostatnia deska ratunku dla trudnych przypadków
-
-Odrzucenie modelu na podstawie: timeout, błąd sieci, wzorzec odmowy, invalid JSON, schema error.
-
-### 4. Brak importów z control-plane
-
-`services/planner-agent/` jest w pełni samodzielny. Nie importuje nic z
-`services/control-plane/`. Emisja eventów jest implementowana lokalnie (kopia logiki
-`scripts/lib/events.py`).
-
-Uzasadnienie: planner musi działać nawet jeśli control-plane jest offline; oddzielne
-cykl deploymentu.
-
-### 5. structlog z PrintLoggerFactory
-
-Nie używamy `structlog.stdlib.add_logger_name` — `PrintLogger` nie ma atrybutu `.name`.
-Zamiast tego łańcuch procesorów: `add_log_level` → `TimeStamper` → `StackInfoRenderer`
-→ `format_exc_info` → `JSONRenderer`.
-
-### 6. NODE_NAME czytany w czasie wywołania, nie importu
-
-`_emit_event_sync` czyta `NODE_NAME` z modułowego `NODE_NAME` przy każdym wywołaniu
-(nie jako default parameter). Umożliwia patchowanie w testach.
-
---
-
-## Problemy napotkane i rozwiązania
-
-### Problem: `localhost` w kontenerze nie sięga do hosta
-
-**Kontekst:** Ollama działa na SOLARIA pod `localhost:11434`. Kontener Docker
-z domyślną siecią bridge nie może sięgnąć do hosta przez `localhost`.
-
-**Rozwiązanie:**
-1. Dodano `extra_hosts: - "host-gateway:host-gateway"` do docker-compose.yml
-2. `.env` używa `OLLAMA_HOST=http://host-gateway:11434`
-
-### Problem: `environment` vs `env_file` — podwójne zmienne
-
-**Kontekst:** Pierwsza wersja docker-compose.yml miała wszystkie zmienne hardkodowane
-w sekcji `environment` z fallback wartościami (`${VAR:-default}`). Powodowało to
-że `.env` był opcjonalny a nie wymagany.
-
-**Rozwiązanie:** Usunięto wszystkie zmienne runtime z `environment`, przeniesiono do `env_file`.
-Pozostał tylko `ANTHROPIC_API_KEY` w `environment` (opcjonalny sekret, nie powinien być w pliku na dysku).
-
-### Problem: `structlog.stdlib.add_logger_name` crashuje z PrintLogger
-
-**Symptom:** `AttributeError: 'PrintLogger' object has no attribute 'name'`
-
-**Rozwiązanie:** Usunięto `add_logger_name` z łańcucha procesorów. Nie jest
-kompatybilny z `PrintLoggerFactory`.
-
-### Problem: verify stage failuje zaraz po starcie
-
-**Symptom:** `deploy.sh` raportuje FAILED przy verify bo heartbeat nie istnieje.
-
-**Przyczyna:** Race condition — agent potrzebuje kilku sekund na uruchomienie
-pętli i pierwsze `touch()` heartbeatu.
-
-**Rozwiązanie:** Nie jest to prawdziwy błąd. Docker healthcheck ma `start_period: 30s`.
-Kontener pokazuje `(healthy)` po 30s od startu.
-
-### Problem: git pull z divergent branches na solaria
-
-**Symptom:** Solaria miała 2 lokalne commity nie będące na Forgejo + ręczne zmiany w working tree.
-`git pull` failował z "Need to specify how to reconcile divergent branches."
-
-**Rozwiązanie:**
-```bash
-git checkout -- services/planner-agent/docker-compose.yml  # porzuć ręczne zmiany
-git fetch origin
-git rebase origin/master  # rebase local commits on top of master
-```
-
---
-
-## Status deploymentu na SOLARIA
-
-```
-Container:  planner-agent   Up ~30m (healthy)
-Image:      planner-agent-planner-agent
-Node:       solaria (100.100.231.104)
-Heartbeat:  /opt/homelab/state/planner-agent.heartbeat  (age 0s)
-
-Channels subscribed:
-  - health_events
-  - world_updates
-
-LLM chain:
-  PRIMARY:  ollama/qwen2.5-coder:14b @ http://host-gateway:11434
-  FALLBACK: claude-haiku-4-5-20251001  (disabled — brak ANTHROPIC_API_KEY)
-  FALLBACK: claude-sonnet-4-6          (disabled — brak ANTHROPIC_API_KEY)
-
-Redis:      redis://100.108.208.3:6379  ✓ connected
-```
-
---
-
-## Co zostało na później
-
-### 1. ANTHROPIC_API_KEY — cloud fallback wyłączony
-
-Haiku i Sonnet są skonfigurowane w chain ale nie mają klucza API.  
-Gdy Ollama nie da rady (złożony przypadek / timeout), chain się wysypie bez fallbacku.
-
-Aby włączyć:
-```bash
-ssh oskar@100.100.231.104
-echo "ANTHROPIC_API_KEY=sk-ant-..." >> /opt/homelab/config/planner-agent/.env
-docker compose -f ~/homelab-codex-ws/services/planner-agent/docker-compose.yml up -d
-```
-
-### 2. End-to-end test z prawdziwym eventem
-
-Planner jest podłączony do Redis i nasłuchuje, ale żadne zdarzenie jeszcze nie
-przeszło przez pełną ścieżkę (LLM call → pending action → operator UI).
-
-Test:
-```bash
-redis-cli -h 100.108.208.3 PUBLISH health_events '{
-  "type": "service_unhealthy",
-  "node": "piha",
-  "service": "mosquitto",
-  "severity": "error",
-  "payload": {"reason": "container exited"},
-  "timestamp": "2026-05-27T20:00:00Z"
-}'
-# Obserwuj: docker logs planner-agent -f
-# Sprawdź: ls /opt/homelab/actions/pending/
-```
-
-### 3. Solaria local commits
-
-Solaria ma 2 lokalne commity (`feat: add ECC skills`, `fix: remove duplicate CLAUDE.md sections`)
-które nie są na Forgejo. Zostały zrebase'owane na top of master ale nie wypchnięte.
-Należy je wypchnąć lub zreviewować i ewentualnie squashować.
-
-### 4. Integracja z operator UI / Telegram
-
-Propozycje w `actions/pending/` nie mają jeszcze kanału notyfikacji do operatora.
-Telegram bot powinien wysyłać powiadomienie gdy pojawi się nowy plik w `pending/`.
-
---
-
-## Commity tej sesji
-
-```
-ff6fda1  planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
-ca37fca  Add planner-agent: LLM-powered remediation planner
-         (llm_router.py, planner.py, tests, service.yaml, docker-compose.yml,
-          healthcheck.sh, Dockerfile)
-```
--- a/docs/sessions/2026-05-27.md
+++ b/docs/sessions/2026-05-27.md
@ -1,103 +0,0 @@
-# SESSION: Stabilizacja systemu wieloagentowego homelabu
-
-**DATE:** 2026-05-27  
-**RESULT:** System NOMINAL (97/97 services, 0 errors)
-
---
-
-## PROBLEMS FOUND
-
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
- `service_healthy` event nie zamykał aktywnych incydentów
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
-
---
-
-## FIXES SHIPPED (commits in master)
-
-```
-7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
-b40b832 Fix ghost service keys from hash-prefixed Docker container names
-28e9534 observer: service_healthy resolves active incidents
-46ae92b supervisor: also cancel pending actions for services removed from desired state
-410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
-b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
-61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
-51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
-fb7828b supervisor: auto-cancel pending actions when drift is resolved
-2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
-267742c vps/node-agent: add network_mode: host for control-plane health probe
-4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
-f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
-a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
-2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
-65bac4e fix(node-agent): mount host SSH key into container for event shipping
-96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
-ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
-c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
-01b7758 feat(node-agent): implement health monitor and safe cleanup policy
-```
-
-### Szczegóły kluczowych napraw
-
-**fix(observer): per-node checkpoints**  
-Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
-
-**fix(observer): ghost key pruning**  
-`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
-
-**fix(node-agent): canonical container name**  
-`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
-
-**fix(node-agent): service_healthy emission**  
-Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
-
-**fix(supervisor): auto-cancel resolved actions**  
-`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
- serwis stał się healthy (`drift_resolved_auto`)
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
-
-**fix(supervisor): monitor:false**  
-Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
-
-**fix(agent-system/materializer): control-plane API as source**  
-Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
-
-**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**  
-Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
-
-**fix(chelsty-infra/zigbee2mqtt): writable config**  
-z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
-
---
-
-## STAN KOŃCOWY
-
-| Node | Status | Serwisy |
-|------|--------|---------|
-| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
-| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
-| solaria | online | node-agent, stability-agent, AI workloads |
-| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
-| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
-
-**Action queue:** 0 pending, 0 approved, 0 running  
-**Incidents:** 0 active  
-**Ghost service keys:** 0  
-
---
-
-## ZNANE OGRANICZENIA / TODO
-
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.
--- a/docs/sessions/2026-06-08-lustro-onboarding.md
+++ b/docs/sessions/2026-06-08-lustro-onboarding.md
@ -1,100 +0,0 @@
-# Sesja 2026-06-08 — onboarding LUSTRO (RPi4 / Magic Mirror / KEN)
-
-## Cel
-
-Budowa reużywalnego narzędzia onboardingu nodów `scripts/onboard/` (bash idempotentny,
-NIE Ansible — świadoma decyzja), napędzanego deklaratywnym manifestem
-`hosts/<node>/node.yaml`. Pierwszy realny node: LUSTRO.
-
-## Node LUSTRO (fakty z preflight)
-
- RPi4, aarch64, Debian bookworm, hostname pimirror2, sieć KEN 192.168.31.x
- RAM 4 GB (MM zjada ~1.7 Gi — ten sam profil co VPS z OOM 2026-06-01 → `mem_limit` obowiązkowy)
- dysk 58 G / 48% (luz)
- docker 29.5.3 już zainstalowany (krok `20-install-docker` zbędny dla tego node'a)
- user `pi`: uid=1000, passwordless sudo (potwierdzone `sudo -n true`=0), grupy docker+ollama
- Magic Mirror = systemd unit `magicmirror.service` (Electron jako pi) — **NIETKNIĘTY** przez całą sesję
- swap = 200 M plik `/var/swap` na SD → do migracji na zram (wear karty)
- Tailscale: zainstalowany w tej sesji, Running, IP 100.99.85.73
-
-## Decyzje
-
- **user = istniejący `pi`** (NIE tworzymy `oskar` — `pi` już zajmuje uid 1000, jest
-  właścicielem MM, ma docker+sudo; node-agent docker `1000:1000` pasuje out-of-box).
-  Świadome odstępstwo od konwencji "oskar wszędzie".
- runtime node-agent = docker
- `first_contact` = LAN IP `pi@192.168.31.19` (mDNS `.local` okazał się zawodny —
-  transient resolve fail); po `tailscale up` kontakt przejmuje mesh (`pi@lustro`)
- Tailscale auth = login interaktywny (URL), bez authkey
- swap target = zram
-
-## Stan: 00-access ZAMKNIĘTY
-
-Idempotentny, przeszedł na ostro + re-run czysty. Lustro w mesh, kanał SATURN→lustro
-przez Tailscale działa bezhasłowo. Verify czysty (arch=aarch64).
-
-## Bugi narzędzia naprawione w tej sesji
-
-1. **dry-run był płytki** (tylko orchestrator) → `run()` helper + propagacja `DRY_RUN=1`
-   do steps (`lib/common.sh`, `onboard.sh`, `remote.sh`, `00-access.sh`)
-2. **`yaml_get` fallback** (bez `yq`):
-   - inline-comment stripping — `[[:space:]]+#.*$` po wartości
-   - PRE-EXISTING greedy-colon bug — `.*:` ucinał ostatni dwukropek, gubił prefix
-     w `systemd:magicmirror.service`; fix: `^[[:space:]]*[^:]*:[[:space:]]*`
-3. **`00-access` verify** — ssh known-hosts warning wpadał do parsowanego `arch`
-   (`WARN "Unexpected arch 'Warning:Permanently…'"`); fix: `-o LogLevel=ERROR`
-   + czysty stdout (bez `2>&1`)
-
-## Branch / commity
-
-`feat/node-onboarding` (6 commitów):
-
-| Hash | Opis |
-|------|------|
-| `adb8407` | scaffold — onboard.sh, lib/, steps/00-preflight, hosts/lustro/node.yaml draft |
-| `9012a36` | 00-access.sh + node.yaml ssh_user/first_contact/hardware |
-| `931fd46` | dry-run propagacja — run() helper, DRY_RUN=0/1 |
-| `eed0ad0` | yaml_get fix — inline-comment + greedy-colon |
-| `1bed855` | first_contact: IP zamiast mDNS .local |
-| `471ba09` | verify fix — LogLevel=ERROR, czysty stdout |
-
-## OTWARTE — do następnej sesji (kolejność)
-
-1. **WORKTREE HYGIENE** (pierwsza rzecz): cała sesja jechała w MAIN checkout wbrew
-   zasadzie "main = deploy-only". Decyzja nierozstrzygnięta:
-   - (A) rename `feat/` → `task/node-onboarding` + worktree + main→master
-     (pełna zgodność z `agent.sh`; merge=FF)
-   - (B) zostać `feat/` + ręczny `git merge --ff-only`
-
-   `agent.sh new` tworzy `task/<name>` od `master` i NIE bierze istniejącego brancha.
-   `git worktree list` jeszcze nieodczytany (potrzebny wzorzec ścieżki).
-
-2. **base step**: migracja swap 200 M-plik → zram; `/opt/homelab` + `chown pi`
-   (uid 1000 już pasuje); event dir `/opt/homelab/events/lustro/`
-3. **node-agent step**: docker override, user 1000:1000 (pi=1000), `mem_limit: 256m`
-4. **register step**: observer/supervisor inventory + redis sub + UI panel agents.okit.pl
-5. **verify step (50)**: smoke end-to-end (event dotarł do control plane, widać w UI,
-   realny alert path Telegram)
-6. **mm-watch**: health check `systemctl is-active magicmirror.service`
-7. **drobiazgi**: baner URL w 00-access ma defekt wyrównania; `locale pl_PL`
-   niewygenerowane na lustrze (niegroźne)
-
-## Learnings
-
-(odzwierciedlone też w `scripts/onboard/README.md`)
-
- mDNS `.local` zawodny do automatyzacji → `first_contact` przez IP lub tailscale, nie `.local`
- istniejący node z userem uid=1000: użyj go zamiast tworzyć `oskar` (kolizja uid)
- swap na SD = wear → zram
- dry-run MUSI propagować do step-skryptów (`run()` wrapper), inaczej bezużyteczny
- yaml fallback bez `yq` musi strippować inline komentarze i nie być greedy na `:`
-
-## Update — worktree hygiene
- feat/node-onboarding → task/node-onboarding. Main checkout (~/homelab-codex-ws) wrócił na master (deploy-only). Praca onboardingu w ~/homelab-codex-ws-node-onboarding.
- Origin: task/ pushnięty+tracking, feat/ usunięty.
- DROBIAZG: worktree utworzony ręcznie (git worktree add) → agent.sh list pokazuje "(no marker)"/parent=?. Działa; przy finałowym `agent.sh merge node-onboarding` zweryfikować, czy brak markera nie przeszkadza — inaczej dorobić marker (wzór: ha-piha) lub ręczny `git merge --ff-only`.
- NASTĘPNE: base step (zram, /opt/homelab, event dir /opt/homelab/events/lustro/) — z worktree node-onboarding.
- Osobny przyszły projekt: parent-layout refaktor (bare + worktree pod jednym katalogiem) — wymaga przepisania agent.sh + zabezpieczenia dirty ha-piha.
-
-## Tech-debt złapany w sesji
- OBSERVER STALENESS: martwy node (chelsty-infra) świeci NOMINAL w agents.okit.pl — observer/supervisor trzyma ostatni znany stan i nie degraduje przy braku heartbeatu (eventy: tylko VPS raportuje świeżo, chelsty milczy a status NOMINAL). FIX (zdalny, software): heartbeat TTL → po przekroczeniu oznacz `stale`/`down`. Ważne: false-NOMINAL podważa zaufanie do statusu wszystkich nodów. Przenieść do głównego tech-debt backlogu, jeśli istnieje osobny.
--- a/docs/sessions/2026-06-09-flota-recovery-lustro-register.md
+++ b/docs/sessions/2026-06-09-flota-recovery-lustro-register.md
@ -1,124 +0,0 @@
-# Sesja 2026-06-09 — flota recovery + LUSTRO register
-
-## Cel
-
-Diagnoza cichej awarii reportingu floty; dokończenie kroku REGISTER dla LUSTRO
-(40-register.sh + 50-verify.sh); update skilla node-onboarding.
-
---
-
-## GŁÓWNE: 8-dniowa cicha awaria reportingu floty — ROZWIĄZANA
-
-### Root cause
-
-`oskar` (uid 1002) **spoza grupy aerbot (1000)** na VPS.
-`/opt/homelab/events/*` = `aerbot:aerbot 775` → `oskar` w "other" (r-x).
-`rsync` push z każdego node'a (jako `oskar` przez SSH) = **Permission denied** przy
-zapisie → `--remove-source-files` nie czyścił backlogu → **292 000 plików** nagromadzonych
-w staging cache node-agentów.
-
-### Fix
-
-```bash
-usermod -aG 1000 oskar    # na VPS; ssh re-login wymagany
-```
-
-### Weryfikacja
-
- VPS `events/piha` 3443 pliki (rośnie)
- `piha` lokalnie: 2 pliki (staging wyczyszczony)
- Panel agents.okit.pl: vps / piha / solaria — Last Seen świeże
-
-### Diagnoza — 5 warstw, 4 obalone hipotezy
-
-Verify-before-fix obalił kolejno:
-1. `authorized_keys` missing — klucz był, SSH działał (piha→VPS ręcznie OK)
-2. Remote agent down — procesy `rsync` widoczne w `ps`, logi bez crash
-3. VPS IP zmiana — Tailscale IP niezmieniony 100.95.58.48
-4. Bridge/relay cutoff — ping VPS→piha OK przez mesh
-
-5 warstwa (błąd uprawnienia) odkryta przez ręczny `rsync` jako `oskar` na VPS →
-`Permission denied (13)` → `stat /opt/homelab/events/` → `aerbot:aerbot 775`.
-
-### Dlaczego awaria była CICHA (3 warstwy maskujące)
-
-| Warstwa | Mechanizm |
-|---------|-----------|
-| (a) shipping fail | Logowany jako `WARNING`, nie crash — node-agent nie failował, milczał |
-| (b) observer staleness | Stale node pokazywany NOMINAL — brak heartbeat TTL, observer trzyma ostatni znany stan |
-| (c) brain-watchdog | Ślepy na per-node freshness — nie monitoruje świeżości eventów per-node |
-
-### Pozostały drobny błąd
-
-`rsync` exit code 23: `set-times` na katalogu = `EPERM` (oskar nie jest właścicielem
-`/opt/homelab/events/` — `aerbot` jest). Kosmetyka — rsync działa poprawnie.
-**Fix**: dodać `--omit-dir-times` do wywołania rsync w node-agent (wpisane do backlogu).
-
---
-
-## LUSTRO register: stan po sesji
-
-### Dokonane
-
- `40-register.sh` — napisany i zcommitowany na `task/node-onboarding`
-  - Idempotentny: grep topology, `[[ -f services.yaml ]]`, `git diff --quiet`
-  - Commituje tylko `inventory/topology.yaml` + `hosts/lustro/services.yaml` na bieżącym branchu
-  - BEZ `git push` (merge należy do operatora)
- `50-verify.sh` — napisany i zcommitowany
-  - 4 checki: node-agent running, eventy, observer restart + heartbeat poll, world/nodes.json
-  - Tabela pass/fail; exit 1 on failure
- `40-deploy-node-agent.sh` — scaffold usunięty (deploy w 30-node-agent.sh)
- Dry-run `40-register.sh --dry-run` przeszedł czysto
-
-### Mechanizm aktywacji observera (zbadany)
-
-Observer bind-mountuje repo root jako `/repo:ro` z `services/control-plane/docker-compose.yml`
-(`../..:/repo:ro` → `/home/oskar/homelab-codex-ws` na VPS). `_load_inventory()` wywoływane
-raz przy starcie. **Aktywacja po merge**: `git pull` VPS + `docker restart control-plane-observer`
-— bez redeploy.
-
-### Wpis lustro w topology.yaml (minimalistyczny, 1:1 z piha)
-
-```yaml
-  lustro:
-    roles:
-      - edge
-    services:
-      - node-agent
-```
-
-### PENDING (jutro)
-
-1. Commit B: `onboard.sh --node lustro --step 40-register` live → commit na branchu
-2. `agent.sh merge task/node-onboarding` → master
-3. `git pull` na VPS + `docker restart control-plane-observer`
-4. `onboard.sh --node lustro --step 50-verify` → lustro widoczny w agents.okit.pl
-
---
-
-## fix-event-bloat (task/fix-event-bloat)
-
-Commit `d483274` na branchu: batch rsync, backlog trim, timeout 120s, backlog warn.
-**PENDING**: review + deploy na flotę.
-
---
-
-## OOM ai-cluster (obserwacja live)
-
-Zaobserwowany na VPS podczas sesji: cgroup OOM restart-loop, python workery ~195 MB,
-0 swap. **PENDING**: migracja `ai-cluster` → SOLARIA + dodanie swap na VPS.
-
---
-
-## Gotcha sesji
-
-**Worktree branch confusion**: `~/homelab-codex-ws-node-onboarding` był przełączony
-ręcznie na `task/fix-event-bloat` (jeden worktree, dwa branche ręcznie switchwane).
-Anty-wzorzec: zawsze sprawdzać `git branch --show-current` na wejściu do worktree.
-Docelowo: osobny worktree per task.
-
---
-
-## Tech-debt złapany w sesji
-
-→ wpisany do `docs/backlog.md`
--- a/docs/sessions/2026-06-11-lustro-ssh-shipping.md
+++ b/docs/sessions/2026-06-11-lustro-ssh-shipping.md
@ -1,114 +0,0 @@
-# Sesja 2026-06-10/11 — lustro SSH shipping fix + ha-diag-agent piha
-
-## Cel
-
-Naprawa shippingu eventów lustro → VPS; domknięcie deploy-configu ha-diag-agent na piha;
-zachowanie poison-quarantine (Codex) do osobnego review.
-
---
-
-## GŁÓWNE: LUSTRO event shipping — NAPRAWIONY (merged `a5a1352`)
-
-### Root cause
-
-`_ship_events_to_vps()` (`services/node-agent/src/node_agent.py`) woła `ssh` **bez `-i`**,
-więc klucz jest szukany w `$HOME/.ssh` = `/home/homelab/.ssh` (kontener działa jako
-uid 1000 `homelab` od dodania `user: "1000:1000"` do bazowego
-`services/node-agent/docker-compose.yml`). Override lustra montował klucz w `/root/.ssh`
-— **ślepy mount**, ssh tam nie patrzy → `oskar@100.95.58.48: Permission denied`.
-
-### Fix
-
-`hosts/lustro/runtime/node-agent/docker-compose.override.yml`:
-
-```yaml
- /home/pi/.ssh:/home/homelab/.ssh:ro   # było: /root/.ssh — ślepe
-```
-
-Klucz `pi@pimirror2` dodany do `authorized_keys` `oskar@VPS`.
-uid match (pi=1000 = homelab=1000) spełnia strict ownership check OpenSSH.
-
-### Weryfikacja
-
- 5 nodów NOMINAL w world state; lustro w `/opt/homelab/world/nodes.json` (online, świeży `last_seen`)
- 7600+ eventów backlogu spłynęło na VPS (`/opt/homelab/events/lustro/`)
- Staging na lustrze drenowany do zera (`--remove-source-files` działa)
- "Permission denied" zniknął z logów node-agenta
-
-### Diagnoza — lekcja verify-before-fix
-
-Oba agenty (Claude Code, Codex) błędnie wskazały observer (poison event / race)
-na **nieaktualnym stanie** (`events=2` z ręcznego testu). Verify-before-fix obalił
-obie hipotezy: `events/lustro` na VPS było puste → problem w warstwie **dostarczania**
-(klucz SSH), nie w observerze.
-
---
-
-## ha-diag-agent piha — deploy config merged (`5e9db5c`), deploy NIEDOKOŃCZONY
-
- `.env` utworzony na piha: `/opt/homelab/config/ha-diag-agent/.env`, chmod 600
- **ALE token = PLACEHOLDER** — chelsty-ha offline → brak tokenu i połączenia
- Przed `shadow_mode=false`: target restartu w supervisorze = nazwa kontenera
-  `homeassistant5`; curl endpointu z tokenem musi dać HTTP 200
- Decyzja PENDING: cel HA = chelsty-ha vs HA Ken (`homeassistant5` na piha —
-  z kontenera NIE `localhost`)
-
---
-
-## observer poison-quarantine (Codex)
-
-Zachowany na branchu `task/observer-poison-quarantine` (`78c9e4a`) — **NIE w master**.
-Do osobnego review: czy observer realnie wiesza się na malformed evencie
-(poison NIE był przyczyną lustra; hipoteza niezweryfikowana).
-Realny bug → merge; inaczej → drop.
-
---
-
-## 🔴 FLOTA-BOMBA — odkryta, NIE naprawiona (backlog, BLOKUJĄCE)
-
-solaria / piha / chelsty to wciąż **stare root kontenery** node-agenta
-(piha Created 2026-05-27, uid 0). Ich mount `/root/.ssh` działa tylko dlatego,
-że kontenery są sprzed `user: "1000:1000"`. Pierwszy `--force-recreate` / reboot
-hosta / update obrazu przełączy je na uid 1000 i shipping padnie jak na lustrze.
-**NIE RECREATE bez fixu.** Szczegóły i fix: `docs/backlog.md`.
-
---
-
-## Tech-debt złapany w sesji
-
-→ wpisany do `docs/backlog.md` (flota-bomba, ha-diag-agent blocked,
-poison-quarantine review, `--omit-dir-times`, stale komentarz node_agent.py,
-shipping success na `logger.debug`, event-bloat lustro na VPS).
-
-## Session 20:19
-
-### Commits
-fa59625 docs(ha-diag-agent): replace curl verify commands with docker exec
-d7e0d31 fix(ha-diag-agent): remove host port mapping for 8087
-
-### Files changed
- services/ha-diag-agent/DEPLOY.md          | 4 ++--
- services/ha-diag-agent/README.md          | 4 ++--
- services/ha-diag-agent/docker-compose.yml | 3 ---
- services/ha-diag-agent/service.yaml       | 3 ---
- 4 files changed, 4 insertions(+), 10 deletions(-))
-
-### Deploys
-None recorded
-
-### Narrative
-> _user-provided summary_
-
-## Session 20:35
-
-### Commits
-(brak nowych — commity d7e0d31 i fa59625 z tej sesji trafiły do mastera przed tym wpisem)
-
-### Files changed
-(bez zmian — zob. Session 20:19)
-
-### Deploys
-None recorded
-
-### Narrative
-> _user-provided summary_
--- a/docs/stability-agent-rollout.md
+++ b/docs/stability-agent-rollout.md
@ -1,62 +0,0 @@
-# Stability Agent Multi-Node Rollout
-
-## Architecture Summary
-The `stability-agent` is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on **PIHA**.
-
- **Source**: `services/stability-agent`
- **State Path**: `/opt/homelab/state`
- **Events Path**: `/opt/homelab/events`
- **Redis Target**: `100.108.208.3:6379` (PIHA)
-
-## Why UI only showed CHELSTY
-Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
-
-## Deployment
-
-Use the helper script to deploy or generate commands. The script uses explicit Tailscale IPs for remote targets (piha, chelsty, vps) and runs locally for solaria.
-
-```bash
-# Print commands
-./scripts/deploy/deploy-stability-agent.sh <node-name>
-
-# Deploy via SSH (executes ssh oskar@<ip>)
-./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh
-```
-
-### Manual Steps per Node
-The manual steps are encapsulated in `services/stability-agent/deploy-local.sh`. On the target node:
-```bash
-cd /home/oskar/homelab-codex-ws
-git fetch origin
-git checkout master
-git pull origin master
-cd services/stability-agent
-./deploy-local.sh <node-name>
-```
-
-## Verification
-
-### Fleet Overview
-Run the verification script from any node with `redis-cli` access:
-```bash
-./scripts/deploy/verify-agent-fleet.sh
-```
-
-### Redis Inspection (on PIHA)
-```bash
-docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
-docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
-```
-
-Verify Web UI backend:
-```bash
-curl -s http://127.0.0.1:18180/nodes
-curl -k https://agents.okit.pl/nodes
-```
-
-## Troubleshooting
-
- **Redis empty after compose down**: The `agent-system-redis` on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every `CHECK_INTERVAL`).
- **Secrets**: `.env` files and local secrets are not committed to the repo. Ensure `MQTT_HOST` and other specific secrets are set via overrides if needed.
- **Telegram**: Telegram bot notifications can remain disabled if `TELEGRAM_BOT_TOKEN` is absent.
- **Docker Socket**: If the agent reports `unavailable` for Docker, ensure `/var/run/docker.sock` is mounted and the user has permissions.
--- a/docs/standards.md
+++ b/docs/standards.md
@ -49,10 +49,9 @@ Runtime state must live outside the repository to keep it immutable and clean.
 ## Service Standards

 1.  **Normalization**: Every service MUST follow the `services/<service>/` layout.
-2.  **Metadata**: Every service MUST have a `service.yaml` defining its operational contract. This is the primary source of truth for AI agents.
-3.  **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification. Agents use this to emit stability events.
-4.  **Actionability**: Any automated recovery action proposed by an agent must be backed by a `service.yaml` definition.
-5.  **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host. Agents must treat these as "black box" configurations.
+2.  **Metadata**: Every service MUST have a `service.yaml` defining its operational contract.
+3.  **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification.
+4.  **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host.

 ## Docker Compose Standards

--- a/docs/vps-control-plane.md
+++ b/docs/vps-control-plane.md
@ -1,126 +0,0 @@
-# VPS Control Plane
-
-The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
-
-## Architecture
-
-The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
-
-| Container | Role |
-|-----------|------|
-| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
-| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
-| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
-| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
-
-All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
-
-## Supervisor Behavior
-
-### Desired State
-Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
-
-### Drift Types
- `missing_service` — service is in desired state but absent from `services.json`
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
-
-### Action Types
-| Trigger | Action type | Risk |
-|---------|-------------|------|
-| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
-| Any other / unknown | `redeploy` | guarded |
-| Node `disk_pressure: high` | `disk_cleanup` | guarded |
-
-### Action ID Stability
-Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
-
-### Auto-Cancel
-Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
-
-Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
-
-### Node Name Resolution
-The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
-
-```bash
-NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
-```
-
-## Deployment
-
-### From SATURN (primary control node)
-```bash
-# Full deploy via SSH
-./scripts/deploy/deploy-control-plane.sh --ssh
-
-# Or manually:
-ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
-```
-
-### Direct on VPS
-```bash
-cd ~/homelab-codex-ws/services/control-plane
-docker compose up -d --build --force-recreate
-```
-
-`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
-
-### Verification
-```bash
-# On VPS
-docker ps --filter "name=control-plane"
-curl -s http://localhost:18180/summary | python3 -m json.tool
-```
-
-## Action Approval Workflow
-
-```
-Supervisor writes → /opt/homelab/actions/pending/<id>.json
-                 → Operator UI (port 18180) or Telegram Bot notifies
-                 → Operator clicks Approve
-                 → /opt/homelab/actions/approved/<id>.json
-                 → Executor executes → completed / failed
-```
-
-Possible action states: `pending → approved → running → completed / failed / rejected`  
-Auto-cancel path: `pending → cancelled/`
-
-## Recovery
-
-### World state is stale or corrupt
-```bash
-# On VPS — delete checkpoint to force full replay
-rm /opt/homelab/state/observer_checkpoint.json
-docker restart control-plane-observer
-```
-
-### Flood of pending actions after bootstrap
-Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
-
-```bash
-# Check node-agent on each node
-ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
-```
-
-### Rebuild from scratch
-```bash
-ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
-```
-
-## Integration
-
-### piha agent-system webui (port 18180 on piha)
-The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
-
-Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
-
-### Nginx Proxy Manager
-The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
-
-### Log Locations
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
- World state: `/opt/homelab/world/`
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`
--- a/hosts/chelsty-ha/capabilities.yaml
+++ b/hosts/chelsty-ha/capabilities.yaml
@ -1,24 +0,0 @@
-host: chelsty-ha
-site: chelsty
-
-capabilities:
-  networking:
-    reachability: tailscale-only
-    tailscale_ip: 100.122.201.23
-    ingress_suitability: false
-    bandwidth: LTE
-
-  runtime:
-    container_engine: docker
-    os: debian
-
-  operational:
-    connectivity: intermittent
-    availability_target: best-effort
-    offline_first: true
-    uplink: lte
-
-  deployment:
-    suitability:
-      - homeassistant
-    restricted: false
--- a/hosts/chelsty-ha/host.yaml
+++ b/hosts/chelsty-ha/host.yaml
@ -1,20 +0,0 @@
-hostname: chelsty-ha
-site: chelsty
-
-roles:
-  - homeassistant
-
-network:
-  tailscale_ip: 100.122.201.23
-
-runtime:
-  root: /opt/homelab
-
-deployment:
-  mode: pull
-  managed_by: saturn
-
-  constraints:
-    connectivity:
-      intermittent: true
-      uplink: lte
--- a/hosts/chelsty-ha/services.yaml
+++ b/hosts/chelsty-ha/services.yaml
@ -1,12 +0,0 @@
-host: chelsty-ha
-site: chelsty
-
-services:
-  homeassistant:
-    role: home-automation-controller
-    offline_required: true
-    # monitor: false — chelsty-ha has no node-agent deployed, so there are no
-    # container-health events for the observer to track. HA is monitored
-    # indirectly via the chelsty-infra MQTT broker (if MQTT goes silent, HA
-    # is likely down). Re-enable once node-agent is bootstrapped on this VM.
-    monitor: false
--- a/hosts/chelsty-infra/runtime/frigate/config.yml
+++ b/hosts/chelsty-infra/runtime/frigate/config.yml
@ -1,88 +0,0 @@
-# Frigate NVR — chelsty-infra
-# Hardware decode: Intel UHD 630 via VAAPI (/dev/dri/renderD128)
-# Object detection: CPU (no Coral TPU)
-# Cameras: 2x Reolink RLC-540 (5MP, WiFi)
-#
-# Required env vars in /opt/homelab/config/frigate/frigate.env:
-#   CAMERA1_IP, CAMERA1_USER, CAMERA1_PASS
-#   CAMERA2_IP, CAMERA2_USER, CAMERA2_PASS
-#   MQTT_USER, MQTT_PASS  (if mosquitto auth is enabled)
-
-mqtt:
-  enabled: true
-  host: 127.0.0.1
-  port: 1883
-  # user: "{MQTT_USER}"
-  # password: "{MQTT_PASS}"
-
-detectors:
-  cpu1:
-    type: cpu
-    num_threads: 3
-
-ffmpeg:
-  hwaccel_args: preset-vaapi
-  global_args:
-    - -hide_banner
-    - -loglevel
-    - warning
-
-record:
-  enabled: true
-  retain:
-    days: 7
-    mode: all
-  events:
-    retain:
-      default: 14
-      mode: motion
-
-snapshots:
-  enabled: true
-  retain:
-    default: 7
-  quality: 70
-
-objects:
-  track:
-    - person
-    - car
-    - bicycle
-  filters:
-    person:
-      min_area: 5000
-      max_area: 100000
-      threshold: 0.7
-
-cameras:
-  camera1:
-    ffmpeg:
-      inputs:
-        # Main stream — high-res recording
-        - path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_main
-          roles:
-            - record
-        # Sub stream — low-res detection (lower CPU cost)
-        - path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_sub
-          roles:
-            - detect
-    detect:
-      enabled: true
-      width: 640
-      height: 480
-      fps: 5
-
-  camera2:
-    ffmpeg:
-      inputs:
-        - path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_main
-          roles:
-            - record
-        - path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_sub
-          roles:
-            - detect
-    detect:
-      enabled: true
-      width: 640
-      height: 480
-      fps: 5
--- a/hosts/chelsty-infra/runtime/frigate/docker-compose.yml
+++ b/hosts/chelsty-infra/runtime/frigate/docker-compose.yml
@ -1,25 +0,0 @@
-services:
-  frigate:
-    container_name: frigate
-    image: ghcr.io/blakeblackshear/frigate:stable
-    restart: unless-stopped
-    privileged: true
-    shm_size: "256mb"
-    network_mode: host
-    devices:
-      - /dev/dri/renderD128:/dev/dri/renderD128
-    volumes:
-      - /etc/localtime:/etc/localtime:ro
-      - /opt/homelab/config/frigate/config.yml:/config/config.yml
-      - /opt/homelab/config/frigate:/config/credentials:ro
-      - /opt/homelab/data/frigate:/media/frigate
-    tmpfs:
-      - /tmp/cache
-    env_file:
-      - /opt/homelab/config/frigate/frigate.env
-    healthcheck:
-      test: ["CMD-SHELL", "wget -q --spider http://localhost:5000/api/version 2>&1 || exit 1"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-      start_period: 60s
--- a/hosts/chelsty-infra/runtime/node-agent/docker-compose.override.yml
+++ b/hosts/chelsty-infra/runtime/node-agent/docker-compose.override.yml
@ -1,11 +0,0 @@
-services:
-  node-agent:
-    environment:
-      - NODE_NAME=chelsty-infra
-      - NODE_TYPE=lte_node
-      - VPS_EVENTS_HOST=100.95.58.48
-      - VPS_EVENTS_USER=oskar
-      - VPS_EVENTS_PATH=/opt/homelab/events
-      - CHECK_INTERVAL=60
-    volumes:
-      - /home/oskar/.ssh:/home/homelab/.ssh:ro
--- a/hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
+++ b/hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
@ -1,21 +0,0 @@
-services:
-  zigbee2mqtt:
-    # mosquitto runs with network_mode: host on chelsty-infra.
-    # extra_hosts maps the 'mosquitto' hostname to the host gateway IP so that
-    # mqtt://mosquitto:1883 in configuration.yaml reaches the host-networked
-    # mosquitto process. Requires Docker 20.10+ (present on chelsty-infra).
-    extra_hosts:
-      - "mosquitto:host-gateway"
-    environment:
-      - TZ=Europe/Warsaw
-    healthcheck:
-      test: ["CMD-SHELL", "wget -qO- http://localhost:8080 > /dev/null 2>&1 || exit 1"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-      start_period: 90s
-    # Note: volumes NOT overridden here.
-    # The base docker-compose.yml mounts /opt/homelab/data/zigbee2mqtt/data:/app/data
-    # (read-write). configuration.yaml must be placed in that directory on the node:
-    #   /opt/homelab/data/zigbee2mqtt/data/configuration.yaml
-    # z2m rewrites this file during migrations — read-only mount is not viable.
--- a/hosts/chelsty-infra/services.yaml
+++ b/hosts/chelsty-infra/services.yaml
@ -1,37 +0,0 @@
-host: chelsty-infra
-site: chelsty
-
-services:
-  ha-diag-agent:
-    role: ha-diagnostic-agent
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: false
-    depends_on:
-      local: []
-      external: [homeassistant]
-    config:
-      target_url: http://100.70.180.90:8123  # chelsty-ha via Tailscale (HAOS, separate VM)
-      location_tag: "chelsty"
-      events_dir: /opt/homelab/events/chelsty-infra
-    runtime:
-      config_path: /opt/homelab/config/ha-diag-agent
-      data_path: /var/lib/ha-diag-agent
-
-  node-agent:
-    role: node-stability-monitor
-    # LTE node: node-agent monitors and emits events but does NO Docker cleanup.
-    # Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's
-    # own retain policy is the correct remediation, not docker prune.
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-
-  mosquitto:
-    role: local-mqtt-broker
-
-  zigbee2mqtt:
-    role: zigbee-mqtt-bridge
-
-  frigate:
-    role: nvr
--- a/hosts/chelsty-infra/capabilities.yaml
+++ b/hosts/chelsty-infra/capabilities.yaml
@ -1,6 +1,3 @@
-host: chelsty-infra
-site: chelsty
-
 capabilities:
  hardware:
    cpu:
@ -34,11 +31,10 @@ capabilities:
    power_constraint: low-power
    connectivity: intermittent
    availability_target: best-effort
-    offline_operation_required: true
  
  deployment:
    suitability:
      - staging
-      - infra
+      - homeassistant
      - edge
    restricted: false
--- a/hosts/chelsty-infra/host.yaml
+++ b/hosts/chelsty-infra/host.yaml
@ -1,10 +1,9 @@
-hostname: chelsty-infra
-site: chelsty
+hostname: chelsty

 roles:
  - edge
  - hypervisor
-  - infra
+  - homeassistant
  - staging

 network:
--- a/hosts/chelsty-infra/networking.yaml
+++ b/hosts/chelsty-infra/networking.yaml
@ -1,4 +1,4 @@
-host: chelsty-infra
+host: chelsty

 uplink:
  type: lte
@ -20,7 +20,7 @@ exposure_classes:

 networks:
  home_automation_lan:
-    purpose: MQTT broker, Zigbee coordinator, and local device control.
+    purpose: Home Assistant, MQTT, Zigbee coordinator, and local device control.
    offline_required: true
    internet_required_for_core_operation: false

--- a/hosts/chelsty-infra/paths.yaml
+++ b/hosts/chelsty-infra/paths.yaml
@ -1,4 +1,4 @@
-host: chelsty-infra
+host: chelsty

 runtime_root: /opt/homelab

@ -9,6 +9,12 @@ conventions:
  logs: /opt/homelab/logs

 services:
+  homeassistant:
+    data: /opt/homelab/data/homeassistant
+    config: /opt/homelab/config/homeassistant
+    logs: /opt/homelab/logs/homeassistant
+    backup_priority: critical
+
  zigbee2mqtt:
    data: /opt/homelab/data/zigbee2mqtt
    config: /opt/homelab/config/zigbee2mqtt
@ -21,13 +27,13 @@ services:
    logs: /opt/homelab/logs/mosquitto
    backup_priority: high

-  stability-agent:
-    data: /opt/homelab/state
-    config: /opt/homelab/config/stability-agent
-    logs: /opt/homelab/events
-    backup_priority: low
-
 backup_sets:
+  homeassistant:
+    include:
+      - /opt/homelab/config/homeassistant
+      - /opt/homelab/data/homeassistant
+    restore_note: Restore before starting the Home Assistant container.
+
  zigbee2mqtt:
    include:
      - /opt/homelab/config/zigbee2mqtt
--- a/hosts/chelsty-infra/runtime/mosquitto/docker-compose.override.yml
+++ b/hosts/chelsty-infra/runtime/mosquitto/docker-compose.override.yml
--- a/hosts/chelsty-infra/runtime/mosquitto/mosquitto.conf
+++ b/hosts/chelsty-infra/runtime/mosquitto/mosquitto.conf
--- a/hosts/chelsty-infra/runtime/stability-agent/docker-compose.override.yml
+++ b/hosts/chelsty-infra/runtime/stability-agent/docker-compose.override.yml
@ -1,11 +1,6 @@
 services:
  stability-agent:
    environment:
-      - NODE_NAME=chelsty-infra
-      - SITE_NAME=chelsty
-      - REDIS_HOST=100.108.208.3
-      - REDIS_PORT=6379
-      - REDIS_ENABLED=true
      - STABILITY_CHECK_INTERVAL=60
      - DISK_THRESHOLD_PCT=85
      - MQTT_HOST=mosquitto
--- a/hosts/chelsty-infra/runtime/zigbee2mqtt/configuration.yaml
+++ b/hosts/chelsty-infra/runtime/zigbee2mqtt/configuration.yaml
--- a/hosts/chelsty/runtime/zigbee2mqtt/docker-compose.override.yml
+++ b/hosts/chelsty/runtime/zigbee2mqtt/docker-compose.override.yml
@ -0,0 +1,13 @@
+services:
+  zigbee2mqtt:
+    volumes:
+      - ./configuration.yaml:/app/data/configuration.yaml:ro
+    environment:
+      - MQTT_USER=${MQTT_USER}
+      - MQTT_PASSWORD=${MQTT_PASSWORD}
+    # Healthcheck is already defined in base service, but we ensure compatibility
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8080"]
+      interval: 10s
+      timeout: 5s
+      retries: 3
--- a/hosts/chelsty/services.yaml
+++ b/hosts/chelsty/services.yaml
@ -0,0 +1,126 @@
+host: chelsty
+
+exposure_classes:
+  local-only:
+    description: Reachable only from CHELSTY-local networks or container networks.
+    public_ingress: false
+    tailscale_required: false
+  tailscale-internal:
+    description: Reachable through the Tailscale mesh by approved tailnet clients.
+    public_ingress: false
+    tailscale_required: true
+  public:
+    description: Reachable from the public internet through an explicit ingress path.
+    public_ingress: true
+    tailscale_required: false
+
+operational_constraints:
+  uplink: lte
+  connectivity: intermittent
+  offline_operation_required: true
+  must_not_depend_on:
+    - saturn
+    - vps
+    - forgejo
+
+services:
+  homeassistant:
+    role: home-automation-controller
+    deployment_model: docker-compose
+    exposure: tailscale-internal
+    offline_required: true
+    depends_on:
+      local:
+        - mosquitto
+        - zigbee2mqtt
+      external: []
+    ports:
+      - name: http
+        container_port: 8123
+        protocol: tcp
+    runtime:
+      config_path: /opt/homelab/config/homeassistant
+      data_path: /opt/homelab/data/homeassistant
+      logs_path: /opt/homelab/logs/homeassistant
+    backup:
+      recommended: true
+      include:
+        - /opt/homelab/config/homeassistant
+        - /opt/homelab/data/homeassistant
+      notes:
+        - Back up before Home Assistant core, supervisor-equivalent, or integration upgrades.
+        - Keep local restore copies on CHELSTY because LTE connectivity may be unavailable during recovery.
+
+  zigbee2mqtt:
+    role: zigbee-mqtt-bridge
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: true
+    depends_on:
+      local:
+        - mosquitto
+      external:
+        - slzb-06u
+    coordinator:
+      name: slzb-06u
+      connection: network
+      usb_device: null
+    ports:
+      - name: frontend
+        container_port: 8080
+        protocol: tcp
+        exposure: tailscale-internal
+    runtime:
+      config_path: /opt/homelab/config/zigbee2mqtt
+      data_path: /opt/homelab/data/zigbee2mqtt
+      logs_path: /opt/homelab/logs/zigbee2mqtt
+    backup:
+      recommended: true
+      include:
+        - /opt/homelab/config/zigbee2mqtt
+        - /opt/homelab/data/zigbee2mqtt
+      notes:
+        - Include configuration.yaml, database.db, coordinator backup files, and network key material.
+        - Restore Zigbee2MQTT state together with the SLZB-06U coordinator state when replacing hardware.
+
+  mosquitto:
+    role: local-mqtt-broker
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: true
+    depends_on:
+      local: []
+      external: []
+    ports:
+      - name: mqtt
+        container_port: 1883
+        protocol: tcp
+    runtime:
+      config_path: /opt/homelab/config/mosquitto
+      data_path: /opt/homelab/data/mosquitto
+      logs_path: /opt/homelab/logs/mosquitto
+    backup:
+      recommended: true
+      include:
+        - /opt/homelab/config/mosquitto
+        - /opt/homelab/data/mosquitto
+      notes:
+        - Retain ACL, password, persistence, and bridge configuration if enabled.
+
+  stability-agent:
+    role: node-stability-monitor
+    deployment_model: docker-compose
+    exposure: local-only
+    offline_required: true
+    depends_on:
+      local:
+        - mosquitto
+      external: []
+    runtime:
+      config_path: null
+      data_path: /opt/homelab/state
+      logs_path: /opt/homelab/events
+    backup:
+      recommended: false
+      notes:
+        - Events and state are transient or can be reconstructed; high-frequency writes.
--- a/hosts/lustro/node.yaml
+++ b/hosts/lustro/node.yaml
@ -1,32 +0,0 @@
-# hosts/lustro/node.yaml — LUSTRO edge node manifest
-# First-contact bootstrap: scripts/onboard/onboard.sh --node lustro --step 00-access
-# Full onboarding:          scripts/onboard/onboard.sh --node lustro
-
-name: LUSTRO
-role: edge
-location: KEN
-
-ssh_user: pi
-first_contact: pi@192.168.31.19   # LAN IP KEN; mDNS .local zawodny; mesh przejmuje po tailscale up
-
-tailscale:
-  hostname: lustro
-  # ip: TODO — fill after tailscale join (step 30-install-tailscale)
-
-deploy_autonomy: true   # onboard.sh may run mutating steps autonomously
-git_control: false       # node does NOT pull from Forgejo; push-based via SATURN
-
-hardware:
-  arch: arm64
-  ram_mb: 4096
-  swap:
-    kind: zram
-    mb: 2048
-  docker_present: true
-  mm_runtime: systemd:magicmirror.service
-
-services:
-  node-agent:
-    runtime:
-      engine: docker
-      mem_limit: 256m
--- a/hosts/lustro/runtime/node-agent/docker-compose.override.yml
+++ b/hosts/lustro/runtime/node-agent/docker-compose.override.yml
@ -1,23 +0,0 @@
-services:
-  node-agent:
-    # Docker GID on LUSTRO is 991 (not the Debian default 999).
-    # Compose concatenates group_add lists; 991 is what gives socket access here.
-    group_add:
-      - "991"
-    mem_limit: 256m   # RPi4 4 GiB; MagicMirror consumes ~1.9 GiB — agent must be bounded
-    environment:
-      - NODE_NAME=lustro
-      - NODE_TYPE=sd_card
-      - VPS_EVENTS_HOST=100.95.58.48
-      - VPS_EVENTS_USER=oskar
-      - VPS_EVENTS_PATH=/opt/homelab/events
-      - CHECK_INTERVAL=60
-    volumes:
-      # pi's SSH key for rsync event shipping to VPS (push-based node, no repo
-      # checkout). Container runs as uid 1000 (homelab, HOME=/home/homelab) per
-      # the base compose — ssh has no -i flag, so the key must land in
-      # /home/homelab/.ssh, NOT /root/.ssh. uid match (pi=1000) satisfies
-      # OpenSSH strict ownership checks on the mounted key.
-      - /home/pi/.ssh:/home/homelab/.ssh:ro
-      # Override ../.. from the base compose to the pushed deploy dir (no repo on node)
-      - /opt/homelab/deploy/node-agent:/repo:ro
--- a/hosts/lustro/services.yaml
+++ b/hosts/lustro/services.yaml
@ -1,15 +0,0 @@
-host: lustro
-
-services:
-  node-agent:
-    role: node-stability-monitor
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-    depends_on:
-      local: []
-      external: []
-    runtime:
-      config_path: /opt/homelab/config/node-agent
-      data_path: /opt/homelab/state
-      logs_path: /opt/homelab/events
--- a/hosts/piha/runtime/agent-system/docker-compose.override.yml
+++ b/hosts/piha/runtime/agent-system/docker-compose.override.yml
@ -1,8 +0,0 @@
-services:
-  runtime-materializer:
-    environment:
-      # Pull world state from the VPS control-plane API instead of local Redis.
-      # The observer on VPS is the authoritative writer; mirroring its API output
-      # here ensures the webui /snapshot matches the clean 97-service state that
-      # the control-plane /summary endpoint serves.
-      CONTROL_PLANE_URL: "http://100.95.58.48:18180"
--- a/hosts/piha/runtime/brain-watchdog/docker-compose.override.yml
+++ b/hosts/piha/runtime/brain-watchdog/docker-compose.override.yml
@ -1,4 +0,0 @@
-services:
-  brain-watchdog:
-    mem_limit: 64m
-    restart: unless-stopped
--- a/hosts/piha/runtime/ha-diag-agent/docker-compose.override.yml
+++ b/hosts/piha/runtime/ha-diag-agent/docker-compose.override.yml
@ -1,12 +0,0 @@
-services:
-  ha-diag-agent:
-    environment:
-      - NODE_NAME=piha
-    # Pin events to the piha-specific subdirectory; overrides the ${NODE_NAME}
-    # variable substitution in the base compose file which requires a shell env var.
-    volumes:
-      - /opt/homelab/events/piha:/events
-      - /var/lib/ha-diag-agent:/data
-      - /opt/homelab/config/ha-diag-agent:/config:ro
-    mem_limit: 128m
-    restart: unless-stopped
--- a/hosts/piha/runtime/node-agent/docker-compose.override.yml
+++ b/hosts/piha/runtime/node-agent/docker-compose.override.yml
@ -1,11 +0,0 @@
-services:
-  node-agent:
-    environment:
-      - NODE_NAME=piha
-      - NODE_TYPE=sd_card
-      - VPS_EVENTS_HOST=100.95.58.48
-      - VPS_EVENTS_USER=oskar
-      - VPS_EVENTS_PATH=/opt/homelab/events
-      - CHECK_INTERVAL=60
-    volumes:
-      - /home/oskar/.ssh:/home/homelab/.ssh:ro
--- a/hosts/piha/runtime/stability-agent/docker-compose.override.yml
+++ b/hosts/piha/runtime/stability-agent/docker-compose.override.yml
@ -1,7 +0,0 @@
-services:
-  stability-agent:
-    environment:
-      - NODE_NAME=piha
-      - REDIS_HOST=100.108.208.3
-      - REDIS_PORT=6379
-      - REDIS_ENABLED=true
--- a/hosts/piha/services.yaml
+++ b/hosts/piha/services.yaml
@ -1,42 +0,0 @@
-host: piha
-
-services:
-  ha-diag-agent:
-    role: ha-diagnostic-agent
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: false
-    depends_on:
-      local: []
-      external: [homeassistant]
-    config:
-      target_url: http://localhost:8123
-      location_tag: "ken"
-      events_dir: /opt/homelab/events/piha
-    runtime:
-      config_path: /opt/homelab/config/ha-diag-agent
-      data_path: /var/lib/ha-diag-agent
-
-  node-agent:
-    role: node-stability-monitor
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-    depends_on:
-      local: []
-      external: []
-    runtime:
-      config_path: /opt/homelab/config/node-agent
-      data_path: /opt/homelab/state
-      logs_path: /opt/homelab/events
-
-  brain-watchdog:
-    role: control-plane-watchdog
-    deployment_model: docker-compose
-    exposure: private
-    offline_required: false
-    depends_on:
-      local: []
-      external: [control-plane]
-    runtime:
-      config_path: /opt/homelab/config/brain-watchdog
--- a/hosts/solaria/runtime/node-agent/docker-compose.override.yml
+++ b/hosts/solaria/runtime/node-agent/docker-compose.override.yml
@ -1,11 +0,0 @@
-services:
-  node-agent:
-    environment:
-      - NODE_NAME=solaria
-      - NODE_TYPE=ai_node
-      - VPS_EVENTS_HOST=100.95.58.48
-      - VPS_EVENTS_USER=oskar
-      - VPS_EVENTS_PATH=/opt/homelab/events
-      - CHECK_INTERVAL=60
-    volumes:
-      - /home/oskar/.ssh:/home/homelab/.ssh:ro
--- a/hosts/solaria/runtime/stability-agent/docker-compose.override.yml
+++ b/hosts/solaria/runtime/stability-agent/docker-compose.override.yml
@ -1,7 +0,0 @@
-services:
-  stability-agent:
-    environment:
-      - NODE_NAME=solaria
-      - REDIS_HOST=100.108.208.3
-      - REDIS_PORT=6379
-      - REDIS_ENABLED=true
--- a/hosts/solaria/services.yaml
+++ b/hosts/solaria/services.yaml
@ -1,15 +0,0 @@
-host: solaria
-
-services:
-  node-agent:
-    role: node-stability-monitor
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-    depends_on:
-      local: []
-      external: []
-    runtime:
-      config_path: /opt/homelab/config/node-agent
-      data_path: /opt/homelab/state
-      logs_path: /opt/homelab/events
--- a/hosts/vps/runtime/control-plane/docker-compose.override.yml
+++ b/hosts/vps/runtime/control-plane/docker-compose.override.yml
@ -1,39 +0,0 @@
-# Control-plane production overrides for the VPS deployment.
-#
-# NODE_ALIAS_MAP translates the node names that appear in raw event files
-# (written by node agents / seed scripts) to the canonical names used in
-# inventory/topology.yaml and hosts/*/services.yaml.
-#
-# Current live mapping (from /opt/homelab/events/ inspection):
-#   node-2  →  chelsty   (zigbee2mqtt / mosquitto / homeassistant node)
-#
-# Add further entries when new nodes come online and their event-source names
-# differ from their topology names.  Format is a single-line JSON object, e.g.:
-#   NODE_ALIAS_MAP='{"node-2":"chelsty","node-3":"piha"}'
-#
-# The executor inherits the canonical name from the action JSON written by the
-# supervisor, so NODE_ALIAS_MAP is only required on the supervisor service.
-#
-# Memory limits: VPS has 4 GiB RAM, no swap. oom_score_adj -900 ensures the
-# host kernel OOM-killer never targets control-plane containers. mem_limit
-# provides a per-container cgroup ceiling so a leaking process is restarted by
-# Docker before it can exhaust host memory.
-
-services:
-  operator-ui:
-    mem_limit: 192m
-    oom_score_adj: -900
-
-  observer:
-    mem_limit: 192m
-    oom_score_adj: -900
-
-  supervisor:
-    mem_limit: 400m
-    oom_score_adj: -900
-    environment:
-      - NODE_ALIAS_MAP={"node-2":"chelsty"}
-
-  executor:
-    mem_limit: 64m
-    oom_score_adj: -900
--- a/hosts/vps/runtime/control-plane/env.example
+++ b/hosts/vps/runtime/control-plane/env.example
@ -1,7 +0,0 @@
-# Control Plane Environment Variables
-PORT=8080
-HOMELAB_STATE_ROOT=/opt/homelab/state
-HOMELAB_EVENTS_ROOT=/opt/homelab/events
-HOMELAB_WORLD_ROOT=/opt/homelab/world
-HOMELAB_ACTIONS_ROOT=/opt/homelab/actions
-HOMELAB_CONFIG_ROOT=/opt/homelab/config
--- a/hosts/vps/runtime/node-agent/docker-compose.override.yml
+++ b/hosts/vps/runtime/node-agent/docker-compose.override.yml
@ -1,16 +0,0 @@
-services:
-  node-agent:
-    environment:
-      - NODE_NAME=vps
-      - CHECK_INTERVAL=60
-    # host network mode: node-agent on VPS shares the host's network namespace
-    # so that localhost:18180 resolves to the control-plane's exposed port.
-    # Without this, localhost inside the container is the container's own loopback
-    # and the _check_control_plane_health() probe would always fail.
-    network_mode: host
-    # HARD memory ceiling: node-agent mounts /opt/homelab/events/ (page cache)
-    # and may accumulate Python RSS over hours; 640m cap ensures it is killed and
-    # auto-restarted by Docker before consuming host memory. oom_score_adj -900
-    # prevents the host kernel OOM-killer from picking it as a global victim.
-    mem_limit: 640m
-    oom_score_adj: -900
--- a/hosts/vps/runtime/stability-agent/docker-compose.override.yml
+++ b/hosts/vps/runtime/stability-agent/docker-compose.override.yml
@ -1,9 +0,0 @@
-services:
-  stability-agent:
-    environment:
-      - NODE_NAME=vps
-      - REDIS_HOST=100.108.208.3
-      - REDIS_PORT=6379
-      - REDIS_ENABLED=true
-    mem_limit: 96m
-    oom_score_adj: -900
--- a/hosts/vps/services.txt
+++ b/hosts/vps/services.txt
@ -0,0 +1 @@
+npm
--- a/hosts/vps/services.yaml
+++ b/hosts/vps/services.yaml
@ -1,43 +0,0 @@
-host: vps
-
-services:
-  node-agent:
-    role: node-stability-monitor
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-    depends_on:
-      local: []
-      external: []
-    runtime:
-      config_path: /opt/homelab/config/node-agent
-      data_path: /opt/homelab/state
-      logs_path: /opt/homelab/events
-
-  control-plane:
-    role: management-and-orchestration
-    deployment_model: docker-compose
-    exposure: tailscale-internal
-    offline_required: false
-    depends_on:
-      local:
-        - node-agent
-      external:
-        - piha:redis
-    ports:
-      - name: http
-        container_port: 18180
-        protocol: tcp
-    runtime:
-      config_path: /opt/homelab/config/control-plane
-      data_path: /opt/homelab/data/control-plane
-      logs_path: /opt/homelab/logs/control-plane
-
-  node_exporter:
-    role: metrics-exporter
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-    depends_on:
-      local: []
-      external: []
--- a/inventory/topology.yaml
+++ b/inventory/topology.yaml
@ -17,10 +17,6 @@ nodes:
    roles:
      - infra
      - monitoring
-    services:
-      - node-agent
-      - ha-diag-agent
-      - brain-watchdog

  solaria:
    roles:
@ -31,25 +27,12 @@ nodes:
    roles:
      - edge
      - ingress
-      - control-plane
-    services:
-      # Repo-managed GitOps services (hosts/vps/services.yaml is authoritative)
-      - node-agent
-      - control-plane       # executor, observer, supervisor, operator-ui
-      - node_exporter
-      - stability-agent
-      - npm                 # Nginx Proxy Manager — public ingress, TLS termination
-      - outline             # Team wiki (outline + postgres + redis)
-      - joplin              # Note sync server (joplin-server + postgres)
-      - ai-cluster          # AI workers: codex-worker, openclaw, planner-worker,
-                            # service-ops-worker, redis, mosquitto

-  chelsty-infra:
-    site: chelsty
+  chelsty:
    roles:
      - remote
      - hypervisor
-      - infra
+      - homeassistant
      - staging
    connectivity:
      uplink: lte
@ -57,28 +40,10 @@ nodes:
    home_automation:
      offline_operation_required: true
      services:
+        - homeassistant
        - zigbee2mqtt
        - mosquitto
      coordinator:
        model: SLZB-06U
        connection: network
        usb: false
-
-  chelsty-ha:
-    site: chelsty
-    roles:
-      - remote
-      - homeassistant
-    connectivity:
-      uplink: lte
-      intermittent: true
-    home_automation:
-      offline_operation_required: true
-      services:
-        - homeassistant
-
-  lustro:
-    roles:
-      - edge
-    services:
-      - node-agent
--- a/scripts/bootstrap/vps-control-plane.sh
+++ b/scripts/bootstrap/vps-control-plane.sh
@ -1,75 +0,0 @@
-#!/usr/bin/env bash
-# vps-control-plane.sh - Bootstrap script for VPS control plane
-
-set -e
-
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
-RUNTIME_DIR="/opt/homelab"
-VPS_CONFIG="$REPO_ROOT/hosts/vps/runtime"
-
-# Colors for output
-RED='\033[0;31m'
-GREEN='\033[0;32m'
-YELLOW='\033[1;33m'
-NC='\033[0m' # No Color
-
-log() { echo -e "${GREEN}[INFO]${NC} $1"; }
-warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
-error() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
-
-log "Starting VPS control plane bootstrap..."
-
-# 1. Validate Docker availability
-if ! command -v docker &> /dev/null; then
-    error "Docker is not installed. Please install Docker first."
-fi
-
-# 2. Validate compose plugin
-if ! docker compose version &> /dev/null; then
-    error "Docker Compose plugin is not installed."
-fi
-
-log "Docker and Compose plugin verified."
-
-# 3. Create filesystem-first runtime structure
-log "Creating filesystem-first runtime structure in $RUNTIME_DIR..."
-sudo mkdir -p "$RUNTIME_DIR/events" \
-             "$RUNTIME_DIR/state" \
-             "$RUNTIME_DIR/world" \
-             "$RUNTIME_DIR/actions/pending" \
-             "$RUNTIME_DIR/actions/approved" \
-             "$RUNTIME_DIR/actions/running" \
-             "$RUNTIME_DIR/actions/completed" \
-             "$RUNTIME_DIR/actions/failed" \
-             "$RUNTIME_DIR/actions/rejected" \
-             "$RUNTIME_DIR/config" \
-             "$RUNTIME_DIR/logs"
-
-# 4. Set permissions
-log "Setting permissions..."
-sudo chown -R $USER:$USER "$RUNTIME_DIR"
-chmod -R 755 "$RUNTIME_DIR"
-
-# 5. Install environment file
-log "Installing environment configuration..."
-if [ ! -f "$RUNTIME_DIR/config/control-plane.env" ]; then
-    cp "$VPS_CONFIG/control-plane/env.example" "$RUNTIME_DIR/config/control-plane.env"
-    log "Created $RUNTIME_DIR/config/control-plane.env from template."
-else
-    warn "Environment file already exists, skipping installation."
-fi
-
-# 6. Build and start the control plane
-log "Building and starting control plane services..."
-cd "$REPO_ROOT/services/control-plane"
-docker compose build
-docker compose up -d
-
-log "VPS control plane bootstrap complete!"
-
-echo -e "\n${YELLOW}Verification commands:${NC}"
-echo "1. Check container status: docker compose ps"
-echo "2. Check operator UI: curl http://localhost:8080/summary"
-echo "3. Validate world state: ls -l $RUNTIME_DIR/world"
-echo "4. Monitor events: tail -f $RUNTIME_DIR/events/*/*/*.json"
--- a/scripts/deploy/deploy-control-plane.sh
+++ b/scripts/deploy/deploy-control-plane.sh
@ -1,23 +0,0 @@
-#!/bin/bash
-# scripts/deploy/deploy-control-plane.sh
-set -e
-
-VPS_IP="100.95.58.48"
-USER="oskar"
-REMOTE_REPO_PATH="/home/oskar/homelab-codex-ws"
-
-MODE=$1
-
-case "$MODE" in
-    "--ssh")
-        echo "Deploying to VPS ($VPS_IP) via SSH..."
-        ssh -t "$USER@$VPS_IP" "cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh"
-        ;;
-    "--print")
-        echo "ssh -t $USER@$VPS_IP \"cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh\""
-        ;;
-    *)
-        echo "Usage: $0 [--ssh|--print]"
-        exit 1
-        ;;
-esac
--- a/scripts/deploy/deploy-frigate.sh
+++ b/scripts/deploy/deploy-frigate.sh
@ -1,26 +0,0 @@
-#!/usr/bin/env bash
-# deploy-frigate.sh - Deploy Frigate NVR on chelsty-infra (print or SSH)
-
-MODE="print"
-[[ "$1" == "--ssh" ]] && MODE="ssh"
-
-TARGET="100.122.201.22"
-NODE="chelsty-infra"
-REPO_PATH="/home/oskar/homelab-codex-ws"
-SERVICE_PATH="$REPO_PATH/hosts/chelsty-infra/runtime/frigate"
-
-echo "HOST: $NODE"
-echo "MODE: $MODE"
-echo "TARGET: $TARGET"
-
-# Secrets must exist at /opt/homelab/config/frigate/frigate.env on the node
-# before first deploy. See config.yml for required variables.
-DEPLOY_CMD="cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd $SERVICE_PATH && docker-compose pull && docker-compose up -d"
-
-if [[ "$MODE" == "ssh" ]]; then
-    echo "--- Deploying Frigate to $NODE ($TARGET) via SSH ---"
-    ssh oskar@$TARGET "$DEPLOY_CMD"
-else
-    echo "# --- Deployment commands for $NODE ---"
-    echo "ssh oskar@$TARGET '$DEPLOY_CMD'"
-fi
--- a/scripts/deploy/deploy-node.sh
+++ b/scripts/deploy/deploy-node.sh
@ -8,7 +8,6 @@ set -e
 REPO_PATH="${HOME}/homelab-codex-ws"
 RUNTIME_PATH="/opt/homelab"
 HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
-HOST_DIR="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')"

 echo "--- Starting Deployment on ${HOSTNAME} ---"

@ -23,33 +22,20 @@ echo "Pulling latest changes..."
 git pull

 # 2. Identify Services
-SERVICES=()
-if [ -f "${HOST_DIR}/services.txt" ]; then
-    mapfile -t SERVICES < <(grep -v '^\s*#' "${HOST_DIR}/services.txt" | grep -v '^\s*$')
-elif [ -f "${HOST_DIR}/services.yaml" ]; then
-    SERVICES=($(python3 -c "
-import yaml, sys
-try:
-    with open('${HOST_DIR}/services.yaml', 'r') as f:
-        data = yaml.safe_load(f)
-        if data and 'services' in data:
-            if isinstance(data['services'], dict):
-                print(' '.join(data['services'].keys()))
-            elif isinstance(data['services'], list):
-                print(' '.join(data['services']))
-except Exception as e:
-    print(f'Error parsing YAML: {e}', file=sys.stderr)
-    sys.exit(1)
-"))
-fi
+# Based on our convention, we look for services assigned to this host
+# For now, we'll check if a 'services.txt' exists in the host folder
+SERVICE_LIST="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')/services.txt"

-if [ ${#SERVICES[@]} -eq 0 ]; then
-    echo "No services found for ${HOSTNAME}. Skipping service deployment."
+if [ ! -f "$SERVICE_LIST" ]; then
+    echo "No services.txt found for ${HOSTNAME}. Skipping service deployment."
    exit 0
 fi

 # 3. Deploy Services
-for service in "${SERVICES[@]}"; do
+while IFS= read -r service || [ -n "$service" ]; do
+    [[ "$service" =~ ^#.*$ ]] && continue # Skip comments
+    [[ -z "$service" ]] && continue      # Skip empty lines
+
    echo "Deploying service: ${service}..."
    
    COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
@ -59,10 +45,13 @@ for service in "${SERVICES[@]}"; do
        continue
    fi

+    # Target directory in runtime
    TARGET_DIR="${RUNTIME_PATH}/services/${service}"
    mkdir -p "$TARGET_DIR"

-    OVERRIDE_FILE="${HOST_DIR}/runtime/${service}/docker-compose.override.yml"
+    # We use the compose file from the repo directly
+    # but we can also handle overrides here
+    OVERRIDE_FILE="${RUNTIME_PATH}/config/${service}/docker-compose.override.yml"
    
    COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
    if [ -f "$OVERRIDE_FILE" ]; then
@ -71,6 +60,7 @@ for service in "${SERVICES[@]}"; do
    fi

    $COMPOSE_CMD up -d --remove-orphans
-done
+
+done < "$SERVICE_LIST"

 echo "--- Deployment Complete ---"
--- a/scripts/deploy/deploy-stability-agent.sh
+++ b/scripts/deploy/deploy-stability-agent.sh
@ -1,55 +0,0 @@
-#!/usr/bin/env bash
-# deploy-stability-agent.sh - Helper to deploy stability-agent (print or SSH)
-
-NODE=$1
-MODE="print"
-[[ "$2" == "--ssh" ]] && MODE="ssh"
-
-if [[ -z "$NODE" ]]; then
-    echo "Usage: $0 <node-name> [--ssh]"
-    echo "Supported nodes: chelsty, piha, solaria, vps"
-    exit 1
-fi
-
-case "$NODE" in
-    piha)    TARGET="100.108.208.3" ;;
-    chelsty) TARGET="100.122.201.22" ;;
-    vps)     TARGET="100.95.58.48" ;;
-    solaria) TARGET="local" ;;
-    *)
-        echo "Error: Unknown node '$NODE'"
-        echo "Supported nodes: chelsty, piha, solaria, vps"
-        exit 1
-        ;;
-esac
-
-echo "HOST: $NODE"
-echo "MODE: $MODE"
-echo "TARGET: $TARGET"
-
-REPO_PATH="/home/oskar/homelab-codex-ws"
-
-if [[ "$NODE" == "solaria" ]]; then
-    if [[ "$MODE" == "ssh" ]]; then
-        echo "--- Running local deployment for solaria ---"
-        cd "$REPO_PATH" && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh solaria
-    else
-        echo "# --- Deployment commands for solaria ---"
-        echo "cd $REPO_PATH"
-        echo "git fetch origin"
-        echo "git checkout master"
-        echo "git pull origin master"
-        echo "cd services/stability-agent"
-        echo "./deploy-local.sh solaria"
-    fi
-else
-    # Remote nodes
-    SSH_CMD="ssh oskar@$TARGET 'cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh $NODE'"
-    if [[ "$MODE" == "ssh" ]]; then
-        echo "--- Deploying to $NODE ($TARGET) via SSH ---"
-        eval "$SSH_CMD"
-    else
-        echo "# --- Deployment commands for $NODE ---"
-        echo "$SSH_CMD"
-    fi
-fi
--- a/scripts/deploy/deploy.sh
+++ b/scripts/deploy/deploy.sh
@ -1,321 +1,270 @@
 #!/usr/bin/env bash
-# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
-# Usage: deploy.sh <target> [--dry-run] [--no-gate]
-#   target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
-# Exit codes: 0=ok  1=preflight  2=gate  3=execute  4=verify  5=handoff(sudo)
+# deploy.sh - Staged deployment framework for homelab nodes.

-set -uo pipefail
+set -o pipefail

-REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
-SSH_USER="${SSH_USER:-oskar}"
-START_TIME=$(date +%s)
-TARGET=""
-DRY_RUN=false
-NO_GATE=false
+# --- Configuration ---
+export RUNTIME_PATH="/opt/homelab"
+export STATE_DIR="${RUNTIME_PATH}/state/deploy"
+export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
+export REPO_PATH="${HOME}/homelab-codex-ws"
+export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"

-usage() {
-    cat >&2 <<'EOF'
-Usage: deploy.sh <target> [--dry-run] [--no-gate]
+# --- Initialization ---
+mkdir -p "$STATE_DIR" "$LOG_DIR"

-Targets:
-  control-plane   observer/supervisor/executor/operator-ui on VPS
-  vps             all VPS GitOps services
-  piha            PIHA services
-  solaria         SOLARIA compute services
-  chelsty-infra   CHELSTY edge node (LTE, longer SSH timeout)
+# Redirection for logging
+exec > >(tee -a "$LOG_FILE") 2>&1

-Flags:
-  --dry-run   run preflight + gate only; stop before deploy
-  --no-gate   skip pytest + docker build (emergency only; logged as WARNING)
+# --- Load Libraries ---
+LIB_PATH="${REPO_PATH}/scripts/lib"
+source "${LIB_PATH}/log.sh"
+source "${LIB_PATH}/state.sh"
+source "${LIB_PATH}/inventory.sh"
+source "${LIB_PATH}/compose.sh"
+source "${LIB_PATH}/diagnostics.sh"

-Exit codes: 0=ok  1=preflight  2=gate  3=execute  4=verify  5=handoff(sudo)
-EOF
-    exit 1
-}
+# --- CLI Parsing ---
+TARGET_HOST=$(hostname)
+TARGET_SERVICE=""
+RESUME=false
+REQUESTED_STAGE=""

 while [[ $# -gt 0 ]]; do
    case $1 in
-        control-plane|vps|piha|solaria|chelsty-infra)
-            TARGET="$1"; shift ;;
-        --dry-run)
-            DRY_RUN=true; shift ;;
-        --no-gate)
-            NO_GATE=true; shift ;;
-        -h|--help)
-            usage ;;
+        --host)
+            TARGET_HOST="$2"
+            shift 2
+            ;;
+        --service)
+            TARGET_SERVICE="$2"
+            shift 2
+            ;;
+        --resume)
+            RESUME=true
+            shift
+            ;;
+        --stage)
+            REQUESTED_STAGE="$2"
+            shift 2
+            ;;
        *)
-            echo "Unknown argument: $1" >&2
-            usage ;;
+            if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
+                REQUESTED_STAGE="$1"
+            fi
+            shift
+            ;;
    esac
 done

-[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }
+# --- Stages ---

-case "$TARGET" in
-    control-plane) SSH_HOST="vps" ;;
-    *)             SSH_HOST="$TARGET" ;;
-esac
-
-case "$TARGET" in
-    chelsty-*) SSH_TIMEOUT=30 ;;
-    *)         SSH_TIMEOUT=5 ;;
-esac
-
-# ── PREFLIGHT ────────────────────────────────────────────────────────────────
-
-preflight() {
-    echo "=== PREFLIGHT ==="
-
-    local branch
-    branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
-    if [[ "$branch" != "master" ]]; then
-        echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
-        exit 1
-    fi
-    echo "[ok] branch: master"
-
-    if ! git -C "$REPO_ROOT" diff --quiet; then
-        echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
-        exit 1
-    fi
-    if ! git -C "$REPO_ROOT" diff --cached --quiet; then
-        echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
-        exit 1
-    fi
-    echo "[ok] working tree clean"
-
-    git -C "$REPO_ROOT" fetch origin master --quiet
-    local unpushed
-    unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
-    if [[ -n "$unpushed" ]]; then
-        echo "ERROR: Unpushed commits on master:" >&2
-        echo "$unpushed" >&2
-        echo "Push first:  git push origin master" >&2
-        exit 1
-    fi
-    echo "[ok] no unpushed commits"
-
-    echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
-    if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
-            "${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
-        echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
-        exit 1
-    fi
-    echo "[ok] ${SSH_HOST} reachable"
-}
-
-# ── GATE ─────────────────────────────────────────────────────────────────────
-
-gate() {
-    if [[ "$NO_GATE" == "true" ]]; then
-        echo "=== GATE: SKIPPED ==="
-        echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
+stage_prepare() {
+    local host=$1
+    if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
+        log "INFO" "Skipping PREPARE (already complete)"
        return 0
    fi

-    echo "=== GATE ==="
+    log "INFO" "Stage: PREPARE ($host)"
+    set_stage "prepare"
    
-    local services=()
+    emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}"

-    if [[ "$TARGET" == "control-plane" ]]; then
-        services=("control-plane")
-    else
-        local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
-        if [[ ! -f "$svc_yaml" ]]; then
-            echo "ERROR: ${svc_yaml} not found." >&2
-            exit 2
-        fi
-        local svc_list
-        svc_list=$(python3 -c "
-import yaml
-with open('${svc_yaml}') as f:
-    data = yaml.safe_load(f)
-svcs = data.get('services', {})
-if isinstance(svcs, dict):
-    print('\n'.join(svcs.keys()))
-elif isinstance(svcs, list):
-    print('\n'.join(svcs))
-")
-        while IFS= read -r svc; do
-            [[ -z "$svc" ]] && continue
-            if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
-                services+=("$svc")
-            fi
-        done <<< "$svc_list"
+    cd "$REPO_PATH" || exit 1
+    log "INFO" "Pulling latest changes..."
+    if ! git pull; then
+        log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
    fi

-    if [[ ${#services[@]} -eq 0 ]]; then
-        echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
+    # Ensure runtime directories exist
+    mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
+
+    struct_log "prepare" "$host" "all" "success" "repo_updated"
+    mark_stage_complete "prepare"
+}
+
+stage_validate() {
+    local host=$1
+    if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
+        log "INFO" "Skipping VALIDATE (already complete)"
        return 0
    fi

-    echo "Services under gate: ${services[*]}"
-    local gate_failed=false
+    log "INFO" "Stage: VALIDATE ($host)"
+    set_stage "validate"

-    for svc in "${services[@]}"; do
-        local svc_dir="${REPO_ROOT}/services/${svc}"
-
-        if [[ -d "${svc_dir}/tests" ]]; then
-            echo "--- pytest: ${svc} ---"
-            if ! python3 -m pytest "${svc_dir}/tests" -q; then
-                echo "GATE FAIL: pytest failed for ${svc}" >&2
-                gate_failed=true
-            fi
-        fi
-
-        echo "--- docker build: ${svc} ---"
-        if ! docker build --quiet "${svc_dir}" >/dev/null; then
-            echo "GATE FAIL: docker build failed for ${svc}" >&2
-            gate_failed=true
+    for service in "${SERVICES[@]}"; do
+        log "INFO" "Validating $service..."
+        if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
+            log "ERROR" "Service definition not found: $service"
+            struct_log "validate" "$host" "$service" "fail" "not_found"
+            return 1
        fi
    done

-    if [[ "$gate_failed" == "true" ]]; then
-        exit 2
-    fi
-    echo "[ok] gate passed"
+    struct_log "validate" "$host" "all" "success" "validated"
+    mark_stage_complete "validate"
 }

-# ── EXECUTE ──────────────────────────────────────────────────────────────────
+stage_deploy() {
+    local host=$1
+    if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
+        log "INFO" "Skipping DEPLOY (already complete)"
+        return 0
+    fi

-execute() {
-    echo "=== EXECUTE ==="
+    log "INFO" "Stage: DEPLOY ($host)"
+    set_stage "deploy"

-    local cmd_output
-    local cmd_exit=0
+    local last_s=$(get_last_service)
+    local skip=false
+    if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
+        skip=true
+    fi

-    if [[ "$TARGET" == "control-plane" ]]; then
-        echo "Running deploy-control-plane.sh --ssh..."
-        cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
-            || cmd_exit=$?
+    for service in "${SERVICES[@]}"; do
+        if [[ "$skip" == "true" ]]; then
+            if [[ "$service" == "$last_s" ]]; then
+                skip=false
+                log "INFO" "Resuming from $service..."
            else
-        echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
-        cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
-            "${SSH_USER}@${SSH_HOST}" \
-            'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
-            || cmd_exit=$?
+                log "INFO" "Skipping $service (already processed)"
+                continue
+            fi
        fi

-    echo "$cmd_output"
+        log "INFO" "Deploying $service..."
+        set_last_service "$service"

-    if echo "$cmd_output" | grep -qF "[sudo] password"; then
-        echo "" >&2
-        echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
-        echo "Run manually:" >&2
-        if [[ "$TARGET" == "control-plane" ]]; then
-            echo "  ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
-        else
-            echo "  ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
-        fi
-        exit 5
+        if ! run_compose_up "$service"; then
+            struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
+            collect_diagnostics "$host" "$service"
+            return 1
        fi

-    if [[ $cmd_exit -ne 0 ]]; then
-        echo "ERROR: Deploy command exited ${cmd_exit}." >&2
-        exit 3
-    fi
-
-    echo "[ok] execute completed"
-}
-
-# ── VERIFY ───────────────────────────────────────────────────────────────────
-
-verify() {
-    echo "=== VERIFY ==="
-
-    local ps_output
-    local ps_exit=0
-    ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
-        "${SSH_USER}@${SSH_HOST}" \
-        'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
-        || ps_exit=$?
-
-    if [[ $ps_exit -ne 0 ]]; then
-        echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
-        echo "$ps_output" >&2
-        exit 4
-    fi
-
-    echo "$ps_output"
-
-    local failed=false
-
-    local not_up
-    not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
-    if [[ -n "$not_up" ]]; then
-        echo "ERROR: Containers not in Up state:" >&2
-        echo "$not_up" >&2
-        failed=true
-    fi
-
-    local unhealthy
-    unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
-    if [[ -n "$unhealthy" ]]; then
-        echo "ERROR: Unhealthy containers:" >&2
-        echo "$unhealthy" >&2
-        failed=true
-    fi
-
-    if [[ "$TARGET" == "control-plane" ]]; then
-        for cp_svc in supervisor observer executor operator-ui; do
-            if ! echo "$ps_output" | grep -q "$cp_svc"; then
-                echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
-                failed=true
-            fi
+        struct_log "deploy" "$host" "$service" "success" "deployed"
    done
-    fi
    
-    if [[ "$failed" == "true" ]]; then
-        echo "" >&2
-        echo "Full docker ps output above." >&2
-        exit 4
-    fi
-
-    echo "[ok] all containers healthy"
+    set_last_service ""
+    mark_stage_complete "deploy"
 }

-# ── REPORT ───────────────────────────────────────────────────────────────────
-
-report() {
-    local mode="${1:-deploy}"
-    local end_time
-    end_time=$(date +%s)
-    local elapsed
-    elapsed=$(( end_time - START_TIME ))
-    local commit_hash
-    commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
-    local gate_s verify_s
-
-    if [[ "$NO_GATE" == "true" ]]; then
-        gate_s="skip"
-    else
-        gate_s="ok"
+stage_verify() {
+    local host=$1
+    if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
+        log "INFO" "Skipping VERIFY (already complete)"
+        return 0
    fi

-    if [[ "$mode" == "dry-run" ]]; then
-        verify_s="skip(dry-run)"
-    else
-        verify_s="green"
-    fi
+    log "INFO" "Stage: VERIFY ($host)"
+    set_stage "verify"

-    echo ""
-    if [[ "$mode" == "dry-run" ]]; then
-        echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
-    else
-        echo "DEPLOY OK  | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
+    for service in "${SERVICES[@]}"; do
+        log "INFO" "Verifying $service..."
+        local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
+        if [[ -f "$health_script" ]]; then
+            if ! bash "$health_script"; then
+                log "ERROR" "Healthcheck failed for $service"
+                struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
+                collect_diagnostics "$host" "$service"
+                return 1
            fi
+        else
+            # Generic check if container is running
+            if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
+                log "ERROR" "Container $service is not running"
+                struct_log "verify" "$host" "$service" "fail" "container_not_running"
+                collect_diagnostics "$host" "$service"
+                return 1
+            fi
+        fi
+        struct_log "verify" "$host" "$service" "success" "verified"
+    done
+    mark_stage_complete "verify"
 }

-# ── MAIN ─────────────────────────────────────────────────────────────────────
+stage_complete() {
+    local host=$1
+    log "INFO" "Stage: COMPLETE ($host)"
+    set_stage "complete"
+    struct_log "complete" "$host" "all" "success" "deployment_finished"
+    clear_deployment_state
+}

-preflight
-gate
+# --- Execution Logic ---

-if [[ "$DRY_RUN" == "true" ]]; then
-    report dry-run
-    exit 0
+run_deployment() {
+    local start_stage=$1
+
+    # Sequential execution from start_stage
+    case "$start_stage" in
+        prepare)
+            stage_prepare "$TARGET_HOST" || return 1
+            ;&
+        validate)
+            stage_validate "$TARGET_HOST" || return 1
+            ;&
+        deploy)
+            stage_deploy "$TARGET_HOST" || return 1
+            ;&
+        verify)
+            stage_verify "$TARGET_HOST" || return 1
+            ;&
+        complete)
+            stage_complete "$TARGET_HOST" || return 1
+            ;;
+        *)
+            log "ERROR" "Invalid stage: $start_stage"
+            return 1
+            ;;
+    esac
+}
+
+# --- Main ---
+
+log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
+
+if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
+    log "ERROR" "Failed to load inventory"
+    exit 1
 fi

-execute
-verify
-report
+EXIT_STATUS=0
+if [[ "$RESUME" == "true" ]]; then
+    CURRENT=$(get_stage)
+    log "INFO" "Resuming from state: $CURRENT"
+    case "$CURRENT" in
+        prepare|validate|deploy|verify)
+            run_deployment "$CURRENT" || EXIT_STATUS=1
+            ;;
+        complete|none)
+            log "INFO" "No interrupted deployment found. Starting from scratch..."
+            run_deployment "prepare" || EXIT_STATUS=1
+            ;;
+        *)
+            log "INFO" "Unknown state. Starting from prepare..."
+            run_deployment "prepare" || EXIT_STATUS=1
+            ;;
+    esac
+elif [[ -n "$REQUESTED_STAGE" ]]; then
+    if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
+        collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
+    else
+        run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
+    fi
+else
+    # New deployment - clear previous state
+    clear_deployment_state
+    run_deployment "prepare" || EXIT_STATUS=1
+fi
+
+if [[ $EXIT_STATUS -eq 0 ]]; then
+    print_summary "$TARGET_HOST" "SUCCESS"
+    log "INFO" "--- Homelab Deployment Finished Successfully ---"
+else
+    print_summary "$TARGET_HOST" "FAILED"
+    log "ERROR" "--- Homelab Deployment Failed ---"
+    exit 1
+fi
--- a/scripts/deploy/orchestrate-deploy.sh
+++ b/scripts/deploy/orchestrate-deploy.sh
@ -1,30 +1,15 @@
 #!/usr/bin/env bash
 # orchestrate-deploy.sh - To be run on SATURN
-# Triggers deployment on remote execution nodes via inventory.
+# Triggers deployment on remote execution nodes.

 set -e

-REPO_PATH="${HOME}/homelab-codex-ws"
-USER="oskar"
+HOSTS=("solaria" "piha" "vps")
+USER="oskar" # Default user

-while IFS=' ' read -r HOST TAG; do
+for HOST in "${HOSTS[@]}"; do
    echo ">>> Triggering deployment on ${HOST}..."
-    if [[ "$TAG" == "lte" ]]; then
-        ssh -o ConnectTimeout=30 "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh" || \
-            echo "WARNING: Deployment on ${HOST} failed or timed out (LTE/intermittent node, skipping)"
-    else
    ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
-    fi
-done < <(python3 -c "
-import yaml, sys
-with open('${REPO_PATH}/inventory/topology.yaml') as f:
-    data = yaml.safe_load(f)
-skip = {'saturn', 'solaria'}
-for name, info in (data.get('nodes') or {}).items():
-    if name in skip:
-        continue
-    uplink = ((info or {}).get('connectivity') or {}).get('uplink', '')
-    print(name, 'lte' if uplink == 'lte' else 'standard')
-")
+done

 echo ">>> All deployments triggered."
--- a/scripts/deploy/verify-agent-fleet.sh
+++ b/scripts/deploy/verify-agent-fleet.sh
@ -1,68 +0,0 @@
-#!/usr/bin/env bash
-# verify-agent-fleet.sh - Check the status of stability agents across the fleet
-
-REDIS_CMD="docker exec agent-system-redis redis-cli --raw"
-
-# Check if docker is available
-if ! command -v docker &> /dev/null; then
-    echo "Error: docker command not found."
-    exit 1
-fi
-
-# Check if container is running
-if ! docker ps --filter "name=agent-system-redis" --format "{{.Names}}" | grep -q "agent-system-redis"; then
-    echo "Error: agent-system-redis container not found or not running."
-    echo "This script must be run on PIHA (the node hosting the Redis container)."
-    exit 1
-fi
-
-REQUIRED_NODES=("piha" "chelsty" "solaria" "vps")
-MISSING_NODES=0
-
-echo "--- Homelab Agent Fleet Status ---"
-printf "%-10s %-15s %-10s %-10s %-30s\n" "NODE" "HOSTNAME" "HEALTH" "STATUS" "LAST_SEEN"
-printf "%s\n" "--------------------------------------------------------------------------------"
-
-for NODE in "${REQUIRED_NODES[@]}"; do
-    KEY="homelab:nodes:$NODE"
-    
-    # Check if key exists
-    EXISTS=$($REDIS_CMD EXISTS "$KEY" 2>/dev/null | tr -d '\r\n')
-
-    if [[ "$EXISTS" != "1" ]]; then
-        printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "MISSING" "N/A" "N/A" "N/A"
-        MISSING_NODES=$((MISSING_NODES + 1))
-        continue
-    fi
-
-    HOSTNAME=$($REDIS_CMD HGET "$KEY" hostname 2>/dev/null | tr -d '\r\n')
-    HEALTH=$($REDIS_CMD HGET "$KEY" health 2>/dev/null | tr -d '\r\n')
-    STATUS=$($REDIS_CMD HGET "$KEY" status 2>/dev/null | tr -d '\r\n')
-    LAST_SEEN=$($REDIS_CMD HGET "$KEY" last_seen 2>/dev/null | tr -d '\r\n')
-
-    printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "$HOSTNAME" "$HEALTH" "$STATUS" "$LAST_SEEN"
-done
-
-echo ""
-echo "--- Control Plane Summary ---"
-if command -v jq >/dev/null; then
-    curl -s http://127.0.0.1:18180/summary | jq .
-else
-    curl -s http://127.0.0.1:18180/summary
-fi
-
-echo ""
-echo "--- Control Plane Nodes ---"
-if command -v jq >/dev/null; then
-    curl -s http://127.0.0.1:18180/nodes | jq .
-else
-    curl -s http://127.0.0.1:18180/nodes
-fi
-
-if [[ $MISSING_NODES -gt 0 ]]; then
-    echo ""
-    echo "Error: $MISSING_NODES required nodes are missing from Redis."
-    exit 1
-fi
-
-exit 0
--- a/scripts/dev/agent.sh
+++ b/scripts/dev/agent.sh
@ -1,361 +0,0 @@
-#!/usr/bin/env bash
-# Multi-agent worktree manager.
-# EXIT: 0 ok, 1 preflight, 2 operation failed.
-set -euo pipefail
-
-trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
-
-RESERVED_NAMES=(master main HEAD list merge clean new)
-MAX_WORKTREES=4
-
-die()    { echo "ERROR: $*" >&2; exit "${2:-2}"; }
-prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
-
-# ── helpers ──────────────────────────────────────────────────────────────────
-
-is_main_checkout() {
-  local git_dir common_dir
-  git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
-  common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
-  [ "$git_dir" = "$common_dir" ]
-}
-
-require_main_checkout() {
-  is_main_checkout || prefail "must run from the main checkout, not a worktree"
-}
-
-require_master_branch() {
-  local branch
-  branch=$(git rev-parse --abbrev-ref HEAD)
-  [ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
-}
-
-require_clean_tree() {
-  local dirty
-  dirty=$(git status --porcelain)
-  [ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
-}
-
-worktree_paths() {
-  # list worktree paths (excluding main); || true prevents grep exit-1 when empty
-  local main_path
-  main_path=$(git rev-parse --show-toplevel)
-  git worktree list --porcelain \
-    | awk '/^worktree /{p=$2} /^$/{print p}' \
-    | grep -v "^${main_path}$" \
-    || true
-}
-
-worktree_count() {
-  worktree_paths | wc -l
-}
-
-branch_exists_local()  { git show-ref --verify --quiet "refs/heads/$1"; }
-branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
-
-utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
-
-age_str() {
-  local created_utc="$1"
-  local now_ts created_ts diff_s
-  now_ts=$(date -u +%s)
-  # strip Z, replace T with space for `date -d`
-  created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
-  diff_s=$(( now_ts - created_ts ))
-  if   (( diff_s < 60 ));   then echo "${diff_s}s"
-  elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
-  elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
-  else echo "$(( diff_s/86400 ))d"
-  fi
-}
-
-validate_name() {
-  local name="$1"
-  if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
-    prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
-  fi
-  for r in "${RESERVED_NAMES[@]}"; do
-    if [ "$name" = "$r" ]; then
-      prefail "'$name' is a reserved word"
-    fi
-  done
-}
-
-# ── subcommands ───────────────────────────────────────────────────────────────
-
-cmd_new() {
-  local name="${1:-}"
-  [ -n "$name" ] || { usage; exit 1; }
-
-  validate_name "$name"
-  require_main_checkout
-  require_master_branch
-  require_clean_tree
-
-  # worktree limit
-  local count
-  count=$(worktree_count)
-  if (( count >= MAX_WORKTREES )); then
-    echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
-    cmd_list
-    exit 1
-  fi
-
-  # branch collision
-  if branch_exists_local "task/$name"; then
-    prefail "branch task/$name already exists locally"
-  fi
-  git fetch origin master --quiet
-  if branch_exists_remote "refs/heads/task/$name"; then
-    prefail "branch task/$name already exists on origin"
-  fi
-
-  # directory collision
-  local main_path wt_path
-  main_path=$(git rev-parse --show-toplevel)
-  wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
-  [ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
-
-  # create worktree
-  git worktree add -b "task/$name" "$wt_path" origin/master \
-    || die "git worktree add failed"
-
-  # write marker
-  local parent_commit
-  parent_commit=$(git rev-parse origin/master)
-  cat > "$wt_path/.agent-task" <<EOF
-task: $name
-branch: task/$name
-parent_commit: $parent_commit
-created_utc: $(utc_now)
-worktree_path: $wt_path
-EOF
-
-  echo ""
-  echo "Worktree created: $wt_path"
-  echo "Branch:           task/$name"
-  echo ""
-  echo "── Start Claude Code in this worktree ──────────────────────────────────────"
-  echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
-  echo "─────────────────────────────────────────────────────────────────────────────"
-}
-
-cmd_list() {
-  local main_path
-  main_path=$(git rev-parse --show-toplevel)
-
-  # fetch to get up-to-date ahead/behind
-  git fetch origin master --quiet 2>/dev/null || true
-
-  local paths
-  paths=$(worktree_paths)
-
-  if [ -z "$paths" ]; then
-    echo "(no active task worktrees)"
-    return
-  fi
-
-  printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
-    "NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
-
-  while IFS= read -r wt_path; do
-    [ -z "$wt_path" ] && continue
-
-    local marker="$wt_path/.agent-task"
-    local task_name branch parent_commit created_utc
-    if [ -f "$marker" ]; then
-      task_name=$(  grep '^task:'          "$marker" | awk '{print $2}')
-      branch=$(     grep '^branch:'        "$marker" | awk '{print $2}')
-      parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
-      created_utc=$(grep '^created_utc:'   "$marker" | awk '{print $2}')
-    else
-      task_name="(no marker)"
-      branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
-      parent_commit="?"
-      created_utc=""
-    fi
-
-    local status="clean"
-    local dirty
-    dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
-    [ -n "$dirty" ] && status="dirty"
-
-    local ahead behind ab
-    ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
-    behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
-    ab="+${ahead}/-${behind}"
-
-    local age=""
-    [ -n "$created_utc" ] && age=$(age_str "$created_utc")
-
-    local short_parent="${parent_commit:0:7}"
-    local short_created="${created_utc:0:10}"
-
-    printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
-      "$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
-  done <<< "$paths"
-}
-
-cmd_merge() {
-  local name="${1:-}"
-  [ -n "$name" ] || { usage; exit 1; }
-
-  require_main_checkout
-  require_master_branch
-  require_clean_tree
-
-  git fetch origin --quiet
-
-  branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
-
-  local main_path wt_path
-  main_path=$(git rev-parse --show-toplevel)
-  wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
-
-  # attempt ff-only merge
-  local merge_failed=0
-  git merge --ff-only "task/$name" || merge_failed=1
-
-  if (( merge_failed )); then
-    # abort any partial merge state
-    git merge --abort 2>/dev/null || true
-    echo ""
-    echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
-    echo "       The branch has likely diverged from master." >&2
-    echo "" >&2
-    echo "Diagnose with:" >&2
-    echo "  git log master..task/$name        # commits only on task branch" >&2
-    echo "  git log task/$name..master        # commits master has that task doesn't" >&2
-    echo "" >&2
-    echo "Then decide: rebase task/$name onto master, or merge manually." >&2
-    echo "Worktree and branch are preserved — no changes made." >&2
-    exit 2
-  fi
-
-  echo "Merged task/$name into master (fast-forward)."
-
-  git push origin master || die "git push origin master failed"
-  echo "Pushed master to origin."
-
-  if [ -d "$wt_path" ]; then
-    git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
-    echo "Removed worktree: $wt_path"
-  else
-    echo "(worktree directory $wt_path not found — skipping worktree remove)"
-  fi
-
-  git branch -d "task/$name" || die "git branch -d task/$name failed"
-  echo "Deleted local branch task/$name."
-
-  git push origin --delete "task/$name" 2>/dev/null \
-    && echo "Deleted remote branch task/$name." \
-    || echo "(remote branch task/$name not found — nothing to delete)"
-
-  echo ""
-  echo "Done. task/$name merged and cleaned up."
-}
-
-cmd_clean() {
-  local main_path
-  main_path=$(git rev-parse --show-toplevel)
-  git fetch origin --quiet 2>/dev/null || true
-
-  local to_remove=()
-
-  # orphaned registered worktrees: branch deleted or fully merged into master
-  local paths
-  paths=$(worktree_paths)
-  while IFS= read -r wt_path; do
-    [ -z "$wt_path" ] && continue
-    local branch
-    branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
-    [ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
-
-    # branch gone locally?
-    if ! branch_exists_local "$branch"; then
-      to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
-      continue
-    fi
-
-    # branch fully merged into master?
-    local ahead
-    ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
-    if [ "$ahead" = "0" ]; then
-      to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
-    fi
-  done <<< "$paths"
-
-  # dangling directories: ../homelab-codex-ws-* not registered
-  local registered_paths
-  registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
-  local parent_dir
-  parent_dir=$(dirname "$main_path")
-  while IFS= read -r candidate; do
-    [ -d "$candidate" ] || continue
-    if ! echo "$registered_paths" | grep -qF "$candidate"; then
-      to_remove+=("dangling:$candidate")
-    fi
-  done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
-
-  if [ ${#to_remove[@]} -eq 0 ]; then
-    echo "Nothing to clean."
-    return 0
-  fi
-
-  echo "Found ${#to_remove[@]} item(s) to clean:"
-  for entry in "${to_remove[@]}"; do
-    echo "  $entry"
-  done
-  echo ""
-
-  local overall_rc=0
-  for entry in "${to_remove[@]}"; do
-    local kind="${entry%%:*}"
-    local path="${entry#*:}"
-    # strip trailing annotation in parens
-    local raw_path
-    raw_path="${path%% (*}"
-
-    local confirm
-    read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
-    if [[ "$confirm" =~ ^[Yy]$ ]]; then
-      if [ "$kind" = "worktree" ]; then
-        git worktree remove --force "$raw_path" 2>/dev/null \
-          || { echo "  WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
-      else
-        rm -rf "$raw_path"
-      fi
-      echo "  Removed."
-    else
-      echo "  Skipped."
-    fi
-  done
-
-  return $overall_rc
-}
-
-usage() {
-  cat <<'EOF'
-Usage: agent.sh <subcommand> [args]
-
-  agent.sh new <name>    Create a new task worktree (branch task/<name>)
-  agent.sh list          List active task worktrees with status
-  agent.sh merge <name>  Fast-forward merge task/<name> into master and clean up
-  agent.sh clean         Remove orphaned or dangling worktrees (interactive)
-
-EXIT: 0 ok, 1 preflight, 2 operation failed.
-EOF
-}
-
-# ── dispatch ──────────────────────────────────────────────────────────────────
-
-SUBCOMMAND="${1:-}"
-shift || true
-
-case "$SUBCOMMAND" in
-  new)   cmd_new   "$@" ;;
-  list)  cmd_list  "$@" ;;
-  merge) cmd_merge "$@" ;;
-  clean) cmd_clean "$@" ;;
-  *)     usage; exit 1  ;;
-esac
--- a/scripts/monitor/health-monitor.sh
+++ b/scripts/monitor/health-monitor.sh
@ -1,338 +0,0 @@
-#!/usr/bin/env bash
-# health-monitor.sh - Homelab node health monitor and safe disk cleanup
-#
-# Designed to run standalone on the host (cron or direct) or to be called by
-# the node-agent Python daemon. All cleanup decisions follow the conservative
-# policy agreed in the design review:
-#
-#  lte_node  (chelsty-infra, chelsty-ha) : NO cleanup at all
-#  sd_card   (piha, saturn)              : dangling images + stopped containers,
-#                                          rate-limited to once per 24 h
-#  ai_node   (solaria)                   : dangling images + stopped containers
-#                                          + build cache (NEVER -a)
-#  standard  (vps)                       : dangling images + stopped containers
-#                                          + build cache
-#
-# VPS additionally rotates control-plane filesystem artefacts:
-#   actions/completed + failed  > 7 days
-#   logs/deploy                 > 30 days
-#   events/**                   > 3 days AND past observer checkpoint
-#
-# NEVER TOUCHED (any node): /opt/homelab/data/, config/, state/,
-#   actions/pending|approved|running, Frigate recordings, Ollama models,
-#   Zigbee2MQTT data, Mosquitto data, HA database/config.
-
-set -euo pipefail
-
-# ---------------------------------------------------------------------------
-# Configuration
-# ---------------------------------------------------------------------------
-RUNTIME_PATH="${RUNTIME_PATH:-/opt/homelab}"
-EVENTS_DIR="${RUNTIME_PATH}/events"
-STATE_DIR="${RUNTIME_PATH}/state"
-LOGS_DIR="${RUNTIME_PATH}/logs"
-ACTIONS_DIR="${RUNTIME_PATH}/actions"
-
-NODE_NAME="${NODE_NAME:-$(hostname)}"
-TIMESTAMP=$(date +%s)
-DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
-
-# Thresholds
-DISK_WARN_PCT=75
-DISK_CRIT_PCT=85
-MEM_WARN_PCT=85
-MEM_CRIT_PCT=95
-
-# Rate-limit file for SD-card nodes (max one Docker cleanup per 24 h)
-CLEANUP_LOCK="${STATE_DIR}/last-docker-cleanup"
-CLEANUP_INTERVAL=86400   # seconds
-
-# Node classifications
-LTE_NODES="chelsty-infra chelsty-ha"
-SD_CARD_NODES="piha saturn"
-AI_NODES="solaria"
-
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-
-log()  { echo "$(date -u +%H:%M:%S) [INFO]  $*"; }
-warn() { echo "$(date -u +%H:%M:%S) [WARN]  $*" >&2; }
-err()  { echo "$(date -u +%H:%M:%S) [ERROR] $*" >&2; }
-
-contains() {
-    local word="$1"; shift
-    for w in "$@"; do [[ "$w" == "$word" ]] && return 0; done
-    return 1
-}
-
-get_node_type() {
-    # shellcheck disable=SC2086
-    if contains "$NODE_NAME" $LTE_NODES;    then echo "lte_node";  return; fi
-    if contains "$NODE_NAME" $SD_CARD_NODES; then echo "sd_card";   return; fi
-    if contains "$NODE_NAME" $AI_NODES;     then echo "ai_node";   return; fi
-    echo "standard"
-}
-
-# ---------------------------------------------------------------------------
-# Event emission
-# ---------------------------------------------------------------------------
-
-emit_event() {
-    local type="$1" severity="$2" service="${3:-}" message="$4" payload="${5:-{}}"
-    local id="evt-${NODE_NAME}-${TIMESTAMP}-${type}"
-    local dir="${EVENTS_DIR}/${NODE_NAME}"
-    mkdir -p "$dir"
-    cat > "${dir}/${id}.json" <<EOF
-{
-  "id": "${id}",
-  "timestamp": ${TIMESTAMP},
-  "date": "${DATE}",
-  "type": "${type}",
-  "severity": "${severity}",
-  "node": "${NODE_NAME}",
-  "service": "${service}",
-  "message": "${message}",
-  "payload": ${payload}
-}
-EOF
-}
-
-# ---------------------------------------------------------------------------
-# Health checks
-# ---------------------------------------------------------------------------
-
-check_disk() {
-    # Use /opt/homelab as the check target — it lives on the host filesystem
-    # and this path is correct both when running natively and in a container
-    # that mounts /opt/homelab from the host.
-    local mount="${RUNTIME_PATH}"
-    local usage_pct avail_mb total_mb
-    usage_pct=$(df "${mount}" 2>/dev/null | awk 'NR==2 {gsub(/%/,"",$5); print $5}') || return
-    avail_mb=$(df  "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $4/1024}')       || return
-    total_mb=$(df  "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $2/1024}')       || return
-
-    if [[ "${usage_pct}" -ge "${DISK_CRIT_PCT}" ]]; then
-        warn "Disk CRITICAL: ${usage_pct}% used (${avail_mb} MB free)"
-        emit_event "disk_pressure" "high" "" \
-            "Disk usage critical: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
-            "{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
-    elif [[ "${usage_pct}" -ge "${DISK_WARN_PCT}" ]]; then
-        warn "Disk elevated: ${usage_pct}% used"
-        emit_event "disk_pressure" "medium" "" \
-            "Disk usage elevated: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
-            "{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
-    fi
-    echo "${usage_pct}"
-}
-
-check_memory() {
-    local total avail pct avail_mb
-    total=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
-    avail=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
-    pct=$(( (total - avail) * 100 / total ))
-    avail_mb=$(( avail / 1024 ))
-
-    if [[ "${pct}" -ge "${MEM_CRIT_PCT}" ]]; then
-        warn "Memory CRITICAL: ${pct}% used"
-        emit_event "high_memory" "high" "" \
-            "Memory usage critical: ${pct}% (${avail_mb} MB available)" \
-            "{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
-    elif [[ "${pct}" -ge "${MEM_WARN_PCT}" ]]; then
-        warn "Memory elevated: ${pct}%"
-        emit_event "high_memory" "medium" "" \
-            "Memory usage elevated: ${pct}% (${avail_mb} MB available)" \
-            "{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
-    fi
-    echo "${pct}"
-}
-
-check_cpu() {
-    # Two-sample /proc/stat delta for accurate instantaneous CPU usage.
-    local idle1 total1 idle2 total2 pct
-    read -r idle1 total1 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
-    sleep 1
-    read -r idle2 total2 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
-
-    local d_idle=$(( idle2 - idle1 ))
-    local d_total=$(( total2 - total1 ))
-    pct=$(( d_total > 0 ? 100 - d_idle * 100 / d_total : 0 ))
-
-    if [[ "${pct}" -ge 90 ]]; then
-        warn "CPU elevated: ${pct}%"
-        emit_event "high_cpu" "medium" "" \
-            "CPU usage elevated: ${pct}%" \
-            "{\"usage_pct\": ${pct}}"
-    fi
-    echo "${pct}"
-}
-
-check_containers() {
-    command -v docker &>/dev/null || return
-
-    # Containers that have exited but carry a restart policy meaning they should be up
-    local cname
-    while IFS= read -r cname; do
-        [[ -z "$cname" ]] && continue
-        warn "Container exited (should be running): ${cname}"
-        emit_event "containers_not_running" "high" "${cname}" \
-            "Container '${cname}' has exited unexpectedly (restart=unless-stopped)" \
-            "{\"container\": \"${cname}\"}"
-    done < <(docker ps -a \
-        --filter "status=exited" \
-        --filter "label=com.docker.compose.project" \
-        --format "{{.Names}}" 2>/dev/null || true)
-
-    # Containers that are running but their health check is failing
-    while IFS= read -r cname; do
-        [[ -z "$cname" ]] && continue
-        warn "Container unhealthy: ${cname}"
-        emit_event "healthcheck_failed" "high" "${cname}" \
-            "Container '${cname}' is running but health check is failing" \
-            "{\"container\": \"${cname}\"}"
-    done < <(docker ps \
-        --filter "health=unhealthy" \
-        --format "{{.Names}}" 2>/dev/null || true)
-}
-
-# ---------------------------------------------------------------------------
-# Safe Docker cleanup (per policy)
-# ---------------------------------------------------------------------------
-
-_sd_card_rate_ok() {
-    if [[ -f "${CLEANUP_LOCK}" ]]; then
-        local last_ts elapsed
-        last_ts=$(cat "${CLEANUP_LOCK}" 2>/dev/null || echo 0)
-        elapsed=$(( TIMESTAMP - last_ts ))
-        if [[ "${elapsed}" -lt "${CLEANUP_INTERVAL}" ]]; then
-            log "Docker cleanup skipped: last run ${elapsed}s ago (limit ${CLEANUP_INTERVAL}s)"
-            return 1
-        fi
-    fi
-    return 0
-}
-
-_mark_cleanup_done() {
-    echo "${TIMESTAMP}" > "${CLEANUP_LOCK}"
-}
-
-run_safe_cleanup() {
-    command -v docker &>/dev/null || return
-    local node_type
-    node_type=$(get_node_type)
-
-    case "${node_type}" in
-        lte_node)
-            # NO cleanup on LTE nodes. Any docker operation risks triggering
-            # a pull over a metered/intermittent connection.
-            log "Skipping Docker cleanup: LTE node (${NODE_NAME})"
-            ;;
-
-        sd_card)
-            # Dangling images + stopped containers only.
-            # Rate-limited to once per 24 hours to protect SD card write endurance.
-            _sd_card_rate_ok || return
-            log "Running rate-limited Docker cleanup (SD card node)"
-            docker image prune -f     >/dev/null 2>&1 || true
-            docker container prune -f >/dev/null 2>&1 || true
-            _mark_cleanup_done
-            ;;
-
-        ai_node)
-            # Dangling images + stopped containers + build cache.
-            # NEVER docker image prune -a (would remove Ollama runtime images,
-            # requiring a multi-hour re-pull of model weights).
-            log "Running AI-node Docker cleanup (dangling images + containers + build cache)"
-            docker image prune -f     >/dev/null 2>&1 || true
-            docker container prune -f >/dev/null 2>&1 || true
-            docker builder prune -f   >/dev/null 2>&1 || true
-            ;;
-
-        standard)
-            # VPS and other standard nodes: full safe cleanup.
-            log "Running standard Docker cleanup"
-            docker image prune -f     >/dev/null 2>&1 || true
-            docker container prune -f >/dev/null 2>&1 || true
-            docker builder prune -f   >/dev/null 2>&1 || true
-            ;;
-    esac
-}
-
-# ---------------------------------------------------------------------------
-# VPS-specific: control-plane filesystem rotation
-# ---------------------------------------------------------------------------
-
-cleanup_control_plane_fs() {
-    log "Running control-plane filesystem rotation"
-
-    # Completed / failed actions older than 7 days
-    for status in completed failed; do
-        local dir="${ACTIONS_DIR}/${status}"
-        [[ -d "${dir}" ]] || continue
-        find "${dir}" -name "*.json" -mtime +7 -delete 2>/dev/null && \
-            log "Cleaned ${status} actions older than 7 days" || true
-    done
-
-    # Deploy logs older than 30 days
-    local deploy_logs="${LOGS_DIR}/deploy"
-    if [[ -d "${deploy_logs}" ]]; then
-        find "${deploy_logs}" -name "*.log" -mtime +30 -delete 2>/dev/null && \
-            log "Cleaned deploy logs older than 30 days" || true
-    fi
-
-    # Event files older than 3 days AND already past the observer checkpoint.
-    # The dual condition ensures we never delete an event the observer hasn't seen.
-    local checkpoint="${STATE_DIR}/observer_checkpoint.json"
-    if [[ -f "${checkpoint}" ]] && command -v python3 &>/dev/null; then
-        local last_processed
-        last_processed=$(python3 -c "
-import json, sys
-try:
-    d = json.load(open('${checkpoint}'))
-    print(d.get('last_processed_file', ''))
-except Exception:
-    print('')
-" 2>/dev/null || echo "")
-
-        if [[ -n "${last_processed}" ]]; then
-            find "${EVENTS_DIR}" -name "*.json" -mtime +3 | while IFS= read -r f; do
-                # Only delete files that sort before the checkpoint path
-                # (i.e., the observer has already processed them).
-                if [[ "$f" < "${last_processed}" ]]; then
-                    rm -f "$f"
-                    log "Cleaned old event: $(basename "$f")"
-                fi
-            done
-        else
-            log "No observer checkpoint set; skipping event file cleanup"
-        fi
-    fi
-}
-
-# ---------------------------------------------------------------------------
-# Main
-# ---------------------------------------------------------------------------
-
-mkdir -p "${EVENTS_DIR}/${NODE_NAME}" "${STATE_DIR}"
-
-log "Health check starting on ${NODE_NAME} (type=$(get_node_type))"
-
-disk_pct=$(check_disk || echo 0)
-mem_pct=$(check_memory || echo 0)
-cpu_pct=$(check_cpu || echo 0)
-check_containers
-
-run_safe_cleanup
-
-# VPS: also rotate control-plane filesystem artefacts
-if [[ "${NODE_NAME}" == "vps" ]]; then
-    cleanup_control_plane_fs
-fi
-
-# Emit a node_health heartbeat so the observer can update node status
-# and the supervisor can see up-to-date resource metrics.
-emit_event "node_health" "info" "" \
-    "Health check completed on ${NODE_NAME}" \
-    "{\"disk_pct\": ${disk_pct}, \"mem_pct\": ${mem_pct}, \"cpu_pct\": ${cpu_pct}}"
-
-log "Health check complete (disk=${disk_pct}% mem=${mem_pct}% cpu=${cpu_pct}%)"
--- a/scripts/observer/observer.py
+++ b/scripts/observer/observer.py
@ -1,546 +0,0 @@
-import os
-import json
-import time
-import glob
-import logging
-import yaml
-from datetime import datetime, timezone
-from pathlib import Path
-
-
-def _atomic_write_json(path: Path, data) -> None:
-    """Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
-    tmp = path.with_suffix(".tmp")
-    with open(tmp, "w") as f:
-        json.dump(data, f, indent=2)
-        f.flush()
-        os.fsync(f.fileno())
-    os.replace(tmp, path)
-
-
-def _parse_ts(ts) -> float:
-    """Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
-
-    Events from node-agent use int(time.time()); events from stability-agent / events.py
-    use ISO format ('2026-06-03T10:30:00Z').  Both appear in incident fields such as
-    last_occurrence and resolved_at, so any arithmetic on them must go through here.
-    Returns 0.0 on None or unparseable input so callers can use plain comparisons.
-    """
-    if ts is None:
-        return 0.0
-    if isinstance(ts, (int, float)):
-        return float(ts)
-    try:
-        return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
-    except Exception:
-        return 0.0
-
-# Constants and Paths
-RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
-EVENTS_DIR = Path(RUNTIME_PATH) / "events"
-STATE_DIR = Path(RUNTIME_PATH) / "state"
-LOGS_DIR = Path(RUNTIME_PATH) / "logs"
-WORLD_DIR = Path(RUNTIME_PATH) / "world"
-OBSERVER_STATE_FILE = STATE_DIR / "observer_checkpoint.json"
-FAILED_EVENTS_DIR = STATE_DIR / "observer_failed_events"
-
-REPO_ROOT = Path(__file__).parent.parent.parent
-INVENTORY_TOPOLOGY = REPO_ROOT / "inventory" / "topology.yaml"
-
-# Logging setup
-logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
-logger = logging.getLogger("observer")
-
-class Observer:
-    def __init__(self):
-        # Per-node-directory checkpoint: {"vps": "last/file/path", "piha": "last/file/path"}
-        # Replaces the old single last_processed_file which silently skipped event dirs
-        # that sort alphabetically before the checkpoint (e.g. piha/ < vps/).
-        self.node_checkpoints: dict = {}
-        self.world_state = {
-            "nodes": {},
-            "services": {},
-            "deployments": {},
-            "incidents": {},
-            "summary": {
-                "last_update": datetime.now(timezone.utc).isoformat(),
-                "status": "initializing",
-                "active_incidents_count": 0
-            }
-        }
-        self.inventory = self._load_inventory()
-        self._ensure_dirs()
-        self._load_checkpoint()
-
-    def _ensure_dirs(self):
-        WORLD_DIR.mkdir(parents=True, exist_ok=True)
-        STATE_DIR.mkdir(parents=True, exist_ok=True)
-        EVENTS_DIR.mkdir(parents=True, exist_ok=True)
-        LOGS_DIR.mkdir(parents=True, exist_ok=True)
-        FAILED_EVENTS_DIR.mkdir(parents=True, exist_ok=True)
-
-    def _quarantine_event_file(self, file_path: str, node_dir: str, exc: Exception) -> None:
-        """Move an unreadable/unprocessable event out of the hot path."""
-        src = Path(file_path)
-        dest_dir = FAILED_EVENTS_DIR / node_dir
-        dest_dir.mkdir(parents=True, exist_ok=True)
-        dest = dest_dir / src.name
-        if dest.exists():
-            dest = dest_dir / f"{src.stem}-{int(time.time())}{src.suffix}"
-        try:
-            os.replace(src, dest)
-            logger.error(
-                "Quarantined bad event for node_dir=%s: %s -> %s (%s: %s)",
-                node_dir, src, dest, type(exc).__name__, exc,
-            )
-        except Exception as move_exc:
-            logger.error(
-                "Failed to quarantine bad event for node_dir=%s: %s (%s: %s); move error=%s: %s",
-                node_dir, src, type(exc).__name__, exc, type(move_exc).__name__, move_exc,
-            )
-
-    def _load_inventory(self):
-        inventory = {"nodes": {}, "services": {}}
-        try:
-            if INVENTORY_TOPOLOGY.exists():
-                with open(INVENTORY_TOPOLOGY, "r") as f:
-                    topo = yaml.safe_load(f)
-                    for node_name, node_info in topo.get("nodes", {}).items():
-                        inventory["nodes"][node_name] = {
-                            "roles": node_info.get("roles", []),
-                            "connectivity": node_info.get("connectivity", {})
-                        }
-            
-            # Load service assignments from hosts files
-            hosts_dir = REPO_ROOT / "hosts"
-            for host_dir in hosts_dir.iterdir():
-                if host_dir.is_dir():
-                    svc_file = host_dir / "services.yaml"
-                    if svc_file.exists():
-                        with open(svc_file, "r") as f:
-                            svc_data = yaml.safe_load(f)
-                            host_name = svc_data.get("host")
-                            for svc_name, svc_info in svc_data.get("services", {}).items():
-                                if host_name not in inventory["services"]:
-                                    inventory["services"][host_name] = {}
-                                inventory["services"][host_name][svc_name] = {
-                                    "role": svc_info.get("role"),
-                                    "exposure": svc_info.get("exposure")
-                                }
-        except Exception as e:
-            logger.error(f"Failed to load inventory: {e}")
-        return inventory
-
-    def _load_checkpoint(self):
-        if OBSERVER_STATE_FILE.exists():
-            try:
-                with open(OBSERVER_STATE_FILE, "r") as f:
-                    checkpoint = json.load(f)
-
-                if "node_checkpoints" in checkpoint:
-                    # New format: per-directory checkpoints.
-                    self.node_checkpoints = checkpoint["node_checkpoints"]
-                elif "last_processed_file" in checkpoint:
-                    # Migrate old single-file checkpoint: extract node dir from path.
-                    old = checkpoint["last_processed_file"]
-                    if old:
-                        try:
-                            node_dir = Path(old).relative_to(EVENTS_DIR).parts[0]
-                            self.node_checkpoints = {node_dir: old}
-                            logger.info(f"Migrated old checkpoint → node_checkpoints: {self.node_checkpoints}")
-                        except Exception:
-                            pass  # Bad path — start fresh
-
-                self._load_world_from_disk()
-            except Exception as e:
-                logger.error(f"Failed to load checkpoint: {e}")
-
-    def _load_world_from_disk(self):
-        # Optional: Load existing state to resume faster
-        files = {
-            "nodes": WORLD_DIR / "nodes.json",
-            "services": WORLD_DIR / "services.json",
-            "deployments": WORLD_DIR / "deployments.json",
-            "incidents": WORLD_DIR / "incidents.json",
-            "summary": WORLD_DIR / "runtime-summary.json"
-        }
-        for key, path in files.items():
-            if path.exists():
-                try:
-                    with open(path, "r") as f:
-                        self.world_state[key] = json.load(f)
-                except Exception as e:
-                    logger.error(f"Failed to load {key} state: {e}")
-
-    def _save_checkpoint(self):
-        try:
-            _atomic_write_json(OBSERVER_STATE_FILE, {"node_checkpoints": self.node_checkpoints})
-        except Exception as e:
-            logger.error(f"Failed to save checkpoint: {e}")
-
-    def _prune_stale_world(self):
-        """Remove world-state entries for nodes absent from the topology inventory.
-
-        Root cause this guards against: when NODE_NAME env var is unset, node_agent.py
-        falls back to socket.gethostname(), which inside a Docker container returns the
-        12-char hex container ID (e.g. 'be17cb6eb0f6') instead of the canonical host name
-        ('vps').  The observer ingests those events and creates ghost entries that never
-        expire on their own.
-
-        Also ages out resolved incidents older than 7 days to keep world state lean.
-        """
-        known_nodes = set(self.inventory["nodes"].keys())
-        if not known_nodes:
-            # Inventory failed to load — don't prune to avoid wiping valid state.
-            return
-
-        stale_nodes = [n for n in list(self.world_state["nodes"].keys())
-                       if n not in known_nodes]
-        for n in stale_nodes:
-            logger.info(f"Pruning stale node from world state: {n}")
-            del self.world_state["nodes"][n]
-
-        stale_svcs = [k for k in list(self.world_state["services"].keys())
-                      if k.split("/")[0] in stale_nodes]
-        for k in stale_svcs:
-            logger.info(f"Pruning stale service from world state: {k}")
-            del self.world_state["services"][k]
-
-        # Prune ghost service keys whose service-name portion is a hash-prefixed
-        # Docker stale-state artifact (e.g. "9e36297651e7_control-plane-observer").
-        # These are created when node-agent incorrectly uses c.name instead of the
-        # compose label, and accumulate on every container rebuild.
-        # Pattern: <node>/<12hexchars>_<real-name>
-        ghost_svcs = [
-            k for k in list(self.world_state["services"].keys())
-            if len(k.split("/", 1)) == 2
-            and len(k.split("/", 1)[1]) > 13
-            and k.split("/", 1)[1][12] == "_"
-            and all(ch in "0123456789abcdef" for ch in k.split("/", 1)[1][:12])
-        ]
-        for k in ghost_svcs:
-            logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
-            del self.world_state["services"][k]
-
-        now = time.time()
-
-        try:
-            # Collect incident_ids currently referenced by any service entry.
-            linked_ids: set = {
-                svc.get("incident_id")
-                for svc in self.world_state["services"].values()
-                if svc.get("incident_id")
-            }
-
-            # Case 1 — service is healthy but still points at an active incident.
-            # process_event already calls _resolve_incident on service_healthy events,
-            # but if the observer restarted with on-disk state where the link was
-            # intact (inconsistency from a pre-atomic-write crash), it may not get
-            # resolved until the next service_healthy event is processed.  Resolve
-            # immediately — a healthy service cannot have an ongoing incident.
-            for svc_key, svc in self.world_state["services"].items():
-                if svc.get("status") != "healthy":
-                    continue
-                inc_id = svc.get("incident_id")
-                if not inc_id:
-                    continue
-                inc = self.world_state["incidents"].get(inc_id, {})
-                if inc.get("status") == "active":
-                    logger.info(
-                        f"Auto-resolving incident {inc_id} for {svc_key}: "
-                        f"service is healthy"
-                    )
-                    inc["status"] = "resolved"
-                    inc["resolved_at"] = now
-                    svc["incident_id"] = None
-                    linked_ids.discard(inc_id)
-
-            # Case 2 — orphaned active incident: no service entry links to it and
-            # last_occurrence is older than 5 minutes (guard against creation races).
-            # These are the stale records left behind when on-disk state was
-            # inconsistent: the service entry had incident_id cleared but incidents.json
-            # still had the record as "active".
-            for inc_id, inc in self.world_state["incidents"].items():
-                if inc.get("status") != "active":
-                    continue
-                if inc_id in linked_ids:
-                    continue
-                age = now - _parse_ts(inc.get("last_occurrence"))
-                if age > 300:  # 5-minute guard
-                    logger.info(
-                        f"Auto-resolving orphaned incident {inc_id} "
-                        f"(service={inc.get('service')}, node={inc.get('node')}): "
-                        f"no service references it, age={int(age)}s"
-                    )
-                    inc["status"] = "resolved"
-                    inc["resolved_at"] = now
-
-        except Exception as exc:
-            logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
-
-        # Remove resolved incidents older than 7 days.
-        # Use _parse_ts so ISO-string resolved_at values are handled correctly.
-        stale_incidents = [
-            k for k, v in self.world_state["incidents"].items()
-            if v.get("status") == "resolved"
-            and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
-        ]
-        for k in stale_incidents:
-            del self.world_state["incidents"][k]
-
-    def _save_world(self):
-        self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat()
-        active_incidents = [
-            k for k, v in self.world_state["incidents"].items() if v.get("status") == "active"
-        ]
-        self.world_state["summary"]["active_incidents_count"] = len(active_incidents)
-        self.world_state["summary"]["node_count"] = len(self.world_state["nodes"])
-        self.world_state["summary"]["service_count"] = len(self.world_state["services"])
-
-        if active_incidents:
-            self.world_state["summary"]["status"] = "degraded"
-        else:
-            self.world_state["summary"]["status"] = "nominal"
-
-        files = {
-            "nodes.json": self.world_state["nodes"],
-            "services.json": self.world_state["services"],
-            "deployments.json": self.world_state["deployments"],
-            "incidents.json": self.world_state["incidents"],
-            "recommendations.json": [],
-            "runtime-summary.json": self.world_state["summary"]
-        }
-        for filename, data in files.items():
-            try:
-                _atomic_write_json(WORLD_DIR / filename, data)
-            except Exception as e:
-                logger.error(f"Failed to save {filename}: {e}")
-
-    def process_event(self, event):
-        etype = event.get("type")
-        node = event.get("node")
-        service = event.get("service")
-        severity = event.get("severity")
-        timestamp = event.get("timestamp")
-        cid = event.get("correlation_id")
-        payload = event.get("payload", {})
-
-        # 1. Update Node State
-        if node not in self.world_state["nodes"]:
-            self.world_state["nodes"][node] = {
-                "status": "unknown",
-                "last_seen": None,
-                "roles": self.inventory["nodes"].get(node, {}).get("roles", [])
-            }
-        self.world_state["nodes"][node]["last_seen"] = timestamp
-
-        if etype == "node_online":
-            self.world_state["nodes"][node]["status"] = "online"
-        elif etype == "node_offline":
-            self.world_state["nodes"][node]["status"] = "offline"
-
-        elif etype == "node_health":
-            # Regular heartbeat from node-agent; updates resource metrics.
-            # Clears disk_pressure if disk is now healthy (< warn threshold).
-            self.world_state["nodes"][node]["status"] = "online"
-            self.world_state["nodes"][node].update({
-                "disk_usage_pct": payload.get("disk_pct"),
-                "mem_usage_pct":  payload.get("mem_pct"),
-                "cpu_usage_pct":  payload.get("cpu_pct"),
-            })
-            if (payload.get("disk_pct") or 0) < 75:
-                self.world_state["nodes"][node].pop("disk_pressure", None)
-
-        elif etype == "disk_pressure":
-            # Emitted when disk usage crosses 75 % (medium) or 85 % (high).
-            # The supervisor reads disk_pressure to generate disk_cleanup actions.
-            self.world_state["nodes"][node]["disk_pressure"] = severity
-            self.world_state["nodes"][node]["disk_usage_pct"] = payload.get("usage_pct")
-
-        elif etype == "high_memory":
-            # Memory pressure observation; recorded on the node for correlation.
-            # No automated action — operator decides if a container restart helps.
-            self.world_state["nodes"][node]["memory_pressure"] = severity
-            self.world_state["nodes"][node]["mem_usage_pct"] = payload.get("usage_pct")
-
-        elif etype == "high_cpu":
-            # CPU pressure observation; recorded for visibility.
-            self.world_state["nodes"][node]["cpu_pressure"] = severity
-            self.world_state["nodes"][node]["cpu_usage_pct"] = payload.get("usage_pct")
-
-        # 2. Update Service State
-        if service and service != "all":
-            svc_key = f"{node}/{service}"
-            if svc_key not in self.world_state["services"]:
-                self.world_state["services"][svc_key] = {
-                    "node": node,
-                    "service": service,
-                    "status": "unknown",
-                    "last_check": None,
-                    "incident_id": None
-                }
-            self.world_state["services"][svc_key]["last_check"] = timestamp
-
-            if etype == "service_recovered":
-                self.world_state["services"][svc_key]["status"] = "healthy"
-                self._resolve_incident(svc_key, timestamp)
-            elif etype == "service_healthy":
-                # Positive confirmation from node-agent that a managed container
-                # is running. This keeps services.json populated so the supervisor
-                # can correctly detect drift (absent entry = never reported = unknown,
-                # not the same as confirmed missing).
-                # Also resolve any active incident — if a service that had been
-                # unhealthy/crashing is now confirmed healthy, the incident is over.
-                self.world_state["services"][svc_key]["status"] = "healthy"
-                self._resolve_incident(svc_key, timestamp)
-            elif etype in ["service_unhealthy", "healthcheck_failed"]:
-                self.world_state["services"][svc_key]["status"] = "unhealthy"
-                self._handle_incident(svc_key, event)
-
-        # 3. Update Deployment State
-        if etype.startswith("deployment_") and cid:
-            if cid not in self.world_state["deployments"]:
-                self.world_state["deployments"][cid] = {
-                    "node": node,
-                    "service": service,
-                    "status": "unknown",
-                    "started_at": None,
-                    "finished_at": None,
-                    "events": []
-                }
-            self.world_state["deployments"][cid]["events"].append({
-                "type": etype,
-                "timestamp": timestamp,
-                "payload": payload
-            })
-            if etype == "deployment_started":
-                self.world_state["deployments"][cid]["status"] = "in_progress"
-                self.world_state["deployments"][cid]["started_at"] = timestamp
-            elif etype == "deployment_completed":
-                self.world_state["deployments"][cid]["status"] = "completed"
-                self.world_state["deployments"][cid]["finished_at"] = timestamp
-            elif etype == "deployment_failed":
-                self.world_state["deployments"][cid]["status"] = "failed"
-                self.world_state["deployments"][cid]["finished_at"] = timestamp
-                # Deployment failure often creates an incident
-                self._handle_deployment_failure(event)
-
-    def _handle_incident(self, svc_key, event):
-        # Correlation: collapse repeated failures for the same service on the same node
-        active_incident = self.world_state["services"][svc_key].get("incident_id")
-        
-        if active_incident and active_incident in self.world_state["incidents"]:
-            incident = self.world_state["incidents"][active_incident]
-            if incident["status"] == "active":
-                incident["last_occurrence"] = event["timestamp"]
-                incident["occurrence_count"] = incident.get("occurrence_count", 1) + 1
-                incident["events"].append(event["timestamp"])
-                return
-
-        # Create new incident
-        incident_id = f"inc-{int(time.time())}-{event.get('node')}-{event.get('service')}"
-        self.world_state["incidents"][incident_id] = {
-            "id": incident_id,
-            "node": event.get("node"),
-            "service": event.get("service"),
-            "status": "active",
-            "severity": event.get("severity"),
-            # trigger_type records the event type that opened this incident so that
-            # the supervisor can choose the appropriate remediation action
-            # (e.g. container_restart for containers_not_running / mqtt_unreachable
-            # vs. a full redeploy for other causes).
-            "trigger_type": event.get("type"),
-            "started_at": event.get("timestamp"),
-            "last_occurrence": event.get("timestamp"),
-            "occurrence_count": 1,
-            "events": [event["timestamp"]],
-            "correlation_id": event.get("correlation_id")
-        }
-        self.world_state["services"][svc_key]["incident_id"] = incident_id
-
-    def _resolve_incident(self, svc_key, timestamp):
-        incident_id = self.world_state["services"][svc_key].get("incident_id")
-        if incident_id and incident_id in self.world_state["incidents"]:
-            if self.world_state["incidents"][incident_id]["status"] == "active":
-                self.world_state["incidents"][incident_id]["status"] = "resolved"
-                self.world_state["incidents"][incident_id]["resolved_at"] = timestamp
-        self.world_state["services"][svc_key]["incident_id"] = None
-
-    def _handle_deployment_failure(self, event):
-        # Specific logic for deployment failures
-        svc_key = f"{event.get('node')}/{event.get('service')}"
-        self._handle_incident(svc_key, event)
-        
-        # Link diagnostics if available in payload
-        incident_id = self.world_state["services"][svc_key].get("incident_id")
-        if incident_id and incident_id in self.world_state["incidents"]:
-            payload = event.get("payload", {})
-            if "diagnostics_file" in payload:
-                self.world_state["incidents"][incident_id]["diagnostics_ref"] = payload["diagnostics_file"]
-            elif "error" in payload:
-                self.world_state["incidents"][incident_id]["last_error"] = payload["error"]
-
-    def run_once(self):
-        # Update heartbeat
-        heartbeat_file = STATE_DIR / "observer.heartbeat"
-        try:
-            heartbeat_file.touch()
-        except Exception as e:
-            logger.error(f"Failed to touch heartbeat file: {e}")
-
-        # Collect all event files grouped by node directory.
-        # Per-node checkpoints are compared within each directory independently,
-        # so late-arriving events from remote nodes (sorted earlier in the path)
-        # are never skipped just because another node's checkpoint is further ahead.
-        all_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
-
-        new_files = []
-        for file_path in all_files:
-            try:
-                node_dir = str(Path(file_path).relative_to(EVENTS_DIR).parts[0])
-            except (IndexError, ValueError):
-                node_dir = "__unknown__"
-            last_for_node = self.node_checkpoints.get(node_dir, "")
-            if file_path > last_for_node:
-                new_files.append((node_dir, file_path))
-
-        if not new_files:
-            # Even if no new events, prune stale entries and refresh summary freshness.
-            self._prune_stale_world()
-            self._save_world()
-            return
-
-        logger.info(f"Processing {len(new_files)} new events across "
-                    f"{len({n for n, _ in new_files})} node(s)")
-        for node_dir, file_path in new_files:
-            try:
-                with open(file_path, "r") as f:
-                    event = json.load(f)
-                    self.process_event(event)
-                # Advance per-node checkpoint (only forward — no regression).
-                if file_path > self.node_checkpoints.get(node_dir, ""):
-                    self.node_checkpoints[node_dir] = file_path
-            except Exception as e:
-                logger.error(
-                    "Error processing node_dir=%s file=%s (%s: %s)",
-                    node_dir, file_path, type(e).__name__, e,
-                )
-                self._quarantine_event_file(file_path, node_dir, e)
-
-        self._save_checkpoint()
-        self._prune_stale_world()
-        self._save_world()
-
-    def loop(self, interval=5):
-        logger.info("Starting observer loop")
-        while True:
-            self.run_once()
-            time.sleep(interval)
-
-if __name__ == "__main__":
-    import sys
-    observer = Observer()
-    if "--run-once" in sys.argv:
-        observer.run_once()
-    else:
-        observer.loop()
--- a/scripts/observer/test_setup.sh
+++ b/scripts/observer/test_setup.sh
@ -1,83 +0,0 @@
-#!/usr/bin/env bash
-mkdir -p /tmp/homelab/events/2026-05-12/saturn
-mkdir -p /tmp/homelab/state
-mkdir -p /tmp/homelab/logs
-mkdir -p /tmp/homelab/world
-
-cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120000_node_online_1.json
-{
-  "timestamp": "2026-05-12T12:00:00Z",
-  "node": "saturn",
-  "type": "node_online",
-  "severity": "info",
-  "source": "system",
-  "service": "all",
-  "correlation_id": "init",
-  "payload": {}
-}
-EOF
-
-cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120500_service_unhealthy_1.json
-{
-  "timestamp": "2026-05-12T12:05:00Z",
-  "node": "saturn",
-  "type": "service_unhealthy",
-  "severity": "error",
-  "source": "healthcheck",
-  "service": "mosquitto",
-  "correlation_id": "hc-1",
-  "payload": {"error": "connection refused"}
-}
-EOF
-
-cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120600_service_unhealthy_2.json
-{
-  "timestamp": "2026-05-12T12:06:00Z",
-  "node": "saturn",
-  "type": "service_unhealthy",
-  "severity": "error",
-  "source": "healthcheck",
-  "service": "mosquitto",
-  "correlation_id": "hc-2",
-  "payload": {"error": "connection refused"}
-}
-EOF
-
-cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121000_service_recovered_1.json
-{
-  "timestamp": "2026-05-12T12:10:00Z",
-  "node": "saturn",
-  "type": "service_recovered",
-  "severity": "info",
-  "source": "healthcheck",
-  "service": "mosquitto",
-  "correlation_id": "hc-3",
-  "payload": {}
-}
-EOF
-
-cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121500_deployment_started_1.json
-{
-  "timestamp": "2026-05-12T12:15:00Z",
-  "node": "saturn",
-  "type": "deployment_started",
-  "severity": "info",
-  "source": "deploy_agent",
-  "service": "mosquitto",
-  "correlation_id": "deploy-1",
-  "payload": {"version": "2.0.18"}
-}
-EOF
-
-cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121600_deployment_failed_1.json
-{
-  "timestamp": "2026-05-12T12:16:00Z",
-  "node": "saturn",
-  "type": "deployment_failed",
-  "severity": "error",
-  "source": "deploy_agent",
-  "service": "mosquitto",
-  "correlation_id": "deploy-1",
-  "payload": {"error": "container crash", "diagnostics_file": "/opt/homelab/logs/diagnostics-deploy-1.log"}
-}
-EOF
--- a/scripts/onboard/README.md
+++ b/scripts/onboard/README.md
@ -1,139 +0,0 @@
-# scripts/onboard — Node Onboarding Tool
-
-Idempotentny, deklaratywny onboarding nodów przez bash — bez Ansible.
-Każdy node opisany jest manifestem `hosts/<node>/node.yaml`; skrypt
-`onboard.sh` czyta manifest i woła numerowane kroki w kolejności.
-
-## Użycie
-
-```bash
-scripts/onboard/onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
-```
-
-| Flaga | Opis |
-|-------|------|
-| `--node <name>` | Nazwa node'a (wymagana); pasuje do `hosts/<name>/node.yaml` |
-| `--step <name>` | Uruchom tylko ten jeden krok (np. `00-access`) |
-| `--from <step>` | Zacznij od tego kroku i kontynuuj do końca |
-| `--dry-run` | Ustawia `DRY_RUN=1`; mutacje symulowane przez `run()`, sondy wykonywane naprawdę |
-
-```bash
-# Pełny onboarding
-scripts/onboard/onboard.sh --node lustro
-
-# Tylko jeden krok
-scripts/onboard/onboard.sh --node lustro --step 00-access
-
-# Od kroku wzwyż
-scripts/onboard/onboard.sh --node lustro --from 10-bootstrap-runtime
-
-# Podgląd bez zmian (sondy stanu wykonują się naprawdę — plan jest realistyczny)
-scripts/onboard/onboard.sh --node lustro --dry-run
-```
-
-## hosts/\<node\>/node.yaml — schemat
-
-```yaml
-name: LUSTRO                        # nazwa node'a (ALL CAPS)
-role: edge                          # edge | compute | infra
-location: KEN                       # identyfikator lokalizacji
-
-ssh_user: pi                        # user SSH; może różnić się od "oskar" na edge nodach
-                                    # (kolizja uid=1000 — użyj istniejącego usera)
-first_contact: pi@192.168.31.19     # cel SSH przed Tailscale; KONIECZNIE IP, nie .local
-                                    # (mDNS .local zawodny w automatyzacji)
-tailscale:
-  hostname: lustro                  # nazwa w mesh; cel po tailscale up
-  ip:                               # wypełniane po join (opcjonalne)
-
-deploy_autonomy: true               # true = onboard.sh może wykonywać mutacje autonomicznie
-                                    # false = wydrukuj instrukcje manualne i zatrzymaj
-git_control: false                  # true = node pulluje z Forgejo
-                                    # false = push-based z SATURN (edge nodes)
-
-hardware:
-  arch: arm64                       # aarch64 | x86_64 | armv7l; wypełnia 00-preflight
-  ram_mb: 4096                      # RAM w MB; wypełnia 00-preflight
-  swap:
-    kind: zram                      # zram | file | none; zram zalecany (SD wear)
-  docker_present: true              # docker już zainstalowany?; wypełnia 00-preflight
-  mm_runtime: systemd:magicmirror.service
-                                    # runtime MagicMirror: systemd:<unit> | pm2 | process | none
-                                    # wypełnia 00-preflight
-
-services:
-  node-agent:
-    runtime:
-      engine: docker                # docker | docker-compose
-      mem_limit: 256m               # obowiązkowy (RPi4 RAM profil jak VPS — OOM ryzyko)
-```
-
-### Uwagi do pól
-
- **`ssh_user`** — na edge nodach z istniejącym uid=1000 (np. `pi` na RPi OS) użyj
-  tego usera zamiast tworzyć `oskar`; docker group membership i `mem_limit` node-agenta
-  są zaprojektowane pod `1000:1000`.
- **`first_contact`** — zawsze IP, nie hostname `.local`. mDNS okazał się zawodny
-  w automatyzacji (transient resolve fail). Po `tailscale up` używaj `tailscale.hostname`.
- **`deploy_autonomy`** — gdy `false`, kroki 10+ wypisują instrukcje manualne i kończą
-  pracę bez mutacji. Przydatne dla nodów zarządzanych przez inną osobę.
- **`git_control`** — gdy `false`, kroki z `git`/`repo`/`clone` w nazwie są pomijane.
-
-## Status kroków
-
-| Krok | Plik | Status | Opis |
-|------|------|--------|------|
-| `00-access` | `steps/00-access.sh` | **DONE** | SSH key → `first_contact`, install Tailscale, `tailscale up` (interaktywny URL), verify `pi@<ts_hostname>` arch=aarch64 |
-| `00-preflight` | `steps/00-preflight.sh` | SCAFFOLD | Read-only: zbiera fakty (arch, RAM, docker, swap, MM runtime), wypisuje raport + YAML snippet do wklejenia w node.yaml |
-| `10-bootstrap-runtime` | `steps/10-bootstrap-runtime.sh` | TODO | Tworzy `/opt/homelab/` layout, `chown <ssh_user>` |
-| `20-install-docker` | `steps/20-install-docker.sh` | TODO | Instaluje Docker Engine jeśli `docker_present=false`; skip gdy już zainstalowany |
-| `30-install-tailscale` | `steps/30-install-tailscale.sh` | TODO | Superseded przez `00-access` dla nowych nodów; może służyć do re-join |
-| `40-deploy-node-agent` | `steps/40-deploy-node-agent.sh` | TODO | Deploy node-agent docker; user 1000:1000; `mem_limit` z node.yaml |
-| `50-verify` | `steps/50-verify.sh` | TODO | End-to-end smoke: event dotarł do control plane, widać w UI, alert path Telegram |
-
-## Architektura lib/
-
-```
-lib/common.sh   — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
-lib/remote.sh   — rrun/rcopy/rsync_dir/rcheck (SSH wrappers, ONBOARD_SSH_USER/HOST)
-```
-
-### run() i dry-run
-
-`DRY_RUN=1` jest eksportowane do wszystkich step-skryptów przez orchestrator.
-
-```bash
-# Mutacje owijamy w run() — w dry-run drukuje intent, nie wykonuje
-run ssh-copy-id -i ~/.ssh/id_ed25519.pub pi@192.168.31.19
-
-# Sondy stanu (ssh BatchMode test, command -v, status query) wykonują się ZAWSZE
-# — dry-run musi pokazywać realistyczny plan oparty na aktualnym stanie
-if ssh -o BatchMode=yes pi@192.168.31.19 true 2>/dev/null; then
-    log "key already present — skip"
-fi
-```
-
-### yaml_get — fallback bez yq
-
-Gdy `yq` nie jest dostępne, używany jest `grep`+`sed` fallback. Pułapki:
-
- Inline komentarze YAML (`key: value   # komentarz`) są strippowane przez
-  `s/[[:space:]]\+#.*$//` — wymaga co najmniej jednej spacji przed `#`, więc
-  `url#fragment` pozostaje nienaruszone.
- Parser jest non-greedy na `:` — `s/^[[:space:]]*[^:]*:[[:space:]]*//'` —
-  wartości z dwukropkiem (np. `systemd:magicmirror.service`) są czytane poprawnie.
- Dot-path (`tailscale.hostname`) działa tylko z `yq`; fallback pasuje po ostatnim
-  segmencie (`hostname`). Nazwy pól w node.yaml muszą być unikalne.
-
-## Gotchas / Learnings
-
-| Problem | Rozwiązanie |
-|---------|-------------|
-| mDNS `.local` zawodny | Użyj IP w `first_contact`; `.local` OK interaktywnie, nie w automatyzacji |
-| Istniejący uid=1000 na edge node | Użyj tego usera; nie twórz `oskar` (kolizja uid, zepsuje własność MM) |
-| swap plik na SD | Migruj na zram — wear reduction; dodaj krok do `10-bootstrap-runtime` |
-| dry-run zatrzymuje się na orchestratorze | `run()` wrapper + `export DRY_RUN=1`; sondy muszą działać też w dry-run |
-| SSH known-hosts warning w parsowanym output | `-o LogLevel=ERROR` na SSH do nowego hosta w mesh |
-| `yaml_get` gubi prefix po `:` w wartości | Non-greedy `^[[:space:]]*[^:]*:` zamiast `.*:` |
-| yaml_get nie usuwa inline komentarzy | `s/[[:space:]]\+#.*$//` po ekstrakcji wartości |
-| RPi4 4 GB RAM — OOM ryzyko | `mem_limit` w node-agent override obowiązkowy (profil jak VPS) |
--- a/scripts/onboard/lib/common.sh
+++ b/scripts/onboard/lib/common.sh
@ -1,84 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/lib/common.sh — shared helpers for the onboarding tool
-
-set -euo pipefail
-
-# ── colour codes (disabled when not a tty) ──────────────────────────────────
-if [[ -t 1 ]]; then
-    _C_RESET='\033[0m'
-    _C_GREEN='\033[0;32m'
-    _C_YELLOW='\033[1;33m'
-    _C_RED='\033[0;31m'
-    _C_CYAN='\033[0;36m'
-    _C_BOLD='\033[1m'
-else
-    _C_RESET='' _C_GREEN='' _C_YELLOW='' _C_RED='' _C_CYAN='' _C_BOLD=''
-fi
-
-# ── logging ──────────────────────────────────────────────────────────────────
-log()  { echo -e "${_C_GREEN}[onboard]${_C_RESET} $(date +'%H:%M:%S') ${*}"; }
-warn() { echo -e "${_C_YELLOW}[WARN]${_C_RESET}    $(date +'%H:%M:%S') ${*}" >&2; }
-die()  { echo -e "${_C_RED}[ERROR]${_C_RESET}   $(date +'%H:%M:%S') ${*}" >&2; exit 1; }
-step() { echo -e "${_C_CYAN}${_C_BOLD}==> ${*}${_C_RESET}"; }
-dryrun() { echo -e "${_C_YELLOW}[dry-run]${_C_RESET} ${*}"; }
-
-# ── command detection ─────────────────────────────────────────────────────────
-have_cmd() { command -v "$1" >/dev/null 2>&1; }
-
-# ── dry-run execution wrapper ─────────────────────────────────────────────────
-# run CMD [ARGS…] — executes CMD in live mode; prints intent in dry-run.
-# Wrap MUTATIONS with this. Read-only probes (SSH BatchMode tests, status
-# queries, command -v checks) must run unconditionally — never wrap them.
-run() {
-    if [ "${DRY_RUN:-0}" = 1 ]; then
-        echo "[dry-run] would: $*"
-    else
-        "$@"
-    fi
-}
-export -f run
-
-# ── file helpers ──────────────────────────────────────────────────────────────
-# ensure_line FILE LINE — appends LINE to FILE if it is not already present (idempotent)
-ensure_line() {
-    local file="$1" line="$2"
-    [[ -f "$file" ]] || touch "$file"
-    grep -qxF "$line" "$file" || echo "$line" >> "$file"
-}
-
-# ── node.yaml parsing ─────────────────────────────────────────────────────────
-# require_node_yaml NODE — sets NODE_YAML; exits if not found
-require_node_yaml() {
-    local node="$1"
-    NODE_YAML="${REPO_ROOT}/hosts/${node,,}/node.yaml"
-    [[ -f "$NODE_YAML" ]] || die "node.yaml not found: $NODE_YAML"
-    export NODE_YAML
-}
-
-# yaml_get NODE_YAML KEY — read a scalar value from a YAML file
-# Uses yq when available; falls back to grep/sed for simple key: value pairs.
-# Supports dot-separated paths (e.g. tailscale.hostname) only in yq mode;
-# the grep fallback handles only the last path component.
-yaml_get() {
-    local file="$1" key="$2"
-    if have_cmd yq; then
-        yq -r ".${key} // empty" "$file" 2>/dev/null
-    else
-        # fallback: extract last segment of key, match "  key: value"
-        # Strip inline YAML comment (space(s)+'#'+rest) and surrounding whitespace.
-        # Pattern uses \+ (BRE one-or-more) so a bare '#' inside a value is preserved.
-        local leaf="${key##*.}"
-        grep -E "^\s*${leaf}:" "$file" | head -1 \
-            | sed -e 's/^[[:space:]]*[^:]*:[[:space:]]*//' \
-                  -e 's/[[:space:]]\+#.*$//' \
-                  -e 's/^[[:space:]]*//' \
-                  -e 's/[[:space:]]*$//' \
-            | tr -d '"' | tr -d "'"
-    fi
-}
-
-# ── git wrapper ────────────────────────────────────────────────────────────────
-# All git calls from onboarding scripts must go through this so --no-pager is
-# always set and there is no interactive output.
-git() { command git --no-pager "$@"; }
-export -f git
--- a/scripts/onboard/lib/remote.sh
+++ b/scripts/onboard/lib/remote.sh
@ -1,51 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/lib/remote.sh — SSH helpers for remote node operations
-# Requires: ONBOARD_SSH_USER, ONBOARD_SSH_HOST to be set by the caller.
-# Inherits: DRY_RUN (boolean string "true"/"false")
-
-set -euo pipefail
-
-: "${ONBOARD_SSH_USER:?remote.sh: ONBOARD_SSH_USER is not set}"
-: "${ONBOARD_SSH_HOST:?remote.sh: ONBOARD_SSH_HOST is not set}"
-: "${DRY_RUN:=0}"
-
-_SSH_OPTS=(
-    -o StrictHostKeyChecking=accept-new
-    -o ConnectTimeout=10
-    -o BatchMode=yes
-)
-
-# rrun CMD [ARGS…] — run a command on the remote node via SSH
-rrun() {
-    if [ "${DRY_RUN:-0}" = 1 ]; then
-        dryrun "ssh ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST} -- $*"
-        return 0
-    fi
-    ssh "${_SSH_OPTS[@]}" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- "$@"
-}
-
-# rcopy LOCAL_PATH REMOTE_PATH — copy a file to the remote node via scp
-rcopy() {
-    local src="$1" dst="$2"
-    if [ "${DRY_RUN:-0}" = 1 ]; then
-        dryrun "scp $src ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst"
-        return 0
-    fi
-    scp "${_SSH_OPTS[@]}" "$src" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst"
-}
-
-# rsync_dir LOCAL_DIR REMOTE_DIR [EXTRA_RSYNC_ARGS…]
-rsync_dir() {
-    local src="$1" dst="$2"
-    shift 2
-    if [ "${DRY_RUN:-0}" = 1 ]; then
-        dryrun "rsync -az $src ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst"
-        return 0
-    fi
-    rsync -az -e "ssh ${_SSH_OPTS[*]}" "$src" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}:$dst" "$@"
-}
-
-# rcheck — verify SSH connectivity; returns 0 if reachable
-rcheck() {
-    ssh "${_SSH_OPTS[@]}" -o ConnectTimeout=5 "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- true 2>/dev/null
-}
--- a/scripts/onboard/onboard.sh
+++ b/scripts/onboard/onboard.sh
@ -1,182 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/onboard.sh — node onboarding orchestrator
-#
-# Usage:
-#   onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
-#
-# Flags:
-#   --node   <name>   node name matching hosts/<name>/node.yaml  (required)
-#   --step   <name>   run only this step (e.g. 00-preflight)
-#   --from   <step>   start from this step, run all subsequent steps
-#   --dry-run         print what would be done without mutating anything
-#
-# Steps run in lexicographic order from scripts/onboard/steps/.
-# Steps that require deploy_autonomy=true are skipped (with a warning) when
-# that flag is false in node.yaml.  Steps that require git_control=true are
-# similarly gated.
-
-set -euo pipefail
-
-REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
-STEPS_DIR="${REPO_ROOT}/scripts/onboard/steps"
-LIB_DIR="${REPO_ROOT}/scripts/onboard/lib"
-
-# ── load helpers ──────────────────────────────────────────────────────────────
-# shellcheck source=lib/common.sh
-source "${LIB_DIR}/common.sh"
-
-# ── defaults ──────────────────────────────────────────────────────────────────
-NODE_NAME=""
-ONLY_STEP=""
-FROM_STEP=""
-DRY_RUN=0
-export DRY_RUN REPO_ROOT
-
-# ── argument parsing ──────────────────────────────────────────────────────────
-usage() {
-    cat >&2 <<'EOF'
-Usage: onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
-
-  --node   <name>   node name matching hosts/<name>/node.yaml   (required)
-  --step   <name>   run only this single step (e.g. 00-preflight)
-  --from   <step>   start from this step, continue to end
-  --dry-run         no mutations; show what would run
-
-Examples:
-  onboard.sh --node lustro
-  onboard.sh --node lustro --step 00-preflight
-  onboard.sh --node lustro --from 20-install-docker
-  onboard.sh --node lustro --dry-run
-EOF
-    exit 1
-}
-
-while [[ $# -gt 0 ]]; do
-    case "$1" in
-        --node)    NODE_NAME="${2:?--node requires a value}";  shift 2 ;;
-        --step)    ONLY_STEP="${2:?--step requires a value}";  shift 2 ;;
-        --from)    FROM_STEP="${2:?--from requires a value}";  shift 2 ;;
-        --dry-run) DRY_RUN=1;                                 shift   ;;
-        -h|--help) usage ;;
-        *) die "Unknown argument: $1" ;;
-    esac
-done
-
-[[ -z "$NODE_NAME" ]] && { warn "--node is required"; usage; }
-
-export NODE_NAME
-
-# ── load node.yaml ────────────────────────────────────────────────────────────
-require_node_yaml "$NODE_NAME"
-
-log "Loading manifest: $NODE_YAML"
-
-DEPLOY_AUTONOMY=$(yaml_get "$NODE_YAML" "deploy_autonomy")
-GIT_CONTROL=$(yaml_get     "$NODE_YAML" "git_control")
-SSH_USER=$(yaml_get        "$NODE_YAML" "ssh_user")
-TS_HOSTNAME=$(yaml_get     "$NODE_YAML" "tailscale.hostname")
-
-DEPLOY_AUTONOMY="${DEPLOY_AUTONOMY:-false}"
-GIT_CONTROL="${GIT_CONTROL:-false}"
-
-[[ -z "$SSH_USER"    ]] && die "ssh_user not set in $NODE_YAML"
-[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
-
-export ONBOARD_SSH_USER="$SSH_USER"
-export ONBOARD_SSH_HOST="$TS_HOSTNAME"
-
-log "Node: ${NODE_NAME} | host: ${TS_HOSTNAME} | user: ${SSH_USER}"
-log "deploy_autonomy=${DEPLOY_AUTONOMY}  git_control=${GIT_CONTROL}  dry_run=${DRY_RUN}"
-
-# ── collect steps ─────────────────────────────────────────────────────────────
-# Steps are NN-name.sh files in lexicographic order.
-mapfile -t ALL_STEPS < <(find "$STEPS_DIR" -maxdepth 1 -name '[0-9][0-9]-*.sh' | sort)
-
-if [[ ${#ALL_STEPS[@]} -eq 0 ]]; then
-    die "No steps found in $STEPS_DIR"
-fi
-
-# Determine which steps to run based on flags.
-declare -a STEPS_TO_RUN=()
-
-for step_path in "${ALL_STEPS[@]}"; do
-    step_file=$(basename "$step_path" .sh)
-
-    if [[ -n "$ONLY_STEP" ]]; then
-        # Match on prefix (e.g. "00-preflight" matches "00-preflight.sh")
-        [[ "$step_file" == "$ONLY_STEP" ]] || continue
-    elif [[ -n "$FROM_STEP" ]]; then
-        # Skip steps before FROM_STEP
-        [[ "$step_file" < "$FROM_STEP" && "$step_file" != "$FROM_STEP" ]] && continue
-    fi
-
-    STEPS_TO_RUN+=("$step_path")
-done
-
-if [[ ${#STEPS_TO_RUN[@]} -eq 0 ]]; then
-    die "No matching steps found (--step='${ONLY_STEP}' --from='${FROM_STEP}')"
-fi
-
-log "Steps to run (${#STEPS_TO_RUN[@]}):"
-for s in "${STEPS_TO_RUN[@]}"; do
-    printf "    %s\n" "$(basename "$s")"
-done
-echo ""
-
-# ── step execution loop ───────────────────────────────────────────────────────
-# Steps that start at 10+ are "mutating" and require deploy_autonomy=true.
-# Steps that start at 30+ and deal with git/repo sync require git_control=true.
-# Step 00-preflight is always allowed (read-only).
-
-_step_needs_autonomy() {
-    local num="${1%%[^0-9]*}"   # leading digits
-    [[ "$num" -ge 10 ]] 2>/dev/null
-}
-
-_step_needs_git_control() {
-    local name="$1"
-    [[ "$name" == *"git"* || "$name" == *"repo"* || "$name" == *"clone"* ]]
-}
-
-FAILED_STEPS=()
-
-for step_path in "${STEPS_TO_RUN[@]}"; do
-    step_file=$(basename "$step_path" .sh)
-    step_num="${step_file%%[^0-9]*}"
-
-    # autonomy gate
-    if _step_needs_autonomy "$step_num" && [[ "$DEPLOY_AUTONOMY" != "true" ]]; then
-        warn "Skipping $step_file — deploy_autonomy=false in $NODE_YAML"
-        warn "Run this step manually or set deploy_autonomy: true"
-        continue
-    fi
-
-    # git_control gate
-    if _step_needs_git_control "$step_file" && [[ "$GIT_CONTROL" != "true" ]]; then
-        warn "Skipping $step_file — git_control=false in $NODE_YAML"
-        continue
-    fi
-
-    step "Running: $step_file"
-
-    if bash "$step_path"; then
-        log "$step_file — OK"
-    else
-        rc=$?
-        warn "$step_file — FAILED (exit $rc)"
-        FAILED_STEPS+=("$step_file")
-    fi
-
-    echo ""
-done
-
-# ── summary ───────────────────────────────────────────────────────────────────
-if [[ ${#FAILED_STEPS[@]} -gt 0 ]]; then
-    die "Onboarding finished with failures: ${FAILED_STEPS[*]}"
-fi
-
-if [ "${DRY_RUN:-0}" = 1 ]; then
-    log "Dry-run complete — no mutations performed."
-else
-    log "All steps completed successfully for node ${NODE_NAME}."
-fi
--- a/scripts/onboard/steps/00-access.sh
+++ b/scripts/onboard/steps/00-access.sh
@ -1,156 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/steps/00-access.sh — establish remote access channel
-#
-# Stages:
-#   1. ensure_ssh_key  — copy SATURN public key to first_contact (idempotent)
-#   2. ensure_tailscale — install Tailscale and join network (interactive auth URL)
-#   3. verify           — confirm SSH over Tailscale, assert arch=aarch64
-#
-# Dry-run convention (DRY_RUN=1):
-#   - Read-only probes (SSH BatchMode test, tailscale status, command -v) run ALWAYS
-#     so the plan reflects real current state ("key present → skip" vs "would: install")
-#   - Mutations (ssh-copy-id, curl installer, tailscale up) are wrapped with run()
-#
-# Does NOT configure NOPASSWD or /opt/homelab — those are later steps.
-# pi user on Raspberry Pi OS has passwordless sudo — required for `tailscale up`.
-
-set -euo pipefail
-
-STEP_NAME="00-access"
-
-: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
-: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
-: "${DRY_RUN:=0}"
-
-# Source common.sh when run standalone (orchestrator sources it before calling steps)
-if ! declare -f log >/dev/null 2>&1; then
-    # shellcheck source=../lib/common.sh
-    source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
-fi
-
-# ── parse node.yaml ───────────────────────────────────────────────────────────
-FIRST_CONTACT=$(yaml_get "$NODE_YAML" "first_contact")
-TS_HOSTNAME=$(yaml_get   "$NODE_YAML" "tailscale.hostname")
-
-[[ -z "$FIRST_CONTACT" ]] && die "first_contact not set in $NODE_YAML"
-[[ -z "$TS_HOSTNAME"   ]] && die "tailscale.hostname not set in $NODE_YAML"
-
-FC_USER="${FIRST_CONTACT%%@*}"
-
-# ONBOARD_SSH_USER/HOST set by orchestrator to post-Tailscale coordinates;
-# fall back to first_contact for standalone invocation.
-export ONBOARD_SSH_USER="${ONBOARD_SSH_USER:-${FC_USER}}"
-export ONBOARD_SSH_HOST="${ONBOARD_SSH_HOST:-${TS_HOSTNAME}}"
-
-# shellcheck source=../lib/remote.sh
-source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
-
-# ── SSH option arrays ─────────────────────────────────────────────────────────
-# No BatchMode — used for ssh-copy-id where a password prompt may appear
-_FC_SSH_NOKEY=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10)
-# BatchMode — used for all probes and post-key-install operations
-_FC_SSH=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes)
-# Tailscale verify — LogLevel=ERROR suppresses the "Permanently added" known-hosts
-# INFO message that would otherwise leak into captured stdout on first connection
-_TS_SSH=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes -o LogLevel=ERROR)
-
-# ── tailscale state probe helper ──────────────────────────────────────────────
-# Always runs; returns BackendState or "unknown" on any SSH/parse failure.
-_ts_state() {
-    ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" \
-        'tailscale status --json 2>/dev/null | python3 -c \
-         "import sys,json; print(json.load(sys.stdin).get(\"BackendState\",\"unknown\"))" \
-         2>/dev/null || echo "unknown"' 2>/dev/null || echo "unknown"
-}
-
-# ═══════════════════════════════════════════════════════════════════════════════
-# Stage 1 — ensure_ssh_key
-# ═══════════════════════════════════════════════════════════════════════════════
-step "[$STEP_NAME] 1/3 ensure_ssh_key → ${FIRST_CONTACT}"
-
-# Probe: test key-based auth — always runs so dry-run reports real current state
-if ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" true 2>/dev/null; then
-    log "SSH key already accepted by ${FIRST_CONTACT} — skip"
-else
-    pubkeys=( "$HOME"/.ssh/id_*.pub )
-    [[ -f "${pubkeys[0]}" ]] || die "No public key found at ~/.ssh/id_*.pub on SATURN"
-
-    log "Key not yet installed on ${FIRST_CONTACT} (password prompt expected)"
-    # Mutation: install public key
-    run ssh-copy-id \
-        "${_FC_SSH_NOKEY[@]}" \
-        -i "${pubkeys[0]}" \
-        "$FIRST_CONTACT"
-    # Probe: verify key was installed (run() is a no-op in dry-run so this
-    # prints "would:" — avoids a false-failure after a skipped ssh-copy-id)
-    run ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" true
-    log "Key installed and verified"
-fi
-
-# ═══════════════════════════════════════════════════════════════════════════════
-# Stage 2 — ensure_tailscale
-# ═══════════════════════════════════════════════════════════════════════════════
-step "[$STEP_NAME] 2/3 ensure_tailscale on ${FIRST_CONTACT} → hostname=${TS_HOSTNAME}"
-
-# Probe: check if tailscale binary present — always runs.
-# SSH auth failure (key not yet installed in dry-run) falls through to the
-# "not found" branch, which is correct for a fresh node.
-if ! ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" 'command -v tailscale' >/dev/null 2>&1; then
-    log "Tailscale not found on ${FIRST_CONTACT}"
-    # Mutation: install tailscale
-    run ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" \
-        'curl -fsSL https://tailscale.com/install.sh | sh'
-else
-    log "Tailscale already installed on ${FIRST_CONTACT}"
-fi
-
-# Probe: check backend state — always runs
-ts_state=$(_ts_state)
-if [[ "$ts_state" == "Running" ]]; then
-    log "Tailscale already active (BackendState=Running) — skip"
-else
-    warn "Tailscale BackendState=${ts_state} — joining network..."
-    echo ""
-    echo -e "${_C_BOLD}┌─────────────────────────────────────────────────────────────┐"
-    echo -e "│  ACTION REQUIRED: open the URL below in your browser to       │"
-    echo -e "│  authorize ${TS_HOSTNAME} in your Tailscale account.              │"
-    echo -e "└─────────────────────────────────────────────────────────────┘${_C_RESET}"
-    echo ""
-    # Mutation: tailscale up — blocks until user authenticates via printed URL
-    run ssh "${_FC_SSH[@]}" "$FIRST_CONTACT" "sudo tailscale up --hostname=${TS_HOSTNAME}"
-    echo ""
-
-    # Post-join state check — only meaningful after the mutation actually ran
-    if [ "${DRY_RUN:-0}" != 1 ]; then
-        ts_state2=$(_ts_state)
-        [[ "$ts_state2" == "Running" ]] \
-            || die "Tailscale still not active after tailscale up (BackendState=${ts_state2})"
-        log "Tailscale joined successfully (BackendState=Running)"
-    fi
-fi
-
-# ═══════════════════════════════════════════════════════════════════════════════
-# Stage 3 — verify over Tailscale
-# ═══════════════════════════════════════════════════════════════════════════════
-step "[$STEP_NAME] 3/3 verify SSH over Tailscale → ${ONBOARD_SSH_USER}@${TS_HOSTNAME}"
-
-# Probe: always runs — on a node already joined this works even in dry-run.
-# On a fresh node in dry-run mode Tailscale isn't set up yet, so SSH will fail;
-# that is reported as a warning (not a fatal error) to keep dry-run informative.
-# stderr is NOT merged (no 2>&1) — _TS_SSH uses LogLevel=ERROR so the
-# "Permanently added … to known hosts" INFO message is suppressed at source.
-if arch=$(ssh "${_TS_SSH[@]}" "${ONBOARD_SSH_USER}@${TS_HOSTNAME}" 'uname -m'); then
-    # Take the last non-empty stdout line to skip any unexpected preamble
-    arch=$(printf '%s' "$arch" | grep -v '^[[:space:]]*$' | tail -1 | tr -d '[:space:]')
-    if [[ "$arch" == "aarch64" ]]; then
-        log "Verify OK: ${ONBOARD_SSH_USER}@${TS_HOSTNAME} reachable, arch=${arch}"
-    else
-        msg="Unexpected arch '${arch}' on ${TS_HOSTNAME} — expected aarch64"
-        [ "${DRY_RUN:-0}" = 1 ] && warn "$msg" || die "$msg"
-    fi
-else
-    msg="Verify SSH to ${ONBOARD_SSH_USER}@${TS_HOSTNAME} failed (Tailscale not yet joined?)"
-    [ "${DRY_RUN:-0}" = 1 ] && warn "$msg" || die "$msg"
-fi
-
-log "[$STEP_NAME] done"
--- a/scripts/onboard/steps/00-preflight.sh
+++ b/scripts/onboard/steps/00-preflight.sh
@ -1,144 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/steps/00-preflight.sh — READ-ONLY remote node discovery
-#
-# Collects facts from the remote node and prints:
-#   1. A human-readable report block
-#   2. A machine-readable YAML snippet ready to paste into hosts/<node>/node.yaml
-#
-# NO mutations are performed on the remote host.
-# Depends on: lib/common.sh (sourced by orchestrator), lib/remote.sh (sourced here)
-
-set -euo pipefail
-
-STEP_NAME="00-preflight"
-
-# remote.sh is sourced here so individual steps can also be run standalone
-# (when REPO_ROOT is in the environment).
-: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
-# shellcheck source=../lib/remote.sh
-source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
-
-step "[$STEP_NAME] Collecting facts from ${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST} (read-only)"
-
-# ── gather all facts in a single SSH session ──────────────────────────────────
-raw=$(rrun bash -s <<'REMOTE'
-set -euo pipefail
-
-# arch / bitness
-arch=$(uname -m)
-bits=$(getconf LONG_BIT)
-
-# RAM (kB → MB)
-mem_kb=$(grep MemTotal /proc/meminfo | awk '{print $2}')
-mem_mb=$(( mem_kb / 1024 ))
-
-# disk root
-disk_root=$(df -h / | awk 'NR==2{print $2" total, "$3" used, "$4" free ("$5" used)"}')
-
-# docker
-docker_present=false
-docker_info=""
-if command -v docker >/dev/null 2>&1; then
-    docker_present=true
-    docker_info=$(docker info --format '{{.ServerVersion}}' 2>/dev/null || echo "unknown")
-fi
-
-# tailscale
-tailscale_present=false
-tailscale_status=""
-if command -v tailscale >/dev/null 2>&1; then
-    tailscale_present=true
-    tailscale_status=$(tailscale status --json 2>/dev/null | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('BackendState','unknown'))" 2>/dev/null || tailscale status 2>/dev/null | head -1 || echo "unknown")
-fi
-
-# Magic Mirror runtime detection
-mm_runtime="none"
-if systemctl is-active --quiet MagicMirror 2>/dev/null || systemctl is-active --quiet magicmirror 2>/dev/null; then
-    mm_runtime="systemd"
-elif command -v pm2 >/dev/null 2>&1 && pm2 list 2>/dev/null | grep -qi "MagicMirror"; then
-    mm_runtime="pm2"
-elif pgrep -fa "MagicMirror" >/dev/null 2>&1; then
-    mm_runtime="process"
-fi
-
-# swap
-swap_current="none"
-if command -v swapon >/dev/null 2>&1; then
-    swap_lines=$(swapon --show --noheadings 2>/dev/null || true)
-    if [[ -n "$swap_lines" ]]; then
-        swap_current="$swap_lines"
-    fi
-fi
-if command -v zramctl >/dev/null 2>&1; then
-    zram_lines=$(zramctl --noheadings 2>/dev/null || true)
-    [[ -n "$zram_lines" ]] && swap_current="${swap_current:+$swap_current; }zram: $zram_lines"
-fi
-
-# hostname / os
-hostname=$(hostname -f 2>/dev/null || hostname)
-os_pretty=$(grep PRETTY_NAME /etc/os-release 2>/dev/null | cut -d= -f2 | tr -d '"' || echo "unknown")
-
-cat <<EOF
-ARCH=$arch
-BITS=$bits
-MEM_MB=$mem_mb
-DISK_ROOT=$disk_root
-DOCKER_PRESENT=$docker_present
-DOCKER_VERSION=$docker_info
-TAILSCALE_PRESENT=$tailscale_present
-TAILSCALE_STATUS=$tailscale_status
-MM_RUNTIME=$mm_runtime
-SWAP_CURRENT=$swap_current
-HOSTNAME=$hostname
-OS=$os_pretty
-EOF
-REMOTE
-)
-
-# ── parse key=value output ────────────────────────────────────────────────────
-_val() { echo "$raw" | grep "^${1}=" | head -1 | cut -d= -f2-; }
-
-arch=$(_val ARCH)
-bits=$(_val BITS)
-mem_mb=$(_val MEM_MB)
-disk_root=$(_val DISK_ROOT)
-docker_present=$(_val DOCKER_PRESENT)
-docker_version=$(_val DOCKER_VERSION)
-tailscale_present=$(_val TAILSCALE_PRESENT)
-tailscale_status=$(_val TAILSCALE_STATUS)
-mm_runtime=$(_val MM_RUNTIME)
-swap_current=$(_val SWAP_CURRENT)
-remote_hostname=$(_val HOSTNAME)
-os_pretty=$(_val OS)
-
-# ── human-readable report ─────────────────────────────────────────────────────
-echo ""
-echo "┌─────────────────────────────────────────────────────┐"
-printf "│  Preflight report: %-33s│\n" "${ONBOARD_SSH_HOST}"
-echo "├─────────────────────────────────────────────────────┤"
-printf "│  hostname     : %-35s│\n" "$remote_hostname"
-printf "│  OS           : %-35s│\n" "$os_pretty"
-printf "│  arch         : %-35s│\n" "${arch} (${bits}-bit)"
-printf "│  RAM          : %-35s│\n" "${mem_mb} MB"
-printf "│  disk /       : %-35s│\n" "$disk_root"
-printf "│  docker       : %-35s│\n" "${docker_present} (v${docker_version})"
-printf "│  tailscale    : %-35s│\n" "${tailscale_present} / ${tailscale_status}"
-printf "│  MagicMirror  : %-35s│\n" "$mm_runtime"
-printf "│  swap         : %-35s│\n" "${swap_current:-none}"
-echo "└─────────────────────────────────────────────────────┘"
-echo ""
-
-# ── machine-readable YAML snippet ────────────────────────────────────────────
-echo "# ── paste into hosts/${NODE_NAME,,}/node.yaml ──"
-cat <<YAML
-hardware:
-  arch: ${arch}
-  ram_mb: ${mem_mb}
-  swap: ${swap_current:-none}
-  docker_present: ${docker_present}
-  docker_version: "${docker_version}"
-  tailscale_status: "${tailscale_status}"
-  mm_runtime: ${mm_runtime}
-YAML
-
-log "[$STEP_NAME] done — no changes made to remote host"
--- a/scripts/onboard/steps/10-bootstrap-runtime.sh
+++ b/scripts/onboard/steps/10-bootstrap-runtime.sh
@ -1,14 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/steps/10-bootstrap-runtime.sh — create /opt/homelab layout on remote node
-#
-# TODO: create /opt/homelab/{data,config,logs,state,events,world,actions/{pending,approved,running,completed,failed}}
-# TODO: set ownership to ssh_user (from node.yaml)
-# TODO: write /opt/homelab/state/node_name from node.yaml name field
-# TODO: idempotent — skip dirs that already exist
-
-set -euo pipefail
-: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
-source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
-
-STEP_NAME="10-bootstrap-runtime"
-step "[$STEP_NAME] TODO — not yet implemented"
--- a/scripts/onboard/steps/20-base.sh
+++ b/scripts/onboard/steps/20-base.sh
@ -1,152 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/steps/20-base.sh — base system configuration for LUSTRO
-#
-# Stages:
-#   1. swap→zram  — disable dphys-swapfile, install + configure zram-tools
-#   2. /opt/homelab — create base directory, chown <ssh_user>:<ssh_user>
-#   3. event dir  — create /opt/homelab/events/<ts_hostname>, chown -R
-#
-# Dry-run convention:
-#   - Probes (state queries) run unconditionally — dry-run reflects real state
-#   - Mutations use rrun() which skips execution when DRY_RUN=1
-
-set -euo pipefail
-
-STEP_NAME="20-base"
-
-: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
-: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
-: "${DRY_RUN:=0}"
-
-# Source common.sh when run standalone (orchestrator sources it before calling steps)
-if ! declare -f log >/dev/null 2>&1; then
-    # shellcheck source=../lib/common.sh
-    source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
-fi
-
-# ── parse node.yaml ───────────────────────────────────────────────────────────
-SSH_USER=$(yaml_get    "$NODE_YAML" "ssh_user")
-TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
-
-[[ -z "$SSH_USER"    ]] && die "ssh_user not set in $NODE_YAML"
-[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
-
-export ONBOARD_SSH_USER="${ONBOARD_SSH_USER:-${SSH_USER}}"
-export ONBOARD_SSH_HOST="${ONBOARD_SSH_HOST:-${TS_HOSTNAME}}"
-
-# shellcheck source=../lib/remote.sh
-source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
-
-# ── rprobe: read-only remote probe — always runs, even in dry-run ─────────────
-rprobe() {
-    ssh "${_SSH_OPTS[@]}" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- "$@"
-}
-
-# ═══════════════════════════════════════════════════════════════════════════════
-# Stage 1 — swap→zram
-# ═══════════════════════════════════════════════════════════════════════════════
-step "[$STEP_NAME] 1/3 swap→zram (PERCENT=50, algo=zstd)"
-
-# Guard by EFFECT: zram device present in swapon AND dphys-swapfile not active
-# → desired end-state already reached, skip the whole stage.
-_zram_active=0
-_dphys_active=0
-rprobe 'sudo swapon --show 2>/dev/null | grep -q /dev/zram' && _zram_active=1 || true
-rprobe 'systemctl is-active dphys-swapfile' >/dev/null 2>&1  && _dphys_active=1 || true
-
-if [[ "$_zram_active" -eq 1 && "$_dphys_active" -eq 0 ]]; then
-    log "zram already active, dphys-swapfile not active — skip"
-else
-    # Substage: disable dphys-swapfile if still active
-    if [[ "$_dphys_active" -eq 1 ]]; then
-        log "dphys-swapfile active — disabling"
-        rrun sudo dphys-swapfile swapoff
-        rrun sudo systemctl disable --now dphys-swapfile
-        if rprobe '[ -f /var/swap ]' 2>/dev/null; then
-            rrun sudo rm -f /var/swap
-            log "Removed /var/swap"
-        fi
-    else
-        log "dphys-swapfile not active — skip disable"
-    fi
-
-    # Substage: install zram-tools if package not present
-    # Use dpkg -l rather than command -v: zramswap binary may not be on PATH over SSH
-    if ! rprobe 'dpkg -l zram-tools 2>/dev/null | grep -q "^ii"' 2>/dev/null; then
-        log "zram-tools not installed — installing"
-        rrun sudo apt-get install -y zram-tools
-    else
-        log "zram-tools already installed"
-    fi
-
-    # Write config and (re)start zramswap
-    log "Writing /etc/default/zramswap (ALGO=zstd, PERCENT=50)"
-    rrun bash -c "printf '%s\n' 'ALGO=zstd' 'PERCENT=50' | sudo tee /etc/default/zramswap > /dev/null"
-    rrun sudo systemctl enable zramswap
-    rrun sudo systemctl restart zramswap
-fi
-
-# Verify (skipped in dry-run — mutations may not have run)
-if [ "${DRY_RUN:-0}" != 1 ]; then
-    if rprobe 'sudo swapon --show 2>/dev/null | grep -q /dev/zram'; then
-        log "Verify OK: zram swap active"
-        rprobe 'sudo swapon --show' || true
-    else
-        die "zram swap not active after setup — check: systemctl status zramswap on ${TS_HOSTNAME}"
-    fi
-    if rprobe 'systemctl is-active dphys-swapfile' >/dev/null 2>&1; then
-        warn "dphys-swapfile still reports active — manual inspection needed"
-    else
-        log "Verify OK: dphys-swapfile not active"
-    fi
-fi
-
-# ═══════════════════════════════════════════════════════════════════════════════
-# Stage 2 — /opt/homelab
-# ═══════════════════════════════════════════════════════════════════════════════
-step "[$STEP_NAME] 2/3 /opt/homelab (owner: ${SSH_USER}:${SSH_USER})"
-
-# Guard: exists AND owned by SSH_USER?
-_dir_ok=0
-if rprobe '[ -d /opt/homelab ]' 2>/dev/null; then
-    _owner=$(rprobe "stat -c '%U' /opt/homelab" 2>/dev/null || echo "")
-    if [[ "$_owner" == "$SSH_USER" ]]; then
-        _dir_ok=1
-        log "/opt/homelab exists, owner=${SSH_USER} — skip"
-    else
-        log "/opt/homelab exists but owner='${_owner}' — fixing"
-    fi
-else
-    log "/opt/homelab missing — creating"
-fi
-
-if [[ "$_dir_ok" -eq 0 ]]; then
-    rrun sudo mkdir -p /opt/homelab
-    rrun sudo chown "${SSH_USER}:${SSH_USER}" /opt/homelab
-fi
-
-# ═══════════════════════════════════════════════════════════════════════════════
-# Stage 3 — event dir
-# ═══════════════════════════════════════════════════════════════════════════════
-step "[$STEP_NAME] 3/3 event dir (/opt/homelab/events/${TS_HOSTNAME})"
-
-# Guard: event subdir exists AND /opt/homelab/events owned by SSH_USER?
-_evdir_ok=0
-if rprobe "[ -d /opt/homelab/events/${TS_HOSTNAME} ]" 2>/dev/null; then
-    _ev_owner=$(rprobe "stat -c '%U' /opt/homelab/events" 2>/dev/null || echo "")
-    if [[ "$_ev_owner" == "$SSH_USER" ]]; then
-        _evdir_ok=1
-        log "/opt/homelab/events/${TS_HOSTNAME} exists, owner=${SSH_USER} — skip"
-    else
-        log "/opt/homelab/events exists but owner='${_ev_owner}' — fixing"
-    fi
-else
-    log "/opt/homelab/events/${TS_HOSTNAME} missing — creating"
-fi
-
-if [[ "$_evdir_ok" -eq 0 ]]; then
-    rrun sudo mkdir -p "/opt/homelab/events/${TS_HOSTNAME}"
-    rrun sudo chown -R "${SSH_USER}:${SSH_USER}" /opt/homelab/events
-fi
-
-log "[$STEP_NAME] done"
--- a/scripts/onboard/steps/20-install-docker.sh
+++ b/scripts/onboard/steps/20-install-docker.sh
@ -1,16 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/steps/20-install-docker.sh — install Docker Engine on remote node
-#
-# TODO: skip if docker already present (check from 00-preflight facts or live rrun)
-# TODO: detect distro (Debian/Ubuntu/Raspberry Pi OS) and use appropriate apt repo
-# TODO: install docker-ce, docker-ce-cli, containerd.io
-# TODO: add ssh_user to docker group
-# TODO: enable + start docker.service
-# TODO: gate on deploy_autonomy=true in node.yaml (skip step if false, warn operator)
-
-set -euo pipefail
-: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
-source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
-
-STEP_NAME="20-install-docker"
-step "[$STEP_NAME] TODO — not yet implemented"
--- a/scripts/onboard/steps/30-install-tailscale.sh
+++ b/scripts/onboard/steps/30-install-tailscale.sh
@ -1,16 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/steps/30-install-tailscale.sh — install and join Tailscale on remote node
-#
-# TODO: skip if tailscale already installed and connected
-# TODO: install via https://tailscale.com/install.sh (or distro pkg)
-# TODO: gate on operator-provided auth key (TAILSCALE_AUTH_KEY env var; never hardcode)
-# TODO: tailscale up --auth-key=$TAILSCALE_AUTH_KEY --hostname=<node.yaml name>
-# TODO: verify node appears in tailscale status within timeout
-# TODO: gate on deploy_autonomy=true in node.yaml
-
-set -euo pipefail
-: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
-source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
-
-STEP_NAME="30-install-tailscale"
-step "[$STEP_NAME] TODO — not yet implemented"
--- a/scripts/onboard/steps/30-node-agent.sh
+++ b/scripts/onboard/steps/30-node-agent.sh
@ -1,136 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/steps/30-node-agent.sh — deploy node-agent to remote node
-#
-# Push-based deploy (git_control=false on LUSTRO): rsync services/node-agent/
-# and the host override to /opt/homelab/deploy/node-agent/ on the remote, then
-# docker compose build + up via SSH.  Mirrors the PIHA pattern but pushes files
-# instead of git-pulling them on the node.
-#
-# Stages:
-#   1. push   — rsync base compose+src, copy override to remote deploy dir
-#   2. up     — docker compose up -d --build (guarded: skip if already running)
-#   3. verify — container running + fresh event in /opt/homelab/events/<node>/
-#
-# Dry-run: probes run unconditionally; rsync/rrun mutations honour DRY_RUN.
-
-set -euo pipefail
-
-STEP_NAME="30-node-agent"
-
-: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
-: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
-: "${DRY_RUN:=0}"
-
-# Source common.sh when run standalone (orchestrator sources it before calling steps)
-if ! declare -f log >/dev/null 2>&1; then
-    # shellcheck source=../lib/common.sh
-    source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
-fi
-
-# ── parse node.yaml ───────────────────────────────────────────────────────────
-SSH_USER=$(yaml_get    "$NODE_YAML" "ssh_user")
-TS_HOSTNAME=$(yaml_get "$NODE_YAML" "tailscale.hostname")
-
-[[ -z "$SSH_USER"    ]] && die "ssh_user not set in $NODE_YAML"
-[[ -z "$TS_HOSTNAME" ]] && die "tailscale.hostname not set in $NODE_YAML"
-
-export ONBOARD_SSH_USER="${ONBOARD_SSH_USER:-${SSH_USER}}"
-export ONBOARD_SSH_HOST="${ONBOARD_SSH_HOST:-${TS_HOSTNAME}}"
-
-# shellcheck source=../lib/remote.sh
-source "${REPO_ROOT}/scripts/onboard/lib/remote.sh"
-
-REMOTE_DEPLOY_DIR="/opt/homelab/deploy/node-agent"
-COMPOSE_BASE="${REMOTE_DEPLOY_DIR}/docker-compose.yml"
-COMPOSE_OVERRIDE="${REMOTE_DEPLOY_DIR}/docker-compose.override.yml"
-
-LOCAL_SVC_DIR="${REPO_ROOT}/services/node-agent"
-LOCAL_OVERRIDE="${REPO_ROOT}/hosts/${TS_HOSTNAME}/runtime/node-agent/docker-compose.override.yml"
-
-# ── rprobe: read-only remote probe — always runs, even in dry-run ─────────────
-rprobe() {
-    ssh "${_SSH_OPTS[@]}" "${ONBOARD_SSH_USER}@${ONBOARD_SSH_HOST}" -- "$@"
-}
-
-# ═══════════════════════════════════════════════════════════════════════════════
-# Stage 1 — push compose files to remote
-# ═══════════════════════════════════════════════════════════════════════════════
-step "[$STEP_NAME] 1/3 push compose → ${ONBOARD_SSH_HOST}:${REMOTE_DEPLOY_DIR}"
-
-# Guard by EFFECT: is node-agent already running?
-_running=0
-if rprobe "docker ps --filter name=^node-agent\$ --filter status=running --format '{{.Names}}' 2>/dev/null | grep -q node-agent" 2>/dev/null; then
-    _running=1
-    log "node-agent container already running — skip push+build+up"
-fi
-
-if [[ "$_running" -eq 0 ]]; then
-    [[ -f "$LOCAL_OVERRIDE" ]] \
-        || die "Override not found: $LOCAL_OVERRIDE"
-
-    # Ensure remote deploy dir exists (rsync does not create intermediate dirs)
-    # pi owns /opt/homelab, so no sudo needed
-    rrun mkdir -p "${REMOTE_DEPLOY_DIR}"
-
-    # Push base compose + Dockerfile + src/  (rsync_dir handles DRY_RUN)
-    rsync_dir "${LOCAL_SVC_DIR}/" "${REMOTE_DEPLOY_DIR}/"
-
-    # Push host-specific override  (rcopy handles DRY_RUN)
-    rcopy "${LOCAL_OVERRIDE}" "${REMOTE_DEPLOY_DIR}/docker-compose.override.yml"
-fi
-
-# ═══════════════════════════════════════════════════════════════════════════════
-# Stage 2 — docker compose build + up
-# ═══════════════════════════════════════════════════════════════════════════════
-step "[$STEP_NAME] 2/3 docker compose up node-agent"
-
-if [[ "$_running" -eq 1 ]]; then
-    log "node-agent already running — skip"
-else
-    # Build image on remote (arm64 native); then start the service.
-    # --build rebuilds if context changed; idempotent if image is current.
-    rrun docker compose \
-        -f "${COMPOSE_BASE}" \
-        -f "${COMPOSE_OVERRIDE}" \
-        up -d --build node-agent
-fi
-
-# ═══════════════════════════════════════════════════════════════════════════════
-# Stage 3 — verify
-# ═══════════════════════════════════════════════════════════════════════════════
-step "[$STEP_NAME] 3/3 verify"
-
-if [ "${DRY_RUN:-0}" = 1 ]; then
-    log "dry-run: skipping verify (mutations may not have run)"
-else
-    # Verify: container running (docker ps — not command -v)
-    if rprobe "docker ps --filter name=^node-agent\$ --filter status=running --format '{{.Names}}' 2>/dev/null | grep -q node-agent" 2>/dev/null; then
-        log "Verify OK: node-agent container running"
-        rprobe "docker ps --filter name=node-agent --format 'table {{.Names}}\t{{.Status}}\t{{.Image}}'" || true
-    else
-        die "node-agent container is NOT running — check: docker logs node-agent on ${TS_HOSTNAME}"
-    fi
-
-    # Verify: fresh events appear in /opt/homelab/events/<node>/ (confirms agent writes)
-    # First cycle runs at start then sleeps CHECK_INTERVAL; allow 90s.
-    log "Waiting for first event (up to 90 s, CHECK_INTERVAL=60)..."
-    _event_ok=0
-    for _i in $(seq 1 9); do
-        if rprobe "ls /opt/homelab/events/${TS_HOSTNAME}/*.json 2>/dev/null | head -1 | grep -q .json" 2>/dev/null; then
-            _event_ok=1
-            break
-        fi
-        log "  ... ${_i}0 s elapsed, waiting..."
-        sleep 10
-    done
-
-    if [[ "$_event_ok" -eq 1 ]]; then
-        log "Verify OK: events present in /opt/homelab/events/${TS_HOSTNAME}/"
-        rprobe "ls -lth /opt/homelab/events/${TS_HOSTNAME}/ | head -5" || true
-    else
-        warn "No events yet in /opt/homelab/events/${TS_HOSTNAME}/ after 90 s — agent may still be initialising (CHECK_INTERVAL=60)"
-        warn "Re-run verify manually: docker logs node-agent on ${TS_HOSTNAME}"
-    fi
-fi
-
-log "[$STEP_NAME] done"
--- a/scripts/onboard/steps/40-register.sh
+++ b/scripts/onboard/steps/40-register.sh
@ -1,140 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/steps/40-register.sh — wpisz node do inventory i commituj na branchu
-#
-# Efekty (wszystkie idempotentne):
-#   1. Dopisuje blok <node> do inventory/topology.yaml
-#   2. Tworzy hosts/<node>/services.yaml jeśli nie istnieje
-#   3. git add + git commit na aktualnym branchu (NIE push — merge należy do operatora)
-#
-# Reload observera celowo poza tym krokiem — wykonywany ręcznie po merge→master,
-# git pull na VPS i uruchomieniu 50-verify.sh.
-
-set -euo pipefail
-
-STEP_NAME="40-register"
-
-: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
-: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
-: "${DRY_RUN:=0}"
-
-if ! declare -f log >/dev/null 2>&1; then
-    # shellcheck source=../lib/common.sh
-    source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
-fi
-
-NODE_ENTRY=$(yaml_get "${NODE_YAML}" "tailscale.hostname")
-[[ -z "${NODE_ENTRY}" ]] && die "tailscale.hostname not set in ${NODE_YAML}"
-
-TOPOLOGY="${REPO_ROOT}/inventory/topology.yaml"
-SERVICES_YAML="${REPO_ROOT}/hosts/${NODE_ENTRY}/services.yaml"
-
-# ── 1. inventory/topology.yaml ────────────────────────────────────────────────
-step "[${STEP_NAME}] 1/3 inventory/topology.yaml"
-
-_TOPOLOGY_BLOCK=$(cat << 'EOF'
-
-  PLACEHOLDER:
-    roles:
-      - edge
-    services:
-      - node-agent
-EOF
-)
-# Replace the PLACEHOLDER with the actual node name
-_TOPOLOGY_BLOCK="${_TOPOLOGY_BLOCK//PLACEHOLDER/${NODE_ENTRY}}"
-
-if grep -q "^  ${NODE_ENTRY}:" "${TOPOLOGY}"; then
-    log "${NODE_ENTRY} already present in topology.yaml — skip"
-else
-    if [ "${DRY_RUN:-0}" = 1 ]; then
-        dryrun "Would append to ${TOPOLOGY}:"
-        echo "${_TOPOLOGY_BLOCK}"
-    else
-        printf '%s\n' "${_TOPOLOGY_BLOCK}" >> "${TOPOLOGY}"
-        log "Appended ${NODE_ENTRY} block to topology.yaml"
-    fi
-fi
-
-# ── 2. hosts/<node>/services.yaml ────────────────────────────────────────────
-step "[${STEP_NAME}] 2/3 hosts/${NODE_ENTRY}/services.yaml"
-
-if [[ -f "${SERVICES_YAML}" ]]; then
-    log "services.yaml already exists — skip"
-else
-    if [ "${DRY_RUN:-0}" = 1 ]; then
-        dryrun "Would create ${SERVICES_YAML}:"
-        cat << EOF
-host: ${NODE_ENTRY}
-
-services:
-  node-agent:
-    role: node-stability-monitor
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-    depends_on:
-      local: []
-      external: []
-    runtime:
-      config_path: /opt/homelab/config/node-agent
-      data_path: /opt/homelab/state
-      logs_path: /opt/homelab/events
-EOF
-    else
-        mkdir -p "${REPO_ROOT}/hosts/${NODE_ENTRY}"
-        cat > "${SERVICES_YAML}" << EOF
-host: ${NODE_ENTRY}
-
-services:
-  node-agent:
-    role: node-stability-monitor
-    deployment_model: docker-compose
-    exposure: local-only
-    offline_required: true
-    depends_on:
-      local: []
-      external: []
-    runtime:
-      config_path: /opt/homelab/config/node-agent
-      data_path: /opt/homelab/state
-      logs_path: /opt/homelab/events
-EOF
-        log "Created ${SERVICES_YAML}"
-    fi
-fi
-
-# ── 3. git commit ─────────────────────────────────────────────────────────────
-step "[${STEP_NAME}] 3/3 git commit"
-
-cd "${REPO_ROOT}"
-
-_changed_files=()
-git diff --quiet "${TOPOLOGY}" 2>/dev/null                || _changed_files+=("inventory/topology.yaml")
-[[ -f "${SERVICES_YAML}" ]] && \
-    git ls-files --error-unmatch "${SERVICES_YAML}" 2>/dev/null || \
-    _changed_files+=("hosts/${NODE_ENTRY}/services.yaml")
-
-# Re-check: is anything staged or unstaged for these paths?
-_needs_commit=0
-if git diff --quiet && git diff --cached --quiet; then
-    # Nothing changed at all — may already be committed
-    if git ls-files --error-unmatch "${TOPOLOGY}" "${SERVICES_YAML}" >/dev/null 2>&1 && \
-       ! git diff HEAD -- "${TOPOLOGY}" "${SERVICES_YAML}" | grep -q .; then
-        log "Nothing to commit — ${NODE_ENTRY} already registered and committed"
-    else
-        _needs_commit=1
-    fi
-else
-    _needs_commit=1
-fi
-
-if [[ "${_needs_commit}" -eq 1 ]]; then
-    run git add "inventory/topology.yaml" "hosts/${NODE_ENTRY}/services.yaml"
-    run git commit -m "feat(onboard): register ${NODE_ENTRY} in topology + services.yaml"
-    if [ "${DRY_RUN:-0}" != 1 ]; then
-        log "Committed on $(git branch --show-current)"
-        log "Next: agent.sh merge task/node-onboarding → master, git pull VPS, run 50-verify.sh"
-    fi
-fi
-
-log "[${STEP_NAME}] done"
--- a/scripts/onboard/steps/50-verify.sh
+++ b/scripts/onboard/steps/50-verify.sh
@ -1,160 +0,0 @@
-#!/usr/bin/env bash
-# scripts/onboard/steps/50-verify.sh — restart observera + smoke test węzła w panelu
-#
-# Uruchamiaj PO: merge task/node-onboarding → master + git pull na VPS.
-#
-# Sprawdzenia:
-#   1. SSH <node>: node-agent container running
-#   2. SSH <node>: eventy obecne w /opt/homelab/events/<node>/
-#   3. SSH VPS:   docker restart control-plane-observer + poll observer.heartbeat
-#   4. SSH VPS:   <node> widoczny w /opt/homelab/world/nodes.json
-#
-# Exit 0 — wszystkie OK | Exit 1 — co najmniej jedno FAIL (tabela podsumowująca)
-
-set -euo pipefail
-
-STEP_NAME="50-verify"
-
-: "${REPO_ROOT:?REPO_ROOT is not set — run via onboard.sh}"
-: "${NODE_YAML:?NODE_YAML is not set — run via onboard.sh}"
-: "${DRY_RUN:=0}"
-
-if ! declare -f log >/dev/null 2>&1; then
-    # shellcheck source=../lib/common.sh
-    source "${REPO_ROOT}/scripts/onboard/lib/common.sh"
-fi
-
-SSH_USER=$(yaml_get "${NODE_YAML}" "ssh_user")
-TS_HOSTNAME=$(yaml_get "${NODE_YAML}" "tailscale.hostname")
-[[ -z "${SSH_USER}"    ]] && die "ssh_user not set in ${NODE_YAML}"
-[[ -z "${TS_HOSTNAME}" ]] && die "tailscale.hostname not set in ${NODE_YAML}"
-
-VPS_SSH_USER="oskar"
-VPS_SSH_HOST="100.95.58.48"
-VPS_REPO_PATH="/home/oskar/homelab-codex-ws"
-
-_SSH_OPTS=(-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes)
-
-_ssh_node() { ssh "${_SSH_OPTS[@]}" "${SSH_USER}@${TS_HOSTNAME}" -- "$@"; }
-_ssh_vps()  { ssh "${_SSH_OPTS[@]}" "${VPS_SSH_USER}@${VPS_SSH_HOST}" -- "$@"; }
-
-declare -A RESULTS=()
-
-# ── 1. node-agent running on <node> ──────────────────────────────────────────
-step "[${STEP_NAME}] 1/4 ${TS_HOSTNAME}: node-agent container"
-
-if [ "${DRY_RUN:-0}" = 1 ]; then
-    dryrun "ssh ${SSH_USER}@${TS_HOSTNAME} docker ps --filter name=^node-agent\$"
-    RESULTS["node-agent-running"]="skip"
-elif _ssh_node "docker ps --filter name=^node-agent\$ --filter status=running --format '{{.Names}}'" 2>/dev/null \
-     | grep -q "node-agent"; then
-    log "OK: node-agent running"
-    _ssh_node "docker ps --filter name=node-agent --format 'table {{.Names}}\t{{.Status}}'" 2>/dev/null || true
-    RESULTS["node-agent-running"]="PASS"
-else
-    warn "FAIL: node-agent nie działa na ${TS_HOSTNAME}"
-    RESULTS["node-agent-running"]="FAIL"
-fi
-
-# ── 2. eventy w /opt/homelab/events/<node>/ ───────────────────────────────────
-step "[${STEP_NAME}] 2/4 ${TS_HOSTNAME}: eventy"
-
-if [ "${DRY_RUN:-0}" = 1 ]; then
-    dryrun "ssh ${SSH_USER}@${TS_HOSTNAME} find /opt/homelab/events/${TS_HOSTNAME}/ -name '*.json'"
-    RESULTS["events-present"]="skip"
-elif _ssh_node "find /opt/homelab/events/${TS_HOSTNAME}/ -name '*.json' 2>/dev/null | head -1" 2>/dev/null \
-     | grep -q ".json"; then
-    _latest=$(_ssh_node "ls -t /opt/homelab/events/${TS_HOSTNAME}/*.json 2>/dev/null | head -1" || echo "?")
-    log "OK: eventy obecne (ostatni: ${_latest})"
-    RESULTS["events-present"]="PASS"
-else
-    warn "FAIL: brak eventów w /opt/homelab/events/${TS_HOSTNAME}/"
-    RESULTS["events-present"]="FAIL"
-fi
-
-# ── 3. restart observera + healthcheck ────────────────────────────────────────
-step "[${STEP_NAME}] 3/4 VPS: restart control-plane-observer"
-
-if [ "${DRY_RUN:-0}" = 1 ]; then
-    dryrun "ssh ${VPS_SSH_USER}@${VPS_SSH_HOST} docker restart control-plane-observer"
-    dryrun "poll /opt/homelab/state/observer.heartbeat (max 30s)"
-    RESULTS["observer-healthy"]="skip"
-else
-    log "Restarting control-plane-observer na VPS..."
-    _ssh_vps "docker restart control-plane-observer"
-
-    log "Polling observer.heartbeat (max 30s)..."
-    _ok=0
-    for _i in $(seq 1 6); do
-        sleep 5
-        _age=$(_ssh_vps "python3 -c \
-            \"import os,time; s=os.stat('/opt/homelab/state/observer.heartbeat'); \
-              print(int(time.time()-s.st_mtime))\" 2>/dev/null" || echo "999")
-        if [[ "${_age}" -lt 20 ]]; then
-            log "OK: observer.heartbeat fresh (${_age}s temu)"
-            _ok=1
-            break
-        fi
-        log "  ... ${_i}×5s, heartbeat ${_age}s old..."
-    done
-
-    if [[ "${_ok}" -eq 1 ]]; then
-        RESULTS["observer-healthy"]="PASS"
-    else
-        warn "FAIL: observer.heartbeat nie odświeżony po 30s"
-        warn "Sprawdź: ssh ${VPS_SSH_USER}@${VPS_SSH_HOST} docker logs control-plane-observer --tail 30"
-        RESULTS["observer-healthy"]="FAIL"
-    fi
-fi
-
-# ── 4. <node> widoczny w world/nodes.json ─────────────────────────────────────
-step "[${STEP_NAME}] 4/4 VPS: ${TS_HOSTNAME} w world/nodes.json"
-
-if [ "${DRY_RUN:-0}" = 1 ]; then
-    dryrun "ssh ${VPS_SSH_USER}@${VPS_SSH_HOST} python3 -c \"json.load(.../world/nodes.json)['${TS_HOSTNAME}']\""
-    RESULTS["world-state"]="skip"
-else
-    _node_status=$(_ssh_vps "python3 -c \"
-import json, sys
-try:
-    d = json.load(open('/opt/homelab/world/nodes.json'))
-    node = d.get('${TS_HOSTNAME}', {})
-    print(node.get('status', 'missing'))
-except Exception as e:
-    print('error:' + str(e))
-\"" 2>/dev/null || echo "ssh-error")
-
-    case "${_node_status}" in
-        online|offline)
-            log "OK: ${TS_HOSTNAME} w world/nodes.json (status=${_node_status})"
-            RESULTS["world-state"]="PASS"
-            ;;
-        missing)
-            warn "FAIL: ${TS_HOSTNAME} nie ma wpisu w world/nodes.json"
-            warn "Możliwa przyczyna: observer nie przetworzyл jeszcze eventów (poczekaj 60s i spróbuj ponownie)"
-            RESULTS["world-state"]="FAIL"
-            ;;
-        *)
-            warn "FAIL: nieoczekiwana odpowiedź: ${_node_status}"
-            RESULTS["world-state"]="FAIL"
-            ;;
-    esac
-fi
-
-# ── tabela podsumowująca ──────────────────────────────────────────────────────
-echo ""
-printf '%s\n' "══════════════════════════════════════════"
-printf "  %-30s %s\n" "CHECK" "RESULT"
-printf '%s\n' "──────────────────────────────────────────"
-for _key in "node-agent-running" "events-present" "observer-healthy" "world-state"; do
-    _val="${RESULTS[${_key}]:-???}"
-    printf "  %-30s %s\n" "${_key}" "${_val}"
-done
-printf '%s\n' "══════════════════════════════════════════"
-echo ""
-
-for _val in "${RESULTS[@]}"; do
-    [[ "${_val}" == "FAIL" ]] && { warn "Verify: co najmniej jeden check nie przeszedł"; exit 1; }
-done
-
-log "[${STEP_NAME}] Verify OK — ${TS_HOSTNAME} zarejestrowany i widoczny w panelu"
--- a/services/agent-system/README.md
+++ b/services/agent-system/README.md
@ -5,24 +5,6 @@ Central runtime materializer and Operator Control Plane UI.
 - **Redis**: Central state store (on PIHA).
 - **Runtime Materializer**: Converts Redis state to JSON files in `/opt/homelab/world`.
 - **Web UI**: Exposes API endpoints and serving the Operator UI.
- **Telegram Bot**: Provides operator commands and action approvals via Telegram.
-
-#### Configuration
-Environment variables should be set in `.env` (see `env.example`).
-Key variables for the Telegram Bot:
- `TELEGRAM_BOT_TOKEN`: Your bot token from @BotFather.
- `TELEGRAM_ALLOWED_USER_IDS`: Comma-separated list of authorized Telegram User IDs.
- `CONTROL_PLANE_URL`: URL to the `agent-system-webui` (default: `http://webui:8080`).
-
-#### Telegram Commands
- `/status`: Check bot and API connectivity.
- `/summary`: System health overview.
- `/nodes`: List homelab nodes and their status.
- `/services`: Summary of services across nodes.
- `/unhealthy`: List all unhealthy components.
- `/incidents`: View active incidents.
- `/actions`: Summary of operator actions.
- `/help`: List all commands.

 #### Deployment (on PIHA)
 ```bash
--- a/services/agent-system/action-model.md
+++ b/services/agent-system/action-model.md
@ -1,52 +0,0 @@
-### Action Approval Data Model
-
-Actions are JSON files stored in `/opt/homelab/actions/{status}/{action_id}.json`.
-
-#### Statuses
- `pending`: Waiting for operator approval. AI agents create actions in this state.
- `approved`: Approved by operator, ready for execution.
- `rejected`: Rejected by operator, will not be executed.
- `running`: Currently being executed by an agent (e.g. `materializer`).
- `completed`: Successfully executed.
- `failed`: Execution failed.
-
-#### Human-in-the-Loop (HIL) Protocol
-1. **Request**: Agent identifies a required change and writes a JSON to `actions/pending/`.
-2. **Notification**: System notifies the human operator.
-3. **Audit**: Human reviews `details.reason` and `details.diff`.
-4. **Authorization**: Human moves file to `approved/`.
-5. **Execution**: Agent monitors `approved/` and executes the task.
-
-#### Schema
-```json
-{
-  "action_id": "string",
-  "service": "string",
-  "node": "string",
-  "type": "deploy_service | restart_service | rollback | scale",
-  "risk": "nominal | guarded | critical",
-  "status": "pending | approved | rejected | ...",
-  "created_at": <unix_seconds>,
-  "updated_at": <unix_seconds>,
-  "details": {
-    "image": "string",
-    "reason": "string",
-    "diff": "string"
-  },
-  "transition_history": [
-    {
-      "from": "string | null",
-      "to": "string",
-      "timestamp": <unix_seconds>,
-      "by": "string (system | operator-tg-12345 | webui)"
-    }
-  ]
-}
-```
-
-#### Workflow
-1. A system component (e.g. `runtime-materializer` or a future analyzer) creates a file in `actions/pending/`.
-2. `telegram-bot` detects the file, sends a message to allowed users.
-3. Operator clicks "Approve" or "Reject".
-4. `telegram-bot` moves the file to `actions/approved/` or `actions/rejected/` atomically, appending a transition to `transition_history`.
-5. The responsible agent (e.g. `stability-agent` on the target node) picks up the `approved` action, moves it to `running`, executes it, and finally moves it to `completed` or `failed`.
--- a/services/agent-system/deploy.sh
+++ b/services/agent-system/deploy.sh
@ -8,13 +8,7 @@ echo ">>> Building and starting Agent System services..."
 docker compose up -d --build

 echo ">>> Services status:"
-docker ps --filter "name=agent-system" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
-
-if [ -z "$TELEGRAM_BOT_TOKEN" ]; then
-  echo ">>> Telegram bot status: DISABLED (token missing)"
-else
-  echo ">>> Telegram bot status: ENABLED"
-fi
+docker ps --filter "name=agent-system"

 echo ">>> Verifying API endpoints..."
 sleep 5 # Give it a moment to start
--- a/services/agent-system/docker-compose.yml
+++ b/services/agent-system/docker-compose.yml
@ -31,17 +31,3 @@ services:
    depends_on:
      - redis
    restart: unless-stopped
-
-  telegram-bot:
-    build: ./telegram-bot
-    container_name: agent-system-telegram-bot
-    environment:
-      TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN}
-      TELEGRAM_ALLOWED_USER_IDS: ${TELEGRAM_ALLOWED_USER_IDS}
-      CONTROL_PLANE_URL: ${CONTROL_PLANE_URL:-http://webui:8080}
-      ENABLE_LLM_FALLBACK: ${ENABLE_LLM_FALLBACK:-false}
-      OPENCLAW_BASE_URL: ${OPENCLAW_BASE_URL}
-      ACTIONS_ROOT: /opt/homelab/actions
-    volumes:
-      - /opt/homelab:/opt/homelab
-    restart: on-failure
--- a/services/agent-system/env.example
+++ b/services/agent-system/env.example
@ -1,19 +0,0 @@
-# Telegram Bot Configuration
-# Get token from @BotFather
-TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrsTUVwxyz
-# Comma-separated list of Telegram User IDs
-TELEGRAM_ALLOWED_USER_IDS=12345678,87654321
-# Local control-plane API (default is internal compose address)
-CONTROL_PLANE_URL=http://webui:8080
-# Optional LLM fallback logic
-ENABLE_LLM_FALLBACK=false
-OPENCLAW_BASE_URL=http://openclaw.internal
-
-# Runtime Materializer Configuration
-REDIS_HOST=100.108.208.3
-REDIS_PORT=6379
-
-# Paths
-HOMELAB_ROOT=/opt/homelab
-ACTIONS_ROOT=/opt/homelab/actions
-WORLD_DIR=/opt/homelab/world
--- a/services/agent-system/runtime-materializer/materializer.py
+++ b/services/agent-system/runtime-materializer/materializer.py
@ -3,8 +3,6 @@ import json
 import os
 import time
 import argparse
-import urllib.request
-import urllib.error
 from datetime import datetime

 # Configuration from environment variables
@ -12,15 +10,6 @@ REDIS_HOST = os.environ.get("REDIS_HOST", "redis")
 REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
 WORLD_DIR = os.environ.get("WORLD_DIR", "/opt/homelab/world")

-# When set, materialize from the control-plane HTTP API instead of Redis.
-# This is the authoritative source of truth: the observer writes clean world
-# state to the control-plane API, which the materializer mirrors locally so
-# the webui's /snapshot (and all other endpoints) reflect the same data.
-#
-# Example: CONTROL_PLANE_URL=http://100.95.58.48:18180
-CONTROL_PLANE_URL = os.environ.get("CONTROL_PLANE_URL", "").rstrip("/")
-
-
 def get_redis_client():
    """Returns a Redis client with decoding enabled."""
    return redis.Redis(
@ -52,61 +41,6 @@ def normalize_health(health):
        return "degraded"
    return "error"

-
-def _fetch_json(url):
-    """Fetch JSON from a URL, returning parsed data or None on error."""
-    try:
-        with urllib.request.urlopen(url, timeout=10) as resp:
-            return json.loads(resp.read())
-    except Exception as e:
-        print(f"[{datetime.now().isoformat()}] Error fetching {url}: {e}")
-        return None
-
-
-def write_json(filename, data):
-    path = os.path.join(WORLD_DIR, filename)
-    with open(path, "w") as f:
-        json.dump(data, f, indent=2)
-
-
-def materialize_from_api():
-    """Mirror world state from the control-plane API to local world files.
-
-    The control-plane observer on VPS is the single authoritative writer of
-    world state. By fetching from its HTTP API we get the same clean, pruned
-    data that the /summary endpoint serves — no stale Redis artefacts.
-
-    Returns True if all fetches succeeded and files were written, False otherwise.
-    """
-    print(f"[{datetime.now().isoformat()}] Materializing from control-plane API: {CONTROL_PLANE_URL}")
-
-    endpoints = {
-        "nodes.json":          f"{CONTROL_PLANE_URL}/nodes",
-        "services.json":       f"{CONTROL_PLANE_URL}/services",
-        "incidents.json":      f"{CONTROL_PLANE_URL}/incidents",
-        "deployments.json":    f"{CONTROL_PLANE_URL}/deployments",
-        "recommendations.json":f"{CONTROL_PLANE_URL}/recommendations",
-        "runtime-summary.json":f"{CONTROL_PLANE_URL}/summary",
-        "events.json":         f"{CONTROL_PLANE_URL}/events",
-    }
-
-    fetched = {}
-    for filename, url in endpoints.items():
-        data = _fetch_json(url)
-        if data is None:
-            print(f"[{datetime.now().isoformat()}] Aborting: failed to fetch {url}")
-            return False
-        fetched[filename] = data
-
-    os.makedirs(WORLD_DIR, exist_ok=True)
-    for filename, data in fetched.items():
-        write_json(filename, data)
-
-    svc_count = len(fetched.get("services.json") or [])
-    print(f"[{datetime.now().isoformat()}] Materialized from API: {svc_count} services → {WORLD_DIR}")
-    return True
-
-
 def materialize():
    """Reads state from Redis and writes JSON files to the world directory."""
    print(f"[{datetime.now().isoformat()}] Materializing world state...")
@ -212,6 +146,11 @@ def materialize():
        # Ensure directory exists
        os.makedirs(WORLD_DIR, exist_ok=True)

+        def write_json(filename, data):
+            path = os.path.join(WORLD_DIR, filename)
+            with open(path, "w") as f:
+                json.dump(data, f, indent=2)
+
        write_json("runtime-summary.json", summary)
        write_json("nodes.json", nodes)
        write_json("services.json", services)
@ -233,19 +172,10 @@ if __name__ == "__main__":
    parser.add_argument("--interval", type=int, default=30, help="Sleep interval between runs (seconds)")
    args = parser.parse_args()

-    if CONTROL_PLANE_URL:
-        print(f"Mode: control-plane API ({CONTROL_PLANE_URL})")
-        run_fn = materialize_from_api
-    else:
-        print(f"Mode: Redis ({REDIS_HOST}:{REDIS_PORT})")
-        run_fn = materialize
-
-    interval = int(os.environ.get("MATERIALIZE_INTERVAL", args.interval))
-
    if args.once:
-        run_fn()
+        materialize()
    else:
-        print(f"Starting materializer loop (interval: {interval}s)...")
+        print(f"Starting materializer loop (interval: {args.interval}s)...")
        while True:
-            run_fn()
-            time.sleep(interval)
+            materialize()
+            time.sleep(args.interval)
--- a/services/agent-system/scripts/create-test-action.sh
+++ b/services/agent-system/scripts/create-test-action.sh
@ -1,39 +0,0 @@
-#!/bin/bash
-# Script to create a test pending action for Telegram bot verification.
-
-ACTIONS_PENDING_DIR=${ACTIONS_ROOT:-/opt/homelab/actions}/pending
-mkdir -p "$ACTIONS_PENDING_DIR"
-
-ACTION_ID="test-$(date +%s)"
-FILE_PATH="$ACTIONS_PENDING_DIR/$ACTION_ID.json"
-
-TIMESTAMP=$(date +%s)
-
-cat <<EOF > "$FILE_PATH"
-{
-  "action_id": "$ACTION_ID",
-  "service": "frigate",
-  "node": "chelsty",
-  "type": "deploy_service",
-  "risk": "guarded",
-  "status": "pending",
-  "created_at": $TIMESTAMP,
-  "updated_at": $TIMESTAMP,
-  "details": {
-    "image": "blakeblackshear/frigate:0.13.0",
-    "reason": "Security update for Frigate",
-    "diff": "image: blakeblackshear/frigate:0.12.0 -> 0.13.0"
-  },
-  "transition_history": [
-    {
-      "from": null,
-      "to": "pending",
-      "timestamp": $TIMESTAMP,
-      "by": "system-test"
-    }
-  ]
-}
-EOF
-
-echo "Test action created: $FILE_PATH"
-echo "If the telegram-bot is running and configured, you should receive a notification."
--- a/services/agent-system/telegram-bot/Dockerfile
+++ b/services/agent-system/telegram-bot/Dockerfile
@ -1,10 +0,0 @@
-FROM python:3.11-slim
-
-WORKDIR /app
-
-COPY requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-
-COPY bot.py .
-
-CMD ["python", "bot.py"]
--- a/services/agent-system/telegram-bot/bot.py
+++ b/services/agent-system/telegram-bot/bot.py
@ -1,454 +0,0 @@
-import os
-import json
-import time
-import asyncio
-import logging
-import urllib.request
-import urllib.error
-from pathlib import Path
-from telegram import Update, InlineKeyboardButton, InlineKeyboardMarkup
-from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler, CallbackQueryHandler, MessageHandler, filters
-
-# Setup logging
-logging.basicConfig(
-    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
-    level=logging.INFO
-)
-logger = logging.getLogger(__name__)
-
-# Configuration
-TOKEN = os.getenv("TELEGRAM_BOT_TOKEN")
-ALLOWED_IDS = [int(i.strip()) for i in os.getenv("TELEGRAM_ALLOWED_USER_IDS", "").split(",") if i.strip()]
-ACTIONS_ROOT = Path(os.getenv("ACTIONS_ROOT", "/opt/homelab/actions"))
-CONTROL_PLANE_URL = os.getenv("CONTROL_PLANE_URL", "http://webui:8080")
-ENABLE_LLM_FALLBACK = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
-OPENCLAW_BASE_URL = os.getenv("OPENCLAW_BASE_URL")
-
-async def fetch_api(path):
-    """Helper to fetch JSON from the Control Plane API."""
-    url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
-    try:
-        def do_request():
-            req = urllib.request.Request(url)
-            with urllib.request.urlopen(req, timeout=5) as response:
-                if response.status != 200:
-                    return None
-                return json.loads(response.read().decode())
-        return await asyncio.to_thread(do_request)
-    except Exception as e:
-        logger.error(f"Error fetching {url}: {e}")
-        return None
-
-async def post_api(path, data):
-    """Helper to POST JSON to the Control Plane API."""
-    url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
-    try:
-        body = json.dumps(data).encode("utf-8")
-        def do_request():
-            req = urllib.request.Request(url, data=body, method="POST")
-            req.add_header("Content-Type", "application/json")
-            with urllib.request.urlopen(req, timeout=5) as response:
-                return response.status == 200
-        return await asyncio.to_thread(do_request)
-    except Exception as e:
-        logger.error(f"Error posting to {url}: {e}")
-        return False
-
-def _format_pending_action(action_id: str, data: dict) -> str:
-    """Build the Telegram Markdown message for a pending action notification.
-
-    Extracted so it can be unit-tested without a live Telegram connection.
-    """
-    # Supervisor writes risk_level; action-model.md legacy schema used risk.
-    risk = data.get("risk_level") or data.get("risk", "unknown")
-    message = (
-        f"⚠️ *Pending Action*\n"
-        f"ID: `{action_id}`\n"
-        f"Type: `{data.get('type', 'unknown')}`\n"
-        f"Service: `{data.get('service', 'unknown')}`\n"
-        f"Node: `{data.get('node', 'unknown')}`\n"
-        f"Risk: *{risk}*\n"
-    )
-    # description carries the human-readable substance of the action (required for
-    # alert_only actions where it is the entire operator-visible message).
-    description = data.get("description", "")
-    if description:
-        truncated = description[:300] + ("..." if len(description) > 300 else "")
-        message += f"Description: `{truncated}`\n"
-    # Legacy details block (old action-model.md schema) — kept for backwards compat.
-    if "details" in data:
-        details_str = json.dumps(data["details"], indent=2)
-        if len(details_str) > 1000:
-            details_str = details_str[:1000] + "..."
-        message += f"\nDetails:\n```json\n{details_str}\n```"
-    return message
-
-
-class ApprovalBot:
-    def __init__(self):
-        self.pending_dir = ACTIONS_ROOT / "pending"
-        self.approved_dir = ACTIONS_ROOT / "approved"
-        self.rejected_dir = ACTIONS_ROOT / "rejected"
-        # Track which action IDs we have already notified in this session to avoid spam
-        self.notified_actions = set()
-
-    async def check_pending_actions(self, context: ContextTypes.DEFAULT_TYPE):
-        """Job that periodically checks for new pending action files."""
-        if not self.pending_dir.exists():
-            return
-
-        try:
-            for action_file in self.pending_dir.glob("*.json"):
-                action_id = action_file.stem
-                if action_id in self.notified_actions:
-                    continue
-
-                try:
-                    data = json.loads(action_file.read_text())
-                    # Only notify if it's truly pending
-                    if data.get("status") == "pending":
-                        await self.notify_users(context, action_id, data)
-                        self.notified_actions.add(action_id)
-                except Exception as e:
-                    logger.error(f"Error processing action file {action_file}: {e}")
-        except Exception as e:
-            logger.error(f"Error scanning pending directory: {e}")
-
-    async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
-        """Sends an approval request message to all allowed users."""
-        message = _format_pending_action(action_id, data)
-
-        keyboard = [
-            [
-                InlineKeyboardButton("✅ Approve", callback_data=f"approve:{action_id}"),
-                InlineKeyboardButton("❌ Reject", callback_data=f"reject:{action_id}"),
-            ]
-        ]
-        reply_markup = InlineKeyboardMarkup(keyboard)
-
-        for user_id in ALLOWED_IDS:
-            try:
-                await context.bot.send_message(
-                    chat_id=user_id,
-                    text=message,
-                    parse_mode="Markdown",
-                    reply_markup=reply_markup
-                )
-                logger.info(f"Notified user {user_id} about action {action_id}")
-            except Exception as e:
-                logger.error(f"Failed to notify user {user_id}: {e}")
-
-    async def handle_callback(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
-        """Handles button clicks for Approve/Reject."""
-        query = update.callback_query
-        user_id = query.from_user.id
-
-        if user_id not in ALLOWED_IDS:
-            await query.answer("Unauthorized", show_alert=True)
-            return
-
-        await query.answer()
-
-        cb_data = query.data
-        if ":" not in cb_data:
-            return
-
-        action, action_id = cb_data.split(":", 1)
-        target_status = "approved" if action == "approve" else "rejected"
-
-        # Use API for mutation if available, fallback to local disk move
-        success = await post_api("/action/mutate", {"id": action_id, "status": target_status})
-        msg = "Success" if success else "API call failed"
-
-        if not success:
-            # Fallback to direct disk manipulation (original behavior)
-            success, msg = self.move_action(action_id, target_status, user_id, query.from_user.username or str(user_id))
-
-        if success:
-            status_text = "✅ Approved" if target_status == "approved" else "❌ Rejected"
-            await query.edit_message_text(
-                text=query.message.text + f"\n\n{status_text} by {query.from_user.first_name}",
-                parse_mode="Markdown"
-            )
-            # Remove from notified list as it's no longer pending
-            if action_id in self.notified_actions:
-                self.notified_actions.remove(action_id)
-        else:
-            await query.message.reply_text(f"Failed to process action {action_id}: {msg}")
-
-    def move_action(self, action_id, target_status, user_id, username):
-        """Moves action file and updates its status and history."""
-        source_path = self.pending_dir / f"{action_id}.json"
-        if not source_path.exists():
-            return False, "Action file no longer exists in pending."
-
-        target_dir = self.approved_dir if target_status == "approved" else self.rejected_dir
-        target_dir.mkdir(parents=True, exist_ok=True)
-        target_path = target_dir / f"{action_id}.json"
-
-        try:
-            data = json.loads(source_path.read_text())
-            current_status = data.get("status", "pending")
-
-            # Update data
-            data["status"] = target_status
-            data["updated_at"] = time.time()
-
-            history = data.get("transition_history", [])
-            history.append({
-                "from": current_status,
-                "to": target_status,
-                "timestamp": time.time(),
-                "by": f"tg:{username}"
-            })
-            data["transition_history"] = history
-
-            # Atomic move: write to new location, then delete old
-            target_path.write_text(json.dumps(data, indent=2))
-            source_path.unlink()
-            logger.info(f"Action {action_id} moved from {current_status} to {target_status} by {username}")
-            return True, "Success"
-        except Exception as e:
-            logger.error(f"Error moving action file: {e}")
-            return False, str(e)
-
-async def start_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    """Simple start command to help users find their ID."""
-    user = update.effective_user
-    message = (
-        f"Hello {user.first_name}! 🤖\n"
-        f"Your Telegram User ID is: `{user.id}`\n\n"
-    )
-    if user.id in ALLOWED_IDS:
-        message += "✅ You are authorized to manage the homelab.\n\n"
-        message += "Use /help to see available commands."
-    else:
-        message += "❌ You are NOT authorized. Add your ID to `TELEGRAM_ALLOWED_USER_IDS`."
-
-    await update.message.reply_text(message, parse_mode="Markdown")
-
-async def status_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    if update.effective_user.id not in ALLOWED_IDS: return
-    res = await fetch_api("/summary")
-    status = "✅ Online" if res else "❌ Unreachable"
-    message = (
-        f"🤖 *Telegram Bot Status*\n"
-        f"Control Plane API: {status}\n"
-        f"Target URL: `{CONTROL_PLANE_URL}`\n"
-    )
-    await update.message.reply_text(message, parse_mode="Markdown")
-
-async def summary_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    if update.effective_user.id not in ALLOWED_IDS: return
-    data = await fetch_api("/summary")
-    if not data:
-        await update.message.reply_text("❌ Failed to fetch summary from Control Plane.")
-        return
-
-    msg = "📊 *System Summary*\n"
-    msg += f"Status: `{data.get('status', 'unknown')}`\n"
-    msg += f"Nodes: {data.get('node_count', 0)}\n"
-    msg += f"Services: {data.get('service_count', 0)}\n"
-    msg += f"Active Incidents: {data.get('active_incidents_count', 0)}\n"
-    if data.get('stale'):
-        msg += "\n⚠️ *Warning: Data is stale!*"
-
-    await update.message.reply_text(msg, parse_mode="Markdown")
-
-async def nodes_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    if update.effective_user.id not in ALLOWED_IDS: return
-    nodes = await fetch_api("/nodes")
-    if nodes is None:
-        await update.message.reply_text("❌ Failed to fetch nodes.")
-        return
-
-    if not nodes:
-        await update.message.reply_text("No nodes discovered in the fleet.")
-        return
-
-    msg = "🖥️ *Nodes Status*\n"
-    for node in nodes:
-        health_icon = "✅" if node.get('health') == 'nominal' else "⚠️" if node.get('health') == 'degraded' else "❌"
-        msg += f"{health_icon} *{node.get('hostname')}*: `{node.get('status', 'unknown')}`\n"
-        msg += f"   Last seen: {node.get('last_seen', 'N/A')}\n"
-
-    await update.message.reply_text(msg, parse_mode="Markdown")
-
-async def services_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    if update.effective_user.id not in ALLOWED_IDS: return
-    services = await fetch_api("/services")
-    if services is None:
-        await update.message.reply_text("❌ Failed to fetch services.")
-        return
-
-    # Summarize by node
-    nodes = {}
-    for s in services:
-        node = s.get("node", "unknown")
-        if node not in nodes: nodes[node] = []
-        nodes[node].append(s)
-
-    msg = "⚙️ *Services Summary*\n"
-    if not nodes:
-        msg += "No services discovered."
-    else:
-        for node, svc_list in sorted(nodes.items()):
-            nominal = len([s for s in svc_list if s.get("health") == "nominal"])
-            msg += f"• *{node}*: {nominal}/{len(svc_list)} nominal\n"
-
-    msg += "\nUse /unhealthy to see issues."
-    await update.message.reply_text(msg, parse_mode="Markdown")
-
-async def unhealthy_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    if update.effective_user.id not in ALLOWED_IDS: return
-    services = await fetch_api("/services")
-    nodes = await fetch_api("/nodes")
-
-    msg = "⚠️ *Unhealthy Components*\n"
-    found = False
-
-    if services:
-        for s in services:
-            health = s.get("health", "").lower()
-            if health != "nominal":
-                msg += f"• Service *{s.get('name')}* on *{s.get('node')}*: `{health}`\n"
-                found = True
-
-    if nodes:
-        for n in nodes:
-            checks = n.get("checks", {})
-            if isinstance(checks, str):
-                try: checks = json.loads(checks)
-                except: checks = {}
-
-            docker = checks.get("docker", {})
-            if docker.get("status") == "ok":
-                for c in docker.get("containers", []):
-                    if c.get("state") != "running":
-                        msg += f"• Container *{c.get('name')}* on *{n.get('hostname')}*: `{c.get('state')}`\n"
-                        found = True
-
-    if not found:
-        msg += "All systems nominal. ✅"
-
-    await update.message.reply_text(msg, parse_mode="Markdown")
-
-async def incidents_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    if update.effective_user.id not in ALLOWED_IDS: return
-    incidents = await fetch_api("/incidents")
-    if incidents is None:
-        await update.message.reply_text("❌ Failed to fetch incidents.")
-        return
-
-    active = [i for i in incidents if i.get("status") not in ("resolved", "closed")]
-    if not active:
-        await update.message.reply_text("No active incidents. ✅")
-        return
-
-    msg = "🚨 *Active Incidents*\n"
-    for inc in active:
-        severity = inc.get('severity', 'info').upper()
-        msg += f"• [{severity}] *{inc.get('type')}*: {inc.get('message')}\n"
-
-    await update.message.reply_text(msg, parse_mode="Markdown")
-
-async def actions_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    if update.effective_user.id not in ALLOWED_IDS: return
-    actions = await fetch_api("/actions")
-    if actions is None:
-        await update.message.reply_text("❌ Actions endpoint unavailable.")
-        return
-
-    msg = "⚡ *Actions Summary*\n"
-    total = 0
-    for status, act_list in actions.items():
-        if act_list:
-            msg += f"• {status.capitalize()}: {len(act_list)}\n"
-            total += len(act_list)
-
-    if total == 0:
-        msg = "No actions recorded."
-
-    await update.message.reply_text(msg, parse_mode="Markdown")
-
-async def help_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    msg = (
-        "📖 *Supported Commands*\n\n"
-        "/status - Check bot and API connectivity\n"
-        "/summary - System health overview\n"
-        "/nodes - List homelab nodes and their status\n"
-        "/services - Summary of services across nodes\n"
-        "/unhealthy - List all unhealthy components\n"
-        "/incidents - View active incidents\n"
-        "/actions - Summary of operator actions\n"
-        "/help - Show this help message\n\n"
-        "Free text will be handled by the guidance system."
-    )
-    await update.message.reply_text(msg, parse_mode="Markdown")
-
-async def handle_fallback(update: Update, context: ContextTypes.DEFAULT_TYPE):
-    """Handles non-command messages."""
-    if update.effective_user.id not in ALLOWED_IDS: return
-
-    if ENABLE_LLM_FALLBACK and OPENCLAW_BASE_URL:
-        # Placeholder for OpenClaw LLM fallback
-        # In a real scenario, this would call the LLM API
-        logger.info(f"LLM fallback requested for: {update.message.text}")
-
-    await update.message.reply_text(
-        "Use /summary, /nodes, /services, /unhealthy, /incidents, /actions."
-    )
-
-async def run_bot():
-    if not TOKEN:
-        print("CRITICAL: TELEGRAM_BOT_TOKEN is not set. Telegram bot will not start.")
-        # Keep process alive to not crash compose if not desired, but here we just exit
-        # Requirement says: "do not fail if Telegram token is absent, but telegram-bot should be disabled or exit cleanly"
-        return
-
-    bot_logic = ApprovalBot()
-
-    application = ApplicationBuilder().token(TOKEN).build()
-
-    application.add_handler(CommandHandler("start", start_command))
-    application.add_handler(CommandHandler("status", status_command))
-    application.add_handler(CommandHandler("summary", summary_command))
-    application.add_handler(CommandHandler("nodes", nodes_command))
-    application.add_handler(CommandHandler("services", services_command))
-    application.add_handler(CommandHandler("unhealthy", unhealthy_command))
-    application.add_handler(CommandHandler("incidents", incidents_command))
-    application.add_handler(CommandHandler("actions", actions_command))
-    application.add_handler(CommandHandler("help", help_command))
-
-    application.add_handler(MessageHandler(filters.TEXT & (~filters.COMMAND), handle_fallback))
-    application.add_handler(CallbackQueryHandler(bot_logic.handle_callback))
-
-    # Schedule the pending actions check
-    job_queue = application.job_queue
-    if job_queue:
-        job_queue.run_repeating(bot_logic.check_pending_actions, interval=10, first=5)
-    else:
-        logger.warning("JobQueue is not available. Periodic pending actions check will be skipped.")
-
-    logger.info("Starting Telegram Approval Bot...")
-    await application.initialize()
-    await application.start()
-    await application.updater.start_polling()
-
-    # Run until the application is stopped
-    stop_event = asyncio.Event()
-    try:
-        await stop_event.wait()
-    except (KeyboardInterrupt, SystemExit):
-        logger.info("Stopping bot...")
-    finally:
-        await application.stop()
-        await application.shutdown()
-
-if __name__ == "__main__":
-    try:
-        asyncio.run(run_bot())
-    except KeyboardInterrupt:
-        pass
-    except Exception as e:
-        logger.error(f"Fatal error: {e}")
--- a/services/agent-system/telegram-bot/requirements.txt
+++ b/services/agent-system/telegram-bot/requirements.txt
@ -1 +0,0 @@
-python-telegram-bot[job-queue]==20.7
--- a/services/agent-system/telegram-bot/tests/init.py
+++ b/services/agent-system/telegram-bot/tests/init.py
--- a/services/agent-system/telegram-bot/tests/conftest.py
+++ b/services/agent-system/telegram-bot/tests/conftest.py
@ -1,38 +0,0 @@
-"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
-from __future__ import annotations
-
-import sys
-import types
-from unittest.mock import MagicMock
-
-
-def _make_telegram_stub() -> types.ModuleType:
-    mod = types.ModuleType("telegram")
-    mod.Update = MagicMock
-    mod.InlineKeyboardButton = MagicMock
-    mod.InlineKeyboardMarkup = MagicMock
-    return mod
-
-
-def _make_telegram_ext_stub() -> types.ModuleType:
-    mod = types.ModuleType("telegram.ext")
-    mod.ApplicationBuilder = MagicMock
-
-    # ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
-    # evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
-    ContextTypesMock = MagicMock()
-    ContextTypesMock.DEFAULT_TYPE = type(None)
-    mod.ContextTypes = ContextTypesMock
-
-    mod.CommandHandler = MagicMock
-    mod.CallbackQueryHandler = MagicMock
-    mod.MessageHandler = MagicMock
-    mod.filters = MagicMock()
-    return mod
-
-
-# Insert before any import of bot.py
-if "telegram" not in sys.modules:
-    sys.modules["telegram"] = _make_telegram_stub()
-if "telegram.ext" not in sys.modules:
-    sys.modules["telegram.ext"] = _make_telegram_ext_stub()
--- a/services/agent-system/telegram-bot/tests/test_format.py
+++ b/services/agent-system/telegram-bot/tests/test_format.py
@ -1,116 +0,0 @@
-"""Tests for _format_pending_action — no Telegram connection required.
-
-telegram stubs are set up in conftest.py before this module is imported.
-"""
-from __future__ import annotations
-
-import sys
-from pathlib import Path
-
-import pytest
-
-sys.path.insert(0, str(Path(__file__).parent.parent))
-from bot import _format_pending_action
-
-
-# ---------------------------------------------------------------------------
-# Bug 1 — risk_level field
-# ---------------------------------------------------------------------------
-
-def test_risk_level_shown_when_present():
-    data = {
-        "type": "container_restart", "service": "homeassistant",
-        "node": "chelsty-ha", "risk_level": "low",
-    }
-    msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
-    assert "Risk: *low*" in msg
-    assert "unknown" not in msg
-
-
-def test_risk_falls_back_to_legacy_risk_key():
-    data = {
-        "type": "redeploy", "service": "mosquitto",
-        "node": "chelsty-infra", "risk": "guarded",
-    }
-    msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
-    assert "Risk: *guarded*" in msg
-
-
-def test_risk_unknown_when_both_absent():
-    data = {"type": "redeploy", "service": "foo", "node": "bar"}
-    msg = _format_pending_action("redeploy-bar-foo", data)
-    assert "Risk: *unknown*" in msg
-
-
-# ---------------------------------------------------------------------------
-# Bug 2 — description field
-# ---------------------------------------------------------------------------
-
-def test_description_shown_for_alert_only():
-    data = {
-        "type": "alert_only", "service": "homeassistant",
-        "node": "chelsty-ha", "risk_level": "info",
-        "description": "3 entities unavailable for >1h",
-    }
-    msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
-    assert "3 entities unavailable for >1h" in msg
-    assert "Description:" in msg
-
-
-def test_description_shown_for_container_restart():
-    data = {
-        "type": "container_restart", "service": "homeassistant",
-        "node": "chelsty-ha", "risk_level": "low",
-        "description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
-    }
-    msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
-    assert "HA WebSocket unresponsive" in msg
-
-
-def test_description_absent_no_crash():
-    data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
-    msg = _format_pending_action("redeploy-bar-foo", data)
-    assert "Description:" not in msg
-    assert "Risk: *guarded*" in msg
-
-
-def test_description_truncated_at_300_chars():
-    long_desc = "x" * 400
-    data = {
-        "type": "alert_only", "service": "homeassistant",
-        "node": "chelsty-ha", "risk_level": "info",
-        "description": long_desc,
-    }
-    msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
-    assert "x" * 300 in msg
-    assert "..." in msg
-    assert "x" * 301 not in msg
-
-
-# ---------------------------------------------------------------------------
-# Combined — real HA alert_only action shape
-# ---------------------------------------------------------------------------
-
-def test_ha_alert_only_full_action():
-    """Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
-    data = {
-        "action_id": "alert-ha-entity-unavailable-chelsty-ha",
-        "type": "alert_only",
-        "node": "chelsty-ha",
-        "service": "homeassistant",
-        "risk_level": "info",
-        "confidence": 1.0,
-        "description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
-        "status": "pending",
-        "payload": {
-            "location_tag": "chelsty",
-            "reason": "ha_entity_unavailable_long",
-            "count": 3,
-        },
-    }
-    msg = _format_pending_action(data["action_id"], data)
-    assert "alert_only" in msg
-    assert "chelsty-ha" in msg
-    assert "Risk: *info*" in msg
-    assert "3 entities unavailable" in msg
-    assert "unknown" not in msg
--- a/services/agent-system/webui/index.html
+++ b/services/agent-system/webui/index.html
@ -277,9 +277,8 @@
          <option value="maintenance">MAINTENANCE</option>
        </select>
      </div>
-      <div class="header-actions" style="display:flex; gap:8px; align-items:center">
+      <div class="header-actions">
        <button onclick="refreshData()">Refresh</button>
-        <button id="copy-ai-btn" onclick="copyForAI()">Copy for AI</button>
      </div>
    </header>

@ -692,73 +691,6 @@
      }
    }

-    async function copyForAI() {
-      const btn = document.getElementById('copy-ai-btn');
-      const original = btn.textContent;
-      btn.textContent = 'Copying...';
-      btn.disabled = true;
-
-      try {
-        const snap = await fetchData('/snapshot');
-        if (!snap) throw new Error('snapshot fetch failed');
-
-        const now = new Date(snap.timestamp);
-        const dateStr = now.toISOString().slice(0, 16).replace('T', ' ');
-        const lines = [];
-
-        lines.push(`=== HOMELAB SNAPSHOT ${dateStr} ===`);
-
-        if (snap.nodes && snap.nodes.length > 0) {
-          lines.push('NODES: ' + snap.nodes.map(n =>
-            `${(n.hostname || n.id || '?').toUpperCase()} ${(n.health || 'unknown').toUpperCase()}`
-          ).join(', '));
-        } else {
-          lines.push('NODES: none');
-        }
-
-        if (snap.non_nominal_services && snap.non_nominal_services.length > 0) {
-          lines.push('ERRORS: ' + snap.non_nominal_services.map(s =>
-            `${s.name} (${s.node}) - ${s.health}`
-          ).join(', '));
-        } else {
-          lines.push(`ERRORS: none (${snap.nominal_service_count} nominal)`);
-        }
-
-        const activeIncidents = (snap.incidents || []).filter(i => !['resolved', 'closed'].includes(i.status));
-        if (activeIncidents.length > 0) {
-          lines.push('INCIDENTS: ' + activeIncidents.map(i =>
-            `[${i.severity}] ${i.message} (${i.node})`
-          ).join('; '));
-        } else {
-          lines.push('INCIDENTS: none');
-        }
-
-        if (snap.events && snap.events.length > 0) {
-          lines.push(`EVENTS (last ${snap.events.length}):`);
-          snap.events.forEach(ev => {
-            const ts = ev.timestamp
-              ? new Date(ev.timestamp * 1000).toISOString().slice(11, 19)
-              : '?';
-            const svc = ev.service ? '/' + ev.service : '';
-            lines.push(`  ${ts} [${ev.severity || ev.level || '?'}] ${ev.type} - ${ev.message || ''} (${ev.node || ''}${svc})`);
-          });
-        } else {
-          lines.push('EVENTS (last 10): none');
-        }
-
-        const s = snap.summary || {};
-        lines.push(`SUMMARY: status=${s.status || '?'} nodes=${s.node_count ?? '?'} services=${s.service_count ?? '?'} incidents=${s.incident_count ?? '?'}`);
-
-        await navigator.clipboard.writeText(lines.join('\n'));
-        btn.textContent = 'Copied!';
-        setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
-      } catch (e) {
-        console.error('copyForAI error:', e);
-        btn.textContent = 'Error';
-        setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
-      }
-    }
-
    // Initial load
    refreshData();
    // Poll for updates
--- a/services/agent-system/webui/web.py
+++ b/services/agent-system/webui/web.py
@ -1,7 +1,6 @@
 import json
 import os
 import time
-from datetime import datetime, timezone
 from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
 from pathlib import Path

@ -68,22 +67,12 @@ def current_recommendations():


 def current_summary():
-    path = WORLD_DIR / "runtime-summary.json"
-    summary = read_json_file(path, default={})
+    summary = read_json_file(WORLD_DIR / "runtime-summary.json", default={})
    if summary:
-        last_update_val = summary.get("last_update")
-        if last_update_val:
-            try:
-                if isinstance(last_update_val, str):
-                    last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
-                else:
-                    last_update = float(last_update_val)
-            except Exception:
-                last_update = os.path.getmtime(path)
-        else:
-            last_update = os.path.getmtime(path)
-        summary["last_update"] = last_update
-        summary["stale"] = (time.time() - last_update) > 60
+        # Check for staleness
+        mtime = os.path.getmtime(WORLD_DIR / "runtime-summary.json")
+        summary["last_update"] = mtime
+        summary["stale"] = (time.time() - mtime) > 60  # Stale if older than 60s
    return summary


@ -152,28 +141,6 @@ def mutate_action(action_id, target_status):
        return False, str(e)


-def get_snapshot():
-    nodes = current_nodes()
-    services = current_services()
-    incidents = current_incidents()
-    events = current_events()
-    summary = current_summary()
-
-    non_nominal = [s for s in services if s.get("health") != "nominal"]
-    nominal_count = len(services) - len(non_nominal)
-
-    return {
-        "timestamp": datetime.now(timezone.utc).isoformat(),
-        "summary": summary,
-        "nodes": nodes,
-        "non_nominal_services": non_nominal,
-        "nominal_service_count": nominal_count,
-        "total_service_count": len(services),
-        "incidents": incidents,
-        "events": events[:10],
-    }
-
-
 def send_json(status, payload, handler):
    body = (json.dumps(payload) + "\n").encode("utf-8")
    handler.send_response(status)
@ -221,10 +188,6 @@ class Handler(BaseHTTPRequestHandler):
            send_json(200, current_actions(), self)
            return

-        if self.path == "/snapshot":
-            send_json(200, get_snapshot(), self)
-            return
-
        if self.path in ("/", "/index.html"):
            body = (STATIC_DIR / "index.html").read_bytes()
            self.send_response(200)
--- a/services/brain-watchdog/Dockerfile
+++ b/services/brain-watchdog/Dockerfile
@ -1,10 +0,0 @@
-FROM python:3.11-slim
-
-WORKDIR /app
-
-COPY src/ src/
-
-ENV PYTHONUNBUFFERED=1
-ENV PYTHONPATH=/app/src
-
-CMD ["python", "-m", "brain_watchdog.main"]
--- a/services/brain-watchdog/docker-compose.yml
+++ b/services/brain-watchdog/docker-compose.yml
@ -1,30 +0,0 @@
-services:
-  brain-watchdog:
-    build: .
-    container_name: brain-watchdog
-    restart: unless-stopped
-
-    env_file:
-      - /opt/homelab/config/brain-watchdog/.env
-
-    volumes:
-      - brain_watchdog_data:/data
-
-    healthcheck:
-      test:
-        - "CMD"
-        - "python"
-        - "-c"
-        - |
-          import os, time, json, sys
-          p = '/data/state.json'
-          if not os.path.exists(p): sys.exit(1)
-          age = time.time() - os.path.getmtime(p)
-          sys.exit(0 if age < 300 else 1)
-      interval: 1m
-      timeout: 10s
-      retries: 3
-      start_period: 30s
-
-volumes:
-  brain_watchdog_data:
--- a/Show more
+++ b/Show more