docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs

docs/sessions/2026-05-27.md (new): - Full session record: problems found, all commits shipped, end state - Written in Polish per operator preference for session notes - Known limitations: SLZB-06U offline, ezsp→ember migration pending docs/observer-runtime.md: - Document per-node checkpoint format (replaces old global checkpoint) - Add service_healthy / service_recovered resolution behavior - Document ghost key pruning (_prune_stale_world patterns) - Add event type reference table (negative vs positive) docs/vps-control-plane.md: - Add container names and network_mode: host detail - Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior - Add piha agent-system materializer integration note - Rewrite recovery section with actionable bootstrap-flood diagnosis - Add action state machine (pending→approved→running→completed/cancelled) docs/chelsty-runtime.md: - Add chelsty-infra/chelsty-ha node table - Document docker-compose v1 constraint (always use docker-compose, not docker compose) - Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation - Add z2m config writable requirement (EROFS failure mode documented) - Add chelsty-ha monitor:false rationale - Add minimal configuration.yaml template for z2m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 16:18:31 +02:00 · 2026-05-27 16:18:31 +02:00 · 603e10a364
parent 7277bdc27f
commit 603e10a364
4 changed files with 396 additions and 122 deletions
--- a/docs/chelsty-runtime.md
+++ b/docs/chelsty-runtime.md
@ -1,61 +1,154 @@
 # CHELSTY Runtime
-This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
+This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
 | Node | Role | Services |
 |------|------|----------|
 | `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
 | `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
 Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
 ## Runtime Layout
-The CHELSTY runtime is located at `/opt/homelab`.
+```
-
+/opt/homelab/
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
+├── config/          # Service-specific configs and secrets (not in Git)
- `/opt/homelab/data/`: Persistent data for services.
+│   ├── mosquitto/
- `/opt/homelab/logs/`: Service logs.
+│   └── zigbee2mqtt/
-
+├── data/            # Persistent service data
-### Key Service Locations
+│   ├── mosquitto/   # Persistence DB, password file
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
+│   └── zigbee2mqtt/
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
+│       └── data/    # z2m config, coordinator backup, network key
 └── logs/
 ```
 ## SLZB-06U Integration
-CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
+CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
- **Coordinator IP**: 192.168.1.105
+- **Coordinator IP**: `192.168.1.105`
- **Port**: 6638
+- **Port**: `6638`
- **Protocol**: TCP (ezsp adapter)
+- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
 - **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
-Zigbee2MQTT is configured to connect to this coordinator over the local network.
+⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
-## Offline & LTE Assumptions
+## Networking Constraints
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
+### Mosquitto — `network_mode: host`
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
+Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
- **Home Assistant**: Runs on `chelsty-ha` node, connecting to the Mosquitto broker on `chelsty-infra`.
+
 ### Zigbee2MQTT — bridge network + extra_hosts
 Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
 ```yaml
 # hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
 services:
  zigbee2mqtt:
    extra_hosts:
      - "mosquitto:host-gateway"
 ```
 This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
 **Why not `network_mode: host` for z2m?**  
 chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
 ## Zigbee2MQTT Config Location
 The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
 ```
 /opt/homelab/data/zigbee2mqtt/data/configuration.yaml
 ```
 This path is mounted read-write by the base `docker-compose.yml`:
 ```yaml
 volumes:
  - /opt/homelab/data/zigbee2mqtt/data:/app/data
 ```
 Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
 ### Minimal configuration.yaml
 ```yaml
 homeassistant: true
 permit_join: false
 mqtt:
  base_topic: zigbee2mqtt
  server: mqtt://mosquitto:1883
 serial:
  port: tcp://192.168.1.105:6638
  adapter: ezsp
 frontend:
  port: 8080
 advanced:
  log_level: info
 ```
 ## chelsty-ha — No node-agent
 `chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
 In `hosts/chelsty-ha/services.yaml`:
 ```yaml
 services:
  homeassistant:
    monitor: false   # No node-agent; suppresses supervisor action generation
 ```
 Remove `monitor: false` once node-agent is bootstrapped on this VM.
 ## Deployment Flow
-1. **Initial Bootstrap**:
+### Initial Bootstrap
   Run the bootstrap script on the CHELSTY node:
 ```bash
 ./scripts/bootstrap/chelsty-runtime.sh
 ```
-2. **Manual Configuration**:
+### Deploy services
   - Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
   - Add Mosquitto user:
     ```bash
     sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
     ```
 3. **Service Deployment**:
   Use the staged deployment runtime:
 ```bash
 ./scripts/deploy/deploy-node.sh chelsty-infra
 ./scripts/deploy/deploy-node.sh chelsty-ha
 ```
-## Recovery Procedure
+### Manual (SSH) — chelsty-infra uses docker-compose v1
 ```bash
 ssh oskar@100.122.201.22
 cd ~/homelab-codex-ws/services/<service>
 docker-compose -f docker-compose.yml \
  -f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
  up -d --build --force-recreate
 ```
-In case of runtime failure:
+> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
-1. Verify Docker and Compose plugin: `docker compose version`
+
-2. Re-run bootstrap script to ensure directory structure and basic configs.
+## Recovery Procedures
-3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
+
-4. Verify SLZB-06U reachability: `ping 192.168.1.105`
+### Mosquitto stopped
 ```bash
 ssh oskar@100.122.201.22 "docker start mosquitto"
 # Ensure restart policy is correct:
 docker update --restart unless-stopped mosquitto
 ```
 ### Zigbee2MQTT won't start
 1. Check logs: `docker logs zigbee2mqtt --tail 50`
 2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
 3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
 4. If config missing, recreate from the minimal template above
 ### SLZB-06U unreachable
 `192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
 ## Critical Backup Sets
 | Data | Path |
 |------|------|
 | HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
 | z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
 | Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
 | SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
 > ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
--- a/docs/observer-runtime.md
+++ b/docs/observer-runtime.md
@ -7,57 +7,92 @@ The Observer Runtime is a lightweight agent responsible for synthesizing the ope
 The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
 ### Inputs
- `/opt/homelab/events/`: Normalized JSON events.
+- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint.
+- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
 - `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
 - Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
 ### World Model Output
 Generated under `/opt/homelab/world/`:
- `nodes.json`: Current node availability, roles, and last seen timestamps.
+- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
- `services.json`: Service health status and links to active incidents.
+- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
 - `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
 - `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
 - `runtime-summary.json`: High-level overview for dashboards and planner agents.
-## Incident Lifecycle
+## Checkpoint Format
-The observer implements lightweight incident correlation:
+The observer tracks per-node progress to avoid silently skipping event directories:
 1.  **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
 2.  **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
 3.  **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
 4.  **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
 ### Example Incident JSON
 ```json
 {
-  "inc-1715518800-saturn-mosquitto": {
+  "node_checkpoints": {
-    "id": "inc-1715518800-saturn-mosquitto",
+    "vps":            "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
-    "node": "saturn",
+    "piha":           "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
-    "service": "mosquitto",
+    "chelsty-infra":  "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
    "status": "resolved",
    "severity": "error",
    "started_at": "2026-05-12T12:05:00Z",
    "last_occurrence": "2026-05-12T12:06:00Z",
    "occurrence_count": 2,
    "events": [
      "2026-05-12T12:05:00Z",
      "2026-05-12T12:06:00Z"
    ],
    "correlation_id": "hc-1",
    "resolved_at": "2026-05-12T12:10:00Z"
  }
 }
 ```
 A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
 **Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
 ## Event Types
 ### Negative events (create/escalate incidents)
 - `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
 - `deployment_failed` — record failure in deployments.json
 ### Positive events (resolve state)
 - `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
 - `service_recovered` — alias, same effect
 - `deployment_completed` — marks deployment as completed
 ### Node events
 - `node_online`, `node_offline` — update node status in nodes.json
 - `disk_pressure_*` — set `disk_pressure` field on the node record
 ## Incident Lifecycle
 1.  **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
 2.  **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
 3.  **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
 4.  **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
 ### Example Incident JSON
 ```json
 {
  "inc-1715518800-vps-observer": {
    "id": "inc-1715518800-vps-observer",
    "node": "vps",
    "service": "observer",
    "status": "resolved",
    "severity": "error",
    "started_at": 1715518800.0,
    "last_occurrence": 1715518860.0,
    "occurrence_count": 2,
    "trigger_type": "containers_not_running",
    "resolved_at": 1715519100.0
  }
 }
 ```
 ## World State Pruning
 `_prune_stale_world()` runs every reconcile cycle and removes:
 1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
 2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
 3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
 4. **Expired incidents** — resolved incidents older than 7 days.
 ## Runtime Behavior
 ### Idempotency
-The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state.
+The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
 ### Resumability
 The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
 ### Deployment Tracking
-Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment.
+Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
 ### Topology Filtering
 Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
--- a/docs/sessions/2026-05-27.md
+++ b/docs/sessions/2026-05-27.md
@ -0,0 +1,103 @@
 # SESSION: Stabilizacja systemu wieloagentowego homelabu
 **DATE:** 2026-05-27  
 **RESULT:** System NOMINAL (97/97 services, 0 errors)
 ---
 ## PROBLEMS FOUND
 - stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
 - mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
 - zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
 - node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
 - ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
 - materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
 - observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
 - supervisor nie cancelował resolved actions — pending queue rósł bez końca
 - `service_healthy` event nie zamykał aktywnych incydentów
 - NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
 - chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
 ---
 ## FIXES SHIPPED (commits in master)
 ```
 7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
 b40b832 Fix ghost service keys from hash-prefixed Docker container names
 28e9534 observer: service_healthy resolves active incidents
 46ae92b supervisor: also cancel pending actions for services removed from desired state
 410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
 b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
 61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
 51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
 fb7828b supervisor: auto-cancel pending actions when drift is resolved
 2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
 267742c vps/node-agent: add network_mode: host for control-plane health probe
 4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
 f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
 a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
 2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
 65bac4e fix(node-agent): mount host SSH key into container for event shipping
 96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
 ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
 c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
 01b7758 feat(node-agent): implement health monitor and safe cleanup policy
 ```
 ### Szczegóły kluczowych napraw
 **fix(observer): per-node checkpoints**  
 Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
 **fix(observer): ghost key pruning**  
 `_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
 **fix(node-agent): canonical container name**  
 `check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
 **fix(node-agent): service_healthy emission**  
 Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
 **fix(supervisor): auto-cancel resolved actions**  
 `_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
 - serwis stał się healthy (`drift_resolved_auto`)
 - serwis został usunięty z desired state (`service_removed_from_desired_state`)
 **fix(supervisor): monitor:false**  
 Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
 **fix(agent-system/materializer): control-plane API as source**  
 Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
 **fix(chelsty-infra/zigbee2mqtt): mosquitto networking**  
 Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
 **fix(chelsty-infra/zigbee2mqtt): writable config**  
 z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
 ---
 ## STAN KOŃCOWY
 | Node | Status | Serwisy |
 |------|--------|---------|
 | vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
 | piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
 | solaria | online | node-agent, stability-agent, AI workloads |
 | chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
 | chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
 **Action queue:** 0 pending, 0 approved, 0 running  
 **Incidents:** 0 active  
 **Ghost service keys:** 0  
 ---
 ## ZNANE OGRANICZENIA / TODO
 - SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
 - `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
 - chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
 - Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.
--- a/docs/vps-control-plane.md
+++ b/docs/vps-control-plane.md
@ -1,83 +1,126 @@
 # VPS Control Plane
-The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS and provides observability, automated reconciliation, and a web-based operator interface.
+The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
 ## Architecture
-The control plane consists of four core services running as a Docker Compose stack:
+The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
-1.  **Observer**: Synthesizes world state from events.
+| Container | Role |
-2.  **Supervisor**: Detects drifts between desired and actual state.
+|-----------|------|
-3.  **Executor**: Executes approved actions from the queue.
+| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
-4.  **Operator UI**: Web interface for system monitoring and action approval.
+| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
 | `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
 | `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
-All services adhere to **filesystem-first** semantics, using `/opt/homelab/` as the primary data exchange and persistence layer.
+All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
-## Deployment Flow
+## Supervisor Behavior
-### 1. Prerequisites
+### Desired State
- Target VPS node must be onboarded (Tailscale active, Docker installed).
+Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
 - Repository cloned to `/home/oskar/homelab-codex-ws`.
-### 2. Bootstrap
+### Drift Types
-Run the local deployment script on the VPS to initialize the runtime filesystem and start the stack:
+- `missing_service` — service is in desired state but absent from `services.json`
 - `unhealthy_service` — service exists in `services.json` but `status != healthy`
 ### Action Types
 | Trigger | Action type | Risk |
 |---------|-------------|------|
 | `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
 | Any other / unknown | `redeploy` | guarded |
 | Node `disk_pressure: high` | `disk_cleanup` | guarded |
 ### Action ID Stability
 Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
 ### Auto-Cancel
 Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
 - **`drift_resolved_auto`** — the service becomes `healthy` in actual state
 - **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
 Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
 ### Node Name Resolution
 The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
 ```bash
-cd services/control-plane
+NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
 bash deploy-local.sh
 ```
-### 3. Verification
+## Deployment
 Verify the stack is healthy using the deployment script or check container status on the VPS:
 ### From SATURN (primary control node)
 ```bash
-# Check status via deploy script
+# Full deploy via SSH
 ./scripts/deploy/deploy-control-plane.sh --ssh
-# Manual status check on VPS
+# Or manually:
 ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
 ```
 ### Direct on VPS
 ```bash
 cd ~/homelab-codex-ws/services/control-plane
 docker compose up -d --build --force-recreate
 ```
 `deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
 ### Verification
 ```bash
 # On VPS
 docker ps --filter "name=control-plane"
 curl -s http://localhost:18180/summary | python3 -m json.tool
 ```
-## Operational Workflows
+## Action Approval Workflow
-### Action Approval
+```
-1. Access the Operator UI (via Tailscale IP or Nginx Proxy Manager).
+Supervisor writes → /opt/homelab/actions/pending/<id>.json
-2. Navigate to **Action Queue**.
+                 → Operator UI (port 18180) or Telegram Bot notifies
-3. Review **Pending** actions recommended by the Supervisor.
+                 → Operator clicks Approve
-4. Click **Approve** to move actions to the execution queue.
+                 → /opt/homelab/actions/approved/<id>.json
                 → Executor executes → completed / failed
 ```
-### Recovery Flow
+Possible action states: `pending → approved → running → completed / failed / rejected`  
-In case of control plane failure:
+Auto-cancel path: `pending → cancelled/`
 1. Check logs using `docker logs`.
 2. Restart stack using the local deployment script: `bash deploy-local.sh`.
 3. Rebuild world state: Delete `/opt/homelab/state/observer_checkpoint.json` and redeploy.
-### Upgrade Flow
+## Recovery
-To deploy updates from the SOLARIA/control host:
+
 ### World state is stale or corrupt
 ```bash
 # On VPS — delete checkpoint to force full replay
 rm /opt/homelab/state/observer_checkpoint.json
 docker restart control-plane-observer
 ```
 ### Flood of pending actions after bootstrap
 Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
 ```bash
-./scripts/deploy/deploy-control-plane.sh --ssh
+# Check node-agent on each node
 ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
 ```
-### Rollback Semantics
+### Rebuild from scratch
-Since the runtime is filesystem-first and append-only:
+```bash
-1. Roll back the repository state to a previous commit.
+ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
-2. Restart the control plane stack.
+```
 3. The supervisor will detect drift against the older (rolled-back) desired state and recommend actions to restore it.
 ## Runtime Safety
 - **Readonly Mounts**: Most services mount the repository as `:ro` to prevent accidental mutations.
 - **Least-Privilege**: UI, Observer, and Supervisor run as non-root `homelab` user (UID 1000).
 - **Filesystem Isolation**: Clear separation between `/repo` (code/inventory) and `/opt/homelab` (runtime state).
 ## Integration
 ### piha agent-system webui (port 18180 on piha)
 The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
 Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
 ### Nginx Proxy Manager
-Configure a proxy host in NPM to point to `http://control-plane-ui:8080`. Ensure Websockets are enabled if the UI uses them.
+The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
 ### Log Locations
- Container logs: `docker compose logs`
+- Container logs: `docker compose logs -f` (from `services/control-plane/`)
 - Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
 - World state: `/opt/homelab/world/`
- Diagnostics: `/opt/homelab/logs/`
+- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`