docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs

docs/sessions/2026-05-27.md (new): - Full session record: problems found, all commits shipped, end state - Written in Polish per operator preference for session notes - Known limitations: SLZB-06U offline, ezsp→ember migration pending docs/observer-runtime.md: - Document per-node checkpoint format (replaces old global checkpoint) - Add service_healthy / service_recovered resolution behavior - Document ghost key pruning (_prune_stale_world patterns) - Add event type reference table (negative vs positive) docs/vps-control-plane.md: - Add container names and network_mode: host detail - Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior - Add piha agent-system materializer integration note - Rewrite recovery section with actionable bootstrap-flood diagnosis - Add action state machine (pending→approved→running→completed/cancelled) docs/chelsty-runtime.md: - Add chelsty-infra/chelsty-ha node table - Document docker-compose v1 constraint (always use docker-compose, not docker compose) - Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation - Add z2m config writable requirement (EROFS failure mode documented) - Add chelsty-ha monitor:false rationale - Add minimal configuration.yaml template for z2m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 16:18:31 +02:00 · 2026-05-27 16:18:31 +02:00 · 603e10a364
parent 7277bdc27f
commit 603e10a364
4 changed files with 396 additions and 122 deletions
--- a/docs/chelsty-runtime.md
+++ b/docs/chelsty-runtime.md
@ -1,61 +1,154 @@
 # CHELSTY Runtime

-This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
+This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
+
+| Node | Role | Services |
+|------|------|----------|
+| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
+| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
+
+Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.

 ## Runtime Layout

-The CHELSTY runtime is located at `/opt/homelab`.
-
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
- `/opt/homelab/data/`: Persistent data for services.
- `/opt/homelab/logs/`: Service logs.
-
-### Key Service Locations
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
+```
+/opt/homelab/
+├── config/          # Service-specific configs and secrets (not in Git)
+│   ├── mosquitto/
+│   └── zigbee2mqtt/
+├── data/            # Persistent service data
+│   ├── mosquitto/   # Persistence DB, password file
+│   └── zigbee2mqtt/
+│       └── data/    # z2m config, coordinator backup, network key
+└── logs/
+```

 ## SLZB-06U Integration

-CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
+CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.

- **Coordinator IP**: 192.168.1.105
- **Port**: 6638
- **Protocol**: TCP (ezsp adapter)
+- **Coordinator IP**: `192.168.1.105`
+- **Port**: `6638`
+- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
+- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`

-Zigbee2MQTT is configured to connect to this coordinator over the local network.
+⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.

-## Offline & LTE Assumptions
+## Networking Constraints

- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
- **Home Assistant**: Runs on `chelsty-ha` node, connecting to the Mosquitto broker on `chelsty-infra`.
+### Mosquitto — `network_mode: host`
+Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
+
+### Zigbee2MQTT — bridge network + extra_hosts
+Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
+
+```yaml
+# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
+services:
+  zigbee2mqtt:
+    extra_hosts:
+      - "mosquitto:host-gateway"
+```
+
+This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
+
+**Why not `network_mode: host` for z2m?**  
+chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
+
+## Zigbee2MQTT Config Location
+
+The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
+
+```
+/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
+```
+
+This path is mounted read-write by the base `docker-compose.yml`:
+```yaml
+volumes:
+  - /opt/homelab/data/zigbee2mqtt/data:/app/data
+```
+
+Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
+
+### Minimal configuration.yaml
+```yaml
+homeassistant: true
+permit_join: false
+mqtt:
+  base_topic: zigbee2mqtt
+  server: mqtt://mosquitto:1883
+serial:
+  port: tcp://192.168.1.105:6638
+  adapter: ezsp
+frontend:
+  port: 8080
+advanced:
+  log_level: info
+```
+
+## chelsty-ha — No node-agent
+
+`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
+
+In `hosts/chelsty-ha/services.yaml`:
+```yaml
+services:
+  homeassistant:
+    monitor: false   # No node-agent; suppresses supervisor action generation
+```
+
+Remove `monitor: false` once node-agent is bootstrapped on this VM.

 ## Deployment Flow

-1. **Initial Bootstrap**:
-   Run the bootstrap script on the CHELSTY node:
-   ```bash
-   ./scripts/bootstrap/chelsty-runtime.sh
-   ```
+### Initial Bootstrap
+```bash
+./scripts/bootstrap/chelsty-runtime.sh
+```

-2. **Manual Configuration**:
-   - Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
-   - Add Mosquitto user:
-     ```bash
-     sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
-     ```
+### Deploy services
+```bash
+./scripts/deploy/deploy-node.sh chelsty-infra
+./scripts/deploy/deploy-node.sh chelsty-ha
+```

-3. **Service Deployment**:
-   Use the staged deployment runtime:
-   ```bash
-   ./scripts/deploy/deploy-node.sh chelsty-infra
-   ./scripts/deploy/deploy-node.sh chelsty-ha
-   ```
+### Manual (SSH) — chelsty-infra uses docker-compose v1
+```bash
+ssh oskar@100.122.201.22
+cd ~/homelab-codex-ws/services/<service>
+docker-compose -f docker-compose.yml \
+  -f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
+  up -d --build --force-recreate
+```

-## Recovery Procedure
+> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).

-In case of runtime failure:
-1. Verify Docker and Compose plugin: `docker compose version`
-2. Re-run bootstrap script to ensure directory structure and basic configs.
-3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
-4. Verify SLZB-06U reachability: `ping 192.168.1.105`
+## Recovery Procedures
+
+### Mosquitto stopped
+```bash
+ssh oskar@100.122.201.22 "docker start mosquitto"
+# Ensure restart policy is correct:
+docker update --restart unless-stopped mosquitto
+```
+
+### Zigbee2MQTT won't start
+1. Check logs: `docker logs zigbee2mqtt --tail 50`
+2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
+3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
+4. If config missing, recreate from the minimal template above
+
+### SLZB-06U unreachable
+`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
+
+## Critical Backup Sets
+
+| Data | Path |
+|------|------|
+| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
+| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
+| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
+| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
+
+> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
--- a/docs/observer-runtime.md
+++ b/docs/observer-runtime.md
@ -7,57 +7,92 @@ The Observer Runtime is a lightweight agent responsible for synthesizing the ope
 The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.

 ### Inputs
- `/opt/homelab/events/`: Normalized JSON events.
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint.
- `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
+- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
+- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
 - Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.

 ### World Model Output
 Generated under `/opt/homelab/world/`:
- `nodes.json`: Current node availability, roles, and last seen timestamps.
- `services.json`: Service health status and links to active incidents.
+- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
+- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
 - `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
 - `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
 - `runtime-summary.json`: High-level overview for dashboards and planner agents.

-## Incident Lifecycle
+## Checkpoint Format

-The observer implements lightweight incident correlation:
+The observer tracks per-node progress to avoid silently skipping event directories:

-1.  **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
-2.  **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
-3.  **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
-4.  **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
-
-### Example Incident JSON
 ```json
 {
-  "inc-1715518800-saturn-mosquitto": {
-    "id": "inc-1715518800-saturn-mosquitto",
-    "node": "saturn",
-    "service": "mosquitto",
-    "status": "resolved",
-    "severity": "error",
-    "started_at": "2026-05-12T12:05:00Z",
-    "last_occurrence": "2026-05-12T12:06:00Z",
-    "occurrence_count": 2,
-    "events": [
-      "2026-05-12T12:05:00Z",
-      "2026-05-12T12:06:00Z"
-    ],
-    "correlation_id": "hc-1",
-    "resolved_at": "2026-05-12T12:10:00Z"
+  "node_checkpoints": {
+    "vps":            "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
+    "piha":           "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
+    "chelsty-infra":  "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
  }
 }
 ```

+A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
+
+**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
+
+## Event Types
+
+### Negative events (create/escalate incidents)
+- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
+- `deployment_failed` — record failure in deployments.json
+
+### Positive events (resolve state)
+- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
+- `service_recovered` — alias, same effect
+- `deployment_completed` — marks deployment as completed
+
+### Node events
+- `node_online`, `node_offline` — update node status in nodes.json
+- `disk_pressure_*` — set `disk_pressure` field on the node record
+
+## Incident Lifecycle
+
+1.  **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
+2.  **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
+3.  **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
+4.  **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
+
+### Example Incident JSON
+```json
+{
+  "inc-1715518800-vps-observer": {
+    "id": "inc-1715518800-vps-observer",
+    "node": "vps",
+    "service": "observer",
+    "status": "resolved",
+    "severity": "error",
+    "started_at": 1715518800.0,
+    "last_occurrence": 1715518860.0,
+    "occurrence_count": 2,
+    "trigger_type": "containers_not_running",
+    "resolved_at": 1715519100.0
+  }
+}
+```
+
+## World State Pruning
+
+`_prune_stale_world()` runs every reconcile cycle and removes:
+
+1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
+2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
+3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
+4. **Expired incidents** — resolved incidents older than 7 days.
+
 ## Runtime Behavior

 ### Idempotency
-The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state.
-
-### Resumability
-The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
+The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.

 ### Deployment Tracking
-Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment.
+Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
+
+### Topology Filtering
+Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
--- a/docs/sessions/2026-05-27.md
+++ b/docs/sessions/2026-05-27.md
@ -0,0 +1,103 @@
+# SESSION: Stabilizacja systemu wieloagentowego homelabu
+
+**DATE:** 2026-05-27  
+**RESULT:** System NOMINAL (97/97 services, 0 errors)
+
+---
+
+## PROBLEMS FOUND
+
+- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
+- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
+- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
+- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
+- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
+- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
+- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
+- supervisor nie cancelował resolved actions — pending queue rósł bez końca
+- `service_healthy` event nie zamykał aktywnych incydentów
+- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
+- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
+
+---
+
+## FIXES SHIPPED (commits in master)
+
+```
+7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
+b40b832 Fix ghost service keys from hash-prefixed Docker container names
+28e9534 observer: service_healthy resolves active incidents
+46ae92b supervisor: also cancel pending actions for services removed from desired state
+410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
+b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
+61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
+51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
+fb7828b supervisor: auto-cancel pending actions when drift is resolved
+2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
+267742c vps/node-agent: add network_mode: host for control-plane health probe
+4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
+f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
+a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
+2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
+65bac4e fix(node-agent): mount host SSH key into container for event shipping
+96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
+ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
+c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
+01b7758 feat(node-agent): implement health monitor and safe cleanup policy
+```
+
+### Szczegóły kluczowych napraw
+
+**fix(observer): per-node checkpoints**  
+Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
+
+**fix(observer): ghost key pruning**  
+`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
+
+**fix(node-agent): canonical container name**  
+`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
+
+**fix(node-agent): service_healthy emission**  
+Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
+
+**fix(supervisor): auto-cancel resolved actions**  
+`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
+- serwis stał się healthy (`drift_resolved_auto`)
+- serwis został usunięty z desired state (`service_removed_from_desired_state`)
+
+**fix(supervisor): monitor:false**  
+Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
+
+**fix(agent-system/materializer): control-plane API as source**  
+Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
+
+**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**  
+Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
+
+**fix(chelsty-infra/zigbee2mqtt): writable config**  
+z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
+
+---
+
+## STAN KOŃCOWY
+
+| Node | Status | Serwisy |
+|------|--------|---------|
+| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
+| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
+| solaria | online | node-agent, stability-agent, AI workloads |
+| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
+| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
+
+**Action queue:** 0 pending, 0 approved, 0 running  
+**Incidents:** 0 active  
+**Ghost service keys:** 0  
+
+---
+
+## ZNANE OGRANICZENIA / TODO
+
+- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
+- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
+- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
+- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.
--- a/docs/vps-control-plane.md
+++ b/docs/vps-control-plane.md
@ -1,83 +1,126 @@
 # VPS Control Plane

-The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS and provides observability, automated reconciliation, and a web-based operator interface.
+The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.

 ## Architecture

-The control plane consists of four core services running as a Docker Compose stack:
+The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:

-1.  **Observer**: Synthesizes world state from events.
-2.  **Supervisor**: Detects drifts between desired and actual state.
-3.  **Executor**: Executes approved actions from the queue.
-4.  **Operator UI**: Web interface for system monitoring and action approval.
+| Container | Role |
+|-----------|------|
+| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
+| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
+| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
+| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |

-All services adhere to **filesystem-first** semantics, using `/opt/homelab/` as the primary data exchange and persistence layer.
+All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).

-## Deployment Flow
+## Supervisor Behavior

-### 1. Prerequisites
- Target VPS node must be onboarded (Tailscale active, Docker installed).
- Repository cloned to `/home/oskar/homelab-codex-ws`.
+### Desired State
+Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).

-### 2. Bootstrap
-Run the local deployment script on the VPS to initialize the runtime filesystem and start the stack:
+### Drift Types
+- `missing_service` — service is in desired state but absent from `services.json`
+- `unhealthy_service` — service exists in `services.json` but `status != healthy`
+
+### Action Types
+| Trigger | Action type | Risk |
+|---------|-------------|------|
+| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
+| Any other / unknown | `redeploy` | guarded |
+| Node `disk_pressure: high` | `disk_cleanup` | guarded |
+
+### Action ID Stability
+Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
+
+### Auto-Cancel
+Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
+- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
+- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
+
+Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
+
+### Node Name Resolution
+The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:

 ```bash
-cd services/control-plane
-bash deploy-local.sh
+NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
 ```

-### 3. Verification
-Verify the stack is healthy using the deployment script or check container status on the VPS:
+## Deployment

+### From SATURN (primary control node)
 ```bash
-# Check status via deploy script
+# Full deploy via SSH
 ./scripts/deploy/deploy-control-plane.sh --ssh

-# Manual status check on VPS
+# Or manually:
+ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
+```
+
+### Direct on VPS
+```bash
+cd ~/homelab-codex-ws/services/control-plane
+docker compose up -d --build --force-recreate
+```
+
+`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
+
+### Verification
+```bash
+# On VPS
 docker ps --filter "name=control-plane"
+curl -s http://localhost:18180/summary | python3 -m json.tool
 ```

-## Operational Workflows
+## Action Approval Workflow

-### Action Approval
-1. Access the Operator UI (via Tailscale IP or Nginx Proxy Manager).
-2. Navigate to **Action Queue**.
-3. Review **Pending** actions recommended by the Supervisor.
-4. Click **Approve** to move actions to the execution queue.
+```
+Supervisor writes → /opt/homelab/actions/pending/<id>.json
+                 → Operator UI (port 18180) or Telegram Bot notifies
+                 → Operator clicks Approve
+                 → /opt/homelab/actions/approved/<id>.json
+                 → Executor executes → completed / failed
+```

-### Recovery Flow
-In case of control plane failure:
-1. Check logs using `docker logs`.
-2. Restart stack using the local deployment script: `bash deploy-local.sh`.
-3. Rebuild world state: Delete `/opt/homelab/state/observer_checkpoint.json` and redeploy.
+Possible action states: `pending → approved → running → completed / failed / rejected`  
+Auto-cancel path: `pending → cancelled/`

-### Upgrade Flow
-To deploy updates from the SOLARIA/control host:
+## Recovery
+
+### World state is stale or corrupt
+```bash
+# On VPS — delete checkpoint to force full replay
+rm /opt/homelab/state/observer_checkpoint.json
+docker restart control-plane-observer
+```
+
+### Flood of pending actions after bootstrap
+Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.

 ```bash
-./scripts/deploy/deploy-control-plane.sh --ssh
+# Check node-agent on each node
+ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
 ```

-### Rollback Semantics
-Since the runtime is filesystem-first and append-only:
-1. Roll back the repository state to a previous commit.
-2. Restart the control plane stack.
-3. The supervisor will detect drift against the older (rolled-back) desired state and recommend actions to restore it.
-
-## Runtime Safety
-
- **Readonly Mounts**: Most services mount the repository as `:ro` to prevent accidental mutations.
- **Least-Privilege**: UI, Observer, and Supervisor run as non-root `homelab` user (UID 1000).
- **Filesystem Isolation**: Clear separation between `/repo` (code/inventory) and `/opt/homelab` (runtime state).
+### Rebuild from scratch
+```bash
+ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
+```

 ## Integration

+### piha agent-system webui (port 18180 on piha)
+The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
+
+Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
+
 ### Nginx Proxy Manager
-Configure a proxy host in NPM to point to `http://control-plane-ui:8080`. Ensure Websockets are enabled if the UI uses them.
+The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.

 ### Log Locations
- Container logs: `docker compose logs`
+- Container logs: `docker compose logs -f` (from `services/control-plane/`)
 - Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
 - World state: `/opt/homelab/world/`
- Diagnostics: `/opt/homelab/logs/`
+- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`