docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs

docs/sessions/2026-05-27.md (new):
- Full session record: problems found, all commits shipped, end state
- Written in Polish per operator preference for session notes
- Known limitations: SLZB-06U offline, ezsp→ember migration pending

docs/observer-runtime.md:
- Document per-node checkpoint format (replaces old global checkpoint)
- Add service_healthy / service_recovered resolution behavior
- Document ghost key pruning (_prune_stale_world patterns)
- Add event type reference table (negative vs positive)

docs/vps-control-plane.md:
- Add container names and network_mode: host detail
- Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior
- Add piha agent-system materializer integration note
- Rewrite recovery section with actionable bootstrap-flood diagnosis
- Add action state machine (pending→approved→running→completed/cancelled)

docs/chelsty-runtime.md:
- Add chelsty-infra/chelsty-ha node table
- Document docker-compose v1 constraint (always use docker-compose, not docker compose)
- Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation
- Add z2m config writable requirement (EROFS failure mode documented)
- Add chelsty-ha monitor:false rationale
- Add minimal configuration.yaml template for z2m

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Oskar Kapala 2026-05-27 16:18:31 +02:00
parent 7277bdc27f
commit 603e10a364
4 changed files with 396 additions and 122 deletions

View file

@ -1,61 +1,154 @@
# CHELSTY Runtime # CHELSTY Runtime
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node. This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
| Node | Role | Services |
|------|------|----------|
| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
## Runtime Layout ## Runtime Layout
The CHELSTY runtime is located at `/opt/homelab`. ```
/opt/homelab/
- `/opt/homelab/config/`: Service-specific configurations and compose overrides. ├── config/ # Service-specific configs and secrets (not in Git)
- `/opt/homelab/data/`: Persistent data for services. │ ├── mosquitto/
- `/opt/homelab/logs/`: Service logs. │ └── zigbee2mqtt/
├── data/ # Persistent service data
### Key Service Locations │ ├── mosquitto/ # Persistence DB, password file
- **Mosquitto**: `/opt/homelab/config/mosquitto/` │ └── zigbee2mqtt/
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/` │ └── data/ # z2m config, coordinator backup, network key
└── logs/
```
## SLZB-06U Integration ## SLZB-06U Integration
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP. CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
- **Coordinator IP**: 192.168.1.105 - **Coordinator IP**: `192.168.1.105`
- **Port**: 6638 - **Port**: `6638`
- **Protocol**: TCP (ezsp adapter) - **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
Zigbee2MQTT is configured to connect to this coordinator over the local network. ⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
## Offline & LTE Assumptions ## Networking Constraints
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY. ### Mosquitto — `network_mode: host`
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access. Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
- **Home Assistant**: Runs on `chelsty-ha` node, connecting to the Mosquitto broker on `chelsty-infra`.
### Zigbee2MQTT — bridge network + extra_hosts
Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
```yaml
# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
services:
zigbee2mqtt:
extra_hosts:
- "mosquitto:host-gateway"
```
This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
**Why not `network_mode: host` for z2m?**
chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
## Zigbee2MQTT Config Location
The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
```
/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
```
This path is mounted read-write by the base `docker-compose.yml`:
```yaml
volumes:
- /opt/homelab/data/zigbee2mqtt/data:/app/data
```
Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
### Minimal configuration.yaml
```yaml
homeassistant: true
permit_join: false
mqtt:
base_topic: zigbee2mqtt
server: mqtt://mosquitto:1883
serial:
port: tcp://192.168.1.105:6638
adapter: ezsp
frontend:
port: 8080
advanced:
log_level: info
```
## chelsty-ha — No node-agent
`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
In `hosts/chelsty-ha/services.yaml`:
```yaml
services:
homeassistant:
monitor: false # No node-agent; suppresses supervisor action generation
```
Remove `monitor: false` once node-agent is bootstrapped on this VM.
## Deployment Flow ## Deployment Flow
1. **Initial Bootstrap**: ### Initial Bootstrap
Run the bootstrap script on the CHELSTY node:
```bash ```bash
./scripts/bootstrap/chelsty-runtime.sh ./scripts/bootstrap/chelsty-runtime.sh
``` ```
2. **Manual Configuration**: ### Deploy services
- Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
- Add Mosquitto user:
```bash
sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
```
3. **Service Deployment**:
Use the staged deployment runtime:
```bash ```bash
./scripts/deploy/deploy-node.sh chelsty-infra ./scripts/deploy/deploy-node.sh chelsty-infra
./scripts/deploy/deploy-node.sh chelsty-ha ./scripts/deploy/deploy-node.sh chelsty-ha
``` ```
## Recovery Procedure ### Manual (SSH) — chelsty-infra uses docker-compose v1
```bash
ssh oskar@100.122.201.22
cd ~/homelab-codex-ws/services/<service>
docker-compose -f docker-compose.yml \
-f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
up -d --build --force-recreate
```
In case of runtime failure: > **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
1. Verify Docker and Compose plugin: `docker compose version`
2. Re-run bootstrap script to ensure directory structure and basic configs. ## Recovery Procedures
3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
4. Verify SLZB-06U reachability: `ping 192.168.1.105` ### Mosquitto stopped
```bash
ssh oskar@100.122.201.22 "docker start mosquitto"
# Ensure restart policy is correct:
docker update --restart unless-stopped mosquitto
```
### Zigbee2MQTT won't start
1. Check logs: `docker logs zigbee2mqtt --tail 50`
2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
4. If config missing, recreate from the minimal template above
### SLZB-06U unreachable
`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
## Critical Backup Sets
| Data | Path |
|------|------|
| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.

View file

@ -7,57 +7,92 @@ The Observer Runtime is a lightweight agent responsible for synthesizing the ope
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity. The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
### Inputs ### Inputs
- `/opt/homelab/events/`: Normalized JSON events. - `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint. - `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
- `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`. - Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
### World Model Output ### World Model Output
Generated under `/opt/homelab/world/`: Generated under `/opt/homelab/world/`:
- `nodes.json`: Current node availability, roles, and last seen timestamps. - `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
- `services.json`: Service health status and links to active incidents. - `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`. - `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status. - `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
- `runtime-summary.json`: High-level overview for dashboards and planner agents. - `runtime-summary.json`: High-level overview for dashboards and planner agents.
## Incident Lifecycle ## Checkpoint Format
The observer implements lightweight incident correlation: The observer tracks per-node progress to avoid silently skipping event directories:
1. **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
2. **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
3. **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
4. **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
### Example Incident JSON
```json ```json
{ {
"inc-1715518800-saturn-mosquitto": { "node_checkpoints": {
"id": "inc-1715518800-saturn-mosquitto", "vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
"node": "saturn", "piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
"service": "mosquitto", "chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
"status": "resolved",
"severity": "error",
"started_at": "2026-05-12T12:05:00Z",
"last_occurrence": "2026-05-12T12:06:00Z",
"occurrence_count": 2,
"events": [
"2026-05-12T12:05:00Z",
"2026-05-12T12:06:00Z"
],
"correlation_id": "hc-1",
"resolved_at": "2026-05-12T12:10:00Z"
} }
} }
``` ```
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
## Event Types
### Negative events (create/escalate incidents)
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
- `deployment_failed` — record failure in deployments.json
### Positive events (resolve state)
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
- `service_recovered` — alias, same effect
- `deployment_completed` — marks deployment as completed
### Node events
- `node_online`, `node_offline` — update node status in nodes.json
- `disk_pressure_*` — set `disk_pressure` field on the node record
## Incident Lifecycle
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
### Example Incident JSON
```json
{
"inc-1715518800-vps-observer": {
"id": "inc-1715518800-vps-observer",
"node": "vps",
"service": "observer",
"status": "resolved",
"severity": "error",
"started_at": 1715518800.0,
"last_occurrence": 1715518860.0,
"occurrence_count": 2,
"trigger_type": "containers_not_running",
"resolved_at": 1715519100.0
}
}
```
## World State Pruning
`_prune_stale_world()` runs every reconcile cycle and removes:
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
4. **Expired incidents** — resolved incidents older than 7 days.
## Runtime Behavior ## Runtime Behavior
### Idempotency ### Idempotency
The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state. The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
### Resumability
The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
### Deployment Tracking ### Deployment Tracking
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment. Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
### Topology Filtering
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.

103
docs/sessions/2026-05-27.md Normal file
View file

@ -0,0 +1,103 @@
# SESSION: Stabilizacja systemu wieloagentowego homelabu
**DATE:** 2026-05-27
**RESULT:** System NOMINAL (97/97 services, 0 errors)
---
## PROBLEMS FOUND
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
- `service_healthy` event nie zamykał aktywnych incydentów
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
---
## FIXES SHIPPED (commits in master)
```
7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
b40b832 Fix ghost service keys from hash-prefixed Docker container names
28e9534 observer: service_healthy resolves active incidents
46ae92b supervisor: also cancel pending actions for services removed from desired state
410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
fb7828b supervisor: auto-cancel pending actions when drift is resolved
2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
267742c vps/node-agent: add network_mode: host for control-plane health probe
4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
65bac4e fix(node-agent): mount host SSH key into container for event shipping
96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
01b7758 feat(node-agent): implement health monitor and safe cleanup policy
```
### Szczegóły kluczowych napraw
**fix(observer): per-node checkpoints**
Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
**fix(observer): ghost key pruning**
`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
**fix(node-agent): canonical container name**
`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
**fix(node-agent): service_healthy emission**
Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
**fix(supervisor): auto-cancel resolved actions**
`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
- serwis stał się healthy (`drift_resolved_auto`)
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
**fix(supervisor): monitor:false**
Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
**fix(agent-system/materializer): control-plane API as source**
Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**
Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
**fix(chelsty-infra/zigbee2mqtt): writable config**
z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
---
## STAN KOŃCOWY
| Node | Status | Serwisy |
|------|--------|---------|
| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
| solaria | online | node-agent, stability-agent, AI workloads |
| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
**Action queue:** 0 pending, 0 approved, 0 running
**Incidents:** 0 active
**Ghost service keys:** 0
---
## ZNANE OGRANICZENIA / TODO
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.

View file

@ -1,83 +1,126 @@
# VPS Control Plane # VPS Control Plane
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS and provides observability, automated reconciliation, and a web-based operator interface. The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
## Architecture ## Architecture
The control plane consists of four core services running as a Docker Compose stack: The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
1. **Observer**: Synthesizes world state from events. | Container | Role |
2. **Supervisor**: Detects drifts between desired and actual state. |-----------|------|
3. **Executor**: Executes approved actions from the queue. | `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
4. **Operator UI**: Web interface for system monitoring and action approval. | `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
All services adhere to **filesystem-first** semantics, using `/opt/homelab/` as the primary data exchange and persistence layer. All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
## Deployment Flow ## Supervisor Behavior
### 1. Prerequisites ### Desired State
- Target VPS node must be onboarded (Tailscale active, Docker installed). Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
- Repository cloned to `/home/oskar/homelab-codex-ws`.
### 2. Bootstrap ### Drift Types
Run the local deployment script on the VPS to initialize the runtime filesystem and start the stack: - `missing_service` — service is in desired state but absent from `services.json`
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
### Action Types
| Trigger | Action type | Risk |
|---------|-------------|------|
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
| Any other / unknown | `redeploy` | guarded |
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
### Action ID Stability
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
### Auto-Cancel
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
### Node Name Resolution
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
```bash ```bash
cd services/control-plane NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
bash deploy-local.sh
``` ```
### 3. Verification ## Deployment
Verify the stack is healthy using the deployment script or check container status on the VPS:
### From SATURN (primary control node)
```bash ```bash
# Check status via deploy script # Full deploy via SSH
./scripts/deploy/deploy-control-plane.sh --ssh ./scripts/deploy/deploy-control-plane.sh --ssh
# Manual status check on VPS # Or manually:
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
```
### Direct on VPS
```bash
cd ~/homelab-codex-ws/services/control-plane
docker compose up -d --build --force-recreate
```
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
### Verification
```bash
# On VPS
docker ps --filter "name=control-plane" docker ps --filter "name=control-plane"
curl -s http://localhost:18180/summary | python3 -m json.tool
``` ```
## Operational Workflows ## Action Approval Workflow
### Action Approval ```
1. Access the Operator UI (via Tailscale IP or Nginx Proxy Manager). Supervisor writes → /opt/homelab/actions/pending/<id>.json
2. Navigate to **Action Queue**. → Operator UI (port 18180) or Telegram Bot notifies
3. Review **Pending** actions recommended by the Supervisor. → Operator clicks Approve
4. Click **Approve** to move actions to the execution queue. → /opt/homelab/actions/approved/<id>.json
→ Executor executes → completed / failed
```
### Recovery Flow Possible action states: `pending → approved → running → completed / failed / rejected`
In case of control plane failure: Auto-cancel path: `pending → cancelled/`
1. Check logs using `docker logs`.
2. Restart stack using the local deployment script: `bash deploy-local.sh`.
3. Rebuild world state: Delete `/opt/homelab/state/observer_checkpoint.json` and redeploy.
### Upgrade Flow ## Recovery
To deploy updates from the SOLARIA/control host:
### World state is stale or corrupt
```bash
# On VPS — delete checkpoint to force full replay
rm /opt/homelab/state/observer_checkpoint.json
docker restart control-plane-observer
```
### Flood of pending actions after bootstrap
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
```bash ```bash
./scripts/deploy/deploy-control-plane.sh --ssh # Check node-agent on each node
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
``` ```
### Rollback Semantics ### Rebuild from scratch
Since the runtime is filesystem-first and append-only: ```bash
1. Roll back the repository state to a previous commit. ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
2. Restart the control plane stack. ```
3. The supervisor will detect drift against the older (rolled-back) desired state and recommend actions to restore it.
## Runtime Safety
- **Readonly Mounts**: Most services mount the repository as `:ro` to prevent accidental mutations.
- **Least-Privilege**: UI, Observer, and Supervisor run as non-root `homelab` user (UID 1000).
- **Filesystem Isolation**: Clear separation between `/repo` (code/inventory) and `/opt/homelab` (runtime state).
## Integration ## Integration
### piha agent-system webui (port 18180 on piha)
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
### Nginx Proxy Manager ### Nginx Proxy Manager
Configure a proxy host in NPM to point to `http://control-plane-ui:8080`. Ensure Websockets are enabled if the UI uses them. The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
### Log Locations ### Log Locations
- Container logs: `docker compose logs` - Container logs: `docker compose logs -f` (from `services/control-plane/`)
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/` - Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
- World state: `/opt/homelab/world/` - World state: `/opt/homelab/world/`
- Diagnostics: `/opt/homelab/logs/` - Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`