docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs

docs/sessions/2026-05-27.md (new):
- Full session record: problems found, all commits shipped, end state
- Written in Polish per operator preference for session notes
- Known limitations: SLZB-06U offline, ezsp→ember migration pending

docs/observer-runtime.md:
- Document per-node checkpoint format (replaces old global checkpoint)
- Add service_healthy / service_recovered resolution behavior
- Document ghost key pruning (_prune_stale_world patterns)
- Add event type reference table (negative vs positive)

docs/vps-control-plane.md:
- Add container names and network_mode: host detail
- Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior
- Add piha agent-system materializer integration note
- Rewrite recovery section with actionable bootstrap-flood diagnosis
- Add action state machine (pending→approved→running→completed/cancelled)

docs/chelsty-runtime.md:
- Add chelsty-infra/chelsty-ha node table
- Document docker-compose v1 constraint (always use docker-compose, not docker compose)
- Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation
- Add z2m config writable requirement (EROFS failure mode documented)
- Add chelsty-ha monitor:false rationale
- Add minimal configuration.yaml template for z2m

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Oskar Kapala 2026-05-27 16:18:31 +02:00
parent 7277bdc27f
commit 603e10a364
4 changed files with 396 additions and 122 deletions

View file

@ -1,61 +1,154 @@
# CHELSTY Runtime
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
| Node | Role | Services |
|------|------|----------|
| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
## Runtime Layout
The CHELSTY runtime is located at `/opt/homelab`.
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
- `/opt/homelab/data/`: Persistent data for services.
- `/opt/homelab/logs/`: Service logs.
### Key Service Locations
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
```
/opt/homelab/
├── config/ # Service-specific configs and secrets (not in Git)
│ ├── mosquitto/
│ └── zigbee2mqtt/
├── data/ # Persistent service data
│ ├── mosquitto/ # Persistence DB, password file
│ └── zigbee2mqtt/
│ └── data/ # z2m config, coordinator backup, network key
└── logs/
```
## SLZB-06U Integration
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
- **Coordinator IP**: 192.168.1.105
- **Port**: 6638
- **Protocol**: TCP (ezsp adapter)
- **Coordinator IP**: `192.168.1.105`
- **Port**: `6638`
- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
Zigbee2MQTT is configured to connect to this coordinator over the local network.
⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
## Offline & LTE Assumptions
## Networking Constraints
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
- **Home Assistant**: Runs on `chelsty-ha` node, connecting to the Mosquitto broker on `chelsty-infra`.
### Mosquitto — `network_mode: host`
Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
### Zigbee2MQTT — bridge network + extra_hosts
Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
```yaml
# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
services:
zigbee2mqtt:
extra_hosts:
- "mosquitto:host-gateway"
```
This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
**Why not `network_mode: host` for z2m?**
chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
## Zigbee2MQTT Config Location
The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
```
/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
```
This path is mounted read-write by the base `docker-compose.yml`:
```yaml
volumes:
- /opt/homelab/data/zigbee2mqtt/data:/app/data
```
Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
### Minimal configuration.yaml
```yaml
homeassistant: true
permit_join: false
mqtt:
base_topic: zigbee2mqtt
server: mqtt://mosquitto:1883
serial:
port: tcp://192.168.1.105:6638
adapter: ezsp
frontend:
port: 8080
advanced:
log_level: info
```
## chelsty-ha — No node-agent
`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
In `hosts/chelsty-ha/services.yaml`:
```yaml
services:
homeassistant:
monitor: false # No node-agent; suppresses supervisor action generation
```
Remove `monitor: false` once node-agent is bootstrapped on this VM.
## Deployment Flow
1. **Initial Bootstrap**:
Run the bootstrap script on the CHELSTY node:
```bash
./scripts/bootstrap/chelsty-runtime.sh
```
### Initial Bootstrap
```bash
./scripts/bootstrap/chelsty-runtime.sh
```
2. **Manual Configuration**:
- Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
- Add Mosquitto user:
```bash
sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
```
### Deploy services
```bash
./scripts/deploy/deploy-node.sh chelsty-infra
./scripts/deploy/deploy-node.sh chelsty-ha
```
3. **Service Deployment**:
Use the staged deployment runtime:
```bash
./scripts/deploy/deploy-node.sh chelsty-infra
./scripts/deploy/deploy-node.sh chelsty-ha
```
### Manual (SSH) — chelsty-infra uses docker-compose v1
```bash
ssh oskar@100.122.201.22
cd ~/homelab-codex-ws/services/<service>
docker-compose -f docker-compose.yml \
-f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
up -d --build --force-recreate
```
## Recovery Procedure
> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
In case of runtime failure:
1. Verify Docker and Compose plugin: `docker compose version`
2. Re-run bootstrap script to ensure directory structure and basic configs.
3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
4. Verify SLZB-06U reachability: `ping 192.168.1.105`
## Recovery Procedures
### Mosquitto stopped
```bash
ssh oskar@100.122.201.22 "docker start mosquitto"
# Ensure restart policy is correct:
docker update --restart unless-stopped mosquitto
```
### Zigbee2MQTT won't start
1. Check logs: `docker logs zigbee2mqtt --tail 50`
2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
4. If config missing, recreate from the minimal template above
### SLZB-06U unreachable
`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
## Critical Backup Sets
| Data | Path |
|------|------|
| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.

View file

@ -7,57 +7,92 @@ The Observer Runtime is a lightweight agent responsible for synthesizing the ope
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
### Inputs
- `/opt/homelab/events/`: Normalized JSON events.
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint.
- `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
### World Model Output
Generated under `/opt/homelab/world/`:
- `nodes.json`: Current node availability, roles, and last seen timestamps.
- `services.json`: Service health status and links to active incidents.
- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
## Incident Lifecycle
## Checkpoint Format
The observer implements lightweight incident correlation:
The observer tracks per-node progress to avoid silently skipping event directories:
1. **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
2. **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
3. **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
4. **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
### Example Incident JSON
```json
{
"inc-1715518800-saturn-mosquitto": {
"id": "inc-1715518800-saturn-mosquitto",
"node": "saturn",
"service": "mosquitto",
"status": "resolved",
"severity": "error",
"started_at": "2026-05-12T12:05:00Z",
"last_occurrence": "2026-05-12T12:06:00Z",
"occurrence_count": 2,
"events": [
"2026-05-12T12:05:00Z",
"2026-05-12T12:06:00Z"
],
"correlation_id": "hc-1",
"resolved_at": "2026-05-12T12:10:00Z"
"node_checkpoints": {
"vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
"piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
"chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
}
}
```
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
## Event Types
### Negative events (create/escalate incidents)
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
- `deployment_failed` — record failure in deployments.json
### Positive events (resolve state)
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
- `service_recovered` — alias, same effect
- `deployment_completed` — marks deployment as completed
### Node events
- `node_online`, `node_offline` — update node status in nodes.json
- `disk_pressure_*` — set `disk_pressure` field on the node record
## Incident Lifecycle
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
### Example Incident JSON
```json
{
"inc-1715518800-vps-observer": {
"id": "inc-1715518800-vps-observer",
"node": "vps",
"service": "observer",
"status": "resolved",
"severity": "error",
"started_at": 1715518800.0,
"last_occurrence": 1715518860.0,
"occurrence_count": 2,
"trigger_type": "containers_not_running",
"resolved_at": 1715519100.0
}
}
```
## World State Pruning
`_prune_stale_world()` runs every reconcile cycle and removes:
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
4. **Expired incidents** — resolved incidents older than 7 days.
## Runtime Behavior
### Idempotency
The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state.
### Resumability
The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
### Deployment Tracking
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment.
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
### Topology Filtering
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.

103
docs/sessions/2026-05-27.md Normal file
View file

@ -0,0 +1,103 @@
# SESSION: Stabilizacja systemu wieloagentowego homelabu
**DATE:** 2026-05-27
**RESULT:** System NOMINAL (97/97 services, 0 errors)
---
## PROBLEMS FOUND
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
- `service_healthy` event nie zamykał aktywnych incydentów
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
---
## FIXES SHIPPED (commits in master)
```
7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
b40b832 Fix ghost service keys from hash-prefixed Docker container names
28e9534 observer: service_healthy resolves active incidents
46ae92b supervisor: also cancel pending actions for services removed from desired state
410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
fb7828b supervisor: auto-cancel pending actions when drift is resolved
2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
267742c vps/node-agent: add network_mode: host for control-plane health probe
4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
65bac4e fix(node-agent): mount host SSH key into container for event shipping
96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
01b7758 feat(node-agent): implement health monitor and safe cleanup policy
```
### Szczegóły kluczowych napraw
**fix(observer): per-node checkpoints**
Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
**fix(observer): ghost key pruning**
`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
**fix(node-agent): canonical container name**
`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
**fix(node-agent): service_healthy emission**
Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
**fix(supervisor): auto-cancel resolved actions**
`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
- serwis stał się healthy (`drift_resolved_auto`)
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
**fix(supervisor): monitor:false**
Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
**fix(agent-system/materializer): control-plane API as source**
Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**
Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
**fix(chelsty-infra/zigbee2mqtt): writable config**
z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
---
## STAN KOŃCOWY
| Node | Status | Serwisy |
|------|--------|---------|
| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
| solaria | online | node-agent, stability-agent, AI workloads |
| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
**Action queue:** 0 pending, 0 approved, 0 running
**Incidents:** 0 active
**Ghost service keys:** 0
---
## ZNANE OGRANICZENIA / TODO
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.

View file

@ -1,83 +1,126 @@
# VPS Control Plane
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS and provides observability, automated reconciliation, and a web-based operator interface.
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
## Architecture
The control plane consists of four core services running as a Docker Compose stack:
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
1. **Observer**: Synthesizes world state from events.
2. **Supervisor**: Detects drifts between desired and actual state.
3. **Executor**: Executes approved actions from the queue.
4. **Operator UI**: Web interface for system monitoring and action approval.
| Container | Role |
|-----------|------|
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
All services adhere to **filesystem-first** semantics, using `/opt/homelab/` as the primary data exchange and persistence layer.
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
## Deployment Flow
## Supervisor Behavior
### 1. Prerequisites
- Target VPS node must be onboarded (Tailscale active, Docker installed).
- Repository cloned to `/home/oskar/homelab-codex-ws`.
### Desired State
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
### 2. Bootstrap
Run the local deployment script on the VPS to initialize the runtime filesystem and start the stack:
### Drift Types
- `missing_service` — service is in desired state but absent from `services.json`
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
### Action Types
| Trigger | Action type | Risk |
|---------|-------------|------|
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
| Any other / unknown | `redeploy` | guarded |
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
### Action ID Stability
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
### Auto-Cancel
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
### Node Name Resolution
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
```bash
cd services/control-plane
bash deploy-local.sh
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
```
### 3. Verification
Verify the stack is healthy using the deployment script or check container status on the VPS:
## Deployment
### From SATURN (primary control node)
```bash
# Check status via deploy script
# Full deploy via SSH
./scripts/deploy/deploy-control-plane.sh --ssh
# Manual status check on VPS
# Or manually:
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
```
### Direct on VPS
```bash
cd ~/homelab-codex-ws/services/control-plane
docker compose up -d --build --force-recreate
```
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
### Verification
```bash
# On VPS
docker ps --filter "name=control-plane"
curl -s http://localhost:18180/summary | python3 -m json.tool
```
## Operational Workflows
## Action Approval Workflow
### Action Approval
1. Access the Operator UI (via Tailscale IP or Nginx Proxy Manager).
2. Navigate to **Action Queue**.
3. Review **Pending** actions recommended by the Supervisor.
4. Click **Approve** to move actions to the execution queue.
```
Supervisor writes → /opt/homelab/actions/pending/<id>.json
→ Operator UI (port 18180) or Telegram Bot notifies
→ Operator clicks Approve
→ /opt/homelab/actions/approved/<id>.json
→ Executor executes → completed / failed
```
### Recovery Flow
In case of control plane failure:
1. Check logs using `docker logs`.
2. Restart stack using the local deployment script: `bash deploy-local.sh`.
3. Rebuild world state: Delete `/opt/homelab/state/observer_checkpoint.json` and redeploy.
Possible action states: `pending → approved → running → completed / failed / rejected`
Auto-cancel path: `pending → cancelled/`
### Upgrade Flow
To deploy updates from the SOLARIA/control host:
## Recovery
### World state is stale or corrupt
```bash
# On VPS — delete checkpoint to force full replay
rm /opt/homelab/state/observer_checkpoint.json
docker restart control-plane-observer
```
### Flood of pending actions after bootstrap
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
```bash
./scripts/deploy/deploy-control-plane.sh --ssh
# Check node-agent on each node
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
```
### Rollback Semantics
Since the runtime is filesystem-first and append-only:
1. Roll back the repository state to a previous commit.
2. Restart the control plane stack.
3. The supervisor will detect drift against the older (rolled-back) desired state and recommend actions to restore it.
## Runtime Safety
- **Readonly Mounts**: Most services mount the repository as `:ro` to prevent accidental mutations.
- **Least-Privilege**: UI, Observer, and Supervisor run as non-root `homelab` user (UID 1000).
- **Filesystem Isolation**: Clear separation between `/repo` (code/inventory) and `/opt/homelab` (runtime state).
### Rebuild from scratch
```bash
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
```
## Integration
### piha agent-system webui (port 18180 on piha)
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
### Nginx Proxy Manager
Configure a proxy host in NPM to point to `http://control-plane-ui:8080`. Ensure Websockets are enabled if the UI uses them.
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
### Log Locations
- Container logs: `docker compose logs`
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
- World state: `/opt/homelab/world/`
- Diagnostics: `/opt/homelab/logs/`
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`