diff --git a/docs/chelsty-runtime.md b/docs/chelsty-runtime.md index 257868e..ceea3fe 100644 --- a/docs/chelsty-runtime.md +++ b/docs/chelsty-runtime.md @@ -1,61 +1,154 @@ # CHELSTY Runtime -This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node. +This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs. + +| Node | Role | Services | +|------|------|----------| +| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent | +| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) | + +Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo. ## Runtime Layout -The CHELSTY runtime is located at `/opt/homelab`. - -- `/opt/homelab/config/`: Service-specific configurations and compose overrides. -- `/opt/homelab/data/`: Persistent data for services. -- `/opt/homelab/logs/`: Service logs. - -### Key Service Locations -- **Mosquitto**: `/opt/homelab/config/mosquitto/` -- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/` +``` +/opt/homelab/ +├── config/ # Service-specific configs and secrets (not in Git) +│ ├── mosquitto/ +│ └── zigbee2mqtt/ +├── data/ # Persistent service data +│ ├── mosquitto/ # Persistence DB, password file +│ └── zigbee2mqtt/ +│ └── data/ # z2m config, coordinator backup, network key +└── logs/ +``` ## SLZB-06U Integration -CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP. +CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP. -- **Coordinator IP**: 192.168.1.105 -- **Port**: 6638 -- **Protocol**: TCP (ezsp adapter) +- **Coordinator IP**: `192.168.1.105` +- **Port**: `6638` +- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`) +- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638` -Zigbee2MQTT is configured to connect to this coordinator over the local network. +⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site. -## Offline & LTE Assumptions +## Networking Constraints -- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY. -- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access. -- **Home Assistant**: Runs on `chelsty-ha` node, connecting to the Mosquitto broker on `chelsty-infra`. +### Mosquitto — `network_mode: host` +Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.** + +### Zigbee2MQTT — bridge network + extra_hosts +Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto: + +```yaml +# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml +services: + zigbee2mqtt: + extra_hosts: + - "mosquitto:host-gateway" +``` + +This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process. + +**Why not `network_mode: host` for z2m?** +chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this. + +## Zigbee2MQTT Config Location + +The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory: + +``` +/opt/homelab/data/zigbee2mqtt/data/configuration.yaml +``` + +This path is mounted read-write by the base `docker-compose.yml`: +```yaml +volumes: + - /opt/homelab/data/zigbee2mqtt/data:/app/data +``` + +Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`. + +### Minimal configuration.yaml +```yaml +homeassistant: true +permit_join: false +mqtt: + base_topic: zigbee2mqtt + server: mqtt://mosquitto:1883 +serial: + port: tcp://192.168.1.105:6638 + adapter: ezsp +frontend: + port: 8080 +advanced: + log_level: info +``` + +## chelsty-ha — No node-agent + +`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down. + +In `hosts/chelsty-ha/services.yaml`: +```yaml +services: + homeassistant: + monitor: false # No node-agent; suppresses supervisor action generation +``` + +Remove `monitor: false` once node-agent is bootstrapped on this VM. ## Deployment Flow -1. **Initial Bootstrap**: - Run the bootstrap script on the CHELSTY node: - ```bash - ./scripts/bootstrap/chelsty-runtime.sh - ``` +### Initial Bootstrap +```bash +./scripts/bootstrap/chelsty-runtime.sh +``` -2. **Manual Configuration**: - - Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials. - - Add Mosquitto user: - ```bash - sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt - ``` +### Deploy services +```bash +./scripts/deploy/deploy-node.sh chelsty-infra +./scripts/deploy/deploy-node.sh chelsty-ha +``` -3. **Service Deployment**: - Use the staged deployment runtime: - ```bash - ./scripts/deploy/deploy-node.sh chelsty-infra - ./scripts/deploy/deploy-node.sh chelsty-ha - ``` +### Manual (SSH) — chelsty-infra uses docker-compose v1 +```bash +ssh oskar@100.122.201.22 +cd ~/homelab-codex-ws/services/ +docker-compose -f docker-compose.yml \ + -f ../../hosts/chelsty-infra/runtime//docker-compose.override.yml \ + up -d --build --force-recreate +``` -## Recovery Procedure +> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2). -In case of runtime failure: -1. Verify Docker and Compose plugin: `docker compose version` -2. Re-run bootstrap script to ensure directory structure and basic configs. -3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log` -4. Verify SLZB-06U reachability: `ping 192.168.1.105` +## Recovery Procedures + +### Mosquitto stopped +```bash +ssh oskar@100.122.201.22 "docker start mosquitto" +# Ensure restart policy is correct: +docker update --restart unless-stopped mosquitto +``` + +### Zigbee2MQTT won't start +1. Check logs: `docker logs zigbee2mqtt --tail 50` +2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638` +3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml` +4. If config missing, recreate from the minimal template above + +### SLZB-06U unreachable +`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns. + +## Critical Backup Sets + +| Data | Path | +|------|------| +| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha | +| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` | +| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` | +| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` | + +> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices. diff --git a/docs/observer-runtime.md b/docs/observer-runtime.md index dc6b1e6..fd65653 100644 --- a/docs/observer-runtime.md +++ b/docs/observer-runtime.md @@ -7,57 +7,92 @@ The Observer Runtime is a lightweight agent responsible for synthesizing the ope The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity. ### Inputs -- `/opt/homelab/events/`: Normalized JSON events. -- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint. -- `/opt/homelab/logs/`: Detailed execution logs and diagnostics. +- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node). +- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below). - Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`. ### World Model Output Generated under `/opt/homelab/world/`: -- `nodes.json`: Current node availability, roles, and last seen timestamps. -- `services.json`: Service health status and links to active incidents. +- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name. +- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`. - `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`. - `incidents.json`: Correlated operational issues, including repeat failures and resolution status. - `runtime-summary.json`: High-level overview for dashboards and planner agents. -## Incident Lifecycle +## Checkpoint Format -The observer implements lightweight incident correlation: +The observer tracks per-node progress to avoid silently skipping event directories: -1. **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated. -2. **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`. -3. **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload. -4. **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state. - -### Example Incident JSON ```json { - "inc-1715518800-saturn-mosquitto": { - "id": "inc-1715518800-saturn-mosquitto", - "node": "saturn", - "service": "mosquitto", - "status": "resolved", - "severity": "error", - "started_at": "2026-05-12T12:05:00Z", - "last_occurrence": "2026-05-12T12:06:00Z", - "occurrence_count": 2, - "events": [ - "2026-05-12T12:05:00Z", - "2026-05-12T12:06:00Z" - ], - "correlation_id": "hc-1", - "resolved_at": "2026-05-12T12:10:00Z" + "node_checkpoints": { + "vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json", + "piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json", + "chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json" } } ``` +A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`). + +**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch. + +## Event Types + +### Negative events (create/escalate incidents) +- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident +- `deployment_failed` — record failure in deployments.json + +### Positive events (resolve state) +- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service +- `service_recovered` — alias, same effect +- `deployment_completed` — marks deployment as completed + +### Node events +- `node_online`, `node_offline` — update node status in nodes.json +- `disk_pressure_*` — set `disk_pressure` field on the node record + +## Incident Lifecycle + +1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident. +2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`. +3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`. +4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`. + +### Example Incident JSON +```json +{ + "inc-1715518800-vps-observer": { + "id": "inc-1715518800-vps-observer", + "node": "vps", + "service": "observer", + "status": "resolved", + "severity": "error", + "started_at": 1715518800.0, + "last_occurrence": 1715518860.0, + "occurrence_count": 2, + "trigger_type": "containers_not_running", + "resolved_at": 1715519100.0 + } +} +``` + +## World State Pruning + +`_prune_stale_world()` runs every reconcile cycle and removes: + +1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID). +2. **Services of stale nodes** — all `node/service` keys whose node was pruned. +3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label). +4. **Expired incidents** — resolved incidents older than 7 days. + ## Runtime Behavior ### Idempotency -The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state. - -### Resumability -The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event. +The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state. ### Deployment Tracking -Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment. +Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events. + +### Topology Filtering +Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state. diff --git a/docs/sessions/2026-05-27.md b/docs/sessions/2026-05-27.md new file mode 100644 index 0000000..e1dcecc --- /dev/null +++ b/docs/sessions/2026-05-27.md @@ -0,0 +1,103 @@ +# SESSION: Stabilizacja systemu wieloagentowego homelabu + +**DATE:** 2026-05-27 +**RESULT:** System NOMINAL (97/97 services, 0 errors) + +--- + +## PROBLEMS FOUND + +- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart +- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`) +- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra +- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty +- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service` +- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane +- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem +- supervisor nie cancelował resolved actions — pending queue rósł bez końca +- `service_healthy` event nie zamykał aktywnych incydentów +- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology +- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta + +--- + +## FIXES SHIPPED (commits in master) + +``` +7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis +b40b832 Fix ghost service keys from hash-prefixed Docker container names +28e9534 observer: service_healthy resolves active incidents +46ae92b supervisor: also cancel pending actions for services removed from desired state +410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount +b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host +61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat +51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring +fb7828b supervisor: auto-cancel pending actions when drift is resolved +2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites +267742c vps/node-agent: add network_mode: host for control-plane health probe +4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration +f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint +a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors +2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP +65bac4e fix(node-agent): mount host SSH key into container for event shipping +96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter +ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra +c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps +01b7758 feat(node-agent): implement health monitor and safe cleanup policy +``` + +### Szczegóły kluczowych napraw + +**fix(observer): per-node checkpoints** +Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node. + +**fix(observer): ghost key pruning** +`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_` — artefakty z Docker internal state tracking. + +**fix(node-agent): canonical container name** +`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts). + +**fix(node-agent): service_healthy emission** +Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów. + +**fix(supervisor): auto-cancel resolved actions** +`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy: +- serwis stał się healthy (`drift_resolved_auto`) +- serwis został usunięty z desired state (`service_removed_from_desired_state`) + +**fix(supervisor): monitor:false** +Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta). + +**fix(agent-system/materializer): control-plane API as source** +Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback. + +**fix(chelsty-infra/zigbee2mqtt): mosquitto networking** +Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra). + +**fix(chelsty-infra/zigbee2mqtt): writable config** +z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie. + +--- + +## STAN KOŃCOWY + +| Node | Status | Serwisy | +|------|--------|---------| +| vps | online | control-plane (4), node-agent, node_exporter, stability-agent | +| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack | +| solaria | online | node-agent, stability-agent, AI workloads | +| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent | +| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) | + +**Action queue:** 0 pending, 0 approved, 0 running +**Incidents:** 0 active +**Ghost service keys:** 0 + +--- + +## ZNANE OGRANICZENIA / TODO + +- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci. +- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`. +- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap. +- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić. diff --git a/docs/vps-control-plane.md b/docs/vps-control-plane.md index 1507c5c..4caa252 100644 --- a/docs/vps-control-plane.md +++ b/docs/vps-control-plane.md @@ -1,83 +1,126 @@ # VPS Control Plane -The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS and provides observability, automated reconciliation, and a web-based operator interface. +The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface. ## Architecture -The control plane consists of four core services running as a Docker Compose stack: +The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`: -1. **Observer**: Synthesizes world state from events. -2. **Supervisor**: Detects drifts between desired and actual state. -3. **Executor**: Executes approved actions from the queue. -4. **Operator UI**: Web interface for system monitoring and action approval. +| Container | Role | +|-----------|------| +| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` | +| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions | +| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` | +| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 | -All services adhere to **filesystem-first** semantics, using `/opt/homelab/` as the primary data exchange and persistence layer. +All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user). -## Deployment Flow +## Supervisor Behavior -### 1. Prerequisites -- Target VPS node must be onboarded (Tailscale active, Docker installed). -- Repository cloned to `/home/oskar/homelab-codex-ws`. +### Desired State +Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`). -### 2. Bootstrap -Run the local deployment script on the VPS to initialize the runtime filesystem and start the stack: +### Drift Types +- `missing_service` — service is in desired state but absent from `services.json` +- `unhealthy_service` — service exists in `services.json` but `status != healthy` + +### Action Types +| Trigger | Action type | Risk | +|---------|-------------|------| +| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low | +| Any other / unknown | `redeploy` | guarded | +| Node `disk_pressure: high` | `disk_cleanup` | guarded | + +### Action ID Stability +Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts. + +### Auto-Cancel +Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when: +- **`drift_resolved_auto`** — the service becomes `healthy` in actual state +- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false` + +Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically. + +### Node Name Resolution +The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names: ```bash -cd services/control-plane -bash deploy-local.sh +NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}' ``` -### 3. Verification -Verify the stack is healthy using the deployment script or check container status on the VPS: +## Deployment +### From SATURN (primary control node) ```bash -# Check status via deploy script +# Full deploy via SSH ./scripts/deploy/deploy-control-plane.sh --ssh -# Manual status check on VPS +# Or manually: +ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate" +``` + +### Direct on VPS +```bash +cd ~/homelab-codex-ws/services/control-plane +docker compose up -d --build --force-recreate +``` + +`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly. + +### Verification +```bash +# On VPS docker ps --filter "name=control-plane" +curl -s http://localhost:18180/summary | python3 -m json.tool ``` -## Operational Workflows +## Action Approval Workflow -### Action Approval -1. Access the Operator UI (via Tailscale IP or Nginx Proxy Manager). -2. Navigate to **Action Queue**. -3. Review **Pending** actions recommended by the Supervisor. -4. Click **Approve** to move actions to the execution queue. +``` +Supervisor writes → /opt/homelab/actions/pending/.json + → Operator UI (port 18180) or Telegram Bot notifies + → Operator clicks Approve + → /opt/homelab/actions/approved/.json + → Executor executes → completed / failed +``` -### Recovery Flow -In case of control plane failure: -1. Check logs using `docker logs`. -2. Restart stack using the local deployment script: `bash deploy-local.sh`. -3. Rebuild world state: Delete `/opt/homelab/state/observer_checkpoint.json` and redeploy. +Possible action states: `pending → approved → running → completed / failed / rejected` +Auto-cancel path: `pending → cancelled/` -### Upgrade Flow -To deploy updates from the SOLARIA/control host: +## Recovery + +### World state is stale or corrupt +```bash +# On VPS — delete checkpoint to force full replay +rm /opt/homelab/state/observer_checkpoint.json +docker restart control-plane-observer +``` + +### Flood of pending actions after bootstrap +Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle. ```bash -./scripts/deploy/deploy-control-plane.sh --ssh +# Check node-agent on each node +ssh oskar@ "docker ps --filter name=node-agent && docker logs node-agent --tail 20" ``` -### Rollback Semantics -Since the runtime is filesystem-first and append-only: -1. Roll back the repository state to a previous commit. -2. Restart the control plane stack. -3. The supervisor will detect drift against the older (rolled-back) desired state and recommend actions to restore it. - -## Runtime Safety - -- **Readonly Mounts**: Most services mount the repository as `:ro` to prevent accidental mutations. -- **Least-Privilege**: UI, Observer, and Supervisor run as non-root `homelab` user (UID 1000). -- **Filesystem Isolation**: Clear separation between `/repo` (code/inventory) and `/opt/homelab` (runtime state). +### Rebuild from scratch +```bash +ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate" +``` ## Integration +### piha agent-system webui (port 18180 on piha) +The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API. + +Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`. + ### Nginx Proxy Manager -Configure a proxy host in NPM to point to `http://control-plane-ui:8080`. Ensure Websockets are enabled if the UI uses them. +The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required. ### Log Locations -- Container logs: `docker compose logs` +- Container logs: `docker compose logs -f` (from `services/control-plane/`) - Runtime events: `/opt/homelab/events/YYYY-MM-DD/` - World state: `/opt/homelab/world/` -- Diagnostics: `/opt/homelab/logs/` +- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`