docs/sessions/2026-05-27.md (new): - Full session record: problems found, all commits shipped, end state - Written in Polish per operator preference for session notes - Known limitations: SLZB-06U offline, ezsp→ember migration pending docs/observer-runtime.md: - Document per-node checkpoint format (replaces old global checkpoint) - Add service_healthy / service_recovered resolution behavior - Document ghost key pruning (_prune_stale_world patterns) - Add event type reference table (negative vs positive) docs/vps-control-plane.md: - Add container names and network_mode: host detail - Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior - Add piha agent-system materializer integration note - Rewrite recovery section with actionable bootstrap-flood diagnosis - Add action state machine (pending→approved→running→completed/cancelled) docs/chelsty-runtime.md: - Add chelsty-infra/chelsty-ha node table - Document docker-compose v1 constraint (always use docker-compose, not docker compose) - Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation - Add z2m config writable requirement (EROFS failure mode documented) - Add chelsty-ha monitor:false rationale - Add minimal configuration.yaml template for z2m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
127 lines
5.4 KiB
Markdown
127 lines
5.4 KiB
Markdown
# VPS Control Plane
|
|
|
|
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
|
|
|
|
## Architecture
|
|
|
|
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
|
|
|
|
| Container | Role |
|
|
|-----------|------|
|
|
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
|
|
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
|
|
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
|
|
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
|
|
|
|
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
|
|
|
|
## Supervisor Behavior
|
|
|
|
### Desired State
|
|
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
|
|
|
|
### Drift Types
|
|
- `missing_service` — service is in desired state but absent from `services.json`
|
|
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
|
|
|
|
### Action Types
|
|
| Trigger | Action type | Risk |
|
|
|---------|-------------|------|
|
|
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
|
|
| Any other / unknown | `redeploy` | guarded |
|
|
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
|
|
|
|
### Action ID Stability
|
|
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
|
|
|
|
### Auto-Cancel
|
|
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
|
|
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
|
|
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
|
|
|
|
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
|
|
|
|
### Node Name Resolution
|
|
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
|
|
|
|
```bash
|
|
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
|
|
```
|
|
|
|
## Deployment
|
|
|
|
### From SATURN (primary control node)
|
|
```bash
|
|
# Full deploy via SSH
|
|
./scripts/deploy/deploy-control-plane.sh --ssh
|
|
|
|
# Or manually:
|
|
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
|
|
```
|
|
|
|
### Direct on VPS
|
|
```bash
|
|
cd ~/homelab-codex-ws/services/control-plane
|
|
docker compose up -d --build --force-recreate
|
|
```
|
|
|
|
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
|
|
|
|
### Verification
|
|
```bash
|
|
# On VPS
|
|
docker ps --filter "name=control-plane"
|
|
curl -s http://localhost:18180/summary | python3 -m json.tool
|
|
```
|
|
|
|
## Action Approval Workflow
|
|
|
|
```
|
|
Supervisor writes → /opt/homelab/actions/pending/<id>.json
|
|
→ Operator UI (port 18180) or Telegram Bot notifies
|
|
→ Operator clicks Approve
|
|
→ /opt/homelab/actions/approved/<id>.json
|
|
→ Executor executes → completed / failed
|
|
```
|
|
|
|
Possible action states: `pending → approved → running → completed / failed / rejected`
|
|
Auto-cancel path: `pending → cancelled/`
|
|
|
|
## Recovery
|
|
|
|
### World state is stale or corrupt
|
|
```bash
|
|
# On VPS — delete checkpoint to force full replay
|
|
rm /opt/homelab/state/observer_checkpoint.json
|
|
docker restart control-plane-observer
|
|
```
|
|
|
|
### Flood of pending actions after bootstrap
|
|
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
|
|
|
|
```bash
|
|
# Check node-agent on each node
|
|
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
|
|
```
|
|
|
|
### Rebuild from scratch
|
|
```bash
|
|
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
|
|
```
|
|
|
|
## Integration
|
|
|
|
### piha agent-system webui (port 18180 on piha)
|
|
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
|
|
|
|
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
|
|
|
|
### Nginx Proxy Manager
|
|
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
|
|
|
|
### Log Locations
|
|
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
|
|
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
|
|
- World state: `/opt/homelab/world/`
|
|
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`
|