homelab-codex-ws/docs/vps-control-plane.md
Oskar Kapala 603e10a364 docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs
docs/sessions/2026-05-27.md (new):
- Full session record: problems found, all commits shipped, end state
- Written in Polish per operator preference for session notes
- Known limitations: SLZB-06U offline, ezsp→ember migration pending

docs/observer-runtime.md:
- Document per-node checkpoint format (replaces old global checkpoint)
- Add service_healthy / service_recovered resolution behavior
- Document ghost key pruning (_prune_stale_world patterns)
- Add event type reference table (negative vs positive)

docs/vps-control-plane.md:
- Add container names and network_mode: host detail
- Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior
- Add piha agent-system materializer integration note
- Rewrite recovery section with actionable bootstrap-flood diagnosis
- Add action state machine (pending→approved→running→completed/cancelled)

docs/chelsty-runtime.md:
- Add chelsty-infra/chelsty-ha node table
- Document docker-compose v1 constraint (always use docker-compose, not docker compose)
- Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation
- Add z2m config writable requirement (EROFS failure mode documented)
- Add chelsty-ha monitor:false rationale
- Add minimal configuration.yaml template for z2m

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 16:18:31 +02:00

127 lines
5.4 KiB
Markdown

# VPS Control Plane
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
## Architecture
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
| Container | Role |
|-----------|------|
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
## Supervisor Behavior
### Desired State
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
### Drift Types
- `missing_service` — service is in desired state but absent from `services.json`
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
### Action Types
| Trigger | Action type | Risk |
|---------|-------------|------|
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
| Any other / unknown | `redeploy` | guarded |
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
### Action ID Stability
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
### Auto-Cancel
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
### Node Name Resolution
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
```bash
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
```
## Deployment
### From SATURN (primary control node)
```bash
# Full deploy via SSH
./scripts/deploy/deploy-control-plane.sh --ssh
# Or manually:
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
```
### Direct on VPS
```bash
cd ~/homelab-codex-ws/services/control-plane
docker compose up -d --build --force-recreate
```
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
### Verification
```bash
# On VPS
docker ps --filter "name=control-plane"
curl -s http://localhost:18180/summary | python3 -m json.tool
```
## Action Approval Workflow
```
Supervisor writes → /opt/homelab/actions/pending/<id>.json
→ Operator UI (port 18180) or Telegram Bot notifies
→ Operator clicks Approve
→ /opt/homelab/actions/approved/<id>.json
→ Executor executes → completed / failed
```
Possible action states: `pending → approved → running → completed / failed / rejected`
Auto-cancel path: `pending → cancelled/`
## Recovery
### World state is stale or corrupt
```bash
# On VPS — delete checkpoint to force full replay
rm /opt/homelab/state/observer_checkpoint.json
docker restart control-plane-observer
```
### Flood of pending actions after bootstrap
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
```bash
# Check node-agent on each node
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
```
### Rebuild from scratch
```bash
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
```
## Integration
### piha agent-system webui (port 18180 on piha)
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
### Nginx Proxy Manager
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
### Log Locations
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
- World state: `/opt/homelab/world/`
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`