docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs
docs/sessions/2026-05-27.md (new): - Full session record: problems found, all commits shipped, end state - Written in Polish per operator preference for session notes - Known limitations: SLZB-06U offline, ezsp→ember migration pending docs/observer-runtime.md: - Document per-node checkpoint format (replaces old global checkpoint) - Add service_healthy / service_recovered resolution behavior - Document ghost key pruning (_prune_stale_world patterns) - Add event type reference table (negative vs positive) docs/vps-control-plane.md: - Add container names and network_mode: host detail - Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior - Add piha agent-system materializer integration note - Rewrite recovery section with actionable bootstrap-flood diagnosis - Add action state machine (pending→approved→running→completed/cancelled) docs/chelsty-runtime.md: - Add chelsty-infra/chelsty-ha node table - Document docker-compose v1 constraint (always use docker-compose, not docker compose) - Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation - Add z2m config writable requirement (EROFS failure mode documented) - Add chelsty-ha monitor:false rationale - Add minimal configuration.yaml template for z2m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
7277bdc27f
commit
603e10a364
|
|
@ -1,61 +1,154 @@
|
||||||
# CHELSTY Runtime
|
# CHELSTY Runtime
|
||||||
|
|
||||||
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
|
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
|
||||||
|
|
||||||
|
| Node | Role | Services |
|
||||||
|
|------|------|----------|
|
||||||
|
| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
|
||||||
|
| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
|
||||||
|
|
||||||
|
Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
|
||||||
|
|
||||||
## Runtime Layout
|
## Runtime Layout
|
||||||
|
|
||||||
The CHELSTY runtime is located at `/opt/homelab`.
|
```
|
||||||
|
/opt/homelab/
|
||||||
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
|
├── config/ # Service-specific configs and secrets (not in Git)
|
||||||
- `/opt/homelab/data/`: Persistent data for services.
|
│ ├── mosquitto/
|
||||||
- `/opt/homelab/logs/`: Service logs.
|
│ └── zigbee2mqtt/
|
||||||
|
├── data/ # Persistent service data
|
||||||
### Key Service Locations
|
│ ├── mosquitto/ # Persistence DB, password file
|
||||||
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
|
│ └── zigbee2mqtt/
|
||||||
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
|
│ └── data/ # z2m config, coordinator backup, network key
|
||||||
|
└── logs/
|
||||||
|
```
|
||||||
|
|
||||||
## SLZB-06U Integration
|
## SLZB-06U Integration
|
||||||
|
|
||||||
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
|
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
|
||||||
|
|
||||||
- **Coordinator IP**: 192.168.1.105
|
- **Coordinator IP**: `192.168.1.105`
|
||||||
- **Port**: 6638
|
- **Port**: `6638`
|
||||||
- **Protocol**: TCP (ezsp adapter)
|
- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
|
||||||
|
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
|
||||||
|
|
||||||
Zigbee2MQTT is configured to connect to this coordinator over the local network.
|
⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
|
||||||
|
|
||||||
## Offline & LTE Assumptions
|
## Networking Constraints
|
||||||
|
|
||||||
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
|
### Mosquitto — `network_mode: host`
|
||||||
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
|
Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
|
||||||
- **Home Assistant**: Runs on `chelsty-ha` node, connecting to the Mosquitto broker on `chelsty-infra`.
|
|
||||||
|
### Zigbee2MQTT — bridge network + extra_hosts
|
||||||
|
Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
|
||||||
|
services:
|
||||||
|
zigbee2mqtt:
|
||||||
|
extra_hosts:
|
||||||
|
- "mosquitto:host-gateway"
|
||||||
|
```
|
||||||
|
|
||||||
|
This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
|
||||||
|
|
||||||
|
**Why not `network_mode: host` for z2m?**
|
||||||
|
chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
|
||||||
|
|
||||||
|
## Zigbee2MQTT Config Location
|
||||||
|
|
||||||
|
The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
|
||||||
|
|
||||||
|
```
|
||||||
|
/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
This path is mounted read-write by the base `docker-compose.yml`:
|
||||||
|
```yaml
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab/data/zigbee2mqtt/data:/app/data
|
||||||
|
```
|
||||||
|
|
||||||
|
Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
|
||||||
|
|
||||||
|
### Minimal configuration.yaml
|
||||||
|
```yaml
|
||||||
|
homeassistant: true
|
||||||
|
permit_join: false
|
||||||
|
mqtt:
|
||||||
|
base_topic: zigbee2mqtt
|
||||||
|
server: mqtt://mosquitto:1883
|
||||||
|
serial:
|
||||||
|
port: tcp://192.168.1.105:6638
|
||||||
|
adapter: ezsp
|
||||||
|
frontend:
|
||||||
|
port: 8080
|
||||||
|
advanced:
|
||||||
|
log_level: info
|
||||||
|
```
|
||||||
|
|
||||||
|
## chelsty-ha — No node-agent
|
||||||
|
|
||||||
|
`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
|
||||||
|
|
||||||
|
In `hosts/chelsty-ha/services.yaml`:
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
homeassistant:
|
||||||
|
monitor: false # No node-agent; suppresses supervisor action generation
|
||||||
|
```
|
||||||
|
|
||||||
|
Remove `monitor: false` once node-agent is bootstrapped on this VM.
|
||||||
|
|
||||||
## Deployment Flow
|
## Deployment Flow
|
||||||
|
|
||||||
1. **Initial Bootstrap**:
|
### Initial Bootstrap
|
||||||
Run the bootstrap script on the CHELSTY node:
|
|
||||||
```bash
|
```bash
|
||||||
./scripts/bootstrap/chelsty-runtime.sh
|
./scripts/bootstrap/chelsty-runtime.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Manual Configuration**:
|
### Deploy services
|
||||||
- Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
|
|
||||||
- Add Mosquitto user:
|
|
||||||
```bash
|
|
||||||
sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Service Deployment**:
|
|
||||||
Use the staged deployment runtime:
|
|
||||||
```bash
|
```bash
|
||||||
./scripts/deploy/deploy-node.sh chelsty-infra
|
./scripts/deploy/deploy-node.sh chelsty-infra
|
||||||
./scripts/deploy/deploy-node.sh chelsty-ha
|
./scripts/deploy/deploy-node.sh chelsty-ha
|
||||||
```
|
```
|
||||||
|
|
||||||
## Recovery Procedure
|
### Manual (SSH) — chelsty-infra uses docker-compose v1
|
||||||
|
```bash
|
||||||
|
ssh oskar@100.122.201.22
|
||||||
|
cd ~/homelab-codex-ws/services/<service>
|
||||||
|
docker-compose -f docker-compose.yml \
|
||||||
|
-f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
|
||||||
|
up -d --build --force-recreate
|
||||||
|
```
|
||||||
|
|
||||||
In case of runtime failure:
|
> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
|
||||||
1. Verify Docker and Compose plugin: `docker compose version`
|
|
||||||
2. Re-run bootstrap script to ensure directory structure and basic configs.
|
## Recovery Procedures
|
||||||
3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
|
|
||||||
4. Verify SLZB-06U reachability: `ping 192.168.1.105`
|
### Mosquitto stopped
|
||||||
|
```bash
|
||||||
|
ssh oskar@100.122.201.22 "docker start mosquitto"
|
||||||
|
# Ensure restart policy is correct:
|
||||||
|
docker update --restart unless-stopped mosquitto
|
||||||
|
```
|
||||||
|
|
||||||
|
### Zigbee2MQTT won't start
|
||||||
|
1. Check logs: `docker logs zigbee2mqtt --tail 50`
|
||||||
|
2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
|
||||||
|
3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
|
||||||
|
4. If config missing, recreate from the minimal template above
|
||||||
|
|
||||||
|
### SLZB-06U unreachable
|
||||||
|
`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
|
||||||
|
|
||||||
|
## Critical Backup Sets
|
||||||
|
|
||||||
|
| Data | Path |
|
||||||
|
|------|------|
|
||||||
|
| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
|
||||||
|
| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
|
||||||
|
| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
|
||||||
|
| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
|
||||||
|
|
||||||
|
> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
|
||||||
|
|
|
||||||
|
|
@ -7,57 +7,92 @@ The Observer Runtime is a lightweight agent responsible for synthesizing the ope
|
||||||
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
|
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
|
||||||
|
|
||||||
### Inputs
|
### Inputs
|
||||||
- `/opt/homelab/events/`: Normalized JSON events.
|
- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
|
||||||
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint.
|
- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
|
||||||
- `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
|
|
||||||
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
|
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
|
||||||
|
|
||||||
### World Model Output
|
### World Model Output
|
||||||
Generated under `/opt/homelab/world/`:
|
Generated under `/opt/homelab/world/`:
|
||||||
- `nodes.json`: Current node availability, roles, and last seen timestamps.
|
- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
|
||||||
- `services.json`: Service health status and links to active incidents.
|
- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
|
||||||
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
|
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
|
||||||
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
|
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
|
||||||
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
|
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
|
||||||
|
|
||||||
## Incident Lifecycle
|
## Checkpoint Format
|
||||||
|
|
||||||
The observer implements lightweight incident correlation:
|
The observer tracks per-node progress to avoid silently skipping event directories:
|
||||||
|
|
||||||
1. **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
|
|
||||||
2. **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
|
|
||||||
3. **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
|
|
||||||
4. **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
|
|
||||||
|
|
||||||
### Example Incident JSON
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"inc-1715518800-saturn-mosquitto": {
|
"node_checkpoints": {
|
||||||
"id": "inc-1715518800-saturn-mosquitto",
|
"vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
|
||||||
"node": "saturn",
|
"piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
|
||||||
"service": "mosquitto",
|
"chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
|
||||||
"status": "resolved",
|
|
||||||
"severity": "error",
|
|
||||||
"started_at": "2026-05-12T12:05:00Z",
|
|
||||||
"last_occurrence": "2026-05-12T12:06:00Z",
|
|
||||||
"occurrence_count": 2,
|
|
||||||
"events": [
|
|
||||||
"2026-05-12T12:05:00Z",
|
|
||||||
"2026-05-12T12:06:00Z"
|
|
||||||
],
|
|
||||||
"correlation_id": "hc-1",
|
|
||||||
"resolved_at": "2026-05-12T12:10:00Z"
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
|
||||||
|
|
||||||
|
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
|
||||||
|
|
||||||
|
## Event Types
|
||||||
|
|
||||||
|
### Negative events (create/escalate incidents)
|
||||||
|
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
|
||||||
|
- `deployment_failed` — record failure in deployments.json
|
||||||
|
|
||||||
|
### Positive events (resolve state)
|
||||||
|
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
|
||||||
|
- `service_recovered` — alias, same effect
|
||||||
|
- `deployment_completed` — marks deployment as completed
|
||||||
|
|
||||||
|
### Node events
|
||||||
|
- `node_online`, `node_offline` — update node status in nodes.json
|
||||||
|
- `disk_pressure_*` — set `disk_pressure` field on the node record
|
||||||
|
|
||||||
|
## Incident Lifecycle
|
||||||
|
|
||||||
|
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
|
||||||
|
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
|
||||||
|
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
|
||||||
|
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
|
||||||
|
|
||||||
|
### Example Incident JSON
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"inc-1715518800-vps-observer": {
|
||||||
|
"id": "inc-1715518800-vps-observer",
|
||||||
|
"node": "vps",
|
||||||
|
"service": "observer",
|
||||||
|
"status": "resolved",
|
||||||
|
"severity": "error",
|
||||||
|
"started_at": 1715518800.0,
|
||||||
|
"last_occurrence": 1715518860.0,
|
||||||
|
"occurrence_count": 2,
|
||||||
|
"trigger_type": "containers_not_running",
|
||||||
|
"resolved_at": 1715519100.0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## World State Pruning
|
||||||
|
|
||||||
|
`_prune_stale_world()` runs every reconcile cycle and removes:
|
||||||
|
|
||||||
|
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
|
||||||
|
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
|
||||||
|
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
|
||||||
|
4. **Expired incidents** — resolved incidents older than 7 days.
|
||||||
|
|
||||||
## Runtime Behavior
|
## Runtime Behavior
|
||||||
|
|
||||||
### Idempotency
|
### Idempotency
|
||||||
The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state.
|
The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
|
||||||
|
|
||||||
### Resumability
|
|
||||||
The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
|
|
||||||
|
|
||||||
### Deployment Tracking
|
### Deployment Tracking
|
||||||
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment.
|
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
|
||||||
|
|
||||||
|
### Topology Filtering
|
||||||
|
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
|
||||||
|
|
|
||||||
103
docs/sessions/2026-05-27.md
Normal file
103
docs/sessions/2026-05-27.md
Normal file
|
|
@ -0,0 +1,103 @@
|
||||||
|
# SESSION: Stabilizacja systemu wieloagentowego homelabu
|
||||||
|
|
||||||
|
**DATE:** 2026-05-27
|
||||||
|
**RESULT:** System NOMINAL (97/97 services, 0 errors)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PROBLEMS FOUND
|
||||||
|
|
||||||
|
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
|
||||||
|
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
|
||||||
|
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
|
||||||
|
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
|
||||||
|
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
|
||||||
|
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
|
||||||
|
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
|
||||||
|
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
|
||||||
|
- `service_healthy` event nie zamykał aktywnych incydentów
|
||||||
|
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
|
||||||
|
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## FIXES SHIPPED (commits in master)
|
||||||
|
|
||||||
|
```
|
||||||
|
7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
|
||||||
|
b40b832 Fix ghost service keys from hash-prefixed Docker container names
|
||||||
|
28e9534 observer: service_healthy resolves active incidents
|
||||||
|
46ae92b supervisor: also cancel pending actions for services removed from desired state
|
||||||
|
410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
|
||||||
|
b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
|
||||||
|
61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
|
||||||
|
51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
|
||||||
|
fb7828b supervisor: auto-cancel pending actions when drift is resolved
|
||||||
|
2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
|
||||||
|
267742c vps/node-agent: add network_mode: host for control-plane health probe
|
||||||
|
4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
|
||||||
|
f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
|
||||||
|
a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
|
||||||
|
2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
|
||||||
|
65bac4e fix(node-agent): mount host SSH key into container for event shipping
|
||||||
|
96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
|
||||||
|
ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
|
||||||
|
c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
|
||||||
|
01b7758 feat(node-agent): implement health monitor and safe cleanup policy
|
||||||
|
```
|
||||||
|
|
||||||
|
### Szczegóły kluczowych napraw
|
||||||
|
|
||||||
|
**fix(observer): per-node checkpoints**
|
||||||
|
Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
|
||||||
|
|
||||||
|
**fix(observer): ghost key pruning**
|
||||||
|
`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
|
||||||
|
|
||||||
|
**fix(node-agent): canonical container name**
|
||||||
|
`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
|
||||||
|
|
||||||
|
**fix(node-agent): service_healthy emission**
|
||||||
|
Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
|
||||||
|
|
||||||
|
**fix(supervisor): auto-cancel resolved actions**
|
||||||
|
`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
|
||||||
|
- serwis stał się healthy (`drift_resolved_auto`)
|
||||||
|
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
|
||||||
|
|
||||||
|
**fix(supervisor): monitor:false**
|
||||||
|
Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
|
||||||
|
|
||||||
|
**fix(agent-system/materializer): control-plane API as source**
|
||||||
|
Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
|
||||||
|
|
||||||
|
**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**
|
||||||
|
Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
|
||||||
|
|
||||||
|
**fix(chelsty-infra/zigbee2mqtt): writable config**
|
||||||
|
z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## STAN KOŃCOWY
|
||||||
|
|
||||||
|
| Node | Status | Serwisy |
|
||||||
|
|------|--------|---------|
|
||||||
|
| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
|
||||||
|
| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
|
||||||
|
| solaria | online | node-agent, stability-agent, AI workloads |
|
||||||
|
| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
|
||||||
|
| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
|
||||||
|
|
||||||
|
**Action queue:** 0 pending, 0 approved, 0 running
|
||||||
|
**Incidents:** 0 active
|
||||||
|
**Ghost service keys:** 0
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ZNANE OGRANICZENIA / TODO
|
||||||
|
|
||||||
|
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
|
||||||
|
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
|
||||||
|
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
|
||||||
|
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.
|
||||||
|
|
@ -1,83 +1,126 @@
|
||||||
# VPS Control Plane
|
# VPS Control Plane
|
||||||
|
|
||||||
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS and provides observability, automated reconciliation, and a web-based operator interface.
|
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
The control plane consists of four core services running as a Docker Compose stack:
|
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
|
||||||
|
|
||||||
1. **Observer**: Synthesizes world state from events.
|
| Container | Role |
|
||||||
2. **Supervisor**: Detects drifts between desired and actual state.
|
|-----------|------|
|
||||||
3. **Executor**: Executes approved actions from the queue.
|
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
|
||||||
4. **Operator UI**: Web interface for system monitoring and action approval.
|
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
|
||||||
|
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
|
||||||
|
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
|
||||||
|
|
||||||
All services adhere to **filesystem-first** semantics, using `/opt/homelab/` as the primary data exchange and persistence layer.
|
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
|
||||||
|
|
||||||
## Deployment Flow
|
## Supervisor Behavior
|
||||||
|
|
||||||
### 1. Prerequisites
|
### Desired State
|
||||||
- Target VPS node must be onboarded (Tailscale active, Docker installed).
|
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
|
||||||
- Repository cloned to `/home/oskar/homelab-codex-ws`.
|
|
||||||
|
|
||||||
### 2. Bootstrap
|
### Drift Types
|
||||||
Run the local deployment script on the VPS to initialize the runtime filesystem and start the stack:
|
- `missing_service` — service is in desired state but absent from `services.json`
|
||||||
|
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
|
||||||
|
|
||||||
|
### Action Types
|
||||||
|
| Trigger | Action type | Risk |
|
||||||
|
|---------|-------------|------|
|
||||||
|
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
|
||||||
|
| Any other / unknown | `redeploy` | guarded |
|
||||||
|
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
|
||||||
|
|
||||||
|
### Action ID Stability
|
||||||
|
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
|
||||||
|
|
||||||
|
### Auto-Cancel
|
||||||
|
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
|
||||||
|
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
|
||||||
|
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
|
||||||
|
|
||||||
|
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
|
||||||
|
|
||||||
|
### Node Name Resolution
|
||||||
|
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd services/control-plane
|
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
|
||||||
bash deploy-local.sh
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. Verification
|
## Deployment
|
||||||
Verify the stack is healthy using the deployment script or check container status on the VPS:
|
|
||||||
|
|
||||||
|
### From SATURN (primary control node)
|
||||||
```bash
|
```bash
|
||||||
# Check status via deploy script
|
# Full deploy via SSH
|
||||||
./scripts/deploy/deploy-control-plane.sh --ssh
|
./scripts/deploy/deploy-control-plane.sh --ssh
|
||||||
|
|
||||||
# Manual status check on VPS
|
# Or manually:
|
||||||
|
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Direct on VPS
|
||||||
|
```bash
|
||||||
|
cd ~/homelab-codex-ws/services/control-plane
|
||||||
|
docker compose up -d --build --force-recreate
|
||||||
|
```
|
||||||
|
|
||||||
|
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
|
||||||
|
|
||||||
|
### Verification
|
||||||
|
```bash
|
||||||
|
# On VPS
|
||||||
docker ps --filter "name=control-plane"
|
docker ps --filter "name=control-plane"
|
||||||
|
curl -s http://localhost:18180/summary | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
## Operational Workflows
|
## Action Approval Workflow
|
||||||
|
|
||||||
### Action Approval
|
```
|
||||||
1. Access the Operator UI (via Tailscale IP or Nginx Proxy Manager).
|
Supervisor writes → /opt/homelab/actions/pending/<id>.json
|
||||||
2. Navigate to **Action Queue**.
|
→ Operator UI (port 18180) or Telegram Bot notifies
|
||||||
3. Review **Pending** actions recommended by the Supervisor.
|
→ Operator clicks Approve
|
||||||
4. Click **Approve** to move actions to the execution queue.
|
→ /opt/homelab/actions/approved/<id>.json
|
||||||
|
→ Executor executes → completed / failed
|
||||||
|
```
|
||||||
|
|
||||||
### Recovery Flow
|
Possible action states: `pending → approved → running → completed / failed / rejected`
|
||||||
In case of control plane failure:
|
Auto-cancel path: `pending → cancelled/`
|
||||||
1. Check logs using `docker logs`.
|
|
||||||
2. Restart stack using the local deployment script: `bash deploy-local.sh`.
|
|
||||||
3. Rebuild world state: Delete `/opt/homelab/state/observer_checkpoint.json` and redeploy.
|
|
||||||
|
|
||||||
### Upgrade Flow
|
## Recovery
|
||||||
To deploy updates from the SOLARIA/control host:
|
|
||||||
|
### World state is stale or corrupt
|
||||||
|
```bash
|
||||||
|
# On VPS — delete checkpoint to force full replay
|
||||||
|
rm /opt/homelab/state/observer_checkpoint.json
|
||||||
|
docker restart control-plane-observer
|
||||||
|
```
|
||||||
|
|
||||||
|
### Flood of pending actions after bootstrap
|
||||||
|
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./scripts/deploy/deploy-control-plane.sh --ssh
|
# Check node-agent on each node
|
||||||
|
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
|
||||||
```
|
```
|
||||||
|
|
||||||
### Rollback Semantics
|
### Rebuild from scratch
|
||||||
Since the runtime is filesystem-first and append-only:
|
```bash
|
||||||
1. Roll back the repository state to a previous commit.
|
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
|
||||||
2. Restart the control plane stack.
|
```
|
||||||
3. The supervisor will detect drift against the older (rolled-back) desired state and recommend actions to restore it.
|
|
||||||
|
|
||||||
## Runtime Safety
|
|
||||||
|
|
||||||
- **Readonly Mounts**: Most services mount the repository as `:ro` to prevent accidental mutations.
|
|
||||||
- **Least-Privilege**: UI, Observer, and Supervisor run as non-root `homelab` user (UID 1000).
|
|
||||||
- **Filesystem Isolation**: Clear separation between `/repo` (code/inventory) and `/opt/homelab` (runtime state).
|
|
||||||
|
|
||||||
## Integration
|
## Integration
|
||||||
|
|
||||||
|
### piha agent-system webui (port 18180 on piha)
|
||||||
|
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
|
||||||
|
|
||||||
|
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
|
||||||
|
|
||||||
### Nginx Proxy Manager
|
### Nginx Proxy Manager
|
||||||
Configure a proxy host in NPM to point to `http://control-plane-ui:8080`. Ensure Websockets are enabled if the UI uses them.
|
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
|
||||||
|
|
||||||
### Log Locations
|
### Log Locations
|
||||||
- Container logs: `docker compose logs`
|
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
|
||||||
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
|
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
|
||||||
- World state: `/opt/homelab/world/`
|
- World state: `/opt/homelab/world/`
|
||||||
- Diagnostics: `/opt/homelab/logs/`
|
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue