docs/sessions/2026-05-27.md (new): - Full session record: problems found, all commits shipped, end state - Written in Polish per operator preference for session notes - Known limitations: SLZB-06U offline, ezsp→ember migration pending docs/observer-runtime.md: - Document per-node checkpoint format (replaces old global checkpoint) - Add service_healthy / service_recovered resolution behavior - Document ghost key pruning (_prune_stale_world patterns) - Add event type reference table (negative vs positive) docs/vps-control-plane.md: - Add container names and network_mode: host detail - Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior - Add piha agent-system materializer integration note - Rewrite recovery section with actionable bootstrap-flood diagnosis - Add action state machine (pending→approved→running→completed/cancelled) docs/chelsty-runtime.md: - Add chelsty-infra/chelsty-ha node table - Document docker-compose v1 constraint (always use docker-compose, not docker compose) - Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation - Add z2m config writable requirement (EROFS failure mode documented) - Add chelsty-ha monitor:false rationale - Add minimal configuration.yaml template for z2m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
99 lines
4.7 KiB
Markdown
99 lines
4.7 KiB
Markdown
# Observer Runtime
|
|
|
|
The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
|
|
|
|
## Architecture
|
|
|
|
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
|
|
|
|
### Inputs
|
|
- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
|
|
- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
|
|
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
|
|
|
|
### World Model Output
|
|
Generated under `/opt/homelab/world/`:
|
|
- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
|
|
- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
|
|
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
|
|
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
|
|
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
|
|
|
|
## Checkpoint Format
|
|
|
|
The observer tracks per-node progress to avoid silently skipping event directories:
|
|
|
|
```json
|
|
{
|
|
"node_checkpoints": {
|
|
"vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
|
|
"piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
|
|
"chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
|
|
}
|
|
}
|
|
```
|
|
|
|
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
|
|
|
|
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
|
|
|
|
## Event Types
|
|
|
|
### Negative events (create/escalate incidents)
|
|
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
|
|
- `deployment_failed` — record failure in deployments.json
|
|
|
|
### Positive events (resolve state)
|
|
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
|
|
- `service_recovered` — alias, same effect
|
|
- `deployment_completed` — marks deployment as completed
|
|
|
|
### Node events
|
|
- `node_online`, `node_offline` — update node status in nodes.json
|
|
- `disk_pressure_*` — set `disk_pressure` field on the node record
|
|
|
|
## Incident Lifecycle
|
|
|
|
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
|
|
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
|
|
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
|
|
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
|
|
|
|
### Example Incident JSON
|
|
```json
|
|
{
|
|
"inc-1715518800-vps-observer": {
|
|
"id": "inc-1715518800-vps-observer",
|
|
"node": "vps",
|
|
"service": "observer",
|
|
"status": "resolved",
|
|
"severity": "error",
|
|
"started_at": 1715518800.0,
|
|
"last_occurrence": 1715518860.0,
|
|
"occurrence_count": 2,
|
|
"trigger_type": "containers_not_running",
|
|
"resolved_at": 1715519100.0
|
|
}
|
|
}
|
|
```
|
|
|
|
## World State Pruning
|
|
|
|
`_prune_stale_world()` runs every reconcile cycle and removes:
|
|
|
|
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
|
|
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
|
|
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
|
|
4. **Expired incidents** — resolved incidents older than 7 days.
|
|
|
|
## Runtime Behavior
|
|
|
|
### Idempotency
|
|
The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
|
|
|
|
### Deployment Tracking
|
|
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
|
|
|
|
### Topology Filtering
|
|
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
|