# Observer Runtime The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files. ## Architecture The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity. ### Inputs - `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node). - `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below). - Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`. ### World Model Output Generated under `/opt/homelab/world/`: - `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name. - `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`. - `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`. - `incidents.json`: Correlated operational issues, including repeat failures and resolution status. - `runtime-summary.json`: High-level overview for dashboards and planner agents. ## Checkpoint Format The observer tracks per-node progress to avoid silently skipping event directories: ```json { "node_checkpoints": { "vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json", "piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json", "chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json" } } ``` A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`). **Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch. ## Event Types ### Negative events (create/escalate incidents) - `service_unhealthy`, `healthcheck_failed` — open or increment an active incident - `deployment_failed` — record failure in deployments.json ### Positive events (resolve state) - `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service - `service_recovered` — alias, same effect - `deployment_completed` — marks deployment as completed ### Node events - `node_online`, `node_offline` — update node status in nodes.json - `disk_pressure_*` — set `disk_pressure` field on the node record ## Incident Lifecycle 1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident. 2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`. 3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`. 4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`. ### Example Incident JSON ```json { "inc-1715518800-vps-observer": { "id": "inc-1715518800-vps-observer", "node": "vps", "service": "observer", "status": "resolved", "severity": "error", "started_at": 1715518800.0, "last_occurrence": 1715518860.0, "occurrence_count": 2, "trigger_type": "containers_not_running", "resolved_at": 1715519100.0 } } ``` ## World State Pruning `_prune_stale_world()` runs every reconcile cycle and removes: 1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID). 2. **Services of stale nodes** — all `node/service` keys whose node was pruned. 3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label). 4. **Expired incidents** — resolved incidents older than 7 days. ## Runtime Behavior ### Idempotency The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state. ### Deployment Tracking Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events. ### Topology Filtering Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.