The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
## Architecture
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
-`service_unhealthy`, `healthcheck_failed` — open or increment an active incident
-`deployment_failed` — record failure in deployments.json
### Positive events (resolve state)
-`service_healthy` — marks service status as `healthy`**and** resolves any active incident for that service
-`service_recovered` — alias, same effect
-`deployment_completed` — marks deployment as completed
### Node events
-`node_online`, `node_offline` — update node status in nodes.json
-`disk_pressure_*` — set `disk_pressure` field on the node record
## Incident Lifecycle
1.**Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
2.**Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
3.**Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
4.**Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
`_prune_stale_world()` runs every reconcile cycle and removes:
1.**Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
2.**Services of stale nodes** — all `node/service` keys whose node was pruned.
3.**Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
4.**Expired incidents** — resolved incidents older than 7 days.
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
### Topology Filtering
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.