homelab-codex-ws/docs/observer-runtime.md

2.9 KiB

Observer Runtime

The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.

Architecture

The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.

Inputs

  • /opt/homelab/events/: Normalized JSON events.
  • /opt/homelab/state/: Deployment stage markers and internal observer checkpoint.
  • /opt/homelab/logs/: Detailed execution logs and diagnostics.
  • Repository Inventory: inventory/topology.yaml and hosts/*/services.yaml.

World Model Output

Generated under /opt/homelab/world/:

  • nodes.json: Current node availability, roles, and last seen timestamps.
  • services.json: Service health status and links to active incidents.
  • deployments.json: Tracking of active and historical deployment runs by correlation_id.
  • incidents.json: Correlated operational issues, including repeat failures and resolution status.
  • runtime-summary.json: High-level overview for dashboards and planner agents.

Incident Lifecycle

The observer implements lightweight incident correlation:

  1. Detection: When a service_unhealthy or healthcheck_failed event is consumed, a new incident is created or an existing active incident for that service is updated.
  2. Correlation: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the occurrence_count.
  3. Diagnostics: Deployment failures (deployment_failed) automatically attach references to diagnostic files if present in the event payload.
  4. Resolution: A service_recovered event for a service will transition any active incidents for that service to a resolved state.

Example Incident JSON

{
  "inc-1715518800-saturn-mosquitto": {
    "id": "inc-1715518800-saturn-mosquitto",
    "node": "saturn",
    "service": "mosquitto",
    "status": "resolved",
    "severity": "error",
    "started_at": "2026-05-12T12:05:00Z",
    "last_occurrence": "2026-05-12T12:06:00Z",
    "occurrence_count": 2,
    "events": [
      "2026-05-12T12:05:00Z",
      "2026-05-12T12:06:00Z"
    ],
    "correlation_id": "hc-1",
    "resolved_at": "2026-05-12T12:10:00Z"
  }
}

Runtime Behavior

Idempotency

The observer processes events in order. If the world state is lost, deleting the checkpoint file (/opt/homelab/state/observer_checkpoint.json) will cause the observer to re-process all events and rebuild the world state.

Resumability

The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.

Deployment Tracking

Deployments are tracked via correlation_id. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment.