homelab-codex-ws/docs/observer-runtime.md
Oskar Kapala 603e10a364 docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs
docs/sessions/2026-05-27.md (new):
- Full session record: problems found, all commits shipped, end state
- Written in Polish per operator preference for session notes
- Known limitations: SLZB-06U offline, ezsp→ember migration pending

docs/observer-runtime.md:
- Document per-node checkpoint format (replaces old global checkpoint)
- Add service_healthy / service_recovered resolution behavior
- Document ghost key pruning (_prune_stale_world patterns)
- Add event type reference table (negative vs positive)

docs/vps-control-plane.md:
- Add container names and network_mode: host detail
- Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior
- Add piha agent-system materializer integration note
- Rewrite recovery section with actionable bootstrap-flood diagnosis
- Add action state machine (pending→approved→running→completed/cancelled)

docs/chelsty-runtime.md:
- Add chelsty-infra/chelsty-ha node table
- Document docker-compose v1 constraint (always use docker-compose, not docker compose)
- Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation
- Add z2m config writable requirement (EROFS failure mode documented)
- Add chelsty-ha monitor:false rationale
- Add minimal configuration.yaml template for z2m

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 16:18:31 +02:00

4.7 KiB

Observer Runtime

The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.

Architecture

The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.

Inputs

  • /opt/homelab/events/: Normalized JSON events (one .json file per event, organized by date and node).
  • /opt/homelab/state/observer_checkpoint.json: Per-node checkpoint dict (see below).
  • Repository Inventory: inventory/topology.yaml and hosts/*/services.yaml.

World Model Output

Generated under /opt/homelab/world/:

  • nodes.json: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
  • services.json: Service health status and links to active incidents. Dict keyed by "node/service".
  • deployments.json: Tracking of active and historical deployment runs by correlation_id.
  • incidents.json: Correlated operational issues, including repeat failures and resolution status.
  • runtime-summary.json: High-level overview for dashboards and planner agents.

Checkpoint Format

The observer tracks per-node progress to avoid silently skipping event directories:

{
  "node_checkpoints": {
    "vps":            "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
    "piha":           "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
    "chelsty-infra":  "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
  }
}

A single global checkpoint (last_processed_file) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. piha/ would be skipped when the checkpoint pointed to vps/).

Reset: Delete /opt/homelab/state/observer_checkpoint.json. The observer will reprocess all events and rebuild world state from scratch.

Event Types

Negative events (create/escalate incidents)

  • service_unhealthy, healthcheck_failed — open or increment an active incident
  • deployment_failed — record failure in deployments.json

Positive events (resolve state)

  • service_healthy — marks service status as healthy and resolves any active incident for that service
  • service_recovered — alias, same effect
  • deployment_completed — marks deployment as completed

Node events

  • node_online, node_offline — update node status in nodes.json
  • disk_pressure_* — set disk_pressure field on the node record

Incident Lifecycle

  1. Detection: A service_unhealthy or healthcheck_failed event creates or increments an active incident.
  2. Correlation: Multiple failure events for the same node/service are collapsed into one incident, incrementing occurrence_count.
  3. Resolution: A service_healthy or service_recovered event resolves any active incident for that service, setting status: resolved and resolved_at.
  4. Expiry: Resolved incidents older than 7 days are pruned from world state by _prune_stale_world().

Example Incident JSON

{
  "inc-1715518800-vps-observer": {
    "id": "inc-1715518800-vps-observer",
    "node": "vps",
    "service": "observer",
    "status": "resolved",
    "severity": "error",
    "started_at": 1715518800.0,
    "last_occurrence": 1715518860.0,
    "occurrence_count": 2,
    "trigger_type": "containers_not_running",
    "resolved_at": 1715519100.0
  }
}

World State Pruning

_prune_stale_world() runs every reconcile cycle and removes:

  1. Stale nodes — nodes not present in inventory/topology.yaml (e.g. ghost nodes created when NODE_NAME was unset and fell back to the container's 12-char hex ID).
  2. Services of stale nodes — all node/service keys whose node was pruned.
  3. Ghost service keys — service keys whose service-name portion matches the pattern <12hexchars>_<name> (Docker internal stale-state artifacts, created when node-agent used c.name instead of the compose label).
  4. Expired incidents — resolved incidents older than 7 days.

Runtime Behavior

Idempotency

The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.

Deployment Tracking

Deployments are tracked via correlation_id. The observer synthesizes the start, end, and status of each deployment run from events.

Topology Filtering

Events from nodes not listed in inventory/topology.yaml are discarded during pruning. This prevents transient bootstrap noise from polluting world state.