docs/sessions/2026-05-27.md (new): - Full session record: problems found, all commits shipped, end state - Written in Polish per operator preference for session notes - Known limitations: SLZB-06U offline, ezsp→ember migration pending docs/observer-runtime.md: - Document per-node checkpoint format (replaces old global checkpoint) - Add service_healthy / service_recovered resolution behavior - Document ghost key pruning (_prune_stale_world patterns) - Add event type reference table (negative vs positive) docs/vps-control-plane.md: - Add container names and network_mode: host detail - Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior - Add piha agent-system materializer integration note - Rewrite recovery section with actionable bootstrap-flood diagnosis - Add action state machine (pending→approved→running→completed/cancelled) docs/chelsty-runtime.md: - Add chelsty-infra/chelsty-ha node table - Document docker-compose v1 constraint (always use docker-compose, not docker compose) - Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation - Add z2m config writable requirement (EROFS failure mode documented) - Add chelsty-ha monitor:false rationale - Add minimal configuration.yaml template for z2m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.7 KiB
Observer Runtime
The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
Architecture
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
Inputs
/opt/homelab/events/: Normalized JSON events (one.jsonfile per event, organized by date and node)./opt/homelab/state/observer_checkpoint.json: Per-node checkpoint dict (see below).- Repository Inventory:
inventory/topology.yamlandhosts/*/services.yaml.
World Model Output
Generated under /opt/homelab/world/:
nodes.json: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.services.json: Service health status and links to active incidents. Dict keyed by"node/service".deployments.json: Tracking of active and historical deployment runs bycorrelation_id.incidents.json: Correlated operational issues, including repeat failures and resolution status.runtime-summary.json: High-level overview for dashboards and planner agents.
Checkpoint Format
The observer tracks per-node progress to avoid silently skipping event directories:
{
"node_checkpoints": {
"vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
"piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
"chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
}
}
A single global checkpoint (last_processed_file) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. piha/ would be skipped when the checkpoint pointed to vps/).
Reset: Delete /opt/homelab/state/observer_checkpoint.json. The observer will reprocess all events and rebuild world state from scratch.
Event Types
Negative events (create/escalate incidents)
service_unhealthy,healthcheck_failed— open or increment an active incidentdeployment_failed— record failure in deployments.json
Positive events (resolve state)
service_healthy— marks service status ashealthyand resolves any active incident for that serviceservice_recovered— alias, same effectdeployment_completed— marks deployment as completed
Node events
node_online,node_offline— update node status in nodes.jsondisk_pressure_*— setdisk_pressurefield on the node record
Incident Lifecycle
- Detection: A
service_unhealthyorhealthcheck_failedevent creates or increments an active incident. - Correlation: Multiple failure events for the same
node/serviceare collapsed into one incident, incrementingoccurrence_count. - Resolution: A
service_healthyorservice_recoveredevent resolves any active incident for that service, settingstatus: resolvedandresolved_at. - Expiry: Resolved incidents older than 7 days are pruned from world state by
_prune_stale_world().
Example Incident JSON
{
"inc-1715518800-vps-observer": {
"id": "inc-1715518800-vps-observer",
"node": "vps",
"service": "observer",
"status": "resolved",
"severity": "error",
"started_at": 1715518800.0,
"last_occurrence": 1715518860.0,
"occurrence_count": 2,
"trigger_type": "containers_not_running",
"resolved_at": 1715519100.0
}
}
World State Pruning
_prune_stale_world() runs every reconcile cycle and removes:
- Stale nodes — nodes not present in
inventory/topology.yaml(e.g. ghost nodes created whenNODE_NAMEwas unset and fell back to the container's 12-char hex ID). - Services of stale nodes — all
node/servicekeys whose node was pruned. - Ghost service keys — service keys whose service-name portion matches the pattern
<12hexchars>_<name>(Docker internal stale-state artifacts, created when node-agent usedc.nameinstead of the compose label). - Expired incidents — resolved incidents older than 7 days.
Runtime Behavior
Idempotency
The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
Deployment Tracking
Deployments are tracked via correlation_id. The observer synthesizes the start, end, and status of each deployment run from events.
Topology Filtering
Events from nodes not listed in inventory/topology.yaml are discarded during pruning. This prevents transient bootstrap noise from polluting world state.