agent-system/docs/operator/stale-state-semantics.md

26 lines
1.3 KiB
Markdown

# Stale State Semantics
In a local-first, filesystem-backed control plane, visibility depends on the freshness of the runtime state files.
## Detection
The Operator Control Plane monitors the modification time (mtime) of the `runtime-summary.json` file in `/opt/homelab/state/`.
- **Live**: If the file was updated within the last 60 seconds.
- **Stale**: If the file is older than 60 seconds.
## UI Representation
When the state is detected as stale:
1. A **Critical Warning Banner** appears at the top of the console.
2. The exact time of the last successful update is displayed.
3. Health badges and metrics should be treated with caution as they represent the last known good state, not necessarily the current live state.
## Causes of Staleness
- **Runtime Observer Failure**: The process responsible for writing state to the filesystem has crashed.
- **Node Isolation**: The node where the observer is running is offline or disconnected.
- **Filesystem Latency**: Issues with the underlying storage layer (e.g., SD card degradation).
## Operator Response
1. Check the **Nodes** view to identify if a specific observer node is offline.
2. Investigate the `homelab-codex-ws` runtime logs.
3. Manually verify critical services if the control plane remains stale for an extended period.