homelab-codex-ws/docs/observer-runtime.md

64 lines
2.9 KiB
Markdown
Raw Permalink Normal View History

# Observer Runtime
The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
## Architecture
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
### Inputs
- `/opt/homelab/events/`: Normalized JSON events.
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint.
- `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
### World Model Output
Generated under `/opt/homelab/world/`:
- `nodes.json`: Current node availability, roles, and last seen timestamps.
- `services.json`: Service health status and links to active incidents.
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
## Incident Lifecycle
The observer implements lightweight incident correlation:
1. **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
2. **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
3. **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
4. **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
### Example Incident JSON
```json
{
"inc-1715518800-saturn-mosquitto": {
"id": "inc-1715518800-saturn-mosquitto",
"node": "saturn",
"service": "mosquitto",
"status": "resolved",
"severity": "error",
"started_at": "2026-05-12T12:05:00Z",
"last_occurrence": "2026-05-12T12:06:00Z",
"occurrence_count": 2,
"events": [
"2026-05-12T12:05:00Z",
"2026-05-12T12:06:00Z"
],
"correlation_id": "hc-1",
"resolved_at": "2026-05-12T12:10:00Z"
}
}
```
## Runtime Behavior
### Idempotency
The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state.
### Resumability
The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
### Deployment Tracking
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment.