64 lines
2.9 KiB
Markdown
64 lines
2.9 KiB
Markdown
|
|
# Observer Runtime
|
||
|
|
|
||
|
|
The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
|
||
|
|
|
||
|
|
### Inputs
|
||
|
|
- `/opt/homelab/events/`: Normalized JSON events.
|
||
|
|
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint.
|
||
|
|
- `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
|
||
|
|
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
|
||
|
|
|
||
|
|
### World Model Output
|
||
|
|
Generated under `/opt/homelab/world/`:
|
||
|
|
- `nodes.json`: Current node availability, roles, and last seen timestamps.
|
||
|
|
- `services.json`: Service health status and links to active incidents.
|
||
|
|
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
|
||
|
|
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
|
||
|
|
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
|
||
|
|
|
||
|
|
## Incident Lifecycle
|
||
|
|
|
||
|
|
The observer implements lightweight incident correlation:
|
||
|
|
|
||
|
|
1. **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
|
||
|
|
2. **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
|
||
|
|
3. **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
|
||
|
|
4. **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
|
||
|
|
|
||
|
|
### Example Incident JSON
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"inc-1715518800-saturn-mosquitto": {
|
||
|
|
"id": "inc-1715518800-saturn-mosquitto",
|
||
|
|
"node": "saturn",
|
||
|
|
"service": "mosquitto",
|
||
|
|
"status": "resolved",
|
||
|
|
"severity": "error",
|
||
|
|
"started_at": "2026-05-12T12:05:00Z",
|
||
|
|
"last_occurrence": "2026-05-12T12:06:00Z",
|
||
|
|
"occurrence_count": 2,
|
||
|
|
"events": [
|
||
|
|
"2026-05-12T12:05:00Z",
|
||
|
|
"2026-05-12T12:06:00Z"
|
||
|
|
],
|
||
|
|
"correlation_id": "hc-1",
|
||
|
|
"resolved_at": "2026-05-12T12:10:00Z"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Runtime Behavior
|
||
|
|
|
||
|
|
### Idempotency
|
||
|
|
The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state.
|
||
|
|
|
||
|
|
### Resumability
|
||
|
|
The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
|
||
|
|
|
||
|
|
### Deployment Tracking
|
||
|
|
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment.
|