homelab-codex-ws/docs/event-system.md

97 lines
3.8 KiB
Markdown

# Homelab Event System
The homelab multi-agent platform uses a filesystem-first event architecture for observability, auditability, and agent reasoning.
## Architecture
Events are stored as individual JSON files on the local filesystem. This ensures that the system is resilient to network outages and requires no external dependencies like databases or message brokers.
### Filesystem Layout
Events are organized by date and node:
```
/opt/homelab/events/YYYY-MM-DD/node-name/TIMESTAMP_TYPE_UUID.json
```
- **Date-based partitioning** allows for easy archival and rotation.
- **Node-based partitioning** supports multi-node environments and offline synchronization.
- **Append-only** nature ensures an immutable audit trail.
## Event Schema
Each event is a JSON object with the following fields:
| Field | Type | Description |
|------------------|--------|-------------------------------------------------------|
| `timestamp` | string | ISO 8601 UTC timestamp |
| `node` | string | Hostname of the node where the event originated |
| `type` | string | Normalized event type |
| `severity` | string | `info`, `warning`, `error`, `critical` |
| `source` | string | Component that emitted the event (e.g., `deploy.sh`) |
| `service` | string | Service name or `all` |
| `correlation_id` | string | Used to link related events (e.g., deployment run ID) |
| `payload` | object | Arbitrary event-specific data |
### Normalized Event Types
- `deployment_started`: A deployment process has begun.
- `deployment_completed`: A deployment finished successfully.
- `deployment_failed`: A deployment failed at some stage.
- `service_unhealthy`: A healthcheck failed for a service.
- `service_recovered`: A service returned to healthy state.
- `node_offline`: Node detected it is losing connectivity (heartbeat loss).
- `node_online`: Node detected it is back online.
- `healthcheck_failed`: Generic healthcheck failure.
- `remediation_started`: An automated or manual fix is being applied.
- `remediation_completed`: Remediation finished.
## Usage
### Shell Library
Source `scripts/lib/events.sh` to use the event library in bash scripts.
```bash
source scripts/lib/events.sh
# Emit an event
emit_event "deployment_started" "info" "my-script.sh" "mosquitto" "unique-cid" '{"version": "1.0"}'
# List events for today
list_events
```
### Python Library
Import `scripts.lib.events` in Python scripts.
```python
from scripts.lib.events import emit_event
emit_event(
event_type="service_unhealthy",
severity="error",
source="monitor.py",
service="ollama",
correlation_id="12345",
payload={"error": "OOM"}
)
```
## Operator & AI Agent Reasoning
The event system is designed to support future AI agents:
1. **Causal Chains**: By using `correlation_id`, agents can trace a failure back to a specific deployment or remediation attempt.
2. **Resumable Remediation**: Agents can check the latest `remediation_started` events to see what has already been tried.
3. **Auditability**: Every action taken by an operator or agent leaves a permanent record on the filesystem.
4. **Offline Capability**: Events are stored locally and can be synced when connectivity is restored.
## Example Flow: Deployment Failure & Recovery
1. **Event 1**: `deployment_started` (Type: deployment, CID: `deploy-882`)
2. **Event 2**: `deployment_failed` (Type: deployment, CID: `deploy-882`, Payload: `{"stage": "verify", "error": "port 1883 not bound"}`)
3. **Event 3**: `remediation_started` (Source: `diagnostics.sh`, CID: `deploy-882`)
4. **Event 4**: `service_recovered` (Source: `healthcheck.sh`, Service: `mosquitto`, CID: `deploy-882`)