97 lines
3.8 KiB
Markdown
97 lines
3.8 KiB
Markdown
|
|
# Homelab Event System
|
||
|
|
|
||
|
|
The homelab multi-agent platform uses a filesystem-first event architecture for observability, auditability, and agent reasoning.
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
Events are stored as individual JSON files on the local filesystem. This ensures that the system is resilient to network outages and requires no external dependencies like databases or message brokers.
|
||
|
|
|
||
|
|
### Filesystem Layout
|
||
|
|
|
||
|
|
Events are organized by date and node:
|
||
|
|
|
||
|
|
```
|
||
|
|
/opt/homelab/events/YYYY-MM-DD/node-name/TIMESTAMP_TYPE_UUID.json
|
||
|
|
```
|
||
|
|
|
||
|
|
- **Date-based partitioning** allows for easy archival and rotation.
|
||
|
|
- **Node-based partitioning** supports multi-node environments and offline synchronization.
|
||
|
|
- **Append-only** nature ensures an immutable audit trail.
|
||
|
|
|
||
|
|
## Event Schema
|
||
|
|
|
||
|
|
Each event is a JSON object with the following fields:
|
||
|
|
|
||
|
|
| Field | Type | Description |
|
||
|
|
|------------------|--------|-------------------------------------------------------|
|
||
|
|
| `timestamp` | string | ISO 8601 UTC timestamp |
|
||
|
|
| `node` | string | Hostname of the node where the event originated |
|
||
|
|
| `type` | string | Normalized event type |
|
||
|
|
| `severity` | string | `info`, `warning`, `error`, `critical` |
|
||
|
|
| `source` | string | Component that emitted the event (e.g., `deploy.sh`) |
|
||
|
|
| `service` | string | Service name or `all` |
|
||
|
|
| `correlation_id` | string | Used to link related events (e.g., deployment run ID) |
|
||
|
|
| `payload` | object | Arbitrary event-specific data |
|
||
|
|
|
||
|
|
### Normalized Event Types
|
||
|
|
|
||
|
|
- `deployment_started`: A deployment process has begun.
|
||
|
|
- `deployment_completed`: A deployment finished successfully.
|
||
|
|
- `deployment_failed`: A deployment failed at some stage.
|
||
|
|
- `service_unhealthy`: A healthcheck failed for a service.
|
||
|
|
- `service_recovered`: A service returned to healthy state.
|
||
|
|
- `node_offline`: Node detected it is losing connectivity (heartbeat loss).
|
||
|
|
- `node_online`: Node detected it is back online.
|
||
|
|
- `healthcheck_failed`: Generic healthcheck failure.
|
||
|
|
- `remediation_started`: An automated or manual fix is being applied.
|
||
|
|
- `remediation_completed`: Remediation finished.
|
||
|
|
|
||
|
|
## Usage
|
||
|
|
|
||
|
|
### Shell Library
|
||
|
|
|
||
|
|
Source `scripts/lib/events.sh` to use the event library in bash scripts.
|
||
|
|
|
||
|
|
```bash
|
||
|
|
source scripts/lib/events.sh
|
||
|
|
|
||
|
|
# Emit an event
|
||
|
|
emit_event "deployment_started" "info" "my-script.sh" "mosquitto" "unique-cid" '{"version": "1.0"}'
|
||
|
|
|
||
|
|
# List events for today
|
||
|
|
list_events
|
||
|
|
```
|
||
|
|
|
||
|
|
### Python Library
|
||
|
|
|
||
|
|
Import `scripts.lib.events` in Python scripts.
|
||
|
|
|
||
|
|
```python
|
||
|
|
from scripts.lib.events import emit_event
|
||
|
|
|
||
|
|
emit_event(
|
||
|
|
event_type="service_unhealthy",
|
||
|
|
severity="error",
|
||
|
|
source="monitor.py",
|
||
|
|
service="ollama",
|
||
|
|
correlation_id="12345",
|
||
|
|
payload={"error": "OOM"}
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Operator & AI Agent Reasoning
|
||
|
|
|
||
|
|
The event system is designed to support future AI agents:
|
||
|
|
|
||
|
|
1. **Causal Chains**: By using `correlation_id`, agents can trace a failure back to a specific deployment or remediation attempt.
|
||
|
|
2. **Resumable Remediation**: Agents can check the latest `remediation_started` events to see what has already been tried.
|
||
|
|
3. **Auditability**: Every action taken by an operator or agent leaves a permanent record on the filesystem.
|
||
|
|
4. **Offline Capability**: Events are stored locally and can be synced when connectivity is restored.
|
||
|
|
|
||
|
|
## Example Flow: Deployment Failure & Recovery
|
||
|
|
|
||
|
|
1. **Event 1**: `deployment_started` (Type: deployment, CID: `deploy-882`)
|
||
|
|
2. **Event 2**: `deployment_failed` (Type: deployment, CID: `deploy-882`, Payload: `{"stage": "verify", "error": "port 1883 not bound"}`)
|
||
|
|
3. **Event 3**: `remediation_started` (Source: `diagnostics.sh`, CID: `deploy-882`)
|
||
|
|
4. **Event 4**: `service_recovered` (Source: `healthcheck.sh`, Service: `mosquitto`, CID: `deploy-882`)
|