homelab-codex-ws/docs/event-system.md

3.8 KiB

Homelab Event System

The homelab multi-agent platform uses a filesystem-first event architecture for observability, auditability, and agent reasoning.

Architecture

Events are stored as individual JSON files on the local filesystem. This ensures that the system is resilient to network outages and requires no external dependencies like databases or message brokers.

Filesystem Layout

Events are organized by date and node:

/opt/homelab/events/YYYY-MM-DD/node-name/TIMESTAMP_TYPE_UUID.json
  • Date-based partitioning allows for easy archival and rotation.
  • Node-based partitioning supports multi-node environments and offline synchronization.
  • Append-only nature ensures an immutable audit trail.

Event Schema

Each event is a JSON object with the following fields:

Field Type Description
timestamp string ISO 8601 UTC timestamp
node string Hostname of the node where the event originated
type string Normalized event type
severity string info, warning, error, critical
source string Component that emitted the event (e.g., deploy.sh)
service string Service name or all
correlation_id string Used to link related events (e.g., deployment run ID)
payload object Arbitrary event-specific data

Normalized Event Types

  • deployment_started: A deployment process has begun.
  • deployment_completed: A deployment finished successfully.
  • deployment_failed: A deployment failed at some stage.
  • service_unhealthy: A healthcheck failed for a service.
  • service_recovered: A service returned to healthy state.
  • node_offline: Node detected it is losing connectivity (heartbeat loss).
  • node_online: Node detected it is back online.
  • healthcheck_failed: Generic healthcheck failure.
  • remediation_started: An automated or manual fix is being applied.
  • remediation_completed: Remediation finished.

Usage

Shell Library

Source scripts/lib/events.sh to use the event library in bash scripts.

source scripts/lib/events.sh

# Emit an event
emit_event "deployment_started" "info" "my-script.sh" "mosquitto" "unique-cid" '{"version": "1.0"}'

# List events for today
list_events

Python Library

Import scripts.lib.events in Python scripts.

from scripts.lib.events import emit_event

emit_event(
    event_type="service_unhealthy",
    severity="error",
    source="monitor.py",
    service="ollama",
    correlation_id="12345",
    payload={"error": "OOM"}
)

Operator & AI Agent Reasoning

The event system is designed to support future AI agents:

  1. Causal Chains: By using correlation_id, agents can trace a failure back to a specific deployment or remediation attempt.
  2. Resumable Remediation: Agents can check the latest remediation_started events to see what has already been tried.
  3. Auditability: Every action taken by an operator or agent leaves a permanent record on the filesystem.
  4. Offline Capability: Events are stored locally and can be synced when connectivity is restored.

Example Flow: Deployment Failure & Recovery

  1. Event 1: deployment_started (Type: deployment, CID: deploy-882)
  2. Event 2: deployment_failed (Type: deployment, CID: deploy-882, Payload: {"stage": "verify", "error": "port 1883 not bound"})
  3. Event 3: remediation_started (Source: diagnostics.sh, CID: deploy-882)
  4. Event 4: service_recovered (Source: healthcheck.sh, Service: mosquitto, CID: deploy-882)