homelab-codex-ws/services/stability-agent/README.md

70 lines
2.4 KiB
Markdown

### Stability Agent
A lightweight filesystem-first watchdog and observer agent for homelab nodes.
#### Features
* **Continuous Monitoring**: Runs as a background service.
* **Docker Inspection**: Checks container status via read-only Docker socket (optional).
* **Disk Usage**: Monitors local disk utilization.
* **Tailscale Check**: Verifies Tailscale availability (optional).
* **MQTT Reachability**: Checks connectivity to a configured MQTT broker (optional).
* **Redis Publishing**: Publishes runtime state and events to a central Redis server (PIHA).
* **Event Logging**: Writes append-only JSON events to `/opt/homelab/events/YYYY-MM-DD/<NODE_NAME>/`.
* **State Reporting**: Writes heartbeat and status summary to `/opt/homelab/state/`.
#### Deployment
Use the deployment helper script:
```bash
./scripts/deploy/deploy-stability-agent.sh <NODE_NAME>
```
#### Configuration
Environment variables:
* `STABILITY_CHECK_INTERVAL`: Interval between checks in seconds (default: 60).
* `DISK_THRESHOLD_PCT`: Disk usage percentage to trigger warning (default: 90).
* `MQTT_HOST`: Hostname or IP of the MQTT broker to check.
* `MQTT_PORT`: Port of the MQTT broker (default: 1883).
* `REDIS_HOST`: Hostname or IP of the Redis server (e.g., PIHA at 100.108.208.3).
* `REDIS_PORT`: Port of the Redis server (default: 6379).
* `REDIS_ENABLED`: Whether to enable Redis publishing (default: true if REDIS_HOST is set).
* `NODE_NAME`: Name of the current node (default: chelsty).
#### Verification
You can verify the Redis publishing using `redis-cli`:
```bash
# Check node state
redis-cli -h 100.108.208.3 HGETALL homelab:nodes:<NODE_NAME>
# Check service discovery
redis-cli -h 100.108.208.3 HGETALL homelab:services:<NODE_NAME>:stability-agent
# Check event stream
redis-cli -h 100.108.208.3 XRANGE homelab:events - +
```
#### Safety
* No automatic restarts are performed.
* Read-only access to Docker socket.
* No configuration mutation.
* No secrets stored in the repository.
#### Event Schema
Events are written as JSON lines with the following fields:
* `id`: Unique event UUID.
* `timestamp`: ISO 8601 timestamp (UTC).
* `node`: `<NODE_NAME>`.
* `source`: `stability-agent`.
* `type`: Type of event (e.g., `disk_usage_high`, `containers_not_running`).
* `severity`: `info`, `warning`, or `error`.
* `message`: Human-readable description.
* `details`: Object containing specific check results.