### Stability Agent A lightweight filesystem-first watchdog and observer agent for CHELSTY. #### Features * **Continuous Monitoring**: Runs as a background service. * **Docker Inspection**: Checks container status via read-only Docker socket. * **Disk Usage**: Monitors local disk utilization. * **Tailscale Check**: Verifies Tailscale availability. * **MQTT Reachability**: Checks connectivity to the local MQTT broker. * **Zigbee2MQTT Monitoring**: Specifically monitors the Zigbee2MQTT container. * **Redis Publishing**: (Optional) Publishes runtime state and events to a central Redis server. * **Event Logging**: Writes append-only JSON events to `/opt/homelab/events/YYYY-MM-DD/chelsty/`. * **State Reporting**: Writes heartbeat and status summary to `/opt/homelab/state/`. #### Configuration Environment variables: * `STABILITY_CHECK_INTERVAL`: Interval between checks in seconds (default: 60). * `DISK_THRESHOLD_PCT`: Disk usage percentage to trigger warning (default: 90). * `MQTT_HOST`: Hostname or IP of the MQTT broker to check. * `MQTT_PORT`: Port of the MQTT broker (default: 1883). * `REDIS_HOST`: Hostname or IP of the Redis server (e.g., PIHA at 100.108.208.3). * `REDIS_PORT`: Port of the Redis server (default: 6379). * `REDIS_ENABLED`: Whether to enable Redis publishing (default: true if REDIS_HOST is set). * `NODE_NAME`: Name of the current node (default: chelsty). #### Verification You can verify the Redis publishing using `redis-cli`: ```bash # Check node state redis-cli -h 100.108.208.3 HGETALL homelab:nodes:chelsty # Check service discovery redis-cli -h 100.108.208.3 HGETALL homelab:services:chelsty:stability-agent # Check event stream redis-cli -h 100.108.208.3 XRANGE homelab:events - + ``` #### Safety * No automatic restarts are performed. * Read-only access to Docker socket. * No configuration mutation. * No secrets stored in the repository. #### Event Schema Events are written as JSON lines with the following fields: * `id`: Unique event UUID. * `timestamp`: ISO 8601 timestamp (UTC). * `node`: `chelsty`. * `source`: `stability-agent`. * `type`: Type of event (e.g., `disk_usage_high`, `containers_not_running`). * `severity`: `info`, `warning`, or `error`. * `message`: Human-readable description. * `details`: Object containing specific check results.