Add CHELSTY stability agent
This commit is contained in:
parent
90b2a5d0e9
commit
8d0f2379ba
42
docs/chelsty-stability-agent.md
Normal file
42
docs/chelsty-stability-agent.md
Normal file
|
|
@ -0,0 +1,42 @@
|
||||||
|
### CHELSTY Stability Agent
|
||||||
|
|
||||||
|
The stability-agent on CHELSTY provides local observability and health monitoring for the node's services and infrastructure.
|
||||||
|
|
||||||
|
#### Purpose
|
||||||
|
|
||||||
|
It acts as a filesystem-first watchdog that detects anomalies in the local runtime environment without taking autonomous destructive actions (like restarts). It serves as the primary data source for node-level stability metrics.
|
||||||
|
|
||||||
|
#### Monitoring Scope
|
||||||
|
|
||||||
|
* **Docker Containers**: Monitors all local containers. If a container is not in the `running` state, a `containers_not_running` event is generated.
|
||||||
|
* **Disk Usage**: Monitors the root filesystem. Generates `disk_usage_high` events if usage exceeds the configured threshold.
|
||||||
|
* **Connectivity**:
|
||||||
|
* Checks if the Tailscale socket or interface is available.
|
||||||
|
* Checks reachability of the local Mosquitto MQTT broker.
|
||||||
|
* **Zigbee2MQTT**: Specifically tracks the presence and status of the Zigbee2MQTT service.
|
||||||
|
|
||||||
|
#### Storage and Integration
|
||||||
|
|
||||||
|
* **Heartbeat**: Updated every cycle at `/opt/homelab/state/stability-agent.heartbeat`.
|
||||||
|
* **State Summary**: A JSON summary of all latest checks at `/opt/homelab/state/stability-agent.json`.
|
||||||
|
* **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty/events.jsonl`.
|
||||||
|
|
||||||
|
#### Deployment
|
||||||
|
|
||||||
|
The service is deployed via Docker Compose on CHELSTY.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd services/stability-agent
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Configuration
|
||||||
|
|
||||||
|
Configuration is managed via environment variables in `docker-compose.override.yml` on the host.
|
||||||
|
|
||||||
|
| Variable | Description | Default |
|
||||||
|
|----------|-------------|---------|
|
||||||
|
| `STABILITY_CHECK_INTERVAL` | Seconds between checks | `60` |
|
||||||
|
| `DISK_THRESHOLD_PCT` | Disk usage alert threshold | `90` |
|
||||||
|
| `MQTT_HOST` | MQTT broker hostname | `mosquitto` |
|
||||||
|
| `MQTT_PORT` | MQTT broker port | `1883` |
|
||||||
|
|
@ -0,0 +1,7 @@
|
||||||
|
services:
|
||||||
|
stability-agent:
|
||||||
|
environment:
|
||||||
|
- STABILITY_CHECK_INTERVAL=60
|
||||||
|
- DISK_THRESHOLD_PCT=85
|
||||||
|
- MQTT_HOST=mosquitto
|
||||||
|
- MQTT_PORT=1883
|
||||||
|
|
@ -106,3 +106,21 @@ services:
|
||||||
- /opt/homelab/data/mosquitto
|
- /opt/homelab/data/mosquitto
|
||||||
notes:
|
notes:
|
||||||
- Retain ACL, password, persistence, and bridge configuration if enabled.
|
- Retain ACL, password, persistence, and bridge configuration if enabled.
|
||||||
|
|
||||||
|
stability-agent:
|
||||||
|
role: node-stability-monitor
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: true
|
||||||
|
depends_on:
|
||||||
|
local:
|
||||||
|
- mosquitto
|
||||||
|
external: []
|
||||||
|
runtime:
|
||||||
|
config_path: null
|
||||||
|
data_path: /opt/homelab/state
|
||||||
|
logs_path: /opt/homelab/events
|
||||||
|
backup:
|
||||||
|
recommended: false
|
||||||
|
notes:
|
||||||
|
- Events and state are transient or can be reconstructed; high-frequency writes.
|
||||||
|
|
|
||||||
16
services/stability-agent/Dockerfile
Normal file
16
services/stability-agent/Dockerfile
Normal file
|
|
@ -0,0 +1,16 @@
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# No extra dependencies needed beyond standard library for the current script
|
||||||
|
# But we might need them if we decide to use libraries later.
|
||||||
|
|
||||||
|
COPY src/stability_agent.py .
|
||||||
|
COPY healthcheck.sh .
|
||||||
|
RUN chmod +x healthcheck.sh
|
||||||
|
|
||||||
|
# Create the expected directories
|
||||||
|
RUN mkdir -p /opt/homelab/state /opt/homelab/events
|
||||||
|
|
||||||
|
# Run the agent
|
||||||
|
CMD ["python", "stability_agent.py"]
|
||||||
43
services/stability-agent/README.md
Normal file
43
services/stability-agent/README.md
Normal file
|
|
@ -0,0 +1,43 @@
|
||||||
|
### Stability Agent
|
||||||
|
|
||||||
|
A lightweight filesystem-first watchdog and observer agent for CHELSTY.
|
||||||
|
|
||||||
|
#### Features
|
||||||
|
|
||||||
|
* **Continuous Monitoring**: Runs as a background service.
|
||||||
|
* **Docker Inspection**: Checks container status via read-only Docker socket.
|
||||||
|
* **Disk Usage**: Monitors local disk utilization.
|
||||||
|
* **Tailscale Check**: Verifies Tailscale availability.
|
||||||
|
* **MQTT Reachability**: Checks connectivity to the local MQTT broker.
|
||||||
|
* **Zigbee2MQTT Monitoring**: Specifically monitors the Zigbee2MQTT container.
|
||||||
|
* **Event Logging**: Writes append-only JSON events to `/opt/homelab/events/YYYY-MM-DD/chelsty/`.
|
||||||
|
* **State Reporting**: Writes heartbeat and status summary to `/opt/homelab/state/`.
|
||||||
|
|
||||||
|
#### Configuration
|
||||||
|
|
||||||
|
Environment variables:
|
||||||
|
|
||||||
|
* `STABILITY_CHECK_INTERVAL`: Interval between checks in seconds (default: 60).
|
||||||
|
* `DISK_THRESHOLD_PCT`: Disk usage percentage to trigger warning (default: 90).
|
||||||
|
* `MQTT_HOST`: Hostname or IP of the MQTT broker to check.
|
||||||
|
* `MQTT_PORT`: Port of the MQTT broker (default: 1883).
|
||||||
|
|
||||||
|
#### Safety
|
||||||
|
|
||||||
|
* No automatic restarts are performed.
|
||||||
|
* Read-only access to Docker socket.
|
||||||
|
* No configuration mutation.
|
||||||
|
* No secrets stored in the repository.
|
||||||
|
|
||||||
|
#### Event Schema
|
||||||
|
|
||||||
|
Events are written as JSON lines with the following fields:
|
||||||
|
|
||||||
|
* `id`: Unique event UUID.
|
||||||
|
* `timestamp`: ISO 8601 timestamp (UTC).
|
||||||
|
* `node`: `chelsty`.
|
||||||
|
* `source`: `stability-agent`.
|
||||||
|
* `type`: Type of event (e.g., `disk_usage_high`, `containers_not_running`).
|
||||||
|
* `severity`: `info`, `warning`, or `error`.
|
||||||
|
* `message`: Human-readable description.
|
||||||
|
* `details`: Object containing specific check results.
|
||||||
25
services/stability-agent/docker-compose.yml
Normal file
25
services/stability-agent/docker-compose.yml
Normal file
|
|
@ -0,0 +1,25 @@
|
||||||
|
services:
|
||||||
|
stability-agent:
|
||||||
|
build: .
|
||||||
|
container_name: stability-agent
|
||||||
|
restart: unless-stopped
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab:/opt/homelab
|
||||||
|
- /var/run/docker.sock:/var/run/docker.sock:ro
|
||||||
|
- /var/run/tailscale/tailscaled.sock:/var/run/tailscale/tailscaled.sock:ro
|
||||||
|
environment:
|
||||||
|
- STABILITY_CHECK_INTERVAL=${STABILITY_CHECK_INTERVAL:-60}
|
||||||
|
- DISK_THRESHOLD_PCT=${DISK_THRESHOLD_PCT:-90}
|
||||||
|
- MQTT_HOST=${MQTT_HOST}
|
||||||
|
- MQTT_PORT=${MQTT_PORT:-1883}
|
||||||
|
- NODE_NAME=chelsty
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "/bin/sh", "/app/healthcheck.sh"]
|
||||||
|
interval: 1m
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
opt_homelab:
|
||||||
|
external: true
|
||||||
|
name: homelab_data # This might vary, but /opt/homelab mount is preferred as direct path.
|
||||||
4
services/stability-agent/env.example
Normal file
4
services/stability-agent/env.example
Normal file
|
|
@ -0,0 +1,4 @@
|
||||||
|
STABILITY_CHECK_INTERVAL=60
|
||||||
|
DISK_THRESHOLD_PCT=90
|
||||||
|
MQTT_HOST=mosquitto
|
||||||
|
MQTT_PORT=1883
|
||||||
25
services/stability-agent/healthcheck.sh
Normal file
25
services/stability-agent/healthcheck.sh
Normal file
|
|
@ -0,0 +1,25 @@
|
||||||
|
#!/bin/sh
|
||||||
|
|
||||||
|
HEARTBEAT_FILE="/opt/homelab/state/stability-agent.heartbeat"
|
||||||
|
MAX_AGE_SECONDS=300 # 5 minutes
|
||||||
|
|
||||||
|
if [ ! -f "$HEARTBEAT_FILE" ]; then
|
||||||
|
echo "Heartbeat file missing"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Get current time in seconds
|
||||||
|
NOW=$(date +%s)
|
||||||
|
|
||||||
|
# Get file modification time in seconds
|
||||||
|
# Busybox stat (standard in alpine/slim) uses -c %Y
|
||||||
|
FILE_TIME=$(stat -c %Y "$HEARTBEAT_FILE")
|
||||||
|
|
||||||
|
AGE=$((NOW - FILE_TIME))
|
||||||
|
|
||||||
|
if [ "$AGE" -gt "$MAX_AGE_SECONDS" ]; then
|
||||||
|
echo "Heartbeat is too old: ${AGE}s"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit 0
|
||||||
24
services/stability-agent/service.yaml
Normal file
24
services/stability-agent/service.yaml
Normal file
|
|
@ -0,0 +1,24 @@
|
||||||
|
service:
|
||||||
|
name: stability-agent
|
||||||
|
owner_node: chelsty
|
||||||
|
exposure: private
|
||||||
|
dependencies: []
|
||||||
|
healthcheck:
|
||||||
|
type: custom
|
||||||
|
interval: 60s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
restart_policy: unless-stopped
|
||||||
|
persistence:
|
||||||
|
paths:
|
||||||
|
- /opt/homelab/state
|
||||||
|
- /opt/homelab/events
|
||||||
|
runtime:
|
||||||
|
directories:
|
||||||
|
- /opt/homelab/state
|
||||||
|
- /opt/homelab/events
|
||||||
|
env_vars:
|
||||||
|
- STABILITY_CHECK_INTERVAL
|
||||||
|
- DISK_THRESHOLD_PCT
|
||||||
|
- MQTT_HOST
|
||||||
|
- MQTT_PORT
|
||||||
210
services/stability-agent/src/stability_agent.py
Normal file
210
services/stability-agent/src/stability_agent.py
Normal file
|
|
@ -0,0 +1,210 @@
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
import json
|
||||||
|
import datetime
|
||||||
|
import uuid
|
||||||
|
import socket
|
||||||
|
import shutil
|
||||||
|
import http.client
|
||||||
|
|
||||||
|
# Configuration from environment
|
||||||
|
CHECK_INTERVAL = int(os.environ.get("STABILITY_CHECK_INTERVAL", "60"))
|
||||||
|
DISK_THRESHOLD_PCT = float(os.environ.get("DISK_THRESHOLD_PCT", "90.0"))
|
||||||
|
MQTT_HOST = os.environ.get("MQTT_HOST")
|
||||||
|
MQTT_PORT = int(os.environ.get("MQTT_PORT", "1883"))
|
||||||
|
NODE_NAME = os.environ.get("NODE_NAME", "chelsty")
|
||||||
|
SOURCE = "stability-agent"
|
||||||
|
|
||||||
|
STATE_DIR = "/opt/homelab/state"
|
||||||
|
EVENTS_BASE_DIR = "/opt/homelab/events"
|
||||||
|
HEARTBEAT_FILE = os.path.join(STATE_DIR, "stability-agent.heartbeat")
|
||||||
|
STATUS_FILE = os.path.join(STATE_DIR, "stability-agent.json")
|
||||||
|
|
||||||
|
def get_timestamp():
|
||||||
|
return datetime.datetime.utcnow().isoformat() + "Z"
|
||||||
|
|
||||||
|
def get_datestamp():
|
||||||
|
return datetime.datetime.utcnow().strftime("%Y-%m-%d")
|
||||||
|
|
||||||
|
def emit_event(event_type, severity, message, service=None, details=None):
|
||||||
|
event = {
|
||||||
|
"id": str(uuid.uuid4()),
|
||||||
|
"timestamp": get_timestamp(),
|
||||||
|
"node": NODE_NAME,
|
||||||
|
"source": SOURCE,
|
||||||
|
"type": event_type,
|
||||||
|
"severity": severity,
|
||||||
|
"message": message,
|
||||||
|
"details": details or {}
|
||||||
|
}
|
||||||
|
if service:
|
||||||
|
event["service"] = service
|
||||||
|
|
||||||
|
date_str = get_datestamp()
|
||||||
|
event_dir = os.path.join(EVENTS_BASE_DIR, date_str, NODE_NAME)
|
||||||
|
try:
|
||||||
|
os.makedirs(event_dir, exist_ok=True)
|
||||||
|
event_file = os.path.join(event_dir, "events.jsonl")
|
||||||
|
with open(event_file, "a") as f:
|
||||||
|
f.write(json.dumps(event) + "\n")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Failed to write event to filesystem: {e}")
|
||||||
|
|
||||||
|
print(f"[{severity}] {message}")
|
||||||
|
|
||||||
|
def check_disk():
|
||||||
|
total, used, free = shutil.disk_usage("/")
|
||||||
|
percent = (used / total) * 100
|
||||||
|
details = {
|
||||||
|
"total_gb": total // (2**30),
|
||||||
|
"used_gb": used // (2**30),
|
||||||
|
"free_gb": free // (2**30),
|
||||||
|
"percent": round(percent, 2)
|
||||||
|
}
|
||||||
|
|
||||||
|
if percent > DISK_THRESHOLD_PCT:
|
||||||
|
emit_event("disk_usage_high", "warning", f"Disk usage is high: {details['percent']}%", details=details)
|
||||||
|
|
||||||
|
return details
|
||||||
|
|
||||||
|
class DockerClient:
|
||||||
|
def __init__(self, socket_path="/var/run/docker.sock"):
|
||||||
|
self.socket_path = socket_path
|
||||||
|
|
||||||
|
def _request(self, path):
|
||||||
|
class UnixHTTPConnection(http.client.HTTPConnection):
|
||||||
|
def __init__(self, socket_path):
|
||||||
|
super().__init__("localhost")
|
||||||
|
self.socket_path = socket_path
|
||||||
|
def connect(self):
|
||||||
|
self.sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||||
|
self.sock.settimeout(5.0)
|
||||||
|
self.sock.connect(self.socket_path)
|
||||||
|
|
||||||
|
if not os.path.exists(self.socket_path):
|
||||||
|
return None
|
||||||
|
|
||||||
|
conn = UnixHTTPConnection(self.socket_path)
|
||||||
|
try:
|
||||||
|
conn.request("GET", path)
|
||||||
|
res = conn.getresponse()
|
||||||
|
if res.status == 200:
|
||||||
|
return json.loads(res.read().decode())
|
||||||
|
return None
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Docker API error: {e}")
|
||||||
|
return None
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
def get_containers(self):
|
||||||
|
return self._request("/containers/json?all=1")
|
||||||
|
|
||||||
|
def check_docker():
|
||||||
|
client = DockerClient()
|
||||||
|
containers = client.get_containers()
|
||||||
|
if containers is None:
|
||||||
|
emit_event("docker_socket_error", "error", "Could not connect to Docker socket or socket missing")
|
||||||
|
return {"status": "error", "error": "Could not connect to Docker socket"}
|
||||||
|
|
||||||
|
summary = []
|
||||||
|
unhealthy_containers = []
|
||||||
|
for c in containers:
|
||||||
|
state = c.get("State", "")
|
||||||
|
status = c.get("Status", "")
|
||||||
|
name = c.get("Names", ["unknown"])[0].lstrip("/")
|
||||||
|
|
||||||
|
container_info = {
|
||||||
|
"name": name,
|
||||||
|
"state": state,
|
||||||
|
"status": status
|
||||||
|
}
|
||||||
|
summary.append(container_info)
|
||||||
|
|
||||||
|
if state != "running":
|
||||||
|
unhealthy_containers.append(container_info)
|
||||||
|
|
||||||
|
if unhealthy_containers:
|
||||||
|
names = [c["name"] for c in unhealthy_containers]
|
||||||
|
# Only emit warning for containers that should be running?
|
||||||
|
# For now, we report any non-running container found by Docker.
|
||||||
|
emit_event("containers_not_running", "warning", f"Some containers are not running: {', '.join(names)}", details={"containers": unhealthy_containers})
|
||||||
|
|
||||||
|
return {"status": "ok", "containers": summary}
|
||||||
|
|
||||||
|
def check_tailscale():
|
||||||
|
# Check for tailscale socket or interface
|
||||||
|
socket_path = "/var/run/tailscale/tailscaled.sock"
|
||||||
|
socket_available = os.path.exists(socket_path)
|
||||||
|
interface_available = os.path.exists("/sys/class/net/tailscale0")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"available": socket_available or interface_available,
|
||||||
|
"details": {
|
||||||
|
"socket": socket_available,
|
||||||
|
"interface": interface_available
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
def check_mqtt():
|
||||||
|
if not MQTT_HOST:
|
||||||
|
return {"configured": False}
|
||||||
|
|
||||||
|
try:
|
||||||
|
with socket.create_connection((MQTT_HOST, MQTT_PORT), timeout=5):
|
||||||
|
return {"configured": True, "reachable": True}
|
||||||
|
except Exception as e:
|
||||||
|
emit_event("mqtt_unreachable", "error", f"MQTT broker at {MQTT_HOST}:{MQTT_PORT} is unreachable", details={"error": str(e)})
|
||||||
|
return {"configured": True, "reachable": False, "error": str(e)}
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print(f"Starting stability-agent on {NODE_NAME}...")
|
||||||
|
|
||||||
|
# Ensure directories exist
|
||||||
|
os.makedirs(STATE_DIR, exist_ok=True)
|
||||||
|
os.makedirs(EVENTS_BASE_DIR, exist_ok=True)
|
||||||
|
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
status = {
|
||||||
|
"timestamp": get_timestamp(),
|
||||||
|
"node": NODE_NAME,
|
||||||
|
"checks": {}
|
||||||
|
}
|
||||||
|
|
||||||
|
status["checks"]["disk"] = check_disk()
|
||||||
|
status["checks"]["docker"] = check_docker()
|
||||||
|
status["checks"]["tailscale"] = check_tailscale()
|
||||||
|
status["checks"]["mqtt"] = check_mqtt()
|
||||||
|
|
||||||
|
# Zigbee2MQTT container check
|
||||||
|
z2m_present = False
|
||||||
|
z2m_running = False
|
||||||
|
if status["checks"]["docker"]["status"] == "ok":
|
||||||
|
for c in status["checks"]["docker"]["containers"]:
|
||||||
|
if "zigbee2mqtt" in c["name"]:
|
||||||
|
z2m_present = True
|
||||||
|
if c["state"] == "running":
|
||||||
|
z2m_running = True
|
||||||
|
|
||||||
|
status["checks"]["zigbee2mqtt"] = {
|
||||||
|
"present": z2m_present,
|
||||||
|
"running": z2m_running
|
||||||
|
}
|
||||||
|
|
||||||
|
# Write heartbeat
|
||||||
|
with open(HEARTBEAT_FILE, "w") as f:
|
||||||
|
f.write(get_timestamp())
|
||||||
|
|
||||||
|
# Write status summary
|
||||||
|
with open(STATUS_FILE, "w") as f:
|
||||||
|
json.dump(status, f, indent=2)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error in main loop: {e}")
|
||||||
|
emit_event("agent_error", "error", f"Internal agent error: {e}", details={"error": str(e)})
|
||||||
|
|
||||||
|
time.sleep(CHECK_INTERVAL)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Loading…
Reference in a new issue