feat(node-agent): implement health monitor and safe cleanup policy
scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
sd_card (piha, saturn): dangling images + containers, rate-limited once/24h
ai_node (solaria): dangling + containers + build cache, NEVER -a
standard (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
containers_not_running, healthcheck_failed)
services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only
observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation
supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
checks all active states, risk=guarded (operator approval required)
executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
/opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /
hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:15:06 +02:00
|
|
|
services:
|
|
|
|
|
node-agent:
|
|
|
|
|
build: .
|
|
|
|
|
container_name: node-agent
|
2026-06-03 18:20:31 +02:00
|
|
|
user: "1000:1000"
|
|
|
|
|
group_add:
|
|
|
|
|
- "999"
|
feat(node-agent): implement health monitor and safe cleanup policy
scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
sd_card (piha, saturn): dangling images + containers, rate-limited once/24h
ai_node (solaria): dangling + containers + build cache, NEVER -a
standard (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
containers_not_running, healthcheck_failed)
services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only
observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation
supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
checks all active states, risk=guarded (operator approval required)
executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
/opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /
hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:15:06 +02:00
|
|
|
restart: unless-stopped
|
|
|
|
|
|
|
|
|
|
environment:
|
|
|
|
|
- RUNTIME_PATH=/opt/homelab
|
|
|
|
|
- REPO_ROOT=/repo
|
|
|
|
|
# NODE_NAME must be set to the canonical topology node name, e.g.:
|
|
|
|
|
# NODE_NAME=piha
|
|
|
|
|
# The agent uses this to determine its cleanup policy (lte_node / sd_card /
|
|
|
|
|
# ai_node / standard) and to tag emitted events with the correct node name.
|
|
|
|
|
- NODE_NAME=${NODE_NAME:-}
|
|
|
|
|
# NODE_TYPE overrides auto-detection if needed:
|
|
|
|
|
# lte_node | sd_card | ai_node | standard
|
|
|
|
|
- NODE_TYPE=${NODE_TYPE:-}
|
|
|
|
|
# VPS event shipping (non-VPS nodes only).
|
|
|
|
|
# Set VPS_EVENTS_HOST to the VPS Tailscale hostname or IP so that events
|
|
|
|
|
# emitted on this node are rsynced to the VPS observer.
|
|
|
|
|
# Also mount an SSH key (see commented volume below).
|
|
|
|
|
- VPS_EVENTS_HOST=${VPS_EVENTS_HOST:-}
|
|
|
|
|
- VPS_EVENTS_USER=${VPS_EVENTS_USER:-oskar}
|
|
|
|
|
- VPS_EVENTS_PATH=${VPS_EVENTS_PATH:-/opt/homelab/events}
|
|
|
|
|
# How often (seconds) to run a full health check cycle (default: 60)
|
|
|
|
|
- CHECK_INTERVAL=${CHECK_INTERVAL:-60}
|
|
|
|
|
|
|
|
|
|
volumes:
|
|
|
|
|
# Runtime filesystem — events, state, actions, logs
|
|
|
|
|
- /opt/homelab:/opt/homelab
|
|
|
|
|
# Docker socket — required for container health checks and Docker cleanup
|
|
|
|
|
- /var/run/docker.sock:/var/run/docker.sock
|
|
|
|
|
# Repo (read-only) — scripts and host config accessible to agent
|
|
|
|
|
- ../..:/repo:ro
|
|
|
|
|
# SSH key for event shipping to VPS.
|
|
|
|
|
# Uncomment and set SSH_KEY_PATH on nodes where VPS_EVENTS_HOST is set:
|
|
|
|
|
# - ${SSH_KEY_PATH:-/home/oskar/.ssh/id_ed25519}:/root/.ssh/id_rsa:ro
|
|
|
|
|
|
|
|
|
|
healthcheck:
|
|
|
|
|
test: ["CMD", "test", "-f", "/opt/homelab/state/node-agent.heartbeat"]
|
|
|
|
|
interval: 30s
|
|
|
|
|
timeout: 5s
|
|
|
|
|
retries: 3
|
|
|
|
|
start_period: 15s
|