homelab-codex-ws/services/node-agent/docker-compose.yml

services:
  node-agent:
    build: .
    container_name: node-agent
    restart: unless-stopped

    environment:
      - RUNTIME_PATH=/opt/homelab
      - REPO_ROOT=/repo
      # NODE_NAME must be set to the canonical topology node name, e.g.:
      #   NODE_NAME=piha
      # The agent uses this to determine its cleanup policy (lte_node / sd_card /
      # ai_node / standard) and to tag emitted events with the correct node name.
      - NODE_NAME=${NODE_NAME:-}
      # NODE_TYPE overrides auto-detection if needed:
      #   lte_node | sd_card | ai_node | standard
      - NODE_TYPE=${NODE_TYPE:-}
      # VPS event shipping (non-VPS nodes only).
      # Set VPS_EVENTS_HOST to the VPS Tailscale hostname or IP so that events
      # emitted on this node are rsynced to the VPS observer.
      # Also mount an SSH key (see commented volume below).
      - VPS_EVENTS_HOST=${VPS_EVENTS_HOST:-}
      - VPS_EVENTS_USER=${VPS_EVENTS_USER:-oskar}
      - VPS_EVENTS_PATH=${VPS_EVENTS_PATH:-/opt/homelab/events}
      # How often (seconds) to run a full health check cycle (default: 60)
      - CHECK_INTERVAL=${CHECK_INTERVAL:-60}

    volumes:
      # Runtime filesystem — events, state, actions, logs
      - /opt/homelab:/opt/homelab
      # Docker socket — required for container health checks and Docker cleanup
      - /var/run/docker.sock:/var/run/docker.sock
      # Repo (read-only) — scripts and host config accessible to agent
      - ../..:/repo:ro
      # SSH key for event shipping to VPS.
      # Uncomment and set SSH_KEY_PATH on nodes where VPS_EVENTS_HOST is set:
      # - ${SSH_KEY_PATH:-/home/oskar/.ssh/id_ed25519}:/root/.ssh/id_rsa:ro

    healthcheck:
      test: ["CMD", "test", "-f", "/opt/homelab/state/node-agent.heartbeat"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s
feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-27 13:15:06 +02:00			`services:`
			`node-agent:`
			`build: .`
			`container_name: node-agent`
			`restart: unless-stopped`

			`environment:`
			`- RUNTIME_PATH=/opt/homelab`
			`- REPO_ROOT=/repo`
			`# NODE_NAME must be set to the canonical topology node name, e.g.:`
			`# NODE_NAME=piha`
			`# The agent uses this to determine its cleanup policy (lte_node / sd_card /`
			`# ai_node / standard) and to tag emitted events with the correct node name.`
			`- NODE_NAME=${NODE_NAME:-}`
			`# NODE_TYPE overrides auto-detection if needed:`
			`# lte_node \| sd_card \| ai_node \| standard`
			`- NODE_TYPE=${NODE_TYPE:-}`
			`# VPS event shipping (non-VPS nodes only).`
			`# Set VPS_EVENTS_HOST to the VPS Tailscale hostname or IP so that events`
			`# emitted on this node are rsynced to the VPS observer.`
			`# Also mount an SSH key (see commented volume below).`
			`- VPS_EVENTS_HOST=${VPS_EVENTS_HOST:-}`
			`- VPS_EVENTS_USER=${VPS_EVENTS_USER:-oskar}`
			`- VPS_EVENTS_PATH=${VPS_EVENTS_PATH:-/opt/homelab/events}`
			`# How often (seconds) to run a full health check cycle (default: 60)`
			`- CHECK_INTERVAL=${CHECK_INTERVAL:-60}`

			`volumes:`
			`# Runtime filesystem — events, state, actions, logs`
			`- /opt/homelab:/opt/homelab`
			`# Docker socket — required for container health checks and Docker cleanup`
			`- /var/run/docker.sock:/var/run/docker.sock`
			`# Repo (read-only) — scripts and host config accessible to agent`
			`- ../..:/repo:ro`
			`# SSH key for event shipping to VPS.`
			`# Uncomment and set SSH_KEY_PATH on nodes where VPS_EVENTS_HOST is set:`
			`# - ${SSH_KEY_PATH:-/home/oskar/.ssh/id_ed25519}:/root/.ssh/id_rsa:ro`

			`healthcheck:`
			`test: ["CMD", "test", "-f", "/opt/homelab/state/node-agent.heartbeat"]`
			`interval: 30s`
			`timeout: 5s`
			`retries: 3`
			`start_period: 15s`