homelab-codex-ws/hosts/chelsty-infra/services.yaml

host: chelsty-infra
site: chelsty

services:
  ha-diag-agent:
    role: ha-diagnostic-agent
    deployment_model: docker-compose
    exposure: local-only
    offline_required: false
    depends_on:
      local: []
      external: [homeassistant]
    config:
      target_url: http://100.70.180.90:8123  # chelsty-ha via Tailscale (HAOS, separate VM)
      location_tag: "chelsty"
      events_dir: /opt/homelab/events/chelsty-infra
    runtime:
      config_path: /opt/homelab/config/ha-diag-agent
      data_path: /var/lib/ha-diag-agent

  node-agent:
    role: node-stability-monitor
    # LTE node: node-agent monitors and emits events but does NO Docker cleanup.
    # Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's
    # own retain policy is the correct remediation, not docker prune.
    deployment_model: docker-compose
    exposure: local-only
    offline_required: true

  mosquitto:
    role: local-mqtt-broker

  zigbee2mqtt:
    role: zigbee-mqtt-bridge

  frigate:
    role: nvr
refactor(hosts): split chelsty monolith into chelsty-ha and chelsty-infra - remove legacy hosts/chelsty/ monolith - chelsty-infra: add capabilities, networking, paths, runtime (mosquitto, zigbee2mqtt, stability-agent) - chelsty-ha: add capabilities - align with site/node model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 14:20:49 +02:00			`host: chelsty-infra`
			`site: chelsty`

			`services:`
feat(ha-diag-agent): scaffold service with HA REST client and event emitter - new per-host service, follows node-agent pattern - 7 new HA event types defined (routing in supervisor — Phase 5) - HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead) - service.yaml + host configs for piha (ken) and chelsty-infra (chelsty) - test scaffolding with aiohttp/aiosqlite mocks (15/15 passing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 12:26:34 +02:00			`ha-diag-agent:`
			`role: ha-diagnostic-agent`
			`deployment_model: docker-compose`
			`exposure: local-only`
			`offline_required: false`
			`depends_on:`
			`local: []`
			`external: [homeassistant]`
			`config:`
feat(ha-diag-agent): test environment with dual HA Docker instances - dockerized ken + chelsty HA test instances with template fixtures - snapshot/reset/wait scripts for fixture management - integration test infrastructure with separate marker - location_tag promoted from metadata to event payload (Phase 1 flag #3) - chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 12:56:13 +02:00			`target_url: http://100.70.180.90:8123 # chelsty-ha via Tailscale (HAOS, separate VM)`
feat(ha-diag-agent): scaffold service with HA REST client and event emitter - new per-host service, follows node-agent pattern - 7 new HA event types defined (routing in supervisor — Phase 5) - HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead) - service.yaml + host configs for piha (ken) and chelsty-infra (chelsty) - test scaffolding with aiohttp/aiosqlite mocks (15/15 passing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 12:26:34 +02:00			`location_tag: "chelsty"`
			`events_dir: /opt/homelab/events/chelsty-infra`
			`runtime:`
			`config_path: /opt/homelab/config/ha-diag-agent`
			`data_path: /var/lib/ha-diag-agent`

feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-27 13:15:06 +02:00			`node-agent:`
			`role: node-stability-monitor`
			`# LTE node: node-agent monitors and emits events but does NO Docker cleanup.`
			`# Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's`
			`# own retain policy is the correct remediation, not docker prune.`
			`deployment_model: docker-compose`
			`exposure: local-only`
			`offline_required: true`

refactor(hosts): split chelsty monolith into chelsty-ha and chelsty-infra - remove legacy hosts/chelsty/ monolith - chelsty-infra: add capabilities, networking, paths, runtime (mosquitto, zigbee2mqtt, stability-agent) - chelsty-ha: add capabilities - align with site/node model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 14:20:49 +02:00			`mosquitto:`
			`role: local-mqtt-broker`
feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-27 13:15:06 +02:00
refactor(hosts): split chelsty monolith into chelsty-ha and chelsty-infra - remove legacy hosts/chelsty/ monolith - chelsty-infra: add capabilities, networking, paths, runtime (mosquitto, zigbee2mqtt, stability-agent) - chelsty-ha: add capabilities - align with site/node model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-20 14:20:49 +02:00			`zigbee2mqtt:`
			`role: zigbee-mqtt-bridge`
feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-27 13:15:06 +02:00
feat: add Frigate NVR deployment for chelsty-infra VAAPI decode via Intel UHD 630, CPU detection, 2x Reolink RLC-540 placeholders. MQTT to local mosquitto (127.0.0.1), 7-day recording retention. Secrets in /opt/homelab/config/frigate/frigate.env on node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-21 18:19:45 +02:00			`frigate:`
			`role: nvr`