Commit graph

12 commits

Author SHA1 Message Date
Oskar Kapala 1db9db7d03 fix(dashboard): read last_update from JSON content, not file mtime
operator_ui.py called .replace() on last_update without checking type —
an integer value (written by the materializer) raised AttributeError and
silently fell back to os.path.getmtime(), which was stuck at 5/29 after a
deploy with preserved timestamps. web.py had the same class of bug but
worse: it unconditionally replaced last_update with mtime, ignoring the
JSON field entirely. Both now branch on isinstance(str) and cast numeric
values directly to float, with mtime only as a last-resort fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 22:10:50 +02:00
Oskar Kapala 52607a7cdd feat(control-plane): shadow_mode for HA event auto-actions + deploy docs
- HA_DIAG_SHADOW_MODE env flag in supervisor (default true)
- shadow_mode downgrades container_restart actions to alert_only with
  [SHADOW MODE] note; same action_id and 30-min cooldown apply
- alert_only events unaffected (always routed normally)
- 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected
- DEPLOY.md with token gen, per-host config, verification, 48h observation,
  production-mode enablement, rollback
- README.md updated with shadow mode flag summary and DEPLOY.md link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 17:12:33 +02:00
Oskar Kapala bf1415e4c1 feat(control-plane): route ha-diag-agent events through supervisor
- 8 HA event types mapped to existing action types
- ha_websocket_dead → container_restart (homeassistant), 30-min cooldown
- 6 events → alert_only (entity_unavailable, integration_failed,
  automation_failing, update_available, recorder_lag,
  system_health_degraded), 1-hour cooldown
- ha_websocket_recovered → cancels matching pending container_restart
- state-aware suppression: skip HA events when homeassistant has an
  active containers_not_running incident < 5 min ago (avoids alert
  storms during HA restarts/updates)
- location_tag preserved through action pipeline for per-house
  telegram alerts
- executor: alert_only acknowledged as no-op success
- 18 tests covering all 8 event types, suppression, cooldown,
  dedup, location_tag, recovery cancellation
- CLAUDE.md: supervisor event routing table added

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 15:59:23 +02:00
Oskar Kapala 46ae92b5c1 supervisor: also cancel pending actions for services removed from desired state
Previously _cancel_resolved_pending_actions() only cancelled actions where
the service became healthy. This left orphaned actions when a service was
removed from services.yaml or marked monitor:false.

Add Case 1: if the action's svc_key is no longer in desired_state (either
removed entirely or skipped due to monitor:false), cancel with reason
service_removed_from_desired_state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:19:13 +02:00
Oskar Kapala 51002d4502 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
node_exporter (new service):
- Add services/node_exporter/docker-compose.yml matching solaria deployment
  (network_mode: host, pid: host, /:/host:ro,rslave mount)
- Add services/node_exporter/service.yaml

zigbee2mqtt chelsty-infra override:
- Fix network_mode: host (mosquitto runs on host network, port 1883 on localhost)
- Fix volume mount: ./configuration.yaml → absolute /opt/homelab/config/zigbee2mqtt/
  (secrets stay in runtime config dir, never in Git)
- Remove MQTT_USER/MQTT_PASSWORD (mosquitto uses allow_anonymous true)
- Extend healthcheck start_period to 60s (z2m takes time on first start)

chelsty-ha/services.yaml:
- Remove node-agent entry entirely (never deployed, no plans to bootstrap now)
- Keep homeassistant with monitor: false (no node-agent = no health events)

supervisor: respect monitor: false in services.yaml
- Skip action generation for services where monitor=false
- Cleans up chelsty-ha entries from action queue without removing desired-state docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:10:48 +02:00
Oskar Kapala fb7828b52b supervisor: auto-cancel pending actions when drift is resolved
When a service becomes healthy (node-agent emits service_healthy → observer
updates services.json), any previously queued redeploy/container_restart
action is stale. Without cleanup, the queue accumulates old actions that
require manual rejection.

_cancel_resolved_pending_actions() runs after each reconcile cycle:
- Reads all pending/*.json with type=redeploy or container_restart
- If the service is now healthy in actual_state, moves action to cancelled/
  with reason=drift_resolved_auto
- Only pending actions are touched; approved/running are left to the operator

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:58:55 +02:00
Oskar Kapala 96bf32614f fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
  Inside a Docker container this returns the 12-char container ID (e.g.
  'be17cb6eb0f6'), not the host name.  Observer ingested those events and
  created ghost entries in world/nodes.json that never expired.

observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
  from topology inventory; called on every run_once() cycle (both new-events
  and idle paths).  Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
  so the Dashboard's System Overview cards show real numbers instead of undefined.

operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
  as keyed dicts; the frontend calls .map() which requires an array.  All four
  functions now convert the dict to a properly-shaped list.  Each item has the
  fields the Nodes, Services, Topology, Deployments, and Correlation views expect
  (hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
  service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
  default 24).  Without this, every event file ever written was returned,
  including events from ghost-node deploys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:51:03 +02:00
Oskar Kapala 01b7758fe6 feat(node-agent): implement health monitor and safe cleanup policy
scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
    lte_node  (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
    sd_card   (piha, saturn): dangling images + containers, rate-limited once/24h
    ai_node   (solaria): dangling + containers + build cache, NEVER -a
    standard  (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
  events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
  containers_not_running, healthcheck_failed)

services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
  for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only

observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation

supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
  checks all active states, risk=guarded (operator approval required)

executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
  /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /

hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:15:06 +02:00
Oskar Kapala 7742bda245 feat(control-plane): add container_restart remediation
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 12:50:46 +02:00
oskar 9b39581b53 fix(supervisor): content-based action IDs to prevent 30s backlog accumulation
Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired.
Switch to reconcile-{node}-{service} and check pending/approved/running states.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 17:47:37 +02:00
Oskar Kapala 533b8e846d Add heartbeat updates and improve health checks in control-plane components 2026-05-12 20:59:46 +02:00
Oskar Kapala 2029457f57 Implement VPS control-plane deployment profile 2026-05-12 20:19:05 +02:00