Commit graph

26 commits

Author SHA1 Message Date
Oskar Kapala c9ee8eb06d fix(observer): quarantine malformed event files to prevent processing wedge
Recovery from bad merge of task/observer-poison-quarantine (c255a02)
which carried false deletes from a stale branch base. Re-applies only
the genuine observer changes on top of correct master state.

When an event file fails to parse (malformed JSON, truncated, corrupted),
the observer previously kept retrying on every cycle while the node's
checkpoint stayed pinned — all subsequent good events for that node lost.

Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures are logged but don't crash the loop.

Complementary to the atomic_write_json fix on state files; this addresses
the same race-pattern on event files instead.

Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
2026-06-12 13:11:15 +02:00
Oskar Kapala 7f17b65278 fix(control-plane): run executor as uid 1000 with docker group access
Executor was the only control-plane container running as root (uid=0),
writing root-owned files to /opt/homelab via bind-mount and triggering
false sudo on every deploy.

- Dockerfile: add USER homelab after useradd (useradd already present)
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
  (GID 999 = docker group on VPS) so executor retains docker.sock access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 18:19:58 +02:00
Oskar Kapala 00fc36df3a fix(deploy): skip sudo chown/chmod when /opt/homelab ownership is already correct
deploy-local.sh previously ran `sudo chown -R 1000:1000` and
`sudo chmod -R 775` unconditionally on every deploy, which blocked
non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000.

Both steps are now conditional using `find ... -print -quit`:
- chown: runs only if any file/dir is NOT uid/gid 1000
- chmod: runs only if any directory is missing -775 permission bits

When everything is correct (steady state on VPS), both steps log
"already correct, skipping" and never invoke sudo.  If a new directory
was created by root (e.g. a manual mkdir, volume mount, or restart artefact),
the remediation path triggers automatically — the self-heal property is preserved.

Smoke-tested in Docker (ubuntu:22.04):
  Case 1 (1000:1000 + 775):  chown skipped, chmod skipped ✓
  Case 2 (root-owned subdir): chown triggered ✓
  Case 3 (700 dir perms):     chmod triggered ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 15:44:44 +02:00
Oskar Kapala f5dcefc752 fix(observer): robust incident lifecycle + orphan auto-resolve
Two root causes for stale "active" incidents on the dashboard:

1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
   can be an ISO-8601 string (stability-agent via events.py) or a Unix
   int (node-agent).  The previous session's auto-resolve did plain
   `time.time() - last_occ` which raises TypeError for strings,
   silently preventing _save_world() from being called and leaving
   incidents perpetually "active" on disk.

   Fix: add _parse_ts(ts) -> float that handles int, float, and
   ISO-8601 strings uniformly. All timestamp arithmetic now goes through
   it; returns 0.0 on None / garbage to keep comparisons safe.

2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
   and marks the incident "resolved" in memory, but if incidents.json was
   truncated mid-write (pre-atomic-write era), the observer loaded it at
   next startup with status="active" and no service entry pointing to it.
   No code ever touched these orphans again.

   Fix: _prune_stale_world now runs two cleanup passes each cycle:
   - Case 1 (healthy-linked): service.status=="healthy" AND incident_id
     still set → resolve immediately (service cannot have active incident)
   - Case 2 (orphaned): active incident with no service link AND
     last_occurrence > 5 min ago → resolve (5-min guard for creation race)

   Both cases are wrapped in try/except so a bug here never crashes the
   observer loop or blocks _save_world.

   Also fixes the 7-day stale-incident prune to use _parse_ts so
   ISO-string resolved_at values are handled correctly.

3. Operator UI: current_incidents() now filters to status=="active" only.
   Resolved incidents were previously included in the /incidents endpoint,
   making the dashboard show a wall of historical records as if active.

Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 14:29:12 +02:00
Oskar Kapala 98437d46b2 test(control-plane): atomic write and resilient loader coverage
11 new test cases in test_state_reliability.py covering:
- atomic_write_json: produces valid JSON, no .tmp left behind, overwrites,
  works with nested structures
- _load_actual_state: returns False on empty / truncated file, returns True
  on valid files, preserves last-known-good state across a parse failure
- reconcile: empty/truncated services.json or incidents.json generates zero
  actions (skip-cycle semantics proven end-to-end)
- healthy service with valid world state generates no spurious action

All 32 tests (11 new + 21 existing) pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:27:05 +02:00
Oskar Kapala 5e97b4e448 fix(supervisor): atomic writes + skip cycle on unreadable world state
Two independent fixes for the false-alarm storm caused by race-condition
reads of truncated world state files:

1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces
   all bare open('w')+json.dump calls in supervisor and executor, so the
   action-file pipeline is never visible in a half-written state.

2. Resilient loader: _load_actual_state now returns False when any world
   state file fails to parse (empty or truncated mid-write). reconcile()
   skips the entire drift check on False instead of treating {} as "all
   services missing". actual_state retains its last-known-good values so
   a single bad cycle does not wipe accumulated context.

   Before: parse error → raw[key]={} → all desired services missing →
     wall of redeploy actions → drift_resolved_auto churn on next cycle.
   After:  parse error → WARNING logged → cycle skipped → no actions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:26:59 +02:00
Oskar Kapala 495741e7ac operator-ui: /events bez ladowania calego katalogu + daemon threads; epoch z regexa (fix chelsty-infra)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 16:34:52 +02:00
Oskar Kapala 43c5d45353 deploy: chmod/chown na /opt/homelab odporne na znikające pliki eventow 2026-06-01 14:35:19 +02:00
Oskar Kapala f64cec645e vps: mem_limit + oom_score_adj na serwisach in-repo; deploy-local stosuje override (stop OOM) 2026-06-01 14:23:58 +02:00
Oskar Kapala 1db9db7d03 fix(dashboard): read last_update from JSON content, not file mtime
operator_ui.py called .replace() on last_update without checking type —
an integer value (written by the materializer) raised AttributeError and
silently fell back to os.path.getmtime(), which was stuck at 5/29 after a
deploy with preserved timestamps. web.py had the same class of bug but
worse: it unconditionally replaced last_update with mtime, ignoring the
JSON field entirely. Both now branch on isinstance(str) and cast numeric
values directly to float, with mtime only as a last-resort fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 22:10:50 +02:00
Oskar Kapala 52607a7cdd feat(control-plane): shadow_mode for HA event auto-actions + deploy docs
- HA_DIAG_SHADOW_MODE env flag in supervisor (default true)
- shadow_mode downgrades container_restart actions to alert_only with
  [SHADOW MODE] note; same action_id and 30-min cooldown apply
- alert_only events unaffected (always routed normally)
- 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected
- DEPLOY.md with token gen, per-host config, verification, 48h observation,
  production-mode enablement, rollback
- README.md updated with shadow mode flag summary and DEPLOY.md link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 17:12:33 +02:00
Oskar Kapala bf1415e4c1 feat(control-plane): route ha-diag-agent events through supervisor
- 8 HA event types mapped to existing action types
- ha_websocket_dead → container_restart (homeassistant), 30-min cooldown
- 6 events → alert_only (entity_unavailable, integration_failed,
  automation_failing, update_available, recorder_lag,
  system_health_degraded), 1-hour cooldown
- ha_websocket_recovered → cancels matching pending container_restart
- state-aware suppression: skip HA events when homeassistant has an
  active containers_not_running incident < 5 min ago (avoids alert
  storms during HA restarts/updates)
- location_tag preserved through action pipeline for per-house
  telegram alerts
- executor: alert_only acknowledged as no-op success
- 18 tests covering all 8 event types, suppression, cooldown,
  dedup, location_tag, recovery cancellation
- CLAUDE.md: supervisor event routing table added

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 15:59:23 +02:00
Oskar Kapala 46ae92b5c1 supervisor: also cancel pending actions for services removed from desired state
Previously _cancel_resolved_pending_actions() only cancelled actions where
the service became healthy. This left orphaned actions when a service was
removed from services.yaml or marked monitor:false.

Add Case 1: if the action's svc_key is no longer in desired_state (either
removed entirely or skipped due to monitor:false), cancel with reason
service_removed_from_desired_state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:19:13 +02:00
Oskar Kapala 51002d4502 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
node_exporter (new service):
- Add services/node_exporter/docker-compose.yml matching solaria deployment
  (network_mode: host, pid: host, /:/host:ro,rslave mount)
- Add services/node_exporter/service.yaml

zigbee2mqtt chelsty-infra override:
- Fix network_mode: host (mosquitto runs on host network, port 1883 on localhost)
- Fix volume mount: ./configuration.yaml → absolute /opt/homelab/config/zigbee2mqtt/
  (secrets stay in runtime config dir, never in Git)
- Remove MQTT_USER/MQTT_PASSWORD (mosquitto uses allow_anonymous true)
- Extend healthcheck start_period to 60s (z2m takes time on first start)

chelsty-ha/services.yaml:
- Remove node-agent entry entirely (never deployed, no plans to bootstrap now)
- Keep homeassistant with monitor: false (no node-agent = no health events)

supervisor: respect monitor: false in services.yaml
- Skip action generation for services where monitor=false
- Cleans up chelsty-ha entries from action queue without removing desired-state docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:10:48 +02:00
Oskar Kapala fb7828b52b supervisor: auto-cancel pending actions when drift is resolved
When a service becomes healthy (node-agent emits service_healthy → observer
updates services.json), any previously queued redeploy/container_restart
action is stale. Without cleanup, the queue accumulates old actions that
require manual rejection.

_cancel_resolved_pending_actions() runs after each reconcile cycle:
- Reads all pending/*.json with type=redeploy or container_restart
- If the service is now healthy in actual_state, moves action to cancelled/
  with reason=drift_resolved_auto
- Only pending actions are touched; approved/running are left to the operator

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:58:55 +02:00
Oskar Kapala 96bf32614f fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
  Inside a Docker container this returns the 12-char container ID (e.g.
  'be17cb6eb0f6'), not the host name.  Observer ingested those events and
  created ghost entries in world/nodes.json that never expired.

observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
  from topology inventory; called on every run_once() cycle (both new-events
  and idle paths).  Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
  so the Dashboard's System Overview cards show real numbers instead of undefined.

operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
  as keyed dicts; the frontend calls .map() which requires an array.  All four
  functions now convert the dict to a properly-shaped list.  Each item has the
  fields the Nodes, Services, Topology, Deployments, and Correlation views expect
  (hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
  service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
  default 24).  Without this, every event file ever written was returned,
  including events from ghost-node deploys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:51:03 +02:00
Oskar Kapala 01b7758fe6 feat(node-agent): implement health monitor and safe cleanup policy
scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
    lte_node  (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
    sd_card   (piha, saturn): dangling images + containers, rate-limited once/24h
    ai_node   (solaria): dangling + containers + build cache, NEVER -a
    standard  (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
  events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
  containers_not_running, healthcheck_failed)

services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
  for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only

observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation

supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
  checks all active states, risk=guarded (operator approval required)

executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
  /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /

hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:15:06 +02:00
Oskar Kapala 7742bda245 feat(control-plane): add container_restart remediation
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 12:50:46 +02:00
oskar 9b39581b53 fix(supervisor): content-based action IDs to prevent 30s backlog accumulation
Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired.
Switch to reconcile-{node}-{service} and check pending/approved/running states.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 17:47:37 +02:00
oskar 9f20dcae05 Add control plane deploy script and fix UI healthcheck 2026-05-18 21:34:57 +02:00
oskar b7251ac416 Fix control plane UI healthcheck 2026-05-18 21:29:55 +02:00
Oskar Kapala 533b8e846d Add heartbeat updates and improve health checks in control-plane components 2026-05-12 20:59:46 +02:00
Oskar Kapala f4e6871d76 Add health check to control-plane Dockerfile fix syntax 2026-05-12 20:28:13 +02:00
Oskar Kapala 793559a4b5 Add health check to control-plane Dockerfile 2026-05-12 20:25:01 +02:00
Oskar Kapala 0cf1106b34 Update control-plane port mapping to 18180 2026-05-12 20:22:46 +02:00
Oskar Kapala 2029457f57 Implement VPS control-plane deployment profile 2026-05-12 20:19:05 +02:00