homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	c9ee8eb06d	fix(observer): quarantine malformed event files to prevent processing wedge Recovery from bad merge of task/observer-poison-quarantine (`c255a02`) which carried false deletes from a stale branch base. Re-applies only the genuine observer changes on top of correct master state. When an event file fails to parse (malformed JSON, truncated, corrupted), the observer previously kept retrying on every cycle while the node's checkpoint stayed pinned — all subsequent good events for that node lost. Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/ with collision handling. Checkpoint advances, downstream events flow. Move failures are logged but don't crash the loop. Complementary to the atomic_write_json fix on state files; this addresses the same race-pattern on event files instead. Regression test asserts: bad event quarantined to failed_events dir, removed from hot path, subsequent good event processed (node online), checkpoint moves to good event.	2026-06-12 13:11:15 +02:00
Oskar Kapala	7f17b65278	fix(control-plane): run executor as uid 1000 with docker group access Executor was the only control-plane container running as root (uid=0), writing root-owned files to /opt/homelab via bind-mount and triggering false sudo on every deploy. - Dockerfile: add USER homelab after useradd (useradd already present) - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) so executor retains docker.sock access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 18:19:58 +02:00
Oskar Kapala	00fc36df3a	fix(deploy): skip sudo chown/chmod when /opt/homelab ownership is already correct deploy-local.sh previously ran `sudo chown -R 1000:1000` and `sudo chmod -R 775` unconditionally on every deploy, which blocked non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000. Both steps are now conditional using `find ... -print -quit`: - chown: runs only if any file/dir is NOT uid/gid 1000 - chmod: runs only if any directory is missing -775 permission bits When everything is correct (steady state on VPS), both steps log "already correct, skipping" and never invoke sudo. If a new directory was created by root (e.g. a manual mkdir, volume mount, or restart artefact), the remediation path triggers automatically — the self-heal property is preserved. Smoke-tested in Docker (ubuntu:22.04): Case 1 (1000:1000 + 775): chown skipped, chmod skipped ✓ Case 2 (root-owned subdir): chown triggered ✓ Case 3 (700 dir perms): chmod triggered ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 15:44:44 +02:00
Oskar Kapala	f5dcefc752	fix(observer): robust incident lifecycle + orphan auto-resolve Two root causes for stale "active" incidents on the dashboard: 1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at can be an ISO-8601 string (stability-agent via events.py) or a Unix int (node-agent). The previous session's auto-resolve did plain `time.time() - last_occ` which raises TypeError for strings, silently preventing _save_world() from being called and leaving incidents perpetually "active" on disk. Fix: add _parse_ts(ts) -> float that handles int, float, and ISO-8601 strings uniformly. All timestamp arithmetic now goes through it; returns 0.0 on None / garbage to keep comparisons safe. 2. Orphaned active incidents: _resolve_incident clears service["incident_id"] and marks the incident "resolved" in memory, but if incidents.json was truncated mid-write (pre-atomic-write era), the observer loaded it at next startup with status="active" and no service entry pointing to it. No code ever touched these orphans again. Fix: _prune_stale_world now runs two cleanup passes each cycle: - Case 1 (healthy-linked): service.status=="healthy" AND incident_id still set → resolve immediately (service cannot have active incident) - Case 2 (orphaned): active incident with no service link AND last_occurrence > 5 min ago → resolve (5-min guard for creation race) Both cases are wrapped in try/except so a bug here never crashes the observer loop or blocks _save_world. Also fixes the 7-day stale-incident prune to use _parse_ts so ISO-string resolved_at values are handled correctly. 3. Operator UI: current_incidents() now filters to status=="active" only. Resolved incidents were previously included in the /incidents endpoint, making the dashboard show a wall of historical records as if active. Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json (now written atomically) and deletes old event files. No non-atomic writes found. Midnight clustering was likely external (logrotate / OS flush); the supervisor's resilient loader already handles such transient issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 14:29:12 +02:00
Oskar Kapala	98437d46b2	test(control-plane): atomic write and resilient loader coverage 11 new test cases in test_state_reliability.py covering: - atomic_write_json: produces valid JSON, no .tmp left behind, overwrites, works with nested structures - _load_actual_state: returns False on empty / truncated file, returns True on valid files, preserves last-known-good state across a parse failure - reconcile: empty/truncated services.json or incidents.json generates zero actions (skip-cycle semantics proven end-to-end) - healthy service with valid world state generates no spurious action All 32 tests (11 new + 21 existing) pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:27:05 +02:00
Oskar Kapala	5e97b4e448	fix(supervisor): atomic writes + skip cycle on unreadable world state Two independent fixes for the false-alarm storm caused by race-condition reads of truncated world state files: 1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces all bare open('w')+json.dump calls in supervisor and executor, so the action-file pipeline is never visible in a half-written state. 2. Resilient loader: _load_actual_state now returns False when any world state file fails to parse (empty or truncated mid-write). reconcile() skips the entire drift check on False instead of treating {} as "all services missing". actual_state retains its last-known-good values so a single bad cycle does not wipe accumulated context. Before: parse error → raw[key]={} → all desired services missing → wall of redeploy actions → drift_resolved_auto churn on next cycle. After: parse error → WARNING logged → cycle skipped → no actions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:26:59 +02:00
Oskar Kapala	495741e7ac	operator-ui: /events bez ladowania calego katalogu + daemon threads; epoch z regexa (fix chelsty-infra) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 16:34:52 +02:00
Oskar Kapala	43c5d45353	deploy: chmod/chown na /opt/homelab odporne na znikające pliki eventow	2026-06-01 14:35:19 +02:00
Oskar Kapala	f64cec645e	vps: mem_limit + oom_score_adj na serwisach in-repo; deploy-local stosuje override (stop OOM)	2026-06-01 14:23:58 +02:00
Oskar Kapala	1db9db7d03	fix(dashboard): read last_update from JSON content, not file mtime operator_ui.py called .replace() on last_update without checking type — an integer value (written by the materializer) raised AttributeError and silently fell back to os.path.getmtime(), which was stuck at 5/29 after a deploy with preserved timestamps. web.py had the same class of bug but worse: it unconditionally replaced last_update with mtime, ignoring the JSON field entirely. Both now branch on isinstance(str) and cast numeric values directly to float, with mtime only as a last-resort fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-31 22:10:50 +02:00
Oskar Kapala	52607a7cdd	feat(control-plane): shadow_mode for HA event auto-actions + deploy docs - HA_DIAG_SHADOW_MODE env flag in supervisor (default true) - shadow_mode downgrades container_restart actions to alert_only with [SHADOW MODE] note; same action_id and 30-min cooldown apply - alert_only events unaffected (always routed normally) - 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected - DEPLOY.md with token gen, per-host config, verification, 48h observation, production-mode enablement, rollback - README.md updated with shadow mode flag summary and DEPLOY.md link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 17:12:33 +02:00
Oskar Kapala	bf1415e4c1	feat(control-plane): route ha-diag-agent events through supervisor - 8 HA event types mapped to existing action types - ha_websocket_dead → container_restart (homeassistant), 30-min cooldown - 6 events → alert_only (entity_unavailable, integration_failed, automation_failing, update_available, recorder_lag, system_health_degraded), 1-hour cooldown - ha_websocket_recovered → cancels matching pending container_restart - state-aware suppression: skip HA events when homeassistant has an active containers_not_running incident < 5 min ago (avoids alert storms during HA restarts/updates) - location_tag preserved through action pipeline for per-house telegram alerts - executor: alert_only acknowledged as no-op success - 18 tests covering all 8 event types, suppression, cooldown, dedup, location_tag, recovery cancellation - CLAUDE.md: supervisor event routing table added Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 15:59:23 +02:00
Oskar Kapala	46ae92b5c1	supervisor: also cancel pending actions for services removed from desired state Previously _cancel_resolved_pending_actions() only cancelled actions where the service became healthy. This left orphaned actions when a service was removed from services.yaml or marked monitor:false. Add Case 1: if the action's svc_key is no longer in desired_state (either removed entirely or skipped due to monitor:false), cancel with reason service_removed_from_desired_state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:19:13 +02:00
Oskar Kapala	51002d4502	Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring node_exporter (new service): - Add services/node_exporter/docker-compose.yml matching solaria deployment (network_mode: host, pid: host, /:/host:ro,rslave mount) - Add services/node_exporter/service.yaml zigbee2mqtt chelsty-infra override: - Fix network_mode: host (mosquitto runs on host network, port 1883 on localhost) - Fix volume mount: ./configuration.yaml → absolute /opt/homelab/config/zigbee2mqtt/ (secrets stay in runtime config dir, never in Git) - Remove MQTT_USER/MQTT_PASSWORD (mosquitto uses allow_anonymous true) - Extend healthcheck start_period to 60s (z2m takes time on first start) chelsty-ha/services.yaml: - Remove node-agent entry entirely (never deployed, no plans to bootstrap now) - Keep homeassistant with monitor: false (no node-agent = no health events) supervisor: respect monitor: false in services.yaml - Skip action generation for services where monitor=false - Cleans up chelsty-ha entries from action queue without removing desired-state docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:10:48 +02:00
Oskar Kapala	fb7828b52b	supervisor: auto-cancel pending actions when drift is resolved When a service becomes healthy (node-agent emits service_healthy → observer updates services.json), any previously queued redeploy/container_restart action is stale. Without cleanup, the queue accumulates old actions that require manual rejection. _cancel_resolved_pending_actions() runs after each reconcile cycle: - Reads all pending/*.json with type=redeploy or container_restart - If the service is now healthy in actual_state, moves action to cancelled/ with reason=drift_resolved_auto - Only pending actions are touched; approved/running are left to the operator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:58:55 +02:00
Oskar Kapala	96bf32614f	fix(observer+operator-ui): fix stale world state, dict→list API, event time filter Root cause of stale data: - node_agent.py falls back to socket.gethostname() when NODE_NAME is unset. Inside a Docker container this returns the 12-char container ID (e.g. 'be17cb6eb0f6'), not the host name. Observer ingested those events and created ghost entries in world/nodes.json that never expired. observer.py: - _prune_stale_world(): removes node/service/incident entries for nodes absent from topology inventory; called on every run_once() cycle (both new-events and idle paths). Resolved incidents older than 7 days are also aged out. - _save_world(): now writes node_count and service_count to runtime-summary.json so the Dashboard's System Overview cards show real numbers instead of undefined. operator_ui.py: - current_nodes/services/deployments/incidents(): the observer stores world state as keyed dicts; the frontend calls .map() which requires an array. All four functions now convert the dict to a properly-shaped list. Each item has the fields the Nodes, Services, Topology, Deployments, and Correlation views expect (hostname, health, capabilities, desired_state, dependencies, etc.). - current_incidents(): synthesises a human-readable 'message' field from node + service + trigger_type (observer does not store one; dashboard showed undefined). - current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var, default 24). Without this, every event file ever written was returned, including events from ghost-node deploys. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:51:03 +02:00
Oskar Kapala	01b7758fe6	feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:15:06 +02:00
Oskar Kapala	7742bda245	feat(control-plane): add container_restart remediation - observer: store trigger_type on incidents for supervisor routing - supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy - supervisor: fix node alias normalization via NODE_ALIAS_MAP - supervisor: fix pending action dedup (scan by content not filename) - executor: implement container_restart via SSH docker restart with retry - control-plane override: configure NODE_ALIAS_MAP for production Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 12:50:46 +02:00
oskar	9b39581b53	fix(supervisor): content-based action IDs to prevent 30s backlog accumulation Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired. Switch to reconcile-{node}-{service} and check pending/approved/running states. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:47:37 +02:00
oskar	9f20dcae05	Add control plane deploy script and fix UI healthcheck	2026-05-18 21:34:57 +02:00
oskar	b7251ac416	Fix control plane UI healthcheck	2026-05-18 21:29:55 +02:00
Oskar Kapala	533b8e846d	Add heartbeat updates and improve health checks in control-plane components	2026-05-12 20:59:46 +02:00
Oskar Kapala	f4e6871d76	Add health check to control-plane Dockerfile fix syntax	2026-05-12 20:28:13 +02:00
Oskar Kapala	793559a4b5	Add health check to control-plane Dockerfile	2026-05-12 20:25:01 +02:00
Oskar Kapala	0cf1106b34	Update control-plane port mapping to 18180	2026-05-12 20:22:46 +02:00
Oskar Kapala	2029457f57	Implement VPS control-plane deployment profile	2026-05-12 20:19:05 +02:00

26 commits