homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	98437d46b2	test(control-plane): atomic write and resilient loader coverage 11 new test cases in test_state_reliability.py covering: - atomic_write_json: produces valid JSON, no .tmp left behind, overwrites, works with nested structures - _load_actual_state: returns False on empty / truncated file, returns True on valid files, preserves last-known-good state across a parse failure - reconcile: empty/truncated services.json or incidents.json generates zero actions (skip-cycle semantics proven end-to-end) - healthy service with valid world state generates no spurious action All 32 tests (11 new + 21 existing) pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:27:05 +02:00
Oskar Kapala	5e97b4e448	fix(supervisor): atomic writes + skip cycle on unreadable world state Two independent fixes for the false-alarm storm caused by race-condition reads of truncated world state files: 1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces all bare open('w')+json.dump calls in supervisor and executor, so the action-file pipeline is never visible in a half-written state. 2. Resilient loader: _load_actual_state now returns False when any world state file fails to parse (empty or truncated mid-write). reconcile() skips the entire drift check on False instead of treating {} as "all services missing". actual_state retains its last-known-good values so a single bad cycle does not wipe accumulated context. Before: parse error → raw[key]={} → all desired services missing → wall of redeploy actions → drift_resolved_auto churn on next cycle. After: parse error → WARNING logged → cycle skipped → no actions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:26:59 +02:00
Oskar Kapala	495741e7ac	operator-ui: /events bez ladowania calego katalogu + daemon threads; epoch z regexa (fix chelsty-infra) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 16:34:52 +02:00
Oskar Kapala	43c5d45353	deploy: chmod/chown na /opt/homelab odporne na znikające pliki eventow	2026-06-01 14:35:19 +02:00
Oskar Kapala	f64cec645e	vps: mem_limit + oom_score_adj na serwisach in-repo; deploy-local stosuje override (stop OOM)	2026-06-01 14:23:58 +02:00
Oskar Kapala	1db9db7d03	fix(dashboard): read last_update from JSON content, not file mtime operator_ui.py called .replace() on last_update without checking type — an integer value (written by the materializer) raised AttributeError and silently fell back to os.path.getmtime(), which was stuck at 5/29 after a deploy with preserved timestamps. web.py had the same class of bug but worse: it unconditionally replaced last_update with mtime, ignoring the JSON field entirely. Both now branch on isinstance(str) and cast numeric values directly to float, with mtime only as a last-resort fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-31 22:10:50 +02:00
Oskar Kapala	52607a7cdd	feat(control-plane): shadow_mode for HA event auto-actions + deploy docs - HA_DIAG_SHADOW_MODE env flag in supervisor (default true) - shadow_mode downgrades container_restart actions to alert_only with [SHADOW MODE] note; same action_id and 30-min cooldown apply - alert_only events unaffected (always routed normally) - 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected - DEPLOY.md with token gen, per-host config, verification, 48h observation, production-mode enablement, rollback - README.md updated with shadow mode flag summary and DEPLOY.md link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 17:12:33 +02:00
Oskar Kapala	bf1415e4c1	feat(control-plane): route ha-diag-agent events through supervisor - 8 HA event types mapped to existing action types - ha_websocket_dead → container_restart (homeassistant), 30-min cooldown - 6 events → alert_only (entity_unavailable, integration_failed, automation_failing, update_available, recorder_lag, system_health_degraded), 1-hour cooldown - ha_websocket_recovered → cancels matching pending container_restart - state-aware suppression: skip HA events when homeassistant has an active containers_not_running incident < 5 min ago (avoids alert storms during HA restarts/updates) - location_tag preserved through action pipeline for per-house telegram alerts - executor: alert_only acknowledged as no-op success - 18 tests covering all 8 event types, suppression, cooldown, dedup, location_tag, recovery cancellation - CLAUDE.md: supervisor event routing table added Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 15:59:23 +02:00
Oskar Kapala	46ae92b5c1	supervisor: also cancel pending actions for services removed from desired state Previously _cancel_resolved_pending_actions() only cancelled actions where the service became healthy. This left orphaned actions when a service was removed from services.yaml or marked monitor:false. Add Case 1: if the action's svc_key is no longer in desired_state (either removed entirely or skipped due to monitor:false), cancel with reason service_removed_from_desired_state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:19:13 +02:00
Oskar Kapala	51002d4502	Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring node_exporter (new service): - Add services/node_exporter/docker-compose.yml matching solaria deployment (network_mode: host, pid: host, /:/host:ro,rslave mount) - Add services/node_exporter/service.yaml zigbee2mqtt chelsty-infra override: - Fix network_mode: host (mosquitto runs on host network, port 1883 on localhost) - Fix volume mount: ./configuration.yaml → absolute /opt/homelab/config/zigbee2mqtt/ (secrets stay in runtime config dir, never in Git) - Remove MQTT_USER/MQTT_PASSWORD (mosquitto uses allow_anonymous true) - Extend healthcheck start_period to 60s (z2m takes time on first start) chelsty-ha/services.yaml: - Remove node-agent entry entirely (never deployed, no plans to bootstrap now) - Keep homeassistant with monitor: false (no node-agent = no health events) supervisor: respect monitor: false in services.yaml - Skip action generation for services where monitor=false - Cleans up chelsty-ha entries from action queue without removing desired-state docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:10:48 +02:00
Oskar Kapala	fb7828b52b	supervisor: auto-cancel pending actions when drift is resolved When a service becomes healthy (node-agent emits service_healthy → observer updates services.json), any previously queued redeploy/container_restart action is stale. Without cleanup, the queue accumulates old actions that require manual rejection. _cancel_resolved_pending_actions() runs after each reconcile cycle: - Reads all pending/*.json with type=redeploy or container_restart - If the service is now healthy in actual_state, moves action to cancelled/ with reason=drift_resolved_auto - Only pending actions are touched; approved/running are left to the operator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:58:55 +02:00
Oskar Kapala	96bf32614f	fix(observer+operator-ui): fix stale world state, dict→list API, event time filter Root cause of stale data: - node_agent.py falls back to socket.gethostname() when NODE_NAME is unset. Inside a Docker container this returns the 12-char container ID (e.g. 'be17cb6eb0f6'), not the host name. Observer ingested those events and created ghost entries in world/nodes.json that never expired. observer.py: - _prune_stale_world(): removes node/service/incident entries for nodes absent from topology inventory; called on every run_once() cycle (both new-events and idle paths). Resolved incidents older than 7 days are also aged out. - _save_world(): now writes node_count and service_count to runtime-summary.json so the Dashboard's System Overview cards show real numbers instead of undefined. operator_ui.py: - current_nodes/services/deployments/incidents(): the observer stores world state as keyed dicts; the frontend calls .map() which requires an array. All four functions now convert the dict to a properly-shaped list. Each item has the fields the Nodes, Services, Topology, Deployments, and Correlation views expect (hostname, health, capabilities, desired_state, dependencies, etc.). - current_incidents(): synthesises a human-readable 'message' field from node + service + trigger_type (observer does not store one; dashboard showed undefined). - current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var, default 24). Without this, every event file ever written was returned, including events from ghost-node deploys. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:51:03 +02:00
Oskar Kapala	01b7758fe6	feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:15:06 +02:00
Oskar Kapala	7742bda245	feat(control-plane): add container_restart remediation - observer: store trigger_type on incidents for supervisor routing - supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy - supervisor: fix node alias normalization via NODE_ALIAS_MAP - supervisor: fix pending action dedup (scan by content not filename) - executor: implement container_restart via SSH docker restart with retry - control-plane override: configure NODE_ALIAS_MAP for production Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 12:50:46 +02:00
oskar	9b39581b53	fix(supervisor): content-based action IDs to prevent 30s backlog accumulation Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired. Switch to reconcile-{node}-{service} and check pending/approved/running states. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:47:37 +02:00
oskar	9f20dcae05	Add control plane deploy script and fix UI healthcheck	2026-05-18 21:34:57 +02:00
oskar	b7251ac416	Fix control plane UI healthcheck	2026-05-18 21:29:55 +02:00
Oskar Kapala	533b8e846d	Add heartbeat updates and improve health checks in control-plane components	2026-05-12 20:59:46 +02:00
Oskar Kapala	f4e6871d76	Add health check to control-plane Dockerfile fix syntax	2026-05-12 20:28:13 +02:00
Oskar Kapala	793559a4b5	Add health check to control-plane Dockerfile	2026-05-12 20:25:01 +02:00
Oskar Kapala	0cf1106b34	Update control-plane port mapping to 18180	2026-05-12 20:22:46 +02:00
Oskar Kapala	2029457f57	Implement VPS control-plane deployment profile	2026-05-12 20:19:05 +02:00

22 commits