homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	c9ee8eb06d	fix(observer): quarantine malformed event files to prevent processing wedge Recovery from bad merge of task/observer-poison-quarantine (`c255a02`) which carried false deletes from a stale branch base. Re-applies only the genuine observer changes on top of correct master state. When an event file fails to parse (malformed JSON, truncated, corrupted), the observer previously kept retrying on every cycle while the node's checkpoint stayed pinned — all subsequent good events for that node lost. Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/ with collision handling. Checkpoint advances, downstream events flow. Move failures are logged but don't crash the loop. Complementary to the atomic_write_json fix on state files; this addresses the same race-pattern on event files instead. Regression test asserts: bad event quarantined to failed_events dir, removed from hot path, subsequent good event processed (node online), checkpoint moves to good event.	2026-06-12 13:11:15 +02:00
Oskar Kapala	f5dcefc752	fix(observer): robust incident lifecycle + orphan auto-resolve Two root causes for stale "active" incidents on the dashboard: 1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at can be an ISO-8601 string (stability-agent via events.py) or a Unix int (node-agent). The previous session's auto-resolve did plain `time.time() - last_occ` which raises TypeError for strings, silently preventing _save_world() from being called and leaving incidents perpetually "active" on disk. Fix: add _parse_ts(ts) -> float that handles int, float, and ISO-8601 strings uniformly. All timestamp arithmetic now goes through it; returns 0.0 on None / garbage to keep comparisons safe. 2. Orphaned active incidents: _resolve_incident clears service["incident_id"] and marks the incident "resolved" in memory, but if incidents.json was truncated mid-write (pre-atomic-write era), the observer loaded it at next startup with status="active" and no service entry pointing to it. No code ever touched these orphans again. Fix: _prune_stale_world now runs two cleanup passes each cycle: - Case 1 (healthy-linked): service.status=="healthy" AND incident_id still set → resolve immediately (service cannot have active incident) - Case 2 (orphaned): active incident with no service link AND last_occurrence > 5 min ago → resolve (5-min guard for creation race) Both cases are wrapped in try/except so a bug here never crashes the observer loop or blocks _save_world. Also fixes the 7-day stale-incident prune to use _parse_ts so ISO-string resolved_at values are handled correctly. 3. Operator UI: current_incidents() now filters to status=="active" only. Resolved incidents were previously included in the /incidents endpoint, making the dashboard show a wall of historical records as if active. Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json (now written atomically) and deletes old event files. No non-atomic writes found. Midnight clustering was likely external (logrotate / OS flush); the supervisor's resilient loader already handles such transient issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 14:29:12 +02:00
Oskar Kapala	ffb0608b9a	fix(observer): atomic writes for world state files All JSON state writes (services.json, nodes.json, incidents.json, deployments.json, runtime-summary.json, observer_checkpoint.json) now use _atomic_write_json: write to a .tmp sibling, fsync, then os.replace. This eliminates the truncated-write window that caused supervisors reading mid-write files to see empty/partial JSON. Also adds auto-resolution of phantom active incidents: if a service reports status=healthy and its incident's last_occurrence is >30 min old, the incident is resolved in _prune_stale_world. This clears false active incidents accumulated from previous race-condition reads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:26:49 +02:00
Oskar Kapala	b40b832159	Fix ghost service keys from hash-prefixed Docker container names node-agent: use com.docker.compose.service label as canonical name - Add _canonical_container_name() method: prefers compose label, falls back to hash-prefix-stripped c.name - Replace bare c.name usage in check_containers() - Skip 'created'-state containers (Docker stale-state artifacts) observer: prune hash-prefixed ghost keys in _prune_stale_world() - Each reconcile cycle removes service keys matching <node>/<12hex>_<name> - Acts as safety net for entries already in services.json + future slippage control-plane/docker-compose.yml already has explicit container_name on all four services — no change needed there. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:41:13 +02:00
Oskar Kapala	28e9534765	observer: service_healthy resolves active incidents service_healthy is a positive health confirmation — if the service had an active incident (e.g. from earlier service_unhealthy events), that incident should be resolved when the service is confirmed healthy. Previously only service_recovered resolved incidents; service_healthy set status=healthy but left incidents open, keeping status='degraded'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:20:19 +02:00
Oskar Kapala	4e8968f9c7	Fix service health tracking: emit service_healthy, control-plane endpoint check, cleanup checkpoint migration - node_agent: emit service_healthy for all running managed containers so observer populates services.json (previously empty → supervisor flooded action queue with missing_service redeploys for healthy services) - node_agent: VPS-only _check_control_plane_health() probes the HTTP endpoint to emit service_healthy/unhealthy for the 'control-plane' logical service (multi-container stack, container names don't match service name) - node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints format from observer checkpoint (was reading old last_processed_file key, always found nothing, never cleaned up old events) - observer: handle service_healthy event type → sets service status healthy without resolving incidents (unlike service_recovered which also resolves) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:49:56 +02:00
Oskar Kapala	f4a8db93e4	fix(observer): per-node-directory checkpoints replace single global checkpoint The old mechanism tracked a single 'last_processed_file' and used sorted filename order to find new events. Remote nodes ship events into subdirectories (events/piha/, events/chelsty-infra/) that sort alphabetically BEFORE the VPS directory (events/vps/). Once the checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events were silently skipped forever. New mechanism: - node_checkpoints: {node_dir: last_processed_path} - Each node directory has its own independent cursor - New events = files whose path > that node's checkpoint - Backward-compatible: old 'last_processed_file' is migrated by extracting the node dir from the path on first load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:16:58 +02:00
Oskar Kapala	96bf32614f	fix(observer+operator-ui): fix stale world state, dict→list API, event time filter Root cause of stale data: - node_agent.py falls back to socket.gethostname() when NODE_NAME is unset. Inside a Docker container this returns the 12-char container ID (e.g. 'be17cb6eb0f6'), not the host name. Observer ingested those events and created ghost entries in world/nodes.json that never expired. observer.py: - _prune_stale_world(): removes node/service/incident entries for nodes absent from topology inventory; called on every run_once() cycle (both new-events and idle paths). Resolved incidents older than 7 days are also aged out. - _save_world(): now writes node_count and service_count to runtime-summary.json so the Dashboard's System Overview cards show real numbers instead of undefined. operator_ui.py: - current_nodes/services/deployments/incidents(): the observer stores world state as keyed dicts; the frontend calls .map() which requires an array. All four functions now convert the dict to a properly-shaped list. Each item has the fields the Nodes, Services, Topology, Deployments, and Correlation views expect (hostname, health, capabilities, desired_state, dependencies, etc.). - current_incidents(): synthesises a human-readable 'message' field from node + service + trigger_type (observer does not store one; dashboard showed undefined). - current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var, default 24). Without this, every event file ever written was returned, including events from ghost-node deploys. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:51:03 +02:00
Oskar Kapala	01b7758fe6	feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:15:06 +02:00
Oskar Kapala	7742bda245	feat(control-plane): add container_restart remediation - observer: store trigger_type on incidents for supervisor routing - supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy - supervisor: fix node alias normalization via NODE_ALIAS_MAP - supervisor: fix pending action dedup (scan by content not filename) - executor: implement container_restart via SSH docker restart with retry - control-plane override: configure NODE_ALIAS_MAP for production Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 12:50:46 +02:00
Oskar Kapala	533b8e846d	Add heartbeat updates and improve health checks in control-plane components	2026-05-12 20:59:46 +02:00
Oskar Kapala	8f5b905015	Implement observer runtime world synthesis engine	2026-05-12 14:07:03 +02:00

12 commits