homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	28e9534765	observer: service_healthy resolves active incidents service_healthy is a positive health confirmation — if the service had an active incident (e.g. from earlier service_unhealthy events), that incident should be resolved when the service is confirmed healthy. Previously only service_recovered resolved incidents; service_healthy set status=healthy but left incidents open, keeping status='degraded'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:20:19 +02:00
Oskar Kapala	4e8968f9c7	Fix service health tracking: emit service_healthy, control-plane endpoint check, cleanup checkpoint migration - node_agent: emit service_healthy for all running managed containers so observer populates services.json (previously empty → supervisor flooded action queue with missing_service redeploys for healthy services) - node_agent: VPS-only _check_control_plane_health() probes the HTTP endpoint to emit service_healthy/unhealthy for the 'control-plane' logical service (multi-container stack, container names don't match service name) - node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints format from observer checkpoint (was reading old last_processed_file key, always found nothing, never cleaned up old events) - observer: handle service_healthy event type → sets service status healthy without resolving incidents (unlike service_recovered which also resolves) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:49:56 +02:00
Oskar Kapala	f4a8db93e4	fix(observer): per-node-directory checkpoints replace single global checkpoint The old mechanism tracked a single 'last_processed_file' and used sorted filename order to find new events. Remote nodes ship events into subdirectories (events/piha/, events/chelsty-infra/) that sort alphabetically BEFORE the VPS directory (events/vps/). Once the checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events were silently skipped forever. New mechanism: - node_checkpoints: {node_dir: last_processed_path} - Each node directory has its own independent cursor - New events = files whose path > that node's checkpoint - Backward-compatible: old 'last_processed_file' is migrated by extracting the node dir from the path on first load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:16:58 +02:00
Oskar Kapala	96bf32614f	fix(observer+operator-ui): fix stale world state, dict→list API, event time filter Root cause of stale data: - node_agent.py falls back to socket.gethostname() when NODE_NAME is unset. Inside a Docker container this returns the 12-char container ID (e.g. 'be17cb6eb0f6'), not the host name. Observer ingested those events and created ghost entries in world/nodes.json that never expired. observer.py: - _prune_stale_world(): removes node/service/incident entries for nodes absent from topology inventory; called on every run_once() cycle (both new-events and idle paths). Resolved incidents older than 7 days are also aged out. - _save_world(): now writes node_count and service_count to runtime-summary.json so the Dashboard's System Overview cards show real numbers instead of undefined. operator_ui.py: - current_nodes/services/deployments/incidents(): the observer stores world state as keyed dicts; the frontend calls .map() which requires an array. All four functions now convert the dict to a properly-shaped list. Each item has the fields the Nodes, Services, Topology, Deployments, and Correlation views expect (hostname, health, capabilities, desired_state, dependencies, etc.). - current_incidents(): synthesises a human-readable 'message' field from node + service + trigger_type (observer does not store one; dashboard showed undefined). - current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var, default 24). Without this, every event file ever written was returned, including events from ghost-node deploys. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:51:03 +02:00
Oskar Kapala	01b7758fe6	feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:15:06 +02:00
Oskar Kapala	7742bda245	feat(control-plane): add container_restart remediation - observer: store trigger_type on incidents for supervisor routing - supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy - supervisor: fix node alias normalization via NODE_ALIAS_MAP - supervisor: fix pending action dedup (scan by content not filename) - executor: implement container_restart via SSH docker restart with retry - control-plane override: configure NODE_ALIAS_MAP for production Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 12:50:46 +02:00
Oskar Kapala	533b8e846d	Add heartbeat updates and improve health checks in control-plane components	2026-05-12 20:59:46 +02:00
Oskar Kapala	8f5b905015	Implement observer runtime world synthesis engine	2026-05-12 14:07:03 +02:00

8 commits