homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	c255a021d1	fix(observer): quarantine malformed event files to prevent processing wedge Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the node's checkpoint forever — every cycle re-tried, logged, never advanced past the bad file; all subsequent good events for that node lost. Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/ with collision handling. Checkpoint advances, downstream events flow. Move failures themselves are logged but don't crash the loop. Complementary to yesterday's atomic_write_json fix (state files); this addresses the same race-pattern on event files instead. Regression test asserts: bad event quarantined to failed_events dir, removed from hot path, subsequent good event processed (node online), checkpoint moves to good event.	2026-06-12 11:22:56 +02:00
Oskar Kapala	d60b28a949	feat(ha-diag-agent): add piha deploy config - hosts/piha/runtime/ha-diag-agent/docker-compose.override.yml: mem_limit 128m, hardcoded events volume (/opt/homelab/events/piha:/events) to avoid ${NODE_NAME} shell-expansion issue in deploy-node.sh - services/ha-diag-agent/env.example: per-host HA_URL comments (piha vs chelsty-infra tailscale), HA_TOKEN source note Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 14:10:06 +02:00
Oskar Kapala	039f9f7247	feat(piha): brain-watchdog — external watchdog for control-plane Polls /summary on VPS over Tailscale every 60s; computes freshness locally from last_update epoch (never trusts self-reported status). Alerts via Telegram Bot API directly after 3 consecutive failures; sends recovery message on heal. State (fail_count, alerted) persisted to volume so debounce survives restarts. - services/brain-watchdog/: Python service, no external deps (stdlib only) - hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m - hosts/piha/services.yaml + inventory/topology.yaml: manifest entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 17:54:36 +02:00
Oskar Kapala	ab8895d28b	feat(ha-diag-agent): scaffold service with HA REST client and event emitter - new per-host service, follows node-agent pattern - 7 new HA event types defined (routing in supervisor — Phase 5) - HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead) - service.yaml + host configs for piha (ken) and chelsty-infra (chelsty) - test scaffolding with aiohttp/aiosqlite mocks (15/15 passing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:26:34 +02:00
Oskar Kapala	7277bdc27f	Fix Copy for AI: materializer fetches from control-plane API instead of Redis services/agent-system/runtime-materializer/materializer.py: - Add materialize_from_api() that fetches all world-state endpoints from the control-plane HTTP API (CONTROL_PLANE_URL env var) - When CONTROL_PLANE_URL is set, use API as source of truth instead of Redis - Redis path preserved as fallback for backward compat hosts/piha/runtime/agent-system/docker-compose.override.yml (new): - Inject CONTROL_PLANE_URL=http://100.95.58.48:18180 for runtime-materializer - piha webui /snapshot now mirrors VPS observer output (clean, ghost-free) Root cause: materializer read from Redis which held 80 stale service entries with hash-prefixed ghost keys (e.g. 0ccb8a88e079_control-plane-supervisor). Redis is never updated by the current observer pipeline; the control-plane API is the single authoritative world-state source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 16:07:51 +02:00
Oskar Kapala	2349de518b	fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP 100.108.208.3 is piha's Tailscale IP (piha hosts Forgejo+Redis). VPS's actual Tailscale IP is 100.95.58.48. All three node-agent overrides were pointing at piha itself, causing containers to SSH to their own host and fail auth. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:07:27 +02:00
Oskar Kapala	65bac4ebfe	fix(node-agent): mount host SSH key into container for event shipping Nodes ship events to VPS via rsync+SSH. The container runs as root and uses the default SSH identity, which must be at /root/.ssh/. Mount /home/oskar/.ssh from the host read-only so the existing authorized key is available inside the container. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:59:28 +02:00
Oskar Kapala	ae33cce889	feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra - piha: NODE_TYPE=sd_card (rate-limited docker prune, once per day) - solaria: NODE_TYPE=ai_node (dangling+containers+build cache; never -a to preserve Ollama images) - chelsty-infra: NODE_TYPE=lte_node (NO cleanup, events-only) - All three: VPS_EVENTS_HOST set for event shipping via rsync+SSH Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:34:23 +02:00
Oskar Kapala	01b7758fe6	feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:15:06 +02:00
oskar	c9ddfa9ac1	Roll out stability agent to homelab nodes	2026-05-17 15:54:19 +02:00
Oskar Kapala	bbdbdb8321	Add node capability model	2026-05-11 20:46:50 +02:00
Oskar Kapala	04732a30d6	Add normalized host inventory skeletons	2026-05-10 22:12:36 +02:00
Oskar Kapala	d0540f7eb8	Add infrastructure standards and deployment conventions	2026-05-07 21:16:03 +02:00

13 commits