homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	3499b2f280	feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes New checks: - SystemHealthCheck (15min interval): detects newly-failing HA integrations via /api/system_health snapshot diff; transition-based dedup (ok→error fires, sustained error silent, error→ok clears alert) - UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available events with 7-day dedup; release notes truncated at 2000 chars - UpdatesDigestCheck (Sunday cron 09:00): single digest event with all pending updates; weekly ISO-week dedup, independent of daily dedup key - AutomationFailuresCheck (30min interval): detects automations with N consecutive failures (default 3) via /api/trace/automation/<id>; 6h cooldown per automation Phase 3 flag fixes: - Flag #1 (since field): UnavailableEntitiesCheck now uses min(state.last_changed, baseline.first_seen) as effective "since", giving accurate duration when agent was offline at entity's first fail - Flag #3 (registry cache): HAClient.get_entity_registry() caches response in-process with configurable TTL (default 300s); avoids repeated API calls across concurrent check cycles; invalidate_registry_cache() for manual invalidation Storage: system_health_snapshot table (component, last_status, last_seen_at, payload) created automatically on next Storage.open() call Config additions (all with defaults): entity_registry_cache_ttl=300, system_health_check_interval=900, automation_check_interval=1800, automation_failure_threshold=3, updates_check_hour=9, updates_check_minute=0, updates_cooldown_days=7 Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new); 3 skipped (live-HA token not set in CI) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 14:43:10 +02:00
Oskar Kapala	f41ec5d0c5	docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 14:17:23 +02:00
Oskar Kapala	20f6761a67	feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup - shared aiohttp ClientSession in HAClient (Phase 1 Flag #2 fixed): make_session() factory, session injected at startup, closed on shutdown - Check.run() → list[CheckResult]: clean multi-event interface - first real diagnostic check: entity unavailable > 24h (INSERT OR IGNORE baseline preserves first-seen timestamp) - root cause grouping: emit ha_integration_failed instead of N entity events when ≥50% of integration's entities are unavailable (≥3 min) - alert deduplication via SQLite cooldown window (default 6h) - recovery clears baseline + dedup for immediate re-alert - configurable thresholds: duration, integration %, cooldown - 38 unit tests + 7 integration tests (42 pass, 3 skip w/o live HA) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 13:41:55 +02:00
Oskar Kapala	07bd498fd6	feat(ha-diag-agent): test environment with dual HA Docker instances - dockerized ken + chelsty HA test instances with template fixtures - snapshot/reset/wait scripts for fixture management - integration test infrastructure with separate marker - location_tag promoted from metadata to event payload (Phase 1 flag #3) - chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:56:13 +02:00
Oskar Kapala	90c8e77bf7	chore: gitignore *.egg-info, remove committed egg-info Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:26:57 +02:00
Oskar Kapala	ab8895d28b	feat(ha-diag-agent): scaffold service with HA REST client and event emitter - new per-host service, follows node-agent pattern - 7 new HA event types defined (routing in supervisor — Phase 5) - HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead) - service.yaml + host configs for piha (ken) and chelsty-infra (chelsty) - test scaffolding with aiohttp/aiosqlite mocks (15/15 passing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:26:34 +02:00
Oskar Kapala	bd7f955e4e	fix+debug(planner-agent): use base_url (not api_base) for litellm.acompletion, add print [TEMP] litellm.acompletion() has base_url as a named param; api_base only works via **kwargs fallback path. Switching to base_url ensures the value lands correctly in completion_kwargs and reaches the ollama provider. Print() added (not logger) so base_url is always visible in docker logs regardless of log level. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 13:07:58 +02:00
Oskar Kapala	99200e6690	debug(planner-agent): log api_base before each litellm call [TEMP] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 12:52:11 +02:00
Oskar Kapala	dcacac6965	fix(planner-agent): rename OLLAMA_HOST → OLLAMA_API_BASE (litellm convention) LiteLLM reads OLLAMA_API_BASE, not OLLAMA_HOST. - llm_router.py: DEFAULT_OLLAMA_HOST → DEFAULT_OLLAMA_API_BASE, param ollama_host → ollama_api_base - planner.py: env var os.getenv("OLLAMA_HOST") → os.getenv("OLLAMA_API_BASE"), param renamed accordingly - /opt/homelab/config/planner-agent/.env on SOLARIA updated in-place (not in git) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:34:08 +02:00
Oskar Kapala	e52b2e2259	fix(planner-agent): remove duplicate ANTHROPIC_API_KEY from environment Key is already provided via env_file: /opt/homelab/config/planner-agent/.env Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:57:08 +02:00
Oskar Kapala	5ccdfa0ca6	docs: add planner-agent docs and session summary 2026-05-27 - services/planner-agent/README.md: full service doc (what it does, LLM fallback chain, env vars, deploy steps, local run, redis-cli end-to-end test, healthcheck) - README.md: add Agent System section with all agents and their roles - docs/sessions/2026-05-27-planner-agent.md: session summary (built files, architectural decisions, problems + solutions, deployment status, pending work) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 22:35:59 +02:00
Oskar Kapala	ff6fda1f04	planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment All runtime vars (REDIS_URL, OLLAMA_HOST, OLLAMA_MODEL, NODE_NAME, COOLDOWN_SECONDS, RUNTIME_PATH) are sourced from the host-local /opt/homelab/config/planner-agent/.env via env_file. Only ANTHROPIC_API_KEY stays in environment (not in env_file — secret injected at runtime by the operator when needed). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 22:27:44 +02:00
Oskar Kapala	ca37fca5ce	feat(planner-agent): main loop with LLM routing and HITL action proposals services/planner-agent/src/planner.py: - PlannerAgent: async Redis pub/sub on health_events + world_updates - Pipeline: receive event → cooldown gate → LLMRouter → write pending action → emit remediation_started filesystem event - CooldownTracker: 5-min suppression per svc_key (configurable via env) - parse_event(): accepts node-agent shape A and world_updates shape B - PROPOSAL_SCHEMA: jsonschema enforced by LLMRouter before accepting response - SYSTEM_PROMPT: homelab topology + action rules (chelsty always requires_human, disk_pressure always notify, confidence<0.7 → requires_human) - write_pending_action(): atomic tmp→rename write, executor-compatible format - emit_event(): async wrapper around filesystem event write (no control-plane import) - _emit_event_sync() reads NODE_NAME at call time (not import) for testability - Benign events (service_healthy, node_online, ...) silently skipped - LLM chain failure: no cooldown recorded so next event can retry services/planner-agent/tests/test_planner.py (49 tests, 0 network): - TestCooldownTracker: 7 tests (ready/not-ready/elapsed/reset/independence) - TestHealthEvent, TestActionProposal, TestMapActionToExecutorType - TestParseEvent: both event shapes, missing fields, timestamp formats - TestBuildMessages: system prompt rules, payload inclusion - TestPlannerHandleEvent: benign skip, cooldown block, ignore/restart/redeploy/ notify proposals, remediation event emission, LLM failure isolation, requires_human propagation, cooldown recording, model name in proposal - TestPlannerDispatch: valid JSON, invalid JSON, non-string data, missing node - TestWritePendingAction, TestEmitEvent: filesystem integration with tmp_path services/planner-agent/service.yaml: owner_node: solaria, dependencies: [redis, ollama] services/planner-agent/docker-compose.yml: env + healthcheck services/planner-agent/Dockerfile: python:3.11-slim services/planner-agent/healthcheck.sh: heartbeat file age check (300s) services/planner-agent/requirements.txt: litellm, redis, jsonschema, structlog Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 19:11:39 +02:00
Oskar Kapala	1bbc511bb7	feat(planner-agent): add llm_router.py with local-first fallback chain services/planner-agent/src/llm_router.py: - LLMRouter: async routing via litellm; chain = Qwen/Ollama → haiku → sonnet - Timeouts: 8s local, 30s cloud; asyncio.wait_for belt-and-suspenders - Rejection triggers: timeout, API error, refusal patterns, JSON schema fail - JSON fence extraction: recovers valid JSON from blocks - ModelMetrics: per-model success/fallback/error counters + success_rate() - Redis publish to 'llm_router_metrics' after every call (failure-safe) - redis_url=None disables Redis (useful in tests / edge nodes) - context= param adds caller label to all log lines for tracing services/planner-agent/tests/test_llm_router.py: - 34 tests, 0 network calls (litellm + Redis fully mocked) - Covers: primary success, JSON error fallback, refusal fallback, timeout fallback, API exception fallback, all-fail RuntimeError, schema validation, fence extraction, metrics recording, Redis publish, Redis failure isolation services/planner-agent/requirements.txt: - litellm>=1.40.0, redis>=5.0.0, jsonschema>=4.21.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 18:38:06 +02:00
Oskar Kapala	603e10a364	docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs docs/sessions/2026-05-27.md (new): - Full session record: problems found, all commits shipped, end state - Written in Polish per operator preference for session notes - Known limitations: SLZB-06U offline, ezsp→ember migration pending docs/observer-runtime.md: - Document per-node checkpoint format (replaces old global checkpoint) - Add service_healthy / service_recovered resolution behavior - Document ghost key pruning (_prune_stale_world patterns) - Add event type reference table (negative vs positive) docs/vps-control-plane.md: - Add container names and network_mode: host detail - Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior - Add piha agent-system materializer integration note - Rewrite recovery section with actionable bootstrap-flood diagnosis - Add action state machine (pending→approved→running→completed/cancelled) docs/chelsty-runtime.md: - Add chelsty-infra/chelsty-ha node table - Document docker-compose v1 constraint (always use docker-compose, not docker compose) - Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation - Add z2m config writable requirement (EROFS failure mode documented) - Add chelsty-ha monitor:false rationale - Add minimal configuration.yaml template for z2m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 16:18:31 +02:00
Oskar Kapala	7277bdc27f	Fix Copy for AI: materializer fetches from control-plane API instead of Redis services/agent-system/runtime-materializer/materializer.py: - Add materialize_from_api() that fetches all world-state endpoints from the control-plane HTTP API (CONTROL_PLANE_URL env var) - When CONTROL_PLANE_URL is set, use API as source of truth instead of Redis - Redis path preserved as fallback for backward compat hosts/piha/runtime/agent-system/docker-compose.override.yml (new): - Inject CONTROL_PLANE_URL=http://100.95.58.48:18180 for runtime-materializer - piha webui /snapshot now mirrors VPS observer output (clean, ghost-free) Root cause: materializer read from Redis which held 80 stale service entries with hash-prefixed ghost keys (e.g. 0ccb8a88e079_control-plane-supervisor). Redis is never updated by the current observer pipeline; the control-plane API is the single authoritative world-state source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 16:07:51 +02:00
Oskar Kapala	b40b832159	Fix ghost service keys from hash-prefixed Docker container names node-agent: use com.docker.compose.service label as canonical name - Add _canonical_container_name() method: prefers compose label, falls back to hash-prefix-stripped c.name - Replace bare c.name usage in check_containers() - Skip 'created'-state containers (Docker stale-state artifacts) observer: prune hash-prefixed ghost keys in _prune_stale_world() - Each reconcile cycle removes service keys matching <node>/<12hex>_<name> - Acts as safety net for entries already in services.json + future slippage control-plane/docker-compose.yml already has explicit container_name on all four services — no change needed there. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:41:13 +02:00
Oskar Kapala	28e9534765	observer: service_healthy resolves active incidents service_healthy is a positive health confirmation — if the service had an active incident (e.g. from earlier service_unhealthy events), that incident should be resolved when the service is confirmed healthy. Previously only service_recovered resolved incidents; service_healthy set status=healthy but left incidents open, keeping status='degraded'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:20:19 +02:00
Oskar Kapala	46ae92b5c1	supervisor: also cancel pending actions for services removed from desired state Previously _cancel_resolved_pending_actions() only cancelled actions where the service became healthy. This left orphaned actions when a service was removed from services.yaml or marked monitor:false. Add Case 1: if the action's svc_key is no longer in desired_state (either removed entirely or skipped due to monitor:false), cancel with reason service_removed_from_desired_state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:19:13 +02:00
Oskar Kapala	410bfe7065	zigbee2mqtt: config goes in data dir (writable), not separate ro mount z2m migrates configuration.yaml on startup and needs write access. Remove the separate :ro config mount; rely on the base compose's /opt/homelab/data/zigbee2mqtt/data:/app/data read-write mount instead. configuration.yaml must exist at that path on the node before first run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:13:33 +02:00
Oskar Kapala	b3912fe0ce	zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host docker-compose v1 cannot clear the ports list from the base compose with ports: [] in an override, so network_mode: host caused InvalidArgument. Use extra_hosts with host-gateway instead: maps 'mosquitto' hostname to the Docker bridge gateway IP so mqtt://mosquitto:1883 reaches the host-networked mosquitto process from within the bridge-networked z2m container. Requires Docker 20.10+ (present on chelsty-infra). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:12:33 +02:00
Oskar Kapala	61e07f4318	zigbee2mqtt override: clear ports list for docker-compose v1 host network compat docker-compose v1 (1.29.2 on chelsty-infra) raises InvalidArgument when network_mode: host is combined with port_bindings from the base compose file. Add ports: [] in the override to clear the base ports list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:11:42 +02:00
Oskar Kapala	51002d4502	Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring node_exporter (new service): - Add services/node_exporter/docker-compose.yml matching solaria deployment (network_mode: host, pid: host, /:/host:ro,rslave mount) - Add services/node_exporter/service.yaml zigbee2mqtt chelsty-infra override: - Fix network_mode: host (mosquitto runs on host network, port 1883 on localhost) - Fix volume mount: ./configuration.yaml → absolute /opt/homelab/config/zigbee2mqtt/ (secrets stay in runtime config dir, never in Git) - Remove MQTT_USER/MQTT_PASSWORD (mosquitto uses allow_anonymous true) - Extend healthcheck start_period to 60s (z2m takes time on first start) chelsty-ha/services.yaml: - Remove node-agent entry entirely (never deployed, no plans to bootstrap now) - Keep homeassistant with monitor: false (no node-agent = no health events) supervisor: respect monitor: false in services.yaml - Skip action generation for services where monitor=false - Cleans up chelsty-ha entries from action queue without removing desired-state docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:10:48 +02:00
Oskar Kapala	fb7828b52b	supervisor: auto-cancel pending actions when drift is resolved When a service becomes healthy (node-agent emits service_healthy → observer updates services.json), any previously queued redeploy/container_restart action is stale. Without cleanup, the queue accumulates old actions that require manual rejection. _cancel_resolved_pending_actions() runs after each reconcile cycle: - Reads all pending/*.json with type=redeploy or container_restart - If the service is now healthy in actual_state, moves action to cancelled/ with reason=drift_resolved_auto - Only pending actions are touched; approved/running are left to the operator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:58:55 +02:00
Oskar Kapala	2f1965733f	fix(node-agent): unique event IDs per service to prevent same-second overwrites Multiple service_healthy (or containers_not_running) events emitted in the same second for different containers shared the same filename pattern evt-{node}-{ts}-{type}.json — the second write silently overwrote the first, so the observer only ever saw the last container checked per event type per cycle. Fix: include a sanitized service name slug in the ID so every event gets a unique file, e.g. evt-vps-1234-service_healthy-node-agent.json. Also adds import re (required for re.sub in the slug generation). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:55:22 +02:00
Oskar Kapala	267742c7d7	vps/node-agent: add network_mode: host for control-plane health probe The _check_control_plane_health() method probes localhost:18180, which is the control-plane's mapped port. Inside a bridged container, localhost resolves to the container's own loopback — the probe always fails. host network mode shares the VPS host's network namespace so that localhost:18180 correctly reaches the control-plane. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:52:32 +02:00
Oskar Kapala	4e8968f9c7	Fix service health tracking: emit service_healthy, control-plane endpoint check, cleanup checkpoint migration - node_agent: emit service_healthy for all running managed containers so observer populates services.json (previously empty → supervisor flooded action queue with missing_service redeploys for healthy services) - node_agent: VPS-only _check_control_plane_health() probes the HTTP endpoint to emit service_healthy/unhealthy for the 'control-plane' logical service (multi-container stack, container names don't match service name) - node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints format from observer checkpoint (was reading old last_processed_file key, always found nothing, never cleaned up old events) - observer: handle service_healthy event type → sets service status healthy without resolving incidents (unlike service_recovered which also resolves) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:49:56 +02:00
Oskar Kapala	f4a8db93e4	fix(observer): per-node-directory checkpoints replace single global checkpoint The old mechanism tracked a single 'last_processed_file' and used sorted filename order to find new events. Remote nodes ship events into subdirectories (events/piha/, events/chelsty-infra/) that sort alphabetically BEFORE the VPS directory (events/vps/). Once the checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events were silently skipped forever. New mechanism: - node_checkpoints: {node_dir: last_processed_path} - Each node directory has its own independent cursor - New events = files whose path > that node's checkpoint - Backward-compatible: old 'last_processed_file' is migrated by extracting the node dir from the path on first load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:16:58 +02:00
Oskar Kapala	a5a3e223dc	fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors When ~/.ssh is mounted from the host oskar user into a container that runs as root, OpenSSH rejects ~/.ssh/config with 'Bad owner or permissions' because the file UID doesn't match the running process. Add -F /dev/null to the rsync SSH command to skip the config file entirely. Also add UserKnownHostsFile=/dev/null so no known_hosts write is attempted into a potentially read-only mounted .ssh dir. The key itself (/root/.ssh/id_rsa) is still read as an implicit default identity and is not affected by -F. Reproduces on chelsty-infra (has ~/.ssh/config); safe for all nodes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:12:19 +02:00
Oskar Kapala	2349de518b	fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP 100.108.208.3 is piha's Tailscale IP (piha hosts Forgejo+Redis). VPS's actual Tailscale IP is 100.95.58.48. All three node-agent overrides were pointing at piha itself, causing containers to SSH to their own host and fail auth. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:07:27 +02:00
Oskar Kapala	65bac4ebfe	fix(node-agent): mount host SSH key into container for event shipping Nodes ship events to VPS via rsync+SSH. The container runs as root and uses the default SSH identity, which must be at /root/.ssh/. Mount /home/oskar/.ssh from the host read-only so the existing authorized key is available inside the container. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:59:28 +02:00
Oskar Kapala	96bf32614f	fix(observer+operator-ui): fix stale world state, dict→list API, event time filter Root cause of stale data: - node_agent.py falls back to socket.gethostname() when NODE_NAME is unset. Inside a Docker container this returns the 12-char container ID (e.g. 'be17cb6eb0f6'), not the host name. Observer ingested those events and created ghost entries in world/nodes.json that never expired. observer.py: - _prune_stale_world(): removes node/service/incident entries for nodes absent from topology inventory; called on every run_once() cycle (both new-events and idle paths). Resolved incidents older than 7 days are also aged out. - _save_world(): now writes node_count and service_count to runtime-summary.json so the Dashboard's System Overview cards show real numbers instead of undefined. operator_ui.py: - current_nodes/services/deployments/incidents(): the observer stores world state as keyed dicts; the frontend calls .map() which requires an array. All four functions now convert the dict to a properly-shaped list. Each item has the fields the Nodes, Services, Topology, Deployments, and Correlation views expect (hostname, health, capabilities, desired_state, dependencies, etc.). - current_incidents(): synthesises a human-readable 'message' field from node + service + trigger_type (observer does not store one; dashboard showed undefined). - current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var, default 24). Without this, every event file ever written was returned, including events from ghost-node deploys. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:51:03 +02:00
Oskar Kapala	ae33cce889	feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra - piha: NODE_TYPE=sd_card (rate-limited docker prune, once per day) - solaria: NODE_TYPE=ai_node (dangling+containers+build cache; never -a to preserve Ollama images) - chelsty-infra: NODE_TYPE=lte_node (NO cleanup, events-only) - All three: VPS_EVENTS_HOST set for event shipping via rsync+SSH Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:34:23 +02:00
Oskar Kapala	c5c080b3e3	feat(vps): add node-agent runtime override with NODE_NAME=vps Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:18:19 +02:00
Oskar Kapala	01b7758fe6	feat(node-agent): implement health monitor and safe cleanup policy scripts/monitor/health-monitor.sh (new): - Standalone bash health monitor: disk/RAM/CPU checks + docker container health - Per-node-type cleanup policy enforced: lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops sd_card (piha, saturn): dangling images + containers, rate-limited once/24h ai_node (solaria): dangling + containers + build cache, NEVER -a standard (vps): dangling + containers + build cache + CP filesystem rotation - VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d, events >3d AND past observer checkpoint - Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu, containers_not_running, healthcheck_failed) services/node-agent/ (new): - Python daemon (node_agent.py): same policy as bash script, Docker SDK for container checks and cleanup, /proc for system metrics - Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var) - Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0 - docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only observer.py: - Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure - Handle disk_pressure: record severity on node, clear when healthy - Handle high_memory / high_cpu: record pressure level for correlation supervisor.py: - Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha} - reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure - _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node}, checks all active states, risk=guarded (operator approval required) executor.py: - Handle disk_cleanup action type via _execute_disk_cleanup() - Commands come from action payload; safety gate rejects any command touching /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf / hosts/*/services.yaml: - Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra - Add node-agent to chelsty-ha (previously missing) - Add cleanup policy notes to LTE node comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:15:06 +02:00
Oskar Kapala	7742bda245	feat(control-plane): add container_restart remediation - observer: store trigger_type on incidents for supervisor routing - supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy - supervisor: fix node alias normalization via NODE_ALIAS_MAP - supervisor: fix pending action dedup (scan by content not filename) - executor: implement container_restart via SSH docker restart with retry - control-plane override: configure NODE_ALIAS_MAP for production Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 12:50:46 +02:00
oskar	98fe1f1846	fix: frigate config not read-only, mount from /opt/homelab	2026-05-22 11:31:31 +02:00
oskar	beb8b5cbaa	fix: remove --pull always flag incompatible with docker-compose v1	2026-05-21 22:07:49 +02:00
oskar	898deda05f	fix: deploy-frigate.sh use docker-compose v1 for chelsty-infra	2026-05-21 22:05:43 +02:00
oskar	f34399a30d	feat: add Frigate NVR deployment for chelsty-infra VAAPI decode via Intel UHD 630, CPU detection, 2x Reolink RLC-540 placeholders. MQTT to local mosquitto (127.0.0.1), 7-day recording retention. Secrets in /opt/homelab/config/frigate/frigate.env on node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 18:19:45 +02:00
oskar	9b39581b53	fix(supervisor): content-based action IDs to prevent 30s backlog accumulation Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired. Switch to reconcile-{node}-{service} and check pending/approved/running states. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:47:37 +02:00
oskar	ae7446a04b	feat: add Copy for AI snapshot button to webui Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 12:05:37 +02:00
oskar	f21be4f4d4	ops: align vps desired state with control-plane architecture, remove legacy agent-system references Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 11:40:55 +02:00
oskar	8fb4d3d634	docs: add tech-debt.md, forgejo_runner temp disabled	2026-05-21 10:37:42 +02:00
oskar	35e57cc789	docs(CLAUDE.md): update node model and override path convention - split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table - correct docker-compose override path to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 15:27:46 +02:00
oskar	b02c8bb50e	fix(deploy): inventory-aware orchestration and correct override paths - orchestrate-deploy.sh: read nodes from inventory/topology.yaml instead of hardcoded list - orchestrate-deploy.sh: LTE nodes (chelsty-infra, chelsty-ha) use ConnectTimeout=30, non-fatal on failure - deploy-node.sh: service discovery falls back to services.yaml if no services.txt - deploy-node.sh: override path corrected to hosts/<node>/runtime/<service>/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 14:50:01 +02:00
oskar	dc483ae31a	docs(chelsty): update docs and topology for site/node split - chelsty-runtime.md: references chelsty-infra and chelsty-ha nodes - chelsty-stability-agent.md: scoped to chelsty-infra - topology.yaml: chelsty monolith replaced with chelsty-infra + chelsty-ha	2026-05-20 14:23:57 +02:00
oskar	9d2f748557	refactor(hosts): split chelsty monolith into chelsty-ha and chelsty-infra - remove legacy hosts/chelsty/ monolith - chelsty-infra: add capabilities, networking, paths, runtime (mosquitto, zigbee2mqtt, stability-agent) - chelsty-ha: add capabilities - align with site/node model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 14:20:49 +02:00
oskar	8a12b7ff17	docs: uzupelnij dokumentacje pod katem agentow AI Co-authored-by: Junie <junie@jetbrains.com>	2026-05-20 12:06:23 +02:00
oskar	f65698925e	Fix control plane SSH deploy TTY	2026-05-18 21:41:47 +02:00

1 2 3

108 commits