homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	7f17b65278	fix(control-plane): run executor as uid 1000 with docker group access Executor was the only control-plane container running as root (uid=0), writing root-owned files to /opt/homelab via bind-mount and triggering false sudo on every deploy. - Dockerfile: add USER homelab after useradd (useradd already present) - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) so executor retains docker.sock access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 18:19:58 +02:00
Oskar Kapala	e6a2443412	fix(dev): agent.sh worktree_count/paths grep exit-1 on empty set grep -cv (and grep -v) return exit code 1 when there are zero matches. With set -euo pipefail this silently aborted the script before count was returned — causing 'agent.sh new' to fail on a fresh repo with no existing worktrees. Fix: move the grep -v into worktree_paths with '\|\| true' so the function always exits 0, then derive worktree_count via wc -l.	2026-06-03 18:04:38 +02:00
Oskar Kapala	f9b145585f	fix(dev): agent.sh validate_name set -e safety + ERR trap Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail was silently exiting after the loop because the failing-test compound propagated exit code 1 through the function return. Add ERR trap so future silent fails get diagnosed at the source.	2026-06-03 18:02:50 +02:00
Oskar Kapala	3b620ef7e3	docs(claude): multi-agent worktree mode section Main checkout = deploy-only. .agent-task marker triggers mandatory loading of worktree-aware skill. Only the human runs scripts/dev/agent.sh.	2026-06-03 17:41:35 +02:00
Oskar Kapala	745e52723c	feat(skills): worktree-aware skill for Claude Code Encodes branch hygiene for CC running in task worktrees: commit only to assigned branch, no push origin master, no touching main checkout, no git add -A, no worktree management, mandatory final report.	2026-06-03 17:41:35 +02:00
Oskar Kapala	1abe925f65	feat(dev): scripts/dev/agent.sh — multi-agent worktree dispatcher new/list/merge/clean. Decisions: branch task/<name>, sibling worktree ~/homelab-codex-ws-<name>, ff-only auto-merge, cap 4.	2026-06-03 17:41:35 +02:00
Oskar Kapala	1c69a5bc29	feat(skills): save-session skill for Claude Code Records session facts (git log, diff --stat, deploys from transcript) by appending to docs/sessions/YYYY-MM-DD.md with a mandatory narrative placeholder. Never touches backlog.md or CLAUDE.md without explicit instruction. Commits only the session file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 16:06:46 +02:00
Oskar Kapala	02e7c28823	feat(skills): deploy skill for Claude Code Instructs CC to always route deploy/redeploy/ship/wdróż requests through scripts/deploy/deploy.sh, maps exit codes to required actions, and enforces no-bypass rules for gate and branch checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 16:06:40 +02:00
Oskar Kapala	db592fbc28	feat(deploy): Saturn-side dispatcher wrapper Replaces the per-node staged framework with a single entry point that runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest + docker build per service), execute (control-plane.sh --ssh or remote deploy-node.sh), verify (docker ps), and one-line report. Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 16:06:36 +02:00
Oskar Kapala	00fc36df3a	fix(deploy): skip sudo chown/chmod when /opt/homelab ownership is already correct deploy-local.sh previously ran `sudo chown -R 1000:1000` and `sudo chmod -R 775` unconditionally on every deploy, which blocked non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000. Both steps are now conditional using `find ... -print -quit`: - chown: runs only if any file/dir is NOT uid/gid 1000 - chmod: runs only if any directory is missing -775 permission bits When everything is correct (steady state on VPS), both steps log "already correct, skipping" and never invoke sudo. If a new directory was created by root (e.g. a manual mkdir, volume mount, or restart artefact), the remediation path triggers automatically — the self-heal property is preserved. Smoke-tested in Docker (ubuntu:22.04): Case 1 (1000:1000 + 775): chown skipped, chmod skipped ✓ Case 2 (root-owned subdir): chown triggered ✓ Case 3 (700 dir perms): chmod triggered ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 15:44:44 +02:00
Oskar Kapala	f5dcefc752	fix(observer): robust incident lifecycle + orphan auto-resolve Two root causes for stale "active" incidents on the dashboard: 1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at can be an ISO-8601 string (stability-agent via events.py) or a Unix int (node-agent). The previous session's auto-resolve did plain `time.time() - last_occ` which raises TypeError for strings, silently preventing _save_world() from being called and leaving incidents perpetually "active" on disk. Fix: add _parse_ts(ts) -> float that handles int, float, and ISO-8601 strings uniformly. All timestamp arithmetic now goes through it; returns 0.0 on None / garbage to keep comparisons safe. 2. Orphaned active incidents: _resolve_incident clears service["incident_id"] and marks the incident "resolved" in memory, but if incidents.json was truncated mid-write (pre-atomic-write era), the observer loaded it at next startup with status="active" and no service entry pointing to it. No code ever touched these orphans again. Fix: _prune_stale_world now runs two cleanup passes each cycle: - Case 1 (healthy-linked): service.status=="healthy" AND incident_id still set → resolve immediately (service cannot have active incident) - Case 2 (orphaned): active incident with no service link AND last_occurrence > 5 min ago → resolve (5-min guard for creation race) Both cases are wrapped in try/except so a bug here never crashes the observer loop or blocks _save_world. Also fixes the 7-day stale-incident prune to use _parse_ts so ISO-string resolved_at values are handled correctly. 3. Operator UI: current_incidents() now filters to status=="active" only. Resolved incidents were previously included in the /incidents endpoint, making the dashboard show a wall of historical records as if active. Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json (now written atomically) and deletes old event files. No non-atomic writes found. Midnight clustering was likely external (logrotate / OS flush); the supervisor's resilient loader already handles such transient issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 14:29:12 +02:00
Oskar Kapala	98437d46b2	test(control-plane): atomic write and resilient loader coverage 11 new test cases in test_state_reliability.py covering: - atomic_write_json: produces valid JSON, no .tmp left behind, overwrites, works with nested structures - _load_actual_state: returns False on empty / truncated file, returns True on valid files, preserves last-known-good state across a parse failure - reconcile: empty/truncated services.json or incidents.json generates zero actions (skip-cycle semantics proven end-to-end) - healthy service with valid world state generates no spurious action All 32 tests (11 new + 21 existing) pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:27:05 +02:00
Oskar Kapala	5e97b4e448	fix(supervisor): atomic writes + skip cycle on unreadable world state Two independent fixes for the false-alarm storm caused by race-condition reads of truncated world state files: 1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces all bare open('w')+json.dump calls in supervisor and executor, so the action-file pipeline is never visible in a half-written state. 2. Resilient loader: _load_actual_state now returns False when any world state file fails to parse (empty or truncated mid-write). reconcile() skips the entire drift check on False instead of treating {} as "all services missing". actual_state retains its last-known-good values so a single bad cycle does not wipe accumulated context. Before: parse error → raw[key]={} → all desired services missing → wall of redeploy actions → drift_resolved_auto churn on next cycle. After: parse error → WARNING logged → cycle skipped → no actions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:26:59 +02:00
Oskar Kapala	ffb0608b9a	fix(observer): atomic writes for world state files All JSON state writes (services.json, nodes.json, incidents.json, deployments.json, runtime-summary.json, observer_checkpoint.json) now use _atomic_write_json: write to a .tmp sibling, fsync, then os.replace. This eliminates the truncated-write window that caused supervisors reading mid-write files to see empty/partial JSON. Also adds auto-resolution of phantom active incidents: if a service reports status=healthy and its incident's last_occurrence is >30 min old, the incident is resolved in _prune_stale_world. This clears false active incidents accumulated from previous race-condition reads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:26:49 +02:00
Oskar Kapala	f381023206	docs(claude): add Definition of Done for services (smoke test + pytest) Lesson from brain-watchdog: code that was never run had a packaging bug that caused a crash loop in production. New rule: docker build + short smoke-run + pytest before any commit or deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 20:38:39 +02:00
Oskar Kapala	cb4ae756ab	test(brain-watchdog): add pytest suite covering import and check() logic 7 cases: package importable, fresh ok, stale, unreachable, HTTP error, missing last_update field, unparseable timestamp. pytest.ini sets pythonpath=src so tests run without PYTHONPATH set in the environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 20:38:24 +02:00
Oskar Kapala	cfe5e02372	fix(brain-watchdog): add PYTHONPATH=/app/src so brain_watchdog package is importable WORKDIR is /app but the package lives under src/; without PYTHONPATH set `python -m brain_watchdog.main` raised ModuleNotFoundError on startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 20:31:45 +02:00
Oskar Kapala	039f9f7247	feat(piha): brain-watchdog — external watchdog for control-plane Polls /summary on VPS over Tailscale every 60s; computes freshness locally from last_update epoch (never trusts self-reported status). Alerts via Telegram Bot API directly after 3 consecutive failures; sends recovery message on heal. State (fail_count, alerted) persisted to volume so debounce survives restarts. - services/brain-watchdog/: Python service, no external deps (stdlib only) - hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m - hosts/piha/services.yaml + inventory/topology.yaml: manifest entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 17:54:36 +02:00
Oskar Kapala	495741e7ac	operator-ui: /events bez ladowania calego katalogu + daemon threads; epoch z regexa (fix chelsty-infra) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 16:34:52 +02:00
Oskar Kapala	43c5d45353	deploy: chmod/chown na /opt/homelab odporne na znikające pliki eventow	2026-06-01 14:35:19 +02:00
Oskar Kapala	f64cec645e	vps: mem_limit + oom_score_adj na serwisach in-repo; deploy-local stosuje override (stop OOM)	2026-06-01 14:23:58 +02:00
Oskar Kapala	1db9db7d03	fix(dashboard): read last_update from JSON content, not file mtime operator_ui.py called .replace() on last_update without checking type — an integer value (written by the materializer) raised AttributeError and silently fell back to os.path.getmtime(), which was stuck at 5/29 after a deploy with preserved timestamps. web.py had the same class of bug but worse: it unconditionally replaced last_update with mtime, ignoring the JSON field entirely. Both now branch on isinstance(str) and cast numeric values directly to float, with mtime only as a last-resort fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-31 22:10:50 +02:00
Oskar Kapala	52607a7cdd	feat(control-plane): shadow_mode for HA event auto-actions + deploy docs - HA_DIAG_SHADOW_MODE env flag in supervisor (default true) - shadow_mode downgrades container_restart actions to alert_only with [SHADOW MODE] note; same action_id and 30-min cooldown apply - alert_only events unaffected (always routed normally) - 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected - DEPLOY.md with token gen, per-host config, verification, 48h observation, production-mode enablement, rollback - README.md updated with shadow mode flag summary and DEPLOY.md link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 17:12:33 +02:00
Oskar Kapala	b9ed118b8c	fix(telegram-bot): correct risk_level field + show description in alerts - read risk_level with risk fallback (was: risk only → "unknown" for all actions written by supervisor which uses risk_level key) - include description field in alert format (was: alert_only payloads' substance was invisible — description carried the full message) - extract _format_pending_action() pure helper to enable unit testing without a live Telegram connection - 8 tests: risk_level present, risk fallback, both absent, description shown/absent, truncation, full HA alert_only shape, no-description no-crash - flagged during Phase 5 review of ha-diag-agent supervisor routing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 16:26:49 +02:00
Oskar Kapala	bf1415e4c1	feat(control-plane): route ha-diag-agent events through supervisor - 8 HA event types mapped to existing action types - ha_websocket_dead → container_restart (homeassistant), 30-min cooldown - 6 events → alert_only (entity_unavailable, integration_failed, automation_failing, update_available, recorder_lag, system_health_degraded), 1-hour cooldown - ha_websocket_recovered → cancels matching pending container_restart - state-aware suppression: skip HA events when homeassistant has an active containers_not_running incident < 5 min ago (avoids alert storms during HA restarts/updates) - location_tag preserved through action pipeline for per-house telegram alerts - executor: alert_only acknowledged as no-op success - 18 tests covering all 8 event types, suppression, cooldown, dedup, location_tag, recovery cancellation - CLAUDE.md: supervisor event routing table added Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 15:59:23 +02:00
Oskar Kapala	31b48d162a	feat(ha-diag-agent): WebSocketMonitor for real-time HA liveness - persistent WS connection to HA with auth + state_changed subscription - watchdog detects silence > 5min → emits ha_websocket_dead - immediate ha_websocket_dead on disconnect, exponential reconnect with jitter - cooldown prevents alert spam (10min repeat window while HA stays down) - ha_websocket_recovered emitted on reconnect after a dead alert (allows supervisor to clear active incidents in Phase 5) - new monitors/ subpackage for long-running tasks (vs interval checks/) - /health endpoint now includes ws_connected field - 26 unit tests, 3 integration tests (real HA + container stop/restart) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 15:00:18 +02:00
Oskar Kapala	3499b2f280	feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes New checks: - SystemHealthCheck (15min interval): detects newly-failing HA integrations via /api/system_health snapshot diff; transition-based dedup (ok→error fires, sustained error silent, error→ok clears alert) - UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available events with 7-day dedup; release notes truncated at 2000 chars - UpdatesDigestCheck (Sunday cron 09:00): single digest event with all pending updates; weekly ISO-week dedup, independent of daily dedup key - AutomationFailuresCheck (30min interval): detects automations with N consecutive failures (default 3) via /api/trace/automation/<id>; 6h cooldown per automation Phase 3 flag fixes: - Flag #1 (since field): UnavailableEntitiesCheck now uses min(state.last_changed, baseline.first_seen) as effective "since", giving accurate duration when agent was offline at entity's first fail - Flag #3 (registry cache): HAClient.get_entity_registry() caches response in-process with configurable TTL (default 300s); avoids repeated API calls across concurrent check cycles; invalidate_registry_cache() for manual invalidation Storage: system_health_snapshot table (component, last_status, last_seen_at, payload) created automatically on next Storage.open() call Config additions (all with defaults): entity_registry_cache_ttl=300, system_health_check_interval=900, automation_check_interval=1800, automation_failure_threshold=3, updates_check_hour=9, updates_check_minute=0, updates_cooldown_days=7 Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new); 3 skipped (live-HA token not set in CI) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 14:43:10 +02:00
Oskar Kapala	f41ec5d0c5	docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs - CLAUDE.md: collapsed 5-section deployment block to single annotated block, removed inline emit_event signatures (kept path + type list), flattened runtime path tree to bullets, condensed node table note to reference capabilities.yaml, added CHELSTY docker-compose v1 constraint; 156 → 113 lines (~750 → ~480 tokens) - fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at 192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference and corrected owner node from piha to chelsty-infra Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 14:17:23 +02:00
Oskar Kapala	20f6761a67	feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup - shared aiohttp ClientSession in HAClient (Phase 1 Flag #2 fixed): make_session() factory, session injected at startup, closed on shutdown - Check.run() → list[CheckResult]: clean multi-event interface - first real diagnostic check: entity unavailable > 24h (INSERT OR IGNORE baseline preserves first-seen timestamp) - root cause grouping: emit ha_integration_failed instead of N entity events when ≥50% of integration's entities are unavailable (≥3 min) - alert deduplication via SQLite cooldown window (default 6h) - recovery clears baseline + dedup for immediate re-alert - configurable thresholds: duration, integration %, cooldown - 38 unit tests + 7 integration tests (42 pass, 3 skip w/o live HA) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 13:41:55 +02:00
Oskar Kapala	07bd498fd6	feat(ha-diag-agent): test environment with dual HA Docker instances - dockerized ken + chelsty HA test instances with template fixtures - snapshot/reset/wait scripts for fixture management - integration test infrastructure with separate marker - location_tag promoted from metadata to event payload (Phase 1 flag #3) - chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:56:13 +02:00
Oskar Kapala	90c8e77bf7	chore: gitignore *.egg-info, remove committed egg-info Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:26:57 +02:00
Oskar Kapala	ab8895d28b	feat(ha-diag-agent): scaffold service with HA REST client and event emitter - new per-host service, follows node-agent pattern - 7 new HA event types defined (routing in supervisor — Phase 5) - HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead) - service.yaml + host configs for piha (ken) and chelsty-infra (chelsty) - test scaffolding with aiohttp/aiosqlite mocks (15/15 passing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:26:34 +02:00
Oskar Kapala	bd7f955e4e	fix+debug(planner-agent): use base_url (not api_base) for litellm.acompletion, add print [TEMP] litellm.acompletion() has base_url as a named param; api_base only works via **kwargs fallback path. Switching to base_url ensures the value lands correctly in completion_kwargs and reaches the ollama provider. Print() added (not logger) so base_url is always visible in docker logs regardless of log level. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 13:07:58 +02:00
Oskar Kapala	99200e6690	debug(planner-agent): log api_base before each litellm call [TEMP] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 12:52:11 +02:00
Oskar Kapala	dcacac6965	fix(planner-agent): rename OLLAMA_HOST → OLLAMA_API_BASE (litellm convention) LiteLLM reads OLLAMA_API_BASE, not OLLAMA_HOST. - llm_router.py: DEFAULT_OLLAMA_HOST → DEFAULT_OLLAMA_API_BASE, param ollama_host → ollama_api_base - planner.py: env var os.getenv("OLLAMA_HOST") → os.getenv("OLLAMA_API_BASE"), param renamed accordingly - /opt/homelab/config/planner-agent/.env on SOLARIA updated in-place (not in git) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:34:08 +02:00
Oskar Kapala	e52b2e2259	fix(planner-agent): remove duplicate ANTHROPIC_API_KEY from environment Key is already provided via env_file: /opt/homelab/config/planner-agent/.env Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:57:08 +02:00
Oskar Kapala	5ccdfa0ca6	docs: add planner-agent docs and session summary 2026-05-27 - services/planner-agent/README.md: full service doc (what it does, LLM fallback chain, env vars, deploy steps, local run, redis-cli end-to-end test, healthcheck) - README.md: add Agent System section with all agents and their roles - docs/sessions/2026-05-27-planner-agent.md: session summary (built files, architectural decisions, problems + solutions, deployment status, pending work) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 22:35:59 +02:00
Oskar Kapala	ff6fda1f04	planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment All runtime vars (REDIS_URL, OLLAMA_HOST, OLLAMA_MODEL, NODE_NAME, COOLDOWN_SECONDS, RUNTIME_PATH) are sourced from the host-local /opt/homelab/config/planner-agent/.env via env_file. Only ANTHROPIC_API_KEY stays in environment (not in env_file — secret injected at runtime by the operator when needed). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 22:27:44 +02:00
Oskar Kapala	ca37fca5ce	feat(planner-agent): main loop with LLM routing and HITL action proposals services/planner-agent/src/planner.py: - PlannerAgent: async Redis pub/sub on health_events + world_updates - Pipeline: receive event → cooldown gate → LLMRouter → write pending action → emit remediation_started filesystem event - CooldownTracker: 5-min suppression per svc_key (configurable via env) - parse_event(): accepts node-agent shape A and world_updates shape B - PROPOSAL_SCHEMA: jsonschema enforced by LLMRouter before accepting response - SYSTEM_PROMPT: homelab topology + action rules (chelsty always requires_human, disk_pressure always notify, confidence<0.7 → requires_human) - write_pending_action(): atomic tmp→rename write, executor-compatible format - emit_event(): async wrapper around filesystem event write (no control-plane import) - _emit_event_sync() reads NODE_NAME at call time (not import) for testability - Benign events (service_healthy, node_online, ...) silently skipped - LLM chain failure: no cooldown recorded so next event can retry services/planner-agent/tests/test_planner.py (49 tests, 0 network): - TestCooldownTracker: 7 tests (ready/not-ready/elapsed/reset/independence) - TestHealthEvent, TestActionProposal, TestMapActionToExecutorType - TestParseEvent: both event shapes, missing fields, timestamp formats - TestBuildMessages: system prompt rules, payload inclusion - TestPlannerHandleEvent: benign skip, cooldown block, ignore/restart/redeploy/ notify proposals, remediation event emission, LLM failure isolation, requires_human propagation, cooldown recording, model name in proposal - TestPlannerDispatch: valid JSON, invalid JSON, non-string data, missing node - TestWritePendingAction, TestEmitEvent: filesystem integration with tmp_path services/planner-agent/service.yaml: owner_node: solaria, dependencies: [redis, ollama] services/planner-agent/docker-compose.yml: env + healthcheck services/planner-agent/Dockerfile: python:3.11-slim services/planner-agent/healthcheck.sh: heartbeat file age check (300s) services/planner-agent/requirements.txt: litellm, redis, jsonschema, structlog Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 19:11:39 +02:00
Oskar Kapala	1bbc511bb7	feat(planner-agent): add llm_router.py with local-first fallback chain services/planner-agent/src/llm_router.py: - LLMRouter: async routing via litellm; chain = Qwen/Ollama → haiku → sonnet - Timeouts: 8s local, 30s cloud; asyncio.wait_for belt-and-suspenders - Rejection triggers: timeout, API error, refusal patterns, JSON schema fail - JSON fence extraction: recovers valid JSON from blocks - ModelMetrics: per-model success/fallback/error counters + success_rate() - Redis publish to 'llm_router_metrics' after every call (failure-safe) - redis_url=None disables Redis (useful in tests / edge nodes) - context= param adds caller label to all log lines for tracing services/planner-agent/tests/test_llm_router.py: - 34 tests, 0 network calls (litellm + Redis fully mocked) - Covers: primary success, JSON error fallback, refusal fallback, timeout fallback, API exception fallback, all-fail RuntimeError, schema validation, fence extraction, metrics recording, Redis publish, Redis failure isolation services/planner-agent/requirements.txt: - litellm>=1.40.0, redis>=5.0.0, jsonschema>=4.21.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 18:38:06 +02:00
Oskar Kapala	603e10a364	docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs docs/sessions/2026-05-27.md (new): - Full session record: problems found, all commits shipped, end state - Written in Polish per operator preference for session notes - Known limitations: SLZB-06U offline, ezsp→ember migration pending docs/observer-runtime.md: - Document per-node checkpoint format (replaces old global checkpoint) - Add service_healthy / service_recovered resolution behavior - Document ghost key pruning (_prune_stale_world patterns) - Add event type reference table (negative vs positive) docs/vps-control-plane.md: - Add container names and network_mode: host detail - Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior - Add piha agent-system materializer integration note - Rewrite recovery section with actionable bootstrap-flood diagnosis - Add action state machine (pending→approved→running→completed/cancelled) docs/chelsty-runtime.md: - Add chelsty-infra/chelsty-ha node table - Document docker-compose v1 constraint (always use docker-compose, not docker compose) - Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation - Add z2m config writable requirement (EROFS failure mode documented) - Add chelsty-ha monitor:false rationale - Add minimal configuration.yaml template for z2m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 16:18:31 +02:00
Oskar Kapala	7277bdc27f	Fix Copy for AI: materializer fetches from control-plane API instead of Redis services/agent-system/runtime-materializer/materializer.py: - Add materialize_from_api() that fetches all world-state endpoints from the control-plane HTTP API (CONTROL_PLANE_URL env var) - When CONTROL_PLANE_URL is set, use API as source of truth instead of Redis - Redis path preserved as fallback for backward compat hosts/piha/runtime/agent-system/docker-compose.override.yml (new): - Inject CONTROL_PLANE_URL=http://100.95.58.48:18180 for runtime-materializer - piha webui /snapshot now mirrors VPS observer output (clean, ghost-free) Root cause: materializer read from Redis which held 80 stale service entries with hash-prefixed ghost keys (e.g. 0ccb8a88e079_control-plane-supervisor). Redis is never updated by the current observer pipeline; the control-plane API is the single authoritative world-state source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 16:07:51 +02:00
Oskar Kapala	b40b832159	Fix ghost service keys from hash-prefixed Docker container names node-agent: use com.docker.compose.service label as canonical name - Add _canonical_container_name() method: prefers compose label, falls back to hash-prefix-stripped c.name - Replace bare c.name usage in check_containers() - Skip 'created'-state containers (Docker stale-state artifacts) observer: prune hash-prefixed ghost keys in _prune_stale_world() - Each reconcile cycle removes service keys matching <node>/<12hex>_<name> - Acts as safety net for entries already in services.json + future slippage control-plane/docker-compose.yml already has explicit container_name on all four services — no change needed there. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:41:13 +02:00
Oskar Kapala	28e9534765	observer: service_healthy resolves active incidents service_healthy is a positive health confirmation — if the service had an active incident (e.g. from earlier service_unhealthy events), that incident should be resolved when the service is confirmed healthy. Previously only service_recovered resolved incidents; service_healthy set status=healthy but left incidents open, keeping status='degraded'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:20:19 +02:00
Oskar Kapala	46ae92b5c1	supervisor: also cancel pending actions for services removed from desired state Previously _cancel_resolved_pending_actions() only cancelled actions where the service became healthy. This left orphaned actions when a service was removed from services.yaml or marked monitor:false. Add Case 1: if the action's svc_key is no longer in desired_state (either removed entirely or skipped due to monitor:false), cancel with reason service_removed_from_desired_state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:19:13 +02:00
Oskar Kapala	410bfe7065	zigbee2mqtt: config goes in data dir (writable), not separate ro mount z2m migrates configuration.yaml on startup and needs write access. Remove the separate :ro config mount; rely on the base compose's /opt/homelab/data/zigbee2mqtt/data:/app/data read-write mount instead. configuration.yaml must exist at that path on the node before first run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:13:33 +02:00
Oskar Kapala	b3912fe0ce	zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host docker-compose v1 cannot clear the ports list from the base compose with ports: [] in an override, so network_mode: host caused InvalidArgument. Use extra_hosts with host-gateway instead: maps 'mosquitto' hostname to the Docker bridge gateway IP so mqtt://mosquitto:1883 reaches the host-networked mosquitto process from within the bridge-networked z2m container. Requires Docker 20.10+ (present on chelsty-infra). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:12:33 +02:00
Oskar Kapala	61e07f4318	zigbee2mqtt override: clear ports list for docker-compose v1 host network compat docker-compose v1 (1.29.2 on chelsty-infra) raises InvalidArgument when network_mode: host is combined with port_bindings from the base compose file. Add ports: [] in the override to clear the base ports list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:11:42 +02:00
Oskar Kapala	51002d4502	Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring node_exporter (new service): - Add services/node_exporter/docker-compose.yml matching solaria deployment (network_mode: host, pid: host, /:/host:ro,rslave mount) - Add services/node_exporter/service.yaml zigbee2mqtt chelsty-infra override: - Fix network_mode: host (mosquitto runs on host network, port 1883 on localhost) - Fix volume mount: ./configuration.yaml → absolute /opt/homelab/config/zigbee2mqtt/ (secrets stay in runtime config dir, never in Git) - Remove MQTT_USER/MQTT_PASSWORD (mosquitto uses allow_anonymous true) - Extend healthcheck start_period to 60s (z2m takes time on first start) chelsty-ha/services.yaml: - Remove node-agent entry entirely (never deployed, no plans to bootstrap now) - Keep homeassistant with monitor: false (no node-agent = no health events) supervisor: respect monitor: false in services.yaml - Skip action generation for services where monitor=false - Cleans up chelsty-ha entries from action queue without removing desired-state docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 15:10:48 +02:00
Oskar Kapala	fb7828b52b	supervisor: auto-cancel pending actions when drift is resolved When a service becomes healthy (node-agent emits service_healthy → observer updates services.json), any previously queued redeploy/container_restart action is stale. Without cleanup, the queue accumulates old actions that require manual rejection. _cancel_resolved_pending_actions() runs after each reconcile cycle: - Reads all pending/*.json with type=redeploy or container_restart - If the service is now healthy in actual_state, moves action to cancelled/ with reason=drift_resolved_auto - Only pending actions are touched; approved/running are left to the operator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:58:55 +02:00

1 2 3

134 commits