homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	c255a021d1	fix(observer): quarantine malformed event files to prevent processing wedge Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the node's checkpoint forever — every cycle re-tried, logged, never advanced past the bad file; all subsequent good events for that node lost. Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/ with collision handling. Checkpoint advances, downstream events flow. Move failures themselves are logged but don't crash the loop. Complementary to yesterday's atomic_write_json fix (state files); this addresses the same race-pattern on event files instead. Regression test asserts: bad event quarantined to failed_events dir, removed from hot path, subsequent good event processed (node online), checkpoint moves to good event.	2026-06-12 11:22:56 +02:00
Oskar Kapala	31b5981174	docs: session 2026-06-11 20:35	2026-06-11 20:35:23 +02:00
Oskar Kapala	c1acee7acf	docs: session 2026-06-11 20:19	2026-06-11 20:19:26 +02:00
Oskar Kapala	fa59625aa6	docs(ha-diag-agent): replace curl verify commands with docker exec Port 8087 is no longer mapped to the host. Operator verify commands that used curl http://localhost:8087/health now use docker exec with Python's urllib (the image is python:3.11-slim, no curl binary). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 19:46:33 +02:00
Oskar Kapala	d7e0d3162f	fix(ha-diag-agent): remove host port mapping for 8087 Port 8087 conflicted with zigbee2mqtt on piha (8087:8080 mapping active for 7+ days), preventing ha-diag-agent from starting. Grep across the full repo confirms no external consumer (no nginx/npm proxy, no Prometheus scrape, no control-plane reference) uses this port. The Docker healthcheck runs inside the container network namespace and does not require a host-side mapping. Internal FastAPI binding on 8087 is unchanged. Removed: ports section from docker-compose.yml and service.yaml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 19:46:28 +02:00
Oskar Kapala	a0bfd96870	docs: session 2026-06-11 — lustro ssh shipping fix + ha-diag-agent piha + backlog/flota-bomba Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 14:18:00 +02:00
Oskar Kapala	5e9db5c106	fix(ha-diag-agent): structlog event kwarg collision + replace aioresponses - main.py: rename event= to ha_event= in _log.warning() — structlog treats 'event' as a reserved positional arg; the old name caused TypeError when any check returned unhealthy results (events were still emitted, but the check was logged as check_error instead of check_unhealthy) - tests/test_ha_client.py: replace aioresponses with unittest.mock — aioresponses 0.7.8 is incompatible with aiohttp >=3.12 (missing stream_writer kwarg) - pyproject.toml: remove aioresponses from dev dependencies Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 14:10:06 +02:00
Oskar Kapala	d60b28a949	feat(ha-diag-agent): add piha deploy config - hosts/piha/runtime/ha-diag-agent/docker-compose.override.yml: mem_limit 128m, hardcoded events volume (/opt/homelab/events/piha:/events) to avoid ${NODE_NAME} shell-expansion issue in deploy-node.sh - services/ha-diag-agent/env.example: per-host HA_URL comments (piha vs chelsty-infra tailscale), HA_TOKEN source note Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 14:10:06 +02:00
Oskar Kapala	a5a1352e01	fix(lustro): mount SSH key at /home/homelab/.ssh for node-agent event shipping node-agent runs as uid 1000 (homelab) since the base compose sets user "1000:1000"; ssh in _ship_events_to_vps() has no -i flag and looks for keys in $HOME/.ssh = /home/homelab/.ssh. The old mount target /root/.ssh was never consulted, so rsync to VPS failed with 'Permission denied'. uid match (pi=1000 on RPi OS) keeps OpenSSH strict ownership checks happy. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 12:45:55 +02:00
Oskar Kapala	2ade5be4b4	feat(onboard): register lustro in topology + services.yaml	2026-06-10 13:02:04 +02:00
Oskar Kapala	5c2516d097	docs: session 2026-06-09 + skill/backlog update - docs/sessions/2026-06-09-flota-recovery-lustro-register.md: flota recovery (root cause aerbot group, 3 warstwy maskujące), lustro register stan+plan, fix-event-bloat i OOM pending, worktree gotcha - docs/backlog.md: nowy plik — tech-debt tracker; wpisy: --omit-dir-times, oskar∈aerbot deklaratywnie, worktree per task, observer staleness - .claude/skills/node-onboarding/SKILL.md: step table aktualizacja (PROVEN: 20-base, 30-node-agent; WRITTEN: 40-register, 50-verify), 3 nowe gotchas (rsync perm, observer restart, worktree branch) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-09 20:38:35 +02:00
Oskar Kapala	1304c8449f	feat(onboard): implement 40-register + 50-verify, remove dead scaffold - 40-register.sh: idempotent — dopisuje lustro do topology.yaml + tworzy hosts/<node>/services.yaml, commituje na bieżącym branchu (bez push) - 50-verify.sh: 4 checki — node-agent running, eventy, observer restart + heartbeat poll, world/nodes.json; tabela pass/fail; exit 1 on failure - 40-deploy-node-agent.sh: usunięty (martwy scaffold; deploy w 30-node-agent.sh) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-09 20:36:00 +02:00
Oskar Kapala	a99bf9dadc	fix(onboard): 30-node-agent — mkdir -p deploy dir before rsync rsync fails with "No such file or directory" when intermediate dirs don't exist. /opt/homelab/deploy/ is not created by 20-base.sh. Add rrun mkdir -p before rsync_dir; pi owns /opt/homelab so no sudo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-09 14:46:01 +02:00
Oskar Kapala	f6342749e6	feat(onboard): add 30-node-agent.sh + lustro node-agent override Push-based deploy step for LUSTRO (git_control=false): rsync services/node-agent/ and the host override to /opt/homelab/deploy/node-agent/ on the remote, then docker compose up --build via SSH. Guard by effect: skip push+build+up if node-agent container already running (docker ps filter, not command -v). Verify: container running + events appear in /opt/homelab/events/lustro/ within 90 s (confirms agent write path). Override (hosts/lustro/runtime/node-agent/docker-compose.override.yml): - group_add: ["991"] (docker GID on LUSTRO; 999 from base concatenated — harmless) - mem_limit: 256m (MagicMirror ~1.9 GiB; agent must be bounded) - /home/pi/.ssh:/root/.ssh:ro (not /home/oskar/.ssh — pi user) - /opt/homelab/deploy/node-agent:/repo:ro (no repo checkout on push-based node) - NODE_NAME=lustro, NODE_TYPE=sd_card, VPS_EVENTS_HOST=100.95.58.48 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-09 14:24:39 +02:00
Oskar Kapala	415479454a	fix(onboard): 20-base.sh — popraw guard idempotencji swap→zram Stary guard porównywał literał konfigu (SIZE=) zamiast sprawdzać efekt. Ręcznie postawiony zram był pomijany (dpkg -l vs command -v) i config był nadpisywany niepotrzebnie. - Guard by effect: sudo swapon --show \| grep /dev/zram + dphys nieaktywny → cała sekcja skip bez wchodzenia w substages - Detekcja pakietu przez dpkg -l zram-tools (nie command -v zramswap — PATH) - Config: PERCENT=50 (skaluje z RAM) zamiast SIZE=; printf '%s\n' \| sudo tee - Wszystkie weryfikacje zram przez sudo swapon --show (nie zramctl) - Usuń parsowanie hardware.swap.mb (nieużywane po przejściu na PERCENT) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-09 13:30:12 +02:00
Oskar Kapala	d81ac27ebb	feat(onboard): implement 20-base.sh for LUSTRO — swap→zram, /opt/homelab, event dir Three idempotent stages with guards (probe-before-mutate), rrun() for all remote mutations, rprobe() for unconditional state queries. Reads hardware.swap.mb from node.yaml (default 2048 MB). Adds swap.mb: 2048 to hosts/lustro/node.yaml so the value is declarative. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-09 12:21:53 +02:00
Oskar Kapala	9b2a1b4e9a	docs(backlog): observer staleness — dead node shows NOMINAL (heartbeat TTL)	2026-06-09 12:16:59 +02:00
Oskar Kapala	85e056046c	docs(session): worktree hygiene update + marker gap note	2026-06-09 11:38:41 +02:00
Oskar Kapala	c466ed28d1	docs(skills): add node-onboarding skill (living doc) ECC-format skill for the node onboarding workflow. Covers full step sequence, operational rules, node.yaml key fields, gotchas from LUSTRO session, and Definition of Done. Marked as living doc — SCAFFOLD sections to be promoted to PROVEN as steps land on real nodes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-09 10:14:42 +02:00
Oskar Kapala	d2fb2b3d41	docs: onboard README + CLAUDE.md worktree discipline reminder scripts/onboard/README.md (new): - Tool purpose and --node/--step/--from/--dry-run usage - Full node.yaml field schema with annotations (ssh_user uid-1000 gotcha, first_contact IP vs .local, deploy_autonomy/git_control gates) - Step status table (00-access DONE, 00-preflight SCAFFOLD, 10-50 TODO) - lib/ architecture: run() dry-run convention, yaml_get fallback caveats - Gotchas/Learnings table from session CLAUDE.md: - Node Onboarding section: onboard.sh commands, pointer to README - Multi-agent worktree mode: add explicit DISCIPLINE RULE — feature work must happen in agent.sh worktrees, not the main checkout; references the 2026-06-08 session that violated this Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-08 22:31:12 +02:00
Oskar Kapala	e59eb12da3	docs: session log 2026-06-08 — LUSTRO onboarding Records the onboarding session for LUSTRO (RPi4, KEN site): node facts from preflight, key decisions (user pi/uid-1000, IP over mDNS, zram target), 00-access status, tool bugs fixed (dry-run propagation, yaml_get greedy-colon + inline comment, ssh known-hosts in verify), open items for next session (worktree hygiene first, bootstrap-runtime, node-agent, register, verify, mm-watch). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-08 22:31:03 +02:00
Oskar Kapala	471ba09c4a	fix(onboard/00-access): suppress known-hosts warning in Tailscale verify On first SSH to a new mesh hostname, OpenSSH emits "Warning: Permanently added 'lustro' to the list of known hosts" on stderr. The previous code used 2>&1, merging it into the captured arch variable, which caused the arch assertion to fail with arch="Warning:Permanentlyadded...". Fix: - Add dedicated _TS_SSH opts array with -o LogLevel=ERROR, which suppresses INFO-level messages (known-hosts, banner) at source - Remove 2>&1 — stderr is no longer merged into the captured value - Run only `uname -m` instead of `echo ok && uname -m`; take the last non-empty stdout line to be robust against any remaining preamble - Change arch mismatch from warn to die in live mode (warn in dry-run) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-08 15:28:21 +02:00
Oskar Kapala	1bed8559fa	fix(onboard): lustro first_contact via LAN IP (mDNS unreliable)	2026-06-08 15:23:44 +02:00
Oskar Kapala	eed0ad0635	fix(onboard): fix yaml_get fallback — strip inline comments and fix greedy colon match Two bugs in the grep+sed fallback (triggered when yq is unavailable): 1. Greedy colon match: `s/.: //` consumed the last `: ` in the line, so values containing a colon (e.g. `systemd:magicmirror.service`) were silently truncated to the portion after the last colon. Fix: `s/^[[:space:]][^:]:[[:space:]]//' — anchored at line start, key chars are `[^:]` (no colons), so only the first `: ` separator is removed. 2. Inline YAML comment not stripped: `first_contact: pi@pimirror2.local # ...` returned the full tail including `#`, breaking callers like ssh-copy-id. Fix: add `s/[[:space:]]\+#.*$//` — requires at least one space before `#` to preserve bare `#` characters inside a value. Also add leading/trailing whitespace trim as a separate pass. Both bugs affect any node.yaml field that has an inline comment or a colon in its value; all ten fields in hosts/lustro/node.yaml now parse correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-08 15:16:06 +02:00
Oskar Kapala	931fd46e62	fix(onboard): propagate dry-run into steps via run() helper DRY_RUN now uses 1/0 instead of "true"/"false" across all onboard scripts. common.sh: add run() — wraps mutations; prints "[dry-run] would: ..." when DRY_RUN=1. Exported via `export -f run` so child bash processes inherit it. onboard.sh: remove the `--dry-run → dryrun "Would execute" → continue` bypass. Steps now always execute; DRY_RUN=1 is exported so each step's own run() calls handle simulation. The orchestrator no longer needs to know step internals. remote.sh: update DRY_RUN checks to [ "${DRY_RUN:-0}" = 1 ] for consistency. 00-access.sh: remove all if/else DRY_RUN blocks; replace with: - Mutations (ssh-copy-id, curl install, tailscale up) wrapped in run() - Probes (SSH BatchMode test, command -v, _ts_state) run unconditionally so dry-run reports real current state ("key present → skip" vs "would: ...") - Stage 3 verify runs always; SSH failure is die in live mode, warn in dry-run (Tailscale not yet joined is expected on a fresh node) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-08 15:01:09 +02:00
Oskar Kapala	9012a36827	feat(onboard): add 00-access step + update lustro node.yaml 00-access.sh implements a 3-stage idempotent access bootstrap: 1. ensure_ssh_key — ssh-copy-id to first_contact (pi@pimirror2.local), skips if BatchMode key-auth already passes 2. ensure_tailscale — install via install.sh if missing, then tailscale up --hostname=lustro; prints interactive auth URL to operator, blocks until authenticated; skips if BackendState already Running 3. verify — SSH over Tailscale to pi@lustro, asserts 'ok' + arch=aarch64 Reads first_contact and tailscale.hostname from node.yaml. Respects --dry-run. No NOPASSWD or /opt/homelab mutations. hosts/lustro/node.yaml: fill known hardware facts (arm64, 4096 MB RAM, zram swap, docker_present, mm_runtime=systemd:magicmirror.service), add ssh_user=pi, first_contact=pi@pimirror2.local, services.node-agent.runtime engine=docker mem_limit=256m. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-08 14:43:16 +02:00
Oskar Kapala	adb84079ab	feat(onboard): add node onboarding scaffold (bash, idempotent) - scripts/onboard/onboard.sh: orchestrator with --node/--step/--from/--dry-run flags, deploy_autonomy + git_control gates, lexicographic step ordering - scripts/onboard/lib/common.sh: log/warn/die/step helpers, yaml_get (yq+grep/sed fallback), ensure_line, git() wrapper enforcing --no-pager - scripts/onboard/lib/remote.sh: rrun/rcopy/rsync_dir/rcheck SSH wrappers, dry-run aware - scripts/onboard/steps/00-preflight.sh: read-only fact collection (arch, RAM, disk, docker, tailscale, MagicMirror runtime, swap), human report + machine YAML snippet - scripts/onboard/steps/10-50: stub files with TODO headers, no mutations - hosts/lustro/node.yaml: LUSTRO edge node draft (KEN, role=edge, deploy_autonomy=true, git_control=false); hardware fields marked TODO for preflight population Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-08 14:23:21 +02:00
Oskar Kapala	58ac6edd7d	fix(stability-agent): run as uid 1000 with docker group access stability-agent had no USER instruction and no user: in compose, running as root and writing root-owned files to /opt/homelab bind-mount. - Dockerfile: add useradd -m -u 1000 homelab + USER homelab - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) to retain docker.sock:ro access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 18:20:54 +02:00
Oskar Kapala	19fd8799d9	fix(node-agent): run as uid 1000 with docker group access node-agent had no USER instruction and no user: in compose, running as root and writing root-owned files to /opt/homelab bind-mount. - Dockerfile: add useradd -m -u 1000 homelab + USER homelab - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) to retain docker.sock access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 18:20:31 +02:00
Oskar Kapala	7f17b65278	fix(control-plane): run executor as uid 1000 with docker group access Executor was the only control-plane container running as root (uid=0), writing root-owned files to /opt/homelab via bind-mount and triggering false sudo on every deploy. - Dockerfile: add USER homelab after useradd (useradd already present) - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) so executor retains docker.sock access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 18:19:58 +02:00
Oskar Kapala	e6a2443412	fix(dev): agent.sh worktree_count/paths grep exit-1 on empty set grep -cv (and grep -v) return exit code 1 when there are zero matches. With set -euo pipefail this silently aborted the script before count was returned — causing 'agent.sh new' to fail on a fresh repo with no existing worktrees. Fix: move the grep -v into worktree_paths with '\|\| true' so the function always exits 0, then derive worktree_count via wc -l.	2026-06-03 18:04:38 +02:00
Oskar Kapala	f9b145585f	fix(dev): agent.sh validate_name set -e safety + ERR trap Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail was silently exiting after the loop because the failing-test compound propagated exit code 1 through the function return. Add ERR trap so future silent fails get diagnosed at the source.	2026-06-03 18:02:50 +02:00
Oskar Kapala	3b620ef7e3	docs(claude): multi-agent worktree mode section Main checkout = deploy-only. .agent-task marker triggers mandatory loading of worktree-aware skill. Only the human runs scripts/dev/agent.sh.	2026-06-03 17:41:35 +02:00
Oskar Kapala	745e52723c	feat(skills): worktree-aware skill for Claude Code Encodes branch hygiene for CC running in task worktrees: commit only to assigned branch, no push origin master, no touching main checkout, no git add -A, no worktree management, mandatory final report.	2026-06-03 17:41:35 +02:00
Oskar Kapala	1abe925f65	feat(dev): scripts/dev/agent.sh — multi-agent worktree dispatcher new/list/merge/clean. Decisions: branch task/<name>, sibling worktree ~/homelab-codex-ws-<name>, ff-only auto-merge, cap 4.	2026-06-03 17:41:35 +02:00
Oskar Kapala	1c69a5bc29	feat(skills): save-session skill for Claude Code Records session facts (git log, diff --stat, deploys from transcript) by appending to docs/sessions/YYYY-MM-DD.md with a mandatory narrative placeholder. Never touches backlog.md or CLAUDE.md without explicit instruction. Commits only the session file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 16:06:46 +02:00
Oskar Kapala	02e7c28823	feat(skills): deploy skill for Claude Code Instructs CC to always route deploy/redeploy/ship/wdróż requests through scripts/deploy/deploy.sh, maps exit codes to required actions, and enforces no-bypass rules for gate and branch checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 16:06:40 +02:00
Oskar Kapala	db592fbc28	feat(deploy): Saturn-side dispatcher wrapper Replaces the per-node staged framework with a single entry point that runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest + docker build per service), execute (control-plane.sh --ssh or remote deploy-node.sh), verify (docker ps), and one-line report. Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 16:06:36 +02:00
Oskar Kapala	00fc36df3a	fix(deploy): skip sudo chown/chmod when /opt/homelab ownership is already correct deploy-local.sh previously ran `sudo chown -R 1000:1000` and `sudo chmod -R 775` unconditionally on every deploy, which blocked non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000. Both steps are now conditional using `find ... -print -quit`: - chown: runs only if any file/dir is NOT uid/gid 1000 - chmod: runs only if any directory is missing -775 permission bits When everything is correct (steady state on VPS), both steps log "already correct, skipping" and never invoke sudo. If a new directory was created by root (e.g. a manual mkdir, volume mount, or restart artefact), the remediation path triggers automatically — the self-heal property is preserved. Smoke-tested in Docker (ubuntu:22.04): Case 1 (1000:1000 + 775): chown skipped, chmod skipped ✓ Case 2 (root-owned subdir): chown triggered ✓ Case 3 (700 dir perms): chmod triggered ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 15:44:44 +02:00
Oskar Kapala	f5dcefc752	fix(observer): robust incident lifecycle + orphan auto-resolve Two root causes for stale "active" incidents on the dashboard: 1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at can be an ISO-8601 string (stability-agent via events.py) or a Unix int (node-agent). The previous session's auto-resolve did plain `time.time() - last_occ` which raises TypeError for strings, silently preventing _save_world() from being called and leaving incidents perpetually "active" on disk. Fix: add _parse_ts(ts) -> float that handles int, float, and ISO-8601 strings uniformly. All timestamp arithmetic now goes through it; returns 0.0 on None / garbage to keep comparisons safe. 2. Orphaned active incidents: _resolve_incident clears service["incident_id"] and marks the incident "resolved" in memory, but if incidents.json was truncated mid-write (pre-atomic-write era), the observer loaded it at next startup with status="active" and no service entry pointing to it. No code ever touched these orphans again. Fix: _prune_stale_world now runs two cleanup passes each cycle: - Case 1 (healthy-linked): service.status=="healthy" AND incident_id still set → resolve immediately (service cannot have active incident) - Case 2 (orphaned): active incident with no service link AND last_occurrence > 5 min ago → resolve (5-min guard for creation race) Both cases are wrapped in try/except so a bug here never crashes the observer loop or blocks _save_world. Also fixes the 7-day stale-incident prune to use _parse_ts so ISO-string resolved_at values are handled correctly. 3. Operator UI: current_incidents() now filters to status=="active" only. Resolved incidents were previously included in the /incidents endpoint, making the dashboard show a wall of historical records as if active. Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json (now written atomically) and deletes old event files. No non-atomic writes found. Midnight clustering was likely external (logrotate / OS flush); the supervisor's resilient loader already handles such transient issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 14:29:12 +02:00
Oskar Kapala	98437d46b2	test(control-plane): atomic write and resilient loader coverage 11 new test cases in test_state_reliability.py covering: - atomic_write_json: produces valid JSON, no .tmp left behind, overwrites, works with nested structures - _load_actual_state: returns False on empty / truncated file, returns True on valid files, preserves last-known-good state across a parse failure - reconcile: empty/truncated services.json or incidents.json generates zero actions (skip-cycle semantics proven end-to-end) - healthy service with valid world state generates no spurious action All 32 tests (11 new + 21 existing) pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:27:05 +02:00
Oskar Kapala	5e97b4e448	fix(supervisor): atomic writes + skip cycle on unreadable world state Two independent fixes for the false-alarm storm caused by race-condition reads of truncated world state files: 1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces all bare open('w')+json.dump calls in supervisor and executor, so the action-file pipeline is never visible in a half-written state. 2. Resilient loader: _load_actual_state now returns False when any world state file fails to parse (empty or truncated mid-write). reconcile() skips the entire drift check on False instead of treating {} as "all services missing". actual_state retains its last-known-good values so a single bad cycle does not wipe accumulated context. Before: parse error → raw[key]={} → all desired services missing → wall of redeploy actions → drift_resolved_auto churn on next cycle. After: parse error → WARNING logged → cycle skipped → no actions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:26:59 +02:00
Oskar Kapala	ffb0608b9a	fix(observer): atomic writes for world state files All JSON state writes (services.json, nodes.json, incidents.json, deployments.json, runtime-summary.json, observer_checkpoint.json) now use _atomic_write_json: write to a .tmp sibling, fsync, then os.replace. This eliminates the truncated-write window that caused supervisors reading mid-write files to see empty/partial JSON. Also adds auto-resolution of phantom active incidents: if a service reports status=healthy and its incident's last_occurrence is >30 min old, the incident is resolved in _prune_stale_world. This clears false active incidents accumulated from previous race-condition reads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 12:26:49 +02:00
Oskar Kapala	f381023206	docs(claude): add Definition of Done for services (smoke test + pytest) Lesson from brain-watchdog: code that was never run had a packaging bug that caused a crash loop in production. New rule: docker build + short smoke-run + pytest before any commit or deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 20:38:39 +02:00
Oskar Kapala	cb4ae756ab	test(brain-watchdog): add pytest suite covering import and check() logic 7 cases: package importable, fresh ok, stale, unreachable, HTTP error, missing last_update field, unparseable timestamp. pytest.ini sets pythonpath=src so tests run without PYTHONPATH set in the environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 20:38:24 +02:00
Oskar Kapala	cfe5e02372	fix(brain-watchdog): add PYTHONPATH=/app/src so brain_watchdog package is importable WORKDIR is /app but the package lives under src/; without PYTHONPATH set `python -m brain_watchdog.main` raised ModuleNotFoundError on startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 20:31:45 +02:00
Oskar Kapala	039f9f7247	feat(piha): brain-watchdog — external watchdog for control-plane Polls /summary on VPS over Tailscale every 60s; computes freshness locally from last_update epoch (never trusts self-reported status). Alerts via Telegram Bot API directly after 3 consecutive failures; sends recovery message on heal. State (fail_count, alerted) persisted to volume so debounce survives restarts. - services/brain-watchdog/: Python service, no external deps (stdlib only) - hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m - hosts/piha/services.yaml + inventory/topology.yaml: manifest entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 17:54:36 +02:00
Oskar Kapala	495741e7ac	operator-ui: /events bez ladowania calego katalogu + daemon threads; epoch z regexa (fix chelsty-infra) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 16:34:52 +02:00
Oskar Kapala	43c5d45353	deploy: chmod/chown na /opt/homelab odporne na znikające pliki eventow	2026-06-01 14:35:19 +02:00
Oskar Kapala	f64cec645e	vps: mem_limit + oom_score_adj na serwisach in-repo; deploy-local stosuje override (stop OOM)	2026-06-01 14:23:58 +02:00

1 2 3 4

163 commits