Commit graph

152 commits

Author SHA1 Message Date
Oskar Kapala 1304c8449f feat(onboard): implement 40-register + 50-verify, remove dead scaffold
- 40-register.sh: idempotent — dopisuje lustro do topology.yaml + tworzy
  hosts/<node>/services.yaml, commituje na bieżącym branchu (bez push)
- 50-verify.sh: 4 checki — node-agent running, eventy, observer restart +
  heartbeat poll, world/nodes.json; tabela pass/fail; exit 1 on failure
- 40-deploy-node-agent.sh: usunięty (martwy scaffold; deploy w 30-node-agent.sh)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 20:36:00 +02:00
Oskar Kapala a99bf9dadc fix(onboard): 30-node-agent — mkdir -p deploy dir before rsync
rsync fails with "No such file or directory" when intermediate dirs
don't exist. /opt/homelab/deploy/ is not created by 20-base.sh.
Add rrun mkdir -p before rsync_dir; pi owns /opt/homelab so no sudo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 14:46:01 +02:00
Oskar Kapala f6342749e6 feat(onboard): add 30-node-agent.sh + lustro node-agent override
Push-based deploy step for LUSTRO (git_control=false): rsync
services/node-agent/ and the host override to /opt/homelab/deploy/node-agent/
on the remote, then docker compose up --build via SSH.

Guard by effect: skip push+build+up if node-agent container already running
(docker ps filter, not command -v). Verify: container running + events appear
in /opt/homelab/events/lustro/ within 90 s (confirms agent write path).

Override (hosts/lustro/runtime/node-agent/docker-compose.override.yml):
- group_add: ["991"]  (docker GID on LUSTRO; 999 from base concatenated — harmless)
- mem_limit: 256m  (MagicMirror ~1.9 GiB; agent must be bounded)
- /home/pi/.ssh:/root/.ssh:ro  (not /home/oskar/.ssh — pi user)
- /opt/homelab/deploy/node-agent:/repo:ro  (no repo checkout on push-based node)
- NODE_NAME=lustro, NODE_TYPE=sd_card, VPS_EVENTS_HOST=100.95.58.48

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 14:24:39 +02:00
Oskar Kapala 415479454a fix(onboard): 20-base.sh — popraw guard idempotencji swap→zram
Stary guard porównywał literał konfigu (SIZE=) zamiast sprawdzać efekt.
Ręcznie postawiony zram był pomijany (dpkg -l vs command -v) i config
był nadpisywany niepotrzebnie.

- Guard by effect: sudo swapon --show | grep /dev/zram + dphys nieaktywny
  → cała sekcja skip bez wchodzenia w substages
- Detekcja pakietu przez dpkg -l zram-tools (nie command -v zramswap — PATH)
- Config: PERCENT=50 (skaluje z RAM) zamiast SIZE=; printf '%s\n' | sudo tee
- Wszystkie weryfikacje zram przez sudo swapon --show (nie zramctl)
- Usuń parsowanie hardware.swap.mb (nieużywane po przejściu na PERCENT)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 13:30:12 +02:00
Oskar Kapala d81ac27ebb feat(onboard): implement 20-base.sh for LUSTRO — swap→zram, /opt/homelab, event dir
Three idempotent stages with guards (probe-before-mutate), rrun() for all
remote mutations, rprobe() for unconditional state queries. Reads
hardware.swap.mb from node.yaml (default 2048 MB). Adds swap.mb: 2048
to hosts/lustro/node.yaml so the value is declarative.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 12:21:53 +02:00
Oskar Kapala 9b2a1b4e9a docs(backlog): observer staleness — dead node shows NOMINAL (heartbeat TTL) 2026-06-09 12:16:59 +02:00
Oskar Kapala 85e056046c docs(session): worktree hygiene update + marker gap note 2026-06-09 11:38:41 +02:00
Oskar Kapala c466ed28d1 docs(skills): add node-onboarding skill (living doc)
ECC-format skill for the node onboarding workflow. Covers full step
sequence, operational rules, node.yaml key fields, gotchas from LUSTRO
session, and Definition of Done. Marked as living doc — SCAFFOLD sections
to be promoted to PROVEN as steps land on real nodes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 10:14:42 +02:00
Oskar Kapala d2fb2b3d41 docs: onboard README + CLAUDE.md worktree discipline reminder
scripts/onboard/README.md (new):
- Tool purpose and --node/--step/--from/--dry-run usage
- Full node.yaml field schema with annotations (ssh_user uid-1000
  gotcha, first_contact IP vs .local, deploy_autonomy/git_control gates)
- Step status table (00-access DONE, 00-preflight SCAFFOLD, 10-50 TODO)
- lib/ architecture: run() dry-run convention, yaml_get fallback caveats
- Gotchas/Learnings table from session

CLAUDE.md:
- Node Onboarding section: onboard.sh commands, pointer to README
- Multi-agent worktree mode: add explicit DISCIPLINE RULE — feature
  work must happen in agent.sh worktrees, not the main checkout;
  references the 2026-06-08 session that violated this

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 22:31:12 +02:00
Oskar Kapala e59eb12da3 docs: session log 2026-06-08 — LUSTRO onboarding
Records the onboarding session for LUSTRO (RPi4, KEN site):
node facts from preflight, key decisions (user pi/uid-1000, IP
over mDNS, zram target), 00-access status, tool bugs fixed
(dry-run propagation, yaml_get greedy-colon + inline comment,
ssh known-hosts in verify), open items for next session
(worktree hygiene first, bootstrap-runtime, node-agent, register,
verify, mm-watch).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 22:31:03 +02:00
Oskar Kapala 471ba09c4a fix(onboard/00-access): suppress known-hosts warning in Tailscale verify
On first SSH to a new mesh hostname, OpenSSH emits
"Warning: Permanently added 'lustro' to the list of known hosts"
on stderr. The previous code used 2>&1, merging it into the captured
arch variable, which caused the arch assertion to fail with
arch="Warning:Permanentlyadded...".

Fix:
- Add dedicated _TS_SSH opts array with -o LogLevel=ERROR, which
  suppresses INFO-level messages (known-hosts, banner) at source
- Remove 2>&1 — stderr is no longer merged into the captured value
- Run only `uname -m` instead of `echo ok && uname -m`; take the last
  non-empty stdout line to be robust against any remaining preamble
- Change arch mismatch from warn to die in live mode (warn in dry-run)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:28:21 +02:00
Oskar Kapala 1bed8559fa fix(onboard): lustro first_contact via LAN IP (mDNS unreliable) 2026-06-08 15:23:44 +02:00
Oskar Kapala eed0ad0635 fix(onboard): fix yaml_get fallback — strip inline comments and fix greedy colon match
Two bugs in the grep+sed fallback (triggered when yq is unavailable):

1. Greedy colon match: `s/.*: *//` consumed the *last* `: ` in the line, so
   values containing a colon (e.g. `systemd:magicmirror.service`) were
   silently truncated to the portion after the last colon.
   Fix: `s/^[[:space:]]*[^:]*:[[:space:]]*//' — anchored at line start,
   key chars are `[^:]*` (no colons), so only the first `: ` separator is removed.

2. Inline YAML comment not stripped: `first_contact: pi@pimirror2.local   # ...`
   returned the full tail including `#`, breaking callers like ssh-copy-id.
   Fix: add `s/[[:space:]]\+#.*$//` — requires at least one space before `#`
   to preserve bare `#` characters inside a value.

Also add leading/trailing whitespace trim as a separate pass.
Both bugs affect any node.yaml field that has an inline comment or a colon
in its value; all ten fields in hosts/lustro/node.yaml now parse correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:16:06 +02:00
Oskar Kapala 931fd46e62 fix(onboard): propagate dry-run into steps via run() helper
DRY_RUN now uses 1/0 instead of "true"/"false" across all onboard scripts.

common.sh: add run() — wraps mutations; prints "[dry-run] would: ..." when
  DRY_RUN=1. Exported via `export -f run` so child bash processes inherit it.

onboard.sh: remove the `--dry-run → dryrun "Would execute" → continue` bypass.
  Steps now always execute; DRY_RUN=1 is exported so each step's own run()
  calls handle simulation. The orchestrator no longer needs to know step internals.

remote.sh: update DRY_RUN checks to [ "${DRY_RUN:-0}" = 1 ] for consistency.

00-access.sh: remove all if/else DRY_RUN blocks; replace with:
  - Mutations (ssh-copy-id, curl install, tailscale up) wrapped in run()
  - Probes (SSH BatchMode test, command -v, _ts_state) run unconditionally
    so dry-run reports real current state ("key present → skip" vs "would: ...")
  - Stage 3 verify runs always; SSH failure is die in live mode, warn in
    dry-run (Tailscale not yet joined is expected on a fresh node)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:01:09 +02:00
Oskar Kapala 9012a36827 feat(onboard): add 00-access step + update lustro node.yaml
00-access.sh implements a 3-stage idempotent access bootstrap:
  1. ensure_ssh_key  — ssh-copy-id to first_contact (pi@pimirror2.local),
     skips if BatchMode key-auth already passes
  2. ensure_tailscale — install via install.sh if missing, then tailscale up
     --hostname=lustro; prints interactive auth URL to operator, blocks until
     authenticated; skips if BackendState already Running
  3. verify — SSH over Tailscale to pi@lustro, asserts 'ok' + arch=aarch64

Reads first_contact and tailscale.hostname from node.yaml.
Respects --dry-run. No NOPASSWD or /opt/homelab mutations.

hosts/lustro/node.yaml: fill known hardware facts (arm64, 4096 MB RAM,
zram swap, docker_present, mm_runtime=systemd:magicmirror.service),
add ssh_user=pi, first_contact=pi@pimirror2.local,
services.node-agent.runtime engine=docker mem_limit=256m.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 14:43:16 +02:00
Oskar Kapala adb84079ab feat(onboard): add node onboarding scaffold (bash, idempotent)
- scripts/onboard/onboard.sh: orchestrator with --node/--step/--from/--dry-run flags,
  deploy_autonomy + git_control gates, lexicographic step ordering
- scripts/onboard/lib/common.sh: log/warn/die/step helpers, yaml_get (yq+grep/sed fallback),
  ensure_line, git() wrapper enforcing --no-pager
- scripts/onboard/lib/remote.sh: rrun/rcopy/rsync_dir/rcheck SSH wrappers, dry-run aware
- scripts/onboard/steps/00-preflight.sh: read-only fact collection (arch, RAM, disk, docker,
  tailscale, MagicMirror runtime, swap), human report + machine YAML snippet
- scripts/onboard/steps/10-50: stub files with TODO headers, no mutations
- hosts/lustro/node.yaml: LUSTRO edge node draft (KEN, role=edge, deploy_autonomy=true,
  git_control=false); hardware fields marked TODO for preflight population

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 14:23:21 +02:00
Oskar Kapala 58ac6edd7d fix(stability-agent): run as uid 1000 with docker group access
stability-agent had no USER instruction and no user: in compose, running
as root and writing root-owned files to /opt/homelab bind-mount.

- Dockerfile: add useradd -m -u 1000 homelab + USER homelab
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
  (GID 999 = docker group on VPS) to retain docker.sock:ro access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 18:20:54 +02:00
Oskar Kapala 19fd8799d9 fix(node-agent): run as uid 1000 with docker group access
node-agent had no USER instruction and no user: in compose, running
as root and writing root-owned files to /opt/homelab bind-mount.

- Dockerfile: add useradd -m -u 1000 homelab + USER homelab
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
  (GID 999 = docker group on VPS) to retain docker.sock access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 18:20:31 +02:00
Oskar Kapala 7f17b65278 fix(control-plane): run executor as uid 1000 with docker group access
Executor was the only control-plane container running as root (uid=0),
writing root-owned files to /opt/homelab via bind-mount and triggering
false sudo on every deploy.

- Dockerfile: add USER homelab after useradd (useradd already present)
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
  (GID 999 = docker group on VPS) so executor retains docker.sock access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 18:19:58 +02:00
Oskar Kapala e6a2443412 fix(dev): agent.sh worktree_count/paths grep exit-1 on empty set
grep -cv (and grep -v) return exit code 1 when there are zero matches.
With set -euo pipefail this silently aborted the script before count
was returned — causing 'agent.sh new' to fail on a fresh repo with no
existing worktrees.

Fix: move the grep -v into worktree_paths with '|| true' so the
function always exits 0, then derive worktree_count via wc -l.
2026-06-03 18:04:38 +02:00
Oskar Kapala f9b145585f fix(dev): agent.sh validate_name set -e safety + ERR trap
Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail
was silently exiting after the loop because the failing-test compound
propagated exit code 1 through the function return.

Add ERR trap so future silent fails get diagnosed at the source.
2026-06-03 18:02:50 +02:00
Oskar Kapala 3b620ef7e3 docs(claude): multi-agent worktree mode section
Main checkout = deploy-only. .agent-task marker triggers mandatory
loading of worktree-aware skill. Only the human runs scripts/dev/agent.sh.
2026-06-03 17:41:35 +02:00
Oskar Kapala 745e52723c feat(skills): worktree-aware skill for Claude Code
Encodes branch hygiene for CC running in task worktrees: commit only to
assigned branch, no push origin master, no touching main checkout, no
git add -A, no worktree management, mandatory final report.
2026-06-03 17:41:35 +02:00
Oskar Kapala 1abe925f65 feat(dev): scripts/dev/agent.sh — multi-agent worktree dispatcher
new/list/merge/clean. Decisions: branch task/<name>, sibling worktree
~/homelab-codex-ws-<name>, ff-only auto-merge, cap 4.
2026-06-03 17:41:35 +02:00
Oskar Kapala 1c69a5bc29 feat(skills): save-session skill for Claude Code
Records session facts (git log, diff --stat, deploys from transcript)
by appending to docs/sessions/YYYY-MM-DD.md with a mandatory narrative
placeholder. Never touches backlog.md or CLAUDE.md without explicit
instruction. Commits only the session file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 16:06:46 +02:00
Oskar Kapala 02e7c28823 feat(skills): deploy skill for Claude Code
Instructs CC to always route deploy/redeploy/ship/wdróż requests through
scripts/deploy/deploy.sh, maps exit codes to required actions, and
enforces no-bypass rules for gate and branch checks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 16:06:40 +02:00
Oskar Kapala db592fbc28 feat(deploy): Saturn-side dispatcher wrapper
Replaces the per-node staged framework with a single entry point that
runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest +
docker build per service), execute (control-plane.sh --ssh or remote
deploy-node.sh), verify (docker ps), and one-line report.

Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 16:06:36 +02:00
Oskar Kapala 00fc36df3a fix(deploy): skip sudo chown/chmod when /opt/homelab ownership is already correct
deploy-local.sh previously ran `sudo chown -R 1000:1000` and
`sudo chmod -R 775` unconditionally on every deploy, which blocked
non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000.

Both steps are now conditional using `find ... -print -quit`:
- chown: runs only if any file/dir is NOT uid/gid 1000
- chmod: runs only if any directory is missing -775 permission bits

When everything is correct (steady state on VPS), both steps log
"already correct, skipping" and never invoke sudo.  If a new directory
was created by root (e.g. a manual mkdir, volume mount, or restart artefact),
the remediation path triggers automatically — the self-heal property is preserved.

Smoke-tested in Docker (ubuntu:22.04):
  Case 1 (1000:1000 + 775):  chown skipped, chmod skipped ✓
  Case 2 (root-owned subdir): chown triggered ✓
  Case 3 (700 dir perms):     chmod triggered ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 15:44:44 +02:00
Oskar Kapala f5dcefc752 fix(observer): robust incident lifecycle + orphan auto-resolve
Two root causes for stale "active" incidents on the dashboard:

1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
   can be an ISO-8601 string (stability-agent via events.py) or a Unix
   int (node-agent).  The previous session's auto-resolve did plain
   `time.time() - last_occ` which raises TypeError for strings,
   silently preventing _save_world() from being called and leaving
   incidents perpetually "active" on disk.

   Fix: add _parse_ts(ts) -> float that handles int, float, and
   ISO-8601 strings uniformly. All timestamp arithmetic now goes through
   it; returns 0.0 on None / garbage to keep comparisons safe.

2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
   and marks the incident "resolved" in memory, but if incidents.json was
   truncated mid-write (pre-atomic-write era), the observer loaded it at
   next startup with status="active" and no service entry pointing to it.
   No code ever touched these orphans again.

   Fix: _prune_stale_world now runs two cleanup passes each cycle:
   - Case 1 (healthy-linked): service.status=="healthy" AND incident_id
     still set → resolve immediately (service cannot have active incident)
   - Case 2 (orphaned): active incident with no service link AND
     last_occurrence > 5 min ago → resolve (5-min guard for creation race)

   Both cases are wrapped in try/except so a bug here never crashes the
   observer loop or blocks _save_world.

   Also fixes the 7-day stale-incident prune to use _parse_ts so
   ISO-string resolved_at values are handled correctly.

3. Operator UI: current_incidents() now filters to status=="active" only.
   Resolved incidents were previously included in the /incidents endpoint,
   making the dashboard show a wall of historical records as if active.

Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 14:29:12 +02:00
Oskar Kapala 98437d46b2 test(control-plane): atomic write and resilient loader coverage
11 new test cases in test_state_reliability.py covering:
- atomic_write_json: produces valid JSON, no .tmp left behind, overwrites,
  works with nested structures
- _load_actual_state: returns False on empty / truncated file, returns True
  on valid files, preserves last-known-good state across a parse failure
- reconcile: empty/truncated services.json or incidents.json generates zero
  actions (skip-cycle semantics proven end-to-end)
- healthy service with valid world state generates no spurious action

All 32 tests (11 new + 21 existing) pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:27:05 +02:00
Oskar Kapala 5e97b4e448 fix(supervisor): atomic writes + skip cycle on unreadable world state
Two independent fixes for the false-alarm storm caused by race-condition
reads of truncated world state files:

1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces
   all bare open('w')+json.dump calls in supervisor and executor, so the
   action-file pipeline is never visible in a half-written state.

2. Resilient loader: _load_actual_state now returns False when any world
   state file fails to parse (empty or truncated mid-write). reconcile()
   skips the entire drift check on False instead of treating {} as "all
   services missing". actual_state retains its last-known-good values so
   a single bad cycle does not wipe accumulated context.

   Before: parse error → raw[key]={} → all desired services missing →
     wall of redeploy actions → drift_resolved_auto churn on next cycle.
   After:  parse error → WARNING logged → cycle skipped → no actions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:26:59 +02:00
Oskar Kapala ffb0608b9a fix(observer): atomic writes for world state files
All JSON state writes (services.json, nodes.json, incidents.json,
deployments.json, runtime-summary.json, observer_checkpoint.json) now use
_atomic_write_json: write to a .tmp sibling, fsync, then os.replace.
This eliminates the truncated-write window that caused supervisors
reading mid-write files to see empty/partial JSON.

Also adds auto-resolution of phantom active incidents: if a service
reports status=healthy and its incident's last_occurrence is >30 min old,
the incident is resolved in _prune_stale_world. This clears false active
incidents accumulated from previous race-condition reads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:26:49 +02:00
Oskar Kapala f381023206 docs(claude): add Definition of Done for services (smoke test + pytest)
Lesson from brain-watchdog: code that was never run had a packaging bug
that caused a crash loop in production. New rule: docker build + short
smoke-run + pytest before any commit or deploy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 20:38:39 +02:00
Oskar Kapala cb4ae756ab test(brain-watchdog): add pytest suite covering import and check() logic
7 cases: package importable, fresh ok, stale, unreachable, HTTP error,
missing last_update field, unparseable timestamp. pytest.ini sets pythonpath=src
so tests run without PYTHONPATH set in the environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 20:38:24 +02:00
Oskar Kapala cfe5e02372 fix(brain-watchdog): add PYTHONPATH=/app/src so brain_watchdog package is importable
WORKDIR is /app but the package lives under src/; without PYTHONPATH set
`python -m brain_watchdog.main` raised ModuleNotFoundError on startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 20:31:45 +02:00
Oskar Kapala 039f9f7247 feat(piha): brain-watchdog — external watchdog for control-plane
Polls /summary on VPS over Tailscale every 60s; computes freshness
locally from last_update epoch (never trusts self-reported status).
Alerts via Telegram Bot API directly after 3 consecutive failures;
sends recovery message on heal. State (fail_count, alerted) persisted
to volume so debounce survives restarts.

- services/brain-watchdog/: Python service, no external deps (stdlib only)
- hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m
- hosts/piha/services.yaml + inventory/topology.yaml: manifest entries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 17:54:36 +02:00
Oskar Kapala 495741e7ac operator-ui: /events bez ladowania calego katalogu + daemon threads; epoch z regexa (fix chelsty-infra)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 16:34:52 +02:00
Oskar Kapala 43c5d45353 deploy: chmod/chown na /opt/homelab odporne na znikające pliki eventow 2026-06-01 14:35:19 +02:00
Oskar Kapala f64cec645e vps: mem_limit + oom_score_adj na serwisach in-repo; deploy-local stosuje override (stop OOM) 2026-06-01 14:23:58 +02:00
Oskar Kapala 1db9db7d03 fix(dashboard): read last_update from JSON content, not file mtime
operator_ui.py called .replace() on last_update without checking type —
an integer value (written by the materializer) raised AttributeError and
silently fell back to os.path.getmtime(), which was stuck at 5/29 after a
deploy with preserved timestamps. web.py had the same class of bug but
worse: it unconditionally replaced last_update with mtime, ignoring the
JSON field entirely. Both now branch on isinstance(str) and cast numeric
values directly to float, with mtime only as a last-resort fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 22:10:50 +02:00
Oskar Kapala 52607a7cdd feat(control-plane): shadow_mode for HA event auto-actions + deploy docs
- HA_DIAG_SHADOW_MODE env flag in supervisor (default true)
- shadow_mode downgrades container_restart actions to alert_only with
  [SHADOW MODE] note; same action_id and 30-min cooldown apply
- alert_only events unaffected (always routed normally)
- 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected
- DEPLOY.md with token gen, per-host config, verification, 48h observation,
  production-mode enablement, rollback
- README.md updated with shadow mode flag summary and DEPLOY.md link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 17:12:33 +02:00
Oskar Kapala b9ed118b8c fix(telegram-bot): correct risk_level field + show description in alerts
- read risk_level with risk fallback (was: risk only → "unknown" for
  all actions written by supervisor which uses risk_level key)
- include description field in alert format (was: alert_only payloads'
  substance was invisible — description carried the full message)
- extract _format_pending_action() pure helper to enable unit testing
  without a live Telegram connection
- 8 tests: risk_level present, risk fallback, both absent, description
  shown/absent, truncation, full HA alert_only shape, no-description no-crash
- flagged during Phase 5 review of ha-diag-agent supervisor routing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 16:26:49 +02:00
Oskar Kapala bf1415e4c1 feat(control-plane): route ha-diag-agent events through supervisor
- 8 HA event types mapped to existing action types
- ha_websocket_dead → container_restart (homeassistant), 30-min cooldown
- 6 events → alert_only (entity_unavailable, integration_failed,
  automation_failing, update_available, recorder_lag,
  system_health_degraded), 1-hour cooldown
- ha_websocket_recovered → cancels matching pending container_restart
- state-aware suppression: skip HA events when homeassistant has an
  active containers_not_running incident < 5 min ago (avoids alert
  storms during HA restarts/updates)
- location_tag preserved through action pipeline for per-house
  telegram alerts
- executor: alert_only acknowledged as no-op success
- 18 tests covering all 8 event types, suppression, cooldown,
  dedup, location_tag, recovery cancellation
- CLAUDE.md: supervisor event routing table added

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 15:59:23 +02:00
Oskar Kapala 31b48d162a feat(ha-diag-agent): WebSocketMonitor for real-time HA liveness
- persistent WS connection to HA with auth + state_changed subscription
- watchdog detects silence > 5min → emits ha_websocket_dead
- immediate ha_websocket_dead on disconnect, exponential reconnect with jitter
- cooldown prevents alert spam (10min repeat window while HA stays down)
- ha_websocket_recovered emitted on reconnect after a dead alert (allows
  supervisor to clear active incidents in Phase 5)
- new monitors/ subpackage for long-running tasks (vs interval checks/)
- /health endpoint now includes ws_connected field
- 26 unit tests, 3 integration tests (real HA + container stop/restart)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 15:00:18 +02:00
Oskar Kapala 3499b2f280 feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes
New checks:
- SystemHealthCheck (15min interval): detects newly-failing HA
  integrations via /api/system_health snapshot diff; transition-based
  dedup (ok→error fires, sustained error silent, error→ok clears alert)
- UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available
  events with 7-day dedup; release notes truncated at 2000 chars
- UpdatesDigestCheck (Sunday cron 09:00): single digest event with all
  pending updates; weekly ISO-week dedup, independent of daily dedup key
- AutomationFailuresCheck (30min interval): detects automations with
  N consecutive failures (default 3) via /api/trace/automation/<id>;
  6h cooldown per automation

Phase 3 flag fixes:
- Flag #1 (since field): UnavailableEntitiesCheck now uses
  min(state.last_changed, baseline.first_seen) as effective "since",
  giving accurate duration when agent was offline at entity's first fail
- Flag #3 (registry cache): HAClient.get_entity_registry() caches
  response in-process with configurable TTL (default 300s); avoids
  repeated API calls across concurrent check cycles; invalidate_registry_cache()
  for manual invalidation

Storage: system_health_snapshot table (component, last_status, last_seen_at,
payload) created automatically on next Storage.open() call

Config additions (all with defaults): entity_registry_cache_ttl=300,
system_health_check_interval=900, automation_check_interval=1800,
automation_failure_threshold=3, updates_check_hour=9,
updates_check_minute=0, updates_cooldown_days=7

Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new);
3 skipped (live-HA token not set in CI)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 14:43:10 +02:00
Oskar Kapala f41ec5d0c5 docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs
- CLAUDE.md: collapsed 5-section deployment block to single annotated
  block, removed inline emit_event signatures (kept path + type list),
  flattened runtime path tree to bullets, condensed node table note to
  reference capabilities.yaml, added CHELSTY docker-compose v1
  constraint; 156 → 113 lines (~750 → ~480 tokens)
- fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at
  192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference
  and corrected owner node from piha to chelsty-infra

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 14:17:23 +02:00
Oskar Kapala 20f6761a67 feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup
- shared aiohttp ClientSession in HAClient (Phase 1 Flag #2 fixed):
  make_session() factory, session injected at startup, closed on shutdown
- Check.run() → list[CheckResult]: clean multi-event interface
- first real diagnostic check: entity unavailable > 24h
  (INSERT OR IGNORE baseline preserves first-seen timestamp)
- root cause grouping: emit ha_integration_failed instead of N entity
  events when ≥50% of integration's entities are unavailable (≥3 min)
- alert deduplication via SQLite cooldown window (default 6h)
- recovery clears baseline + dedup for immediate re-alert
- configurable thresholds: duration, integration %, cooldown
- 38 unit tests + 7 integration tests (42 pass, 3 skip w/o live HA)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 13:41:55 +02:00
Oskar Kapala 07bd498fd6 feat(ha-diag-agent): test environment with dual HA Docker instances
- dockerized ken + chelsty HA test instances with template fixtures
- snapshot/reset/wait scripts for fixture management
- integration test infrastructure with separate marker
- location_tag promoted from metadata to event payload (Phase 1 flag #3)
- chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 12:56:13 +02:00
Oskar Kapala 90c8e77bf7 chore: gitignore *.egg-info, remove committed egg-info
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 12:26:57 +02:00
Oskar Kapala ab8895d28b feat(ha-diag-agent): scaffold service with HA REST client and event emitter
- new per-host service, follows node-agent pattern
- 7 new HA event types defined (routing in supervisor — Phase 5)
- HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead)
- service.yaml + host configs for piha (ken) and chelsty-infra (chelsty)
- test scaffolding with aiohttp/aiosqlite mocks (15/15 passing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 12:26:34 +02:00