Commit graph

41 commits

Author SHA1 Message Date
Oskar Kapala d2fb2b3d41 docs: onboard README + CLAUDE.md worktree discipline reminder
scripts/onboard/README.md (new):
- Tool purpose and --node/--step/--from/--dry-run usage
- Full node.yaml field schema with annotations (ssh_user uid-1000
  gotcha, first_contact IP vs .local, deploy_autonomy/git_control gates)
- Step status table (00-access DONE, 00-preflight SCAFFOLD, 10-50 TODO)
- lib/ architecture: run() dry-run convention, yaml_get fallback caveats
- Gotchas/Learnings table from session

CLAUDE.md:
- Node Onboarding section: onboard.sh commands, pointer to README
- Multi-agent worktree mode: add explicit DISCIPLINE RULE — feature
  work must happen in agent.sh worktrees, not the main checkout;
  references the 2026-06-08 session that violated this

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 22:31:12 +02:00
Oskar Kapala 471ba09c4a fix(onboard/00-access): suppress known-hosts warning in Tailscale verify
On first SSH to a new mesh hostname, OpenSSH emits
"Warning: Permanently added 'lustro' to the list of known hosts"
on stderr. The previous code used 2>&1, merging it into the captured
arch variable, which caused the arch assertion to fail with
arch="Warning:Permanentlyadded...".

Fix:
- Add dedicated _TS_SSH opts array with -o LogLevel=ERROR, which
  suppresses INFO-level messages (known-hosts, banner) at source
- Remove 2>&1 — stderr is no longer merged into the captured value
- Run only `uname -m` instead of `echo ok && uname -m`; take the last
  non-empty stdout line to be robust against any remaining preamble
- Change arch mismatch from warn to die in live mode (warn in dry-run)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:28:21 +02:00
Oskar Kapala eed0ad0635 fix(onboard): fix yaml_get fallback — strip inline comments and fix greedy colon match
Two bugs in the grep+sed fallback (triggered when yq is unavailable):

1. Greedy colon match: `s/.*: *//` consumed the *last* `: ` in the line, so
   values containing a colon (e.g. `systemd:magicmirror.service`) were
   silently truncated to the portion after the last colon.
   Fix: `s/^[[:space:]]*[^:]*:[[:space:]]*//' — anchored at line start,
   key chars are `[^:]*` (no colons), so only the first `: ` separator is removed.

2. Inline YAML comment not stripped: `first_contact: pi@pimirror2.local   # ...`
   returned the full tail including `#`, breaking callers like ssh-copy-id.
   Fix: add `s/[[:space:]]\+#.*$//` — requires at least one space before `#`
   to preserve bare `#` characters inside a value.

Also add leading/trailing whitespace trim as a separate pass.
Both bugs affect any node.yaml field that has an inline comment or a colon
in its value; all ten fields in hosts/lustro/node.yaml now parse correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:16:06 +02:00
Oskar Kapala 931fd46e62 fix(onboard): propagate dry-run into steps via run() helper
DRY_RUN now uses 1/0 instead of "true"/"false" across all onboard scripts.

common.sh: add run() — wraps mutations; prints "[dry-run] would: ..." when
  DRY_RUN=1. Exported via `export -f run` so child bash processes inherit it.

onboard.sh: remove the `--dry-run → dryrun "Would execute" → continue` bypass.
  Steps now always execute; DRY_RUN=1 is exported so each step's own run()
  calls handle simulation. The orchestrator no longer needs to know step internals.

remote.sh: update DRY_RUN checks to [ "${DRY_RUN:-0}" = 1 ] for consistency.

00-access.sh: remove all if/else DRY_RUN blocks; replace with:
  - Mutations (ssh-copy-id, curl install, tailscale up) wrapped in run()
  - Probes (SSH BatchMode test, command -v, _ts_state) run unconditionally
    so dry-run reports real current state ("key present → skip" vs "would: ...")
  - Stage 3 verify runs always; SSH failure is die in live mode, warn in
    dry-run (Tailscale not yet joined is expected on a fresh node)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:01:09 +02:00
Oskar Kapala 9012a36827 feat(onboard): add 00-access step + update lustro node.yaml
00-access.sh implements a 3-stage idempotent access bootstrap:
  1. ensure_ssh_key  — ssh-copy-id to first_contact (pi@pimirror2.local),
     skips if BatchMode key-auth already passes
  2. ensure_tailscale — install via install.sh if missing, then tailscale up
     --hostname=lustro; prints interactive auth URL to operator, blocks until
     authenticated; skips if BackendState already Running
  3. verify — SSH over Tailscale to pi@lustro, asserts 'ok' + arch=aarch64

Reads first_contact and tailscale.hostname from node.yaml.
Respects --dry-run. No NOPASSWD or /opt/homelab mutations.

hosts/lustro/node.yaml: fill known hardware facts (arm64, 4096 MB RAM,
zram swap, docker_present, mm_runtime=systemd:magicmirror.service),
add ssh_user=pi, first_contact=pi@pimirror2.local,
services.node-agent.runtime engine=docker mem_limit=256m.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 14:43:16 +02:00
Oskar Kapala adb84079ab feat(onboard): add node onboarding scaffold (bash, idempotent)
- scripts/onboard/onboard.sh: orchestrator with --node/--step/--from/--dry-run flags,
  deploy_autonomy + git_control gates, lexicographic step ordering
- scripts/onboard/lib/common.sh: log/warn/die/step helpers, yaml_get (yq+grep/sed fallback),
  ensure_line, git() wrapper enforcing --no-pager
- scripts/onboard/lib/remote.sh: rrun/rcopy/rsync_dir/rcheck SSH wrappers, dry-run aware
- scripts/onboard/steps/00-preflight.sh: read-only fact collection (arch, RAM, disk, docker,
  tailscale, MagicMirror runtime, swap), human report + machine YAML snippet
- scripts/onboard/steps/10-50: stub files with TODO headers, no mutations
- hosts/lustro/node.yaml: LUSTRO edge node draft (KEN, role=edge, deploy_autonomy=true,
  git_control=false); hardware fields marked TODO for preflight population

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 14:23:21 +02:00
Oskar Kapala e6a2443412 fix(dev): agent.sh worktree_count/paths grep exit-1 on empty set
grep -cv (and grep -v) return exit code 1 when there are zero matches.
With set -euo pipefail this silently aborted the script before count
was returned — causing 'agent.sh new' to fail on a fresh repo with no
existing worktrees.

Fix: move the grep -v into worktree_paths with '|| true' so the
function always exits 0, then derive worktree_count via wc -l.
2026-06-03 18:04:38 +02:00
Oskar Kapala f9b145585f fix(dev): agent.sh validate_name set -e safety + ERR trap
Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail
was silently exiting after the loop because the failing-test compound
propagated exit code 1 through the function return.

Add ERR trap so future silent fails get diagnosed at the source.
2026-06-03 18:02:50 +02:00
Oskar Kapala 1abe925f65 feat(dev): scripts/dev/agent.sh — multi-agent worktree dispatcher
new/list/merge/clean. Decisions: branch task/<name>, sibling worktree
~/homelab-codex-ws-<name>, ff-only auto-merge, cap 4.
2026-06-03 17:41:35 +02:00
Oskar Kapala db592fbc28 feat(deploy): Saturn-side dispatcher wrapper
Replaces the per-node staged framework with a single entry point that
runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest +
docker build per service), execute (control-plane.sh --ssh or remote
deploy-node.sh), verify (docker ps), and one-line report.

Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 16:06:36 +02:00
Oskar Kapala f5dcefc752 fix(observer): robust incident lifecycle + orphan auto-resolve
Two root causes for stale "active" incidents on the dashboard:

1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
   can be an ISO-8601 string (stability-agent via events.py) or a Unix
   int (node-agent).  The previous session's auto-resolve did plain
   `time.time() - last_occ` which raises TypeError for strings,
   silently preventing _save_world() from being called and leaving
   incidents perpetually "active" on disk.

   Fix: add _parse_ts(ts) -> float that handles int, float, and
   ISO-8601 strings uniformly. All timestamp arithmetic now goes through
   it; returns 0.0 on None / garbage to keep comparisons safe.

2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
   and marks the incident "resolved" in memory, but if incidents.json was
   truncated mid-write (pre-atomic-write era), the observer loaded it at
   next startup with status="active" and no service entry pointing to it.
   No code ever touched these orphans again.

   Fix: _prune_stale_world now runs two cleanup passes each cycle:
   - Case 1 (healthy-linked): service.status=="healthy" AND incident_id
     still set → resolve immediately (service cannot have active incident)
   - Case 2 (orphaned): active incident with no service link AND
     last_occurrence > 5 min ago → resolve (5-min guard for creation race)

   Both cases are wrapped in try/except so a bug here never crashes the
   observer loop or blocks _save_world.

   Also fixes the 7-day stale-incident prune to use _parse_ts so
   ISO-string resolved_at values are handled correctly.

3. Operator UI: current_incidents() now filters to status=="active" only.
   Resolved incidents were previously included in the /incidents endpoint,
   making the dashboard show a wall of historical records as if active.

Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 14:29:12 +02:00
Oskar Kapala ffb0608b9a fix(observer): atomic writes for world state files
All JSON state writes (services.json, nodes.json, incidents.json,
deployments.json, runtime-summary.json, observer_checkpoint.json) now use
_atomic_write_json: write to a .tmp sibling, fsync, then os.replace.
This eliminates the truncated-write window that caused supervisors
reading mid-write files to see empty/partial JSON.

Also adds auto-resolution of phantom active incidents: if a service
reports status=healthy and its incident's last_occurrence is >30 min old,
the incident is resolved in _prune_stale_world. This clears false active
incidents accumulated from previous race-condition reads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:26:49 +02:00
Oskar Kapala b40b832159 Fix ghost service keys from hash-prefixed Docker container names
node-agent: use com.docker.compose.service label as canonical name
- Add _canonical_container_name() method: prefers compose label,
  falls back to hash-prefix-stripped c.name
- Replace bare c.name usage in check_containers()
- Skip 'created'-state containers (Docker stale-state artifacts)

observer: prune hash-prefixed ghost keys in _prune_stale_world()
- Each reconcile cycle removes service keys matching <node>/<12hex>_<name>
- Acts as safety net for entries already in services.json + future slippage

control-plane/docker-compose.yml already has explicit container_name on
all four services — no change needed there.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:41:13 +02:00
Oskar Kapala 28e9534765 observer: service_healthy resolves active incidents
service_healthy is a positive health confirmation — if the service had
an active incident (e.g. from earlier service_unhealthy events), that
incident should be resolved when the service is confirmed healthy.

Previously only service_recovered resolved incidents; service_healthy
set status=healthy but left incidents open, keeping status='degraded'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:20:19 +02:00
Oskar Kapala 4e8968f9c7 Fix service health tracking: emit service_healthy, control-plane endpoint check, cleanup checkpoint migration
- node_agent: emit service_healthy for all running managed containers so
  observer populates services.json (previously empty → supervisor flooded
  action queue with missing_service redeploys for healthy services)
- node_agent: VPS-only _check_control_plane_health() probes the HTTP
  endpoint to emit service_healthy/unhealthy for the 'control-plane' logical
  service (multi-container stack, container names don't match service name)
- node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints
  format from observer checkpoint (was reading old last_processed_file key,
  always found nothing, never cleaned up old events)
- observer: handle service_healthy event type → sets service status healthy
  without resolving incidents (unlike service_recovered which also resolves)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:49:56 +02:00
Oskar Kapala f4a8db93e4 fix(observer): per-node-directory checkpoints replace single global checkpoint
The old mechanism tracked a single 'last_processed_file' and used sorted
filename order to find new events.  Remote nodes ship events into
subdirectories (events/piha/, events/chelsty-infra/) that sort
alphabetically BEFORE the VPS directory (events/vps/).  Once the
checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events
were silently skipped forever.

New mechanism:
- node_checkpoints: {node_dir: last_processed_path}
- Each node directory has its own independent cursor
- New events = files whose path > that node's checkpoint
- Backward-compatible: old 'last_processed_file' is migrated by extracting
  the node dir from the path on first load

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:16:58 +02:00
Oskar Kapala 96bf32614f fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
  Inside a Docker container this returns the 12-char container ID (e.g.
  'be17cb6eb0f6'), not the host name.  Observer ingested those events and
  created ghost entries in world/nodes.json that never expired.

observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
  from topology inventory; called on every run_once() cycle (both new-events
  and idle paths).  Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
  so the Dashboard's System Overview cards show real numbers instead of undefined.

operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
  as keyed dicts; the frontend calls .map() which requires an array.  All four
  functions now convert the dict to a properly-shaped list.  Each item has the
  fields the Nodes, Services, Topology, Deployments, and Correlation views expect
  (hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
  service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
  default 24).  Without this, every event file ever written was returned,
  including events from ghost-node deploys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:51:03 +02:00
Oskar Kapala 01b7758fe6 feat(node-agent): implement health monitor and safe cleanup policy
scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
    lte_node  (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
    sd_card   (piha, saturn): dangling images + containers, rate-limited once/24h
    ai_node   (solaria): dangling + containers + build cache, NEVER -a
    standard  (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
  events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
  containers_not_running, healthcheck_failed)

services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
  for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only

observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation

supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
  checks all active states, risk=guarded (operator approval required)

executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
  /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /

hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:15:06 +02:00
Oskar Kapala 7742bda245 feat(control-plane): add container_restart remediation
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 12:50:46 +02:00
oskar beb8b5cbaa fix: remove --pull always flag incompatible with docker-compose v1 2026-05-21 22:07:49 +02:00
oskar 898deda05f fix: deploy-frigate.sh use docker-compose v1 for chelsty-infra 2026-05-21 22:05:43 +02:00
oskar f34399a30d feat: add Frigate NVR deployment for chelsty-infra
VAAPI decode via Intel UHD 630, CPU detection, 2x Reolink RLC-540
placeholders. MQTT to local mosquitto (127.0.0.1), 7-day recording
retention. Secrets in /opt/homelab/config/frigate/frigate.env on node.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 18:19:45 +02:00
oskar b02c8bb50e fix(deploy): inventory-aware orchestration and correct override paths
- orchestrate-deploy.sh: read nodes from inventory/topology.yaml instead of hardcoded list
- orchestrate-deploy.sh: LTE nodes (chelsty-infra, chelsty-ha) use ConnectTimeout=30, non-fatal on failure
- deploy-node.sh: service discovery falls back to services.yaml if no services.txt
- deploy-node.sh: override path corrected to hosts/<node>/runtime/<service>/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:50:01 +02:00
oskar f65698925e Fix control plane SSH deploy TTY 2026-05-18 21:41:47 +02:00
oskar 9f20dcae05 Add control plane deploy script and fix UI healthcheck 2026-05-18 21:34:57 +02:00
oskar c299a2cb85 Fix agent fleet verification via Redis container 2026-05-17 23:00:51 +02:00
oskar b129f03837 Fix stability agent fleet deploy scripts 2026-05-17 21:09:06 +02:00
oskar b7faac00c5 Add executable stability agent fleet deploy scripts 2026-05-17 17:32:10 +02:00
oskar 8f305ba3df Merge VPS control plane deployment and observer runtime 2026-05-17 17:30:04 +02:00
oskar c9ddfa9ac1 Roll out stability agent to homelab nodes 2026-05-17 15:54:19 +02:00
Oskar Kapala 533b8e846d Add heartbeat updates and improve health checks in control-plane components 2026-05-12 20:59:46 +02:00
Oskar Kapala 2029457f57 Implement VPS control-plane deployment profile 2026-05-12 20:19:05 +02:00
Oskar Kapala 8f5b905015 Implement observer runtime world synthesis engine 2026-05-12 14:07:03 +02:00
Oskar Kapala 431d777989 Implement filesystem-first runtime event system 2026-05-12 13:38:25 +02:00
Oskar Kapala 0eeb0ac600 Implement reproducible node onboarding 2026-05-12 13:18:00 +02:00
Oskar Kapala 81bce00bf3 Bootstrap CHELSTY runtime stack 2026-05-11 21:36:10 +02:00
Oskar Kapala b524a3886a Harden deployment runtime framework 2026-05-11 21:20:13 +02:00
Oskar Kapala 5947ddd03d Implement staged deployment runtime 2026-05-11 21:04:24 +02:00
Oskar Kapala bbdbdb8321 Add node capability model 2026-05-11 20:46:50 +02:00
Oskar Kapala d0540f7eb8 Add infrastructure standards and deployment conventions 2026-05-07 21:16:03 +02:00
Oskar Kapala 2b5d59ae27 Initial homelab workspace structure 2026-05-07 20:17:27 +02:00