homelab-codex-ws/.claude/skills/node-onboarding/SKILL.md
Oskar Kapala c255a021d1 fix(observer): quarantine malformed event files to prevent processing wedge
Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the
node's checkpoint forever — every cycle re-tried, logged, never advanced
past the bad file; all subsequent good events for that node lost.

Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures themselves are logged but don't crash the loop.

Complementary to yesterday's atomic_write_json fix (state files);
this addresses the same race-pattern on event files instead.

Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
2026-06-12 11:22:56 +02:00

7.3 KiB

name description living_doc maturity
node-onboarding Use when the user wants to add or onboard a new node to homelab-codex — repo manifest, Tailscale mesh, node-agent, monitoring, and UI registration. Keywords: "nowy node", "dodaj node", "onboarding", "onboard node". true partial

Living document — sections marked SCAFFOLD are stubs waiting for battle-testing on a real node. Promote to PROVEN after each step passes end-to-end. Do not treat SCAFFOLD sections as authoritative.

Trigger

User asks to onboard / add a new node. Load this skill before touching any onboarding script or node.yaml.


Workflow — one step at a time

preflight (read-only)
  └─ 00-access        [PROVEN]
       └─ 20-base     [PROVEN]
            └─ 30-node-agent   [PROVEN]
                 └─ 40-register     [WRITTEN — live pending]
                      └─ 50-verify  [WRITTEN — live pending]

Never skip ahead. Each step must exit 0 before the next begins.


Invocation

# Full onboarding (all steps in order)
scripts/onboard/onboard.sh --node <name>

# Single step
scripts/onboard/onboard.sh --node <name> --step 00-access

# Resume from a step
scripts/onboard/onboard.sh --node <name> --from 10-bootstrap-runtime

# Dry-run — probes run for real; mutations are printed, not executed
scripts/onboard/onboard.sh --node <name> --dry-run

Step status table

Step File Status What it does
00-preflight steps/00-preflight.sh SCAFFOLD Read-only: arch, RAM, docker, swap, MM runtime → YAML snippet for node.yaml
00-access steps/00-access.sh PROVEN SSH key → first_contact, install Tailscale, tailscale up (interactive URL), verify over mesh
10-bootstrap-runtime steps/10-bootstrap-runtime.sh SCAFFOLD Create /opt/homelab/ layout, chown <ssh_user>
20-base steps/20-base.sh PROVEN swap→zram, /opt/homelab/ layout, event dir /opt/homelab/events/<node>/
20-install-docker steps/20-install-docker.sh SCAFFOLD Install Docker Engine if docker_present=false; skip if already installed
30-node-agent steps/30-node-agent.sh PROVEN rsync base compose + override, docker compose up -d --build, verify container + events
40-register steps/40-register.sh WRITTEN Dopisuje node do inventory/topology.yaml + tworzy hosts/<node>/services.yaml, commit na branchu (bez push)
50-verify steps/50-verify.sh WRITTEN SSH node: container+events; SSH VPS: restart observer + heartbeat poll + world/nodes.json

node.yaml — key fields

name: LUSTRO                        # ALL CAPS
role: edge                          # edge | compute | infra
ssh_user: pi                        # existing user on the node
first_contact: pi@192.168.31.19     # LAN IP — NEVER .local (mDNS unreliable in automation)
tailscale:
  hostname: lustro                  # mesh name; switch to this after tailscale up
  ip:                               # fill after join
deploy_autonomy: true               # false → print manual instructions and stop
git_control: false                  # false → push-based from SATURN (edge nodes)
hardware:
  arch: arm64                       # filled by 00-preflight
  ram_mb: 4096                      # filled by 00-preflight
  swap:
    kind: zram                      # zram | file | none
  docker_present: true              # filled by 00-preflight
  mm_runtime: systemd:magicmirror.service   # filled by 00-preflight; none if absent
services:
  node-agent:
    runtime:
      engine: docker
      mem_limit: 256m               # mandatory on RAM-constrained hosts (≤4 GB)

preflight fills arch, ram_mb, docker_present, mm_runtime — do NOT guess these.

Full schema: scripts/onboard/README.md.


Operational rules (PROVEN)

PLAN-FIRST — before any mutation, show exactly what will touch the remote host. Always run --dry-run first; dry-run must print real commands (run() propagation).

Idempotency — every step is safe to re-run. Keys, Tailscale join, Docker install → skip if already done.

Isolation — do NOT touch existing services on the node (e.g. MagicMirror as systemd unit).

Worktree discipline — onboarding is a feature. Work in a task worktree (agent.sh new), never in the main checkout (~/homelab-codex-ws is deploy-only). See worktree-aware.


Gotchas (battle-tested)

Problem Fix
mDNS .local resolve fail Always use LAN IP in first_contact; .local OK interactively, not in automation
uid=1000 collision on RPi OS If pi already holds uid=1000 → USE that user, don't create oskar. node-agent 1000:1000 matches out-of-box; creating a second uid=1000 breaks MM ownership
passwordless sudo not guaranteed Verify sudo -n true exits 0 before any sudo-over-SSH step. RPi OS default may require password; ssh without TTY will hang
swap file on SD card Use zram, not a swap file (SD wear). Add migration to 10-bootstrap-runtime
RAM ≤4 GB with heavy app mem_limit on node-agent is mandatory — same OOM profile as VPS
Docker already installed Check docker_present from preflight; skip install step if true
SSH known-hosts warning in parsed output Pass -o LogLevel=ERROR to SSH for new mesh hosts
yaml_get drops value prefix after : Non-greedy colon: s/^[[:space:]]*[^:]*:[[:space:]]*//' — handles systemd:unit correctly
yaml_get keeps inline YAML comments Strip with s/[[:space:]]\+#.*$// after extraction (requires ≥1 space before #)
dry-run stops at orchestrator level run() wrapper + export DRY_RUN=1 propagated to all step scripts; probes execute for real
rsync push Permission denied to VPS events/ ssh-user must be in the group that owns /opt/homelab/events/ (aerbot/1000 on VPS). Symptom: silent WARNING in node-agent log, 292k files backlog, panel stale. Fix: usermod -aG 1000 <user> on VPS + re-login
observer not seeing new node after topology.yaml edit _load_inventory() runs once at __init__. After git pull on VPS (bind-mount is live), docker restart control-plane-observer is required — no redeploy needed
worktree on wrong branch Always check git branch --show-current on entry. One task = one worktree (agent.sh new). Never manually git checkout between task branches in the same worktree

lib/ reference

lib/common.sh  — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
lib/remote.sh  — rrun/rcopy/rsync_dir/rcheck (SSH wrappers; uses ONBOARD_SSH_USER / ONBOARD_HOST)

run() contract: in dry-run mode prints intent without executing; probes (ssh BatchMode=yes, command -v, status queries) always execute so the plan is realistic.


Definition of Done

A node is fully onboarded when:

  1. 50-verify exits 0 — event visible in control-plane UI and Telegram alert path confirmed.
  2. hosts/<node>/node.yaml committed with all preflight fields filled.
  3. hosts/<node>/capabilities.yaml present and accurate.
  4. Node appears in inventory/topology.yaml.