ECC-format skill for the node onboarding workflow. Covers full step sequence, operational rules, node.yaml key fields, gotchas from LUSTRO session, and Definition of Done. Marked as living doc — SCAFFOLD sections to be promoted to PROVEN as steps land on real nodes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.2 KiB
| name | description | living_doc | maturity |
|---|---|---|---|
| node-onboarding | Use when the user wants to add or onboard a new node to homelab-codex — repo manifest, Tailscale mesh, node-agent, monitoring, and UI registration. Keywords: "nowy node", "dodaj node", "onboarding", "onboard node". | true | partial |
Living document — sections marked SCAFFOLD are stubs waiting for battle-testing on a real node. Promote to PROVEN after each step passes end-to-end. Do not treat SCAFFOLD sections as authoritative.
Trigger
User asks to onboard / add a new node. Load this skill before touching any onboarding script or node.yaml.
Workflow — one step at a time
preflight (read-only)
└─ 00-access [PROVEN]
└─ base [SCAFFOLD]
└─ node-agent [SCAFFOLD]
└─ register [SCAFFOLD]
└─ verify(50) [SCAFFOLD]
Never skip ahead. Each step must exit 0 before the next begins.
Invocation
# Full onboarding (all steps in order)
scripts/onboard/onboard.sh --node <name>
# Single step
scripts/onboard/onboard.sh --node <name> --step 00-access
# Resume from a step
scripts/onboard/onboard.sh --node <name> --from 10-bootstrap-runtime
# Dry-run — probes run for real; mutations are printed, not executed
scripts/onboard/onboard.sh --node <name> --dry-run
Step status table
| Step | File | Status | What it does |
|---|---|---|---|
00-preflight |
steps/00-preflight.sh |
SCAFFOLD | Read-only: arch, RAM, docker, swap, MM runtime → YAML snippet for node.yaml |
00-access |
steps/00-access.sh |
PROVEN | SSH key → first_contact, install Tailscale, tailscale up (interactive URL), verify over mesh |
10-bootstrap-runtime |
steps/10-bootstrap-runtime.sh |
SCAFFOLD | Create /opt/homelab/ layout, chown <ssh_user> |
20-install-docker |
steps/20-install-docker.sh |
SCAFFOLD | Install Docker Engine if docker_present=false; skip if already installed |
40-deploy-node-agent |
steps/40-deploy-node-agent.sh |
SCAFFOLD | Deploy node-agent container; user 1000:1000; mem_limit from node.yaml |
50-verify |
steps/50-verify.sh |
SCAFFOLD | End-to-end smoke: event reaches control plane, visible in UI, Telegram alert path |
node.yaml — key fields
name: LUSTRO # ALL CAPS
role: edge # edge | compute | infra
ssh_user: pi # existing user on the node
first_contact: pi@192.168.31.19 # LAN IP — NEVER .local (mDNS unreliable in automation)
tailscale:
hostname: lustro # mesh name; switch to this after tailscale up
ip: # fill after join
deploy_autonomy: true # false → print manual instructions and stop
git_control: false # false → push-based from SATURN (edge nodes)
hardware:
arch: arm64 # filled by 00-preflight
ram_mb: 4096 # filled by 00-preflight
swap:
kind: zram # zram | file | none
docker_present: true # filled by 00-preflight
mm_runtime: systemd:magicmirror.service # filled by 00-preflight; none if absent
services:
node-agent:
runtime:
engine: docker
mem_limit: 256m # mandatory on RAM-constrained hosts (≤4 GB)
preflight fills arch, ram_mb, docker_present, mm_runtime — do NOT guess these.
Full schema: scripts/onboard/README.md.
Operational rules (PROVEN)
PLAN-FIRST — before any mutation, show exactly what will touch the remote host.
Always run --dry-run first; dry-run must print real commands (run() propagation).
Idempotency — every step is safe to re-run. Keys, Tailscale join, Docker install → skip if already done.
Isolation — do NOT touch existing services on the node (e.g. MagicMirror as systemd unit).
Worktree discipline — onboarding is a feature. Work in a task worktree (agent.sh new), never in the main checkout (~/homelab-codex-ws is deploy-only). See worktree-aware.
Gotchas (battle-tested)
| Problem | Fix |
|---|---|
mDNS .local resolve fail |
Always use LAN IP in first_contact; .local OK interactively, not in automation |
| uid=1000 collision on RPi OS | If pi already holds uid=1000 → USE that user, don't create oskar. node-agent 1000:1000 matches out-of-box; creating a second uid=1000 breaks MM ownership |
| passwordless sudo not guaranteed | Verify sudo -n true exits 0 before any sudo-over-SSH step. RPi OS default may require password; ssh without TTY will hang |
| swap file on SD card | Use zram, not a swap file (SD wear). Add migration to 10-bootstrap-runtime |
| RAM ≤4 GB with heavy app | mem_limit on node-agent is mandatory — same OOM profile as VPS |
| Docker already installed | Check docker_present from preflight; skip install step if true |
| SSH known-hosts warning in parsed output | Pass -o LogLevel=ERROR to SSH for new mesh hosts |
yaml_get drops value prefix after : |
Non-greedy colon: s/^[[:space:]]*[^:]*:[[:space:]]*//' — handles systemd:unit correctly |
yaml_get keeps inline YAML comments |
Strip with s/[[:space:]]\+#.*$// after extraction (requires ≥1 space before #) |
| dry-run stops at orchestrator level | run() wrapper + export DRY_RUN=1 propagated to all step scripts; probes execute for real |
lib/ reference
lib/common.sh — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
lib/remote.sh — rrun/rcopy/rsync_dir/rcheck (SSH wrappers; uses ONBOARD_SSH_USER / ONBOARD_HOST)
run() contract: in dry-run mode prints intent without executing; probes (ssh BatchMode=yes, command -v, status queries) always execute so the plan is realistic.
Definition of Done
A node is fully onboarded when:
50-verifyexits 0 — event visible in control-plane UI and Telegram alert path confirmed.hosts/<node>/node.yamlcommitted with all preflight fields filled.hosts/<node>/capabilities.yamlpresent and accurate.- Node appears in
inventory/topology.yaml.