- docs/sessions/2026-06-09-flota-recovery-lustro-register.md: flota recovery (root cause aerbot group, 3 warstwy maskujące), lustro register stan+plan, fix-event-bloat i OOM pending, worktree gotcha - docs/backlog.md: nowy plik — tech-debt tracker; wpisy: --omit-dir-times, oskar∈aerbot deklaratywnie, worktree per task, observer staleness - .claude/skills/node-onboarding/SKILL.md: step table aktualizacja (PROVEN: 20-base, 30-node-agent; WRITTEN: 40-register, 50-verify), 3 nowe gotchas (rsync perm, observer restart, worktree branch) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.3 KiB
| name | description | living_doc | maturity |
|---|---|---|---|
| node-onboarding | Use when the user wants to add or onboard a new node to homelab-codex — repo manifest, Tailscale mesh, node-agent, monitoring, and UI registration. Keywords: "nowy node", "dodaj node", "onboarding", "onboard node". | true | partial |
Living document — sections marked SCAFFOLD are stubs waiting for battle-testing on a real node. Promote to PROVEN after each step passes end-to-end. Do not treat SCAFFOLD sections as authoritative.
Trigger
User asks to onboard / add a new node. Load this skill before touching any onboarding script or node.yaml.
Workflow — one step at a time
preflight (read-only)
└─ 00-access [PROVEN]
└─ 20-base [PROVEN]
└─ 30-node-agent [PROVEN]
└─ 40-register [WRITTEN — live pending]
└─ 50-verify [WRITTEN — live pending]
Never skip ahead. Each step must exit 0 before the next begins.
Invocation
# Full onboarding (all steps in order)
scripts/onboard/onboard.sh --node <name>
# Single step
scripts/onboard/onboard.sh --node <name> --step 00-access
# Resume from a step
scripts/onboard/onboard.sh --node <name> --from 10-bootstrap-runtime
# Dry-run — probes run for real; mutations are printed, not executed
scripts/onboard/onboard.sh --node <name> --dry-run
Step status table
| Step | File | Status | What it does |
|---|---|---|---|
00-preflight |
steps/00-preflight.sh |
SCAFFOLD | Read-only: arch, RAM, docker, swap, MM runtime → YAML snippet for node.yaml |
00-access |
steps/00-access.sh |
PROVEN | SSH key → first_contact, install Tailscale, tailscale up (interactive URL), verify over mesh |
10-bootstrap-runtime |
steps/10-bootstrap-runtime.sh |
SCAFFOLD | Create /opt/homelab/ layout, chown <ssh_user> |
20-base |
steps/20-base.sh |
PROVEN | swap→zram, /opt/homelab/ layout, event dir /opt/homelab/events/<node>/ |
20-install-docker |
steps/20-install-docker.sh |
SCAFFOLD | Install Docker Engine if docker_present=false; skip if already installed |
30-node-agent |
steps/30-node-agent.sh |
PROVEN | rsync base compose + override, docker compose up -d --build, verify container + events |
40-register |
steps/40-register.sh |
WRITTEN | Dopisuje node do inventory/topology.yaml + tworzy hosts/<node>/services.yaml, commit na branchu (bez push) |
50-verify |
steps/50-verify.sh |
WRITTEN | SSH node: container+events; SSH VPS: restart observer + heartbeat poll + world/nodes.json |
node.yaml — key fields
name: LUSTRO # ALL CAPS
role: edge # edge | compute | infra
ssh_user: pi # existing user on the node
first_contact: pi@192.168.31.19 # LAN IP — NEVER .local (mDNS unreliable in automation)
tailscale:
hostname: lustro # mesh name; switch to this after tailscale up
ip: # fill after join
deploy_autonomy: true # false → print manual instructions and stop
git_control: false # false → push-based from SATURN (edge nodes)
hardware:
arch: arm64 # filled by 00-preflight
ram_mb: 4096 # filled by 00-preflight
swap:
kind: zram # zram | file | none
docker_present: true # filled by 00-preflight
mm_runtime: systemd:magicmirror.service # filled by 00-preflight; none if absent
services:
node-agent:
runtime:
engine: docker
mem_limit: 256m # mandatory on RAM-constrained hosts (≤4 GB)
preflight fills arch, ram_mb, docker_present, mm_runtime — do NOT guess these.
Full schema: scripts/onboard/README.md.
Operational rules (PROVEN)
PLAN-FIRST — before any mutation, show exactly what will touch the remote host.
Always run --dry-run first; dry-run must print real commands (run() propagation).
Idempotency — every step is safe to re-run. Keys, Tailscale join, Docker install → skip if already done.
Isolation — do NOT touch existing services on the node (e.g. MagicMirror as systemd unit).
Worktree discipline — onboarding is a feature. Work in a task worktree (agent.sh new), never in the main checkout (~/homelab-codex-ws is deploy-only). See worktree-aware.
Gotchas (battle-tested)
| Problem | Fix |
|---|---|
mDNS .local resolve fail |
Always use LAN IP in first_contact; .local OK interactively, not in automation |
| uid=1000 collision on RPi OS | If pi already holds uid=1000 → USE that user, don't create oskar. node-agent 1000:1000 matches out-of-box; creating a second uid=1000 breaks MM ownership |
| passwordless sudo not guaranteed | Verify sudo -n true exits 0 before any sudo-over-SSH step. RPi OS default may require password; ssh without TTY will hang |
| swap file on SD card | Use zram, not a swap file (SD wear). Add migration to 10-bootstrap-runtime |
| RAM ≤4 GB with heavy app | mem_limit on node-agent is mandatory — same OOM profile as VPS |
| Docker already installed | Check docker_present from preflight; skip install step if true |
| SSH known-hosts warning in parsed output | Pass -o LogLevel=ERROR to SSH for new mesh hosts |
yaml_get drops value prefix after : |
Non-greedy colon: s/^[[:space:]]*[^:]*:[[:space:]]*//' — handles systemd:unit correctly |
yaml_get keeps inline YAML comments |
Strip with s/[[:space:]]\+#.*$// after extraction (requires ≥1 space before #) |
| dry-run stops at orchestrator level | run() wrapper + export DRY_RUN=1 propagated to all step scripts; probes execute for real |
| rsync push Permission denied to VPS events/ | ssh-user must be in the group that owns /opt/homelab/events/ (aerbot/1000 on VPS). Symptom: silent WARNING in node-agent log, 292k files backlog, panel stale. Fix: usermod -aG 1000 <user> on VPS + re-login |
| observer not seeing new node after topology.yaml edit | _load_inventory() runs once at __init__. After git pull on VPS (bind-mount is live), docker restart control-plane-observer is required — no redeploy needed |
| worktree on wrong branch | Always check git branch --show-current on entry. One task = one worktree (agent.sh new). Never manually git checkout between task branches in the same worktree |
lib/ reference
lib/common.sh — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
lib/remote.sh — rrun/rcopy/rsync_dir/rcheck (SSH wrappers; uses ONBOARD_SSH_USER / ONBOARD_HOST)
run() contract: in dry-run mode prints intent without executing; probes (ssh BatchMode=yes, command -v, status queries) always execute so the plan is realistic.
Definition of Done
A node is fully onboarded when:
50-verifyexits 0 — event visible in control-plane UI and Telegram alert path confirmed.hosts/<node>/node.yamlcommitted with all preflight fields filled.hosts/<node>/capabilities.yamlpresent and accurate.- Node appears in
inventory/topology.yaml.