fix(stability-agent): run as uid 1000 with docker group access

stability-agent had no USER instruction and no user: in compose, running as root and writing root-owned files to /opt/homelab bind-mount. - Dockerfile: add useradd -m -u 1000 homelab + USER homelab - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) to retain docker.sock:ro access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(node-agent): run as uid 1000 with docker group access
2026-06-03 18:20:54 +02:00 · 2026-06-03 18:20:31 +02:00 · 2026-06-03 18:19:58 +02:00 · 2026-06-03 18:04:38 +02:00 · 2026-06-03 18:02:50 +02:00 · 2026-06-03 17:41:35 +02:00
16 changed files with 1306 additions and 273 deletions
--- a/.claude/skills/deploy/SKILL.md
+++ b/.claude/skills/deploy/SKILL.md
@ -0,0 +1,43 @@
 ---
 name: deploy
 description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
 ---
 Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
 Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
 ## Targets
 | Target | What it deploys |
 |---|---|
 | `control-plane` | observer, supervisor, executor, operator-ui on VPS |
 | `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
 | `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
 | `solaria` | SOLARIA compute services |
 | `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
 ## Invocation
 ```bash
 scripts/deploy/deploy.sh <target>            # full pipeline
 scripts/deploy/deploy.sh <target> --dry-run  # preflight + gate only
 scripts/deploy/deploy.sh <target> --no-gate  # emergency: bypass tests
 ```
 ## Exit Code Handling
 | Code | Meaning | Required action |
 |---|---|---|
 | 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
 | 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
 | 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
 | 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
 | 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
 | 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
 ## Rules
 - Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
 - Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
 - Canonical branch is `master` — preflight enforces this.
 - For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.
--- a/.claude/skills/save-session/SKILL.md
+++ b/.claude/skills/save-session/SKILL.md
@ -0,0 +1,65 @@
 ---
 name: save-session
 description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
 ---
 **Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
 Never invoke proactively. Never invoke mid-task.
 ## 1. Determine Session Boundary
 1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
 2. Fallback if no previous entry exists: 24 hours ago.
 ## 2. Collect Facts (deterministic only — no invention)
 Run exactly:
 ```bash
 # All commits since boundary
 git --no-pager log --oneline <boundary>..HEAD
 # Changed file summary
 git --no-pager diff --stat <boundary>..HEAD
 ```
 From the visible conversation transcript: deploys run and their outcomes, test results seen.
 ## 3. Write the Session Entry
 **APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
 Never overwrite existing content.
 ```markdown
 ## Session HH:MM
 ### Commits
 <output of git log --oneline>
 ### Files changed
 <output of git diff --stat>
 ### Deploys
 <list from transcript, or "None recorded">
 ### Narrative
 > _user-provided summary_
 ```
 The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
 ## 4. What NOT to Touch
 - `backlog.md` — only on explicit "update backlog" instruction
 - `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
 - Any other file not listed above
 ## 5. Commit
 Stage and commit **only** the session file:
 ```bash
 git add docs/sessions/YYYY-MM-DD.md
 git commit -m "docs: session YYYY-MM-DD HH:MM"
 ```
 No other files. No `git add -A`.
--- a/.claude/skills/worktree-aware/SKILL.md
+++ b/.claude/skills/worktree-aware/SKILL.md
@ -0,0 +1,81 @@
 ---
 name: worktree-aware
 description: >
  Use when working in a git worktree checkout for a parallel agent task.
  The presence of an .agent-task file in the current working directory indicates
  a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
  to the assigned task branch, NEVER push origin master, NEVER touch the main
  checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
  completion, report the branch name verbatim and stop — the human merges via
  scripts/dev/agent.sh.
 ---
 ## When this applies
 - `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
 - `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
  In the main checkout these rules do not apply.
 ## Reading the marker
 `.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
 ```yaml
 task: my-feature
 branch: task/my-feature
 parent_commit: abc1234
 created_utc: 2026-06-03T10:00:00Z
 worktree_path: /home/oskar/homelab-codex-ws-my-feature
 ```
 Always read this file first before taking any action.
 ## Rules
 1. **Commit only to your branch.**
   Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
   If it does not, stop immediately and report the discrepancy.
 2. **Push only to your branch.**
   The only permitted push is `git push origin task/<name>`.
   NEVER `git push origin master` or any other branch.
 3. **Do not touch the main checkout.**
   `~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
   Do not read from, write to, or execute commands inside it.
 4. **Stay scoped.**
   Only change files directly related to your assigned task.
   If you notice other problems, report them in your final summary as separate follow-up proposals.
   Do not fix them in this worktree.
 5. **Never `git add -A`.**
   Always stage specific files by name: `git add path/to/file`.
 6. **Do not manage worktrees.**
   Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
   Worktree lifecycle is the human's responsibility.
 7. **Final report before stopping.**
   When the task is done, provide a structured report containing:
   - Files changed (path and one-line summary of change)
   - Tests run and results
   - All commit hashes on the task branch
   - **Branch name verbatim** (copy-paste ready)
   - Follow-up items as bulleted proposals for separate tasks
 ## Definition of Done
 - All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
 - Test suite passes
 - Branch pushed: `git push origin task/<name>`
 - Full report delivered in conversation
 ## What you do NOT do
 - Merge branches
 - Create or push tags
 - Run deploys or healthchecks against production nodes
 - Delete branches or worktrees
 - Modify files in other worktrees
 - Push to `origin master` under any circumstances
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -180,3 +180,15 @@ Before any new or changed service is considered ready:
 - Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
 - Container names must match service names
 - Always `restart: unless-stopped` unless `service.yaml` says otherwise
 ## Multi-agent worktree mode
 `~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
 Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
 If `.agent-task` exists in your current working directory, you are in a task worktree.
 **You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
 before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
 Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
 Agents never invoke these — only the human does.
--- a/scripts/deploy/deploy.sh
+++ b/scripts/deploy/deploy.sh
@ -1,270 +1,321 @@
 #!/usr/bin/env bash
-# deploy.sh - Staged deployment framework for homelab nodes.
+# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
 # Usage: deploy.sh <target> [--dry-run] [--no-gate]
 #   target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
 # Exit codes: 0=ok  1=preflight  2=gate  3=execute  4=verify  5=handoff(sudo)
-set -o pipefail
+set -uo pipefail
-# --- Configuration ---
+REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
-export RUNTIME_PATH="/opt/homelab"
+SSH_USER="${SSH_USER:-oskar}"
-export STATE_DIR="${RUNTIME_PATH}/state/deploy"
+START_TIME=$(date +%s)
-export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
+TARGET=""
-export REPO_PATH="${HOME}/homelab-codex-ws"
+DRY_RUN=false
-export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+NO_GATE=false
 export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
-# --- Initialization ---
+usage() {
-mkdir -p "$STATE_DIR" "$LOG_DIR"
+    cat >&2 <<'EOF'
 Usage: deploy.sh <target> [--dry-run] [--no-gate]
-# Redirection for logging
+Targets:
-exec > >(tee -a "$LOG_FILE") 2>&1
+  control-plane   observer/supervisor/executor/operator-ui on VPS
  vps             all VPS GitOps services
  piha            PIHA services
  solaria         SOLARIA compute services
  chelsty-infra   CHELSTY edge node (LTE, longer SSH timeout)
-# --- Load Libraries ---
+Flags:
-LIB_PATH="${REPO_PATH}/scripts/lib"
+  --dry-run   run preflight + gate only; stop before deploy
-source "${LIB_PATH}/log.sh"
+  --no-gate   skip pytest + docker build (emergency only; logged as WARNING)
 source "${LIB_PATH}/state.sh"
 source "${LIB_PATH}/inventory.sh"
 source "${LIB_PATH}/compose.sh"
 source "${LIB_PATH}/diagnostics.sh"
-# --- CLI Parsing ---
+Exit codes: 0=ok  1=preflight  2=gate  3=execute  4=verify  5=handoff(sudo)
-TARGET_HOST=$(hostname)
+EOF
-TARGET_SERVICE=""
+    exit 1
-RESUME=false
+}
 REQUESTED_STAGE=""
 while [[ $# -gt 0 ]]; do
    case $1 in
-        --host)
+        control-plane|vps|piha|solaria|chelsty-infra)
-            TARGET_HOST="$2"
+            TARGET="$1"; shift ;;
-            shift 2
+        --dry-run)
-            ;;
+            DRY_RUN=true; shift ;;
-        --service)
+        --no-gate)
-            TARGET_SERVICE="$2"
+            NO_GATE=true; shift ;;
-            shift 2
+        -h|--help)
-            ;;
+            usage ;;
        --resume)
            RESUME=true
            shift
            ;;
        --stage)
            REQUESTED_STAGE="$2"
            shift 2
            ;;
        *)
-            if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
+            echo "Unknown argument: $1" >&2
-                REQUESTED_STAGE="$1"
+            usage ;;
            fi
            shift
            ;;
    esac
 done
-# --- Stages ---
+[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }
-stage_prepare() {
+case "$TARGET" in
-    local host=$1
+    control-plane) SSH_HOST="vps" ;;
-    if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
+    *)             SSH_HOST="$TARGET" ;;
-        log "INFO" "Skipping PREPARE (already complete)"
+esac
        return 0
    fi
-    log "INFO" "Stage: PREPARE ($host)"
+case "$TARGET" in
-    set_stage "prepare"
+    chelsty-*) SSH_TIMEOUT=30 ;;
    *)         SSH_TIMEOUT=5 ;;
 esac
-    emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}"
+# ── PREFLIGHT ────────────────────────────────────────────────────────────────
-    cd "$REPO_PATH" || exit 1
+preflight() {
-    log "INFO" "Pulling latest changes..."
+    echo "=== PREFLIGHT ==="
    if ! git pull; then
        log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
    fi
-    # Ensure runtime directories exist
+    local branch
-    mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
+    branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
-
+    if [[ "$branch" != "master" ]]; then
-    struct_log "prepare" "$host" "all" "success" "repo_updated"
+        echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
    mark_stage_complete "prepare"
 }
 stage_validate() {
    local host=$1
    if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
        log "INFO" "Skipping VALIDATE (already complete)"
        return 0
    fi
    log "INFO" "Stage: VALIDATE ($host)"
    set_stage "validate"
    for service in "${SERVICES[@]}"; do
        log "INFO" "Validating $service..."
        if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
            log "ERROR" "Service definition not found: $service"
            struct_log "validate" "$host" "$service" "fail" "not_found"
            return 1
        fi
    done
    struct_log "validate" "$host" "all" "success" "validated"
    mark_stage_complete "validate"
 }
 stage_deploy() {
    local host=$1
    if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
        log "INFO" "Skipping DEPLOY (already complete)"
        return 0
    fi
    log "INFO" "Stage: DEPLOY ($host)"
    set_stage "deploy"
    local last_s=$(get_last_service)
    local skip=false
    if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
        skip=true
    fi
    for service in "${SERVICES[@]}"; do
        if [[ "$skip" == "true" ]]; then
            if [[ "$service" == "$last_s" ]]; then
                skip=false
                log "INFO" "Resuming from $service..."
            else
                log "INFO" "Skipping $service (already processed)"
                continue
            fi
        fi
        log "INFO" "Deploying $service..."
        set_last_service "$service"
        if ! run_compose_up "$service"; then
            struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
            collect_diagnostics "$host" "$service"
            return 1
        fi
        struct_log "deploy" "$host" "$service" "success" "deployed"
    done
    set_last_service ""
    mark_stage_complete "deploy"
 }
 stage_verify() {
    local host=$1
    if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
        log "INFO" "Skipping VERIFY (already complete)"
        return 0
    fi
    log "INFO" "Stage: VERIFY ($host)"
    set_stage "verify"
    for service in "${SERVICES[@]}"; do
        log "INFO" "Verifying $service..."
        local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
        if [[ -f "$health_script" ]]; then
            if ! bash "$health_script"; then
                log "ERROR" "Healthcheck failed for $service"
                struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
                collect_diagnostics "$host" "$service"
                return 1
            fi
        else
            # Generic check if container is running
            if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
                log "ERROR" "Container $service is not running"
                struct_log "verify" "$host" "$service" "fail" "container_not_running"
                collect_diagnostics "$host" "$service"
                return 1
            fi
        fi
        struct_log "verify" "$host" "$service" "success" "verified"
    done
    mark_stage_complete "verify"
 }
 stage_complete() {
    local host=$1
    log "INFO" "Stage: COMPLETE ($host)"
    set_stage "complete"
    struct_log "complete" "$host" "all" "success" "deployment_finished"
    clear_deployment_state
 }
 # --- Execution Logic ---
 run_deployment() {
    local start_stage=$1
    # Sequential execution from start_stage
    case "$start_stage" in
        prepare)
            stage_prepare "$TARGET_HOST" || return 1
            ;&
        validate)
            stage_validate "$TARGET_HOST" || return 1
            ;&
        deploy)
            stage_deploy "$TARGET_HOST" || return 1
            ;&
        verify)
            stage_verify "$TARGET_HOST" || return 1
            ;&
        complete)
            stage_complete "$TARGET_HOST" || return 1
            ;;
        *)
            log "ERROR" "Invalid stage: $start_stage"
            return 1
            ;;
    esac
 }
 # --- Main ---
 log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
 if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
    log "ERROR" "Failed to load inventory"
        exit 1
 fi
 EXIT_STATUS=0
 if [[ "$RESUME" == "true" ]]; then
    CURRENT=$(get_stage)
    log "INFO" "Resuming from state: $CURRENT"
    case "$CURRENT" in
        prepare|validate|deploy|verify)
            run_deployment "$CURRENT" || EXIT_STATUS=1
            ;;
        complete|none)
            log "INFO" "No interrupted deployment found. Starting from scratch..."
            run_deployment "prepare" || EXIT_STATUS=1
            ;;
        *)
            log "INFO" "Unknown state. Starting from prepare..."
            run_deployment "prepare" || EXIT_STATUS=1
            ;;
    esac
 elif [[ -n "$REQUESTED_STAGE" ]]; then
    if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
        collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
    else
        run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
    fi
-else
+    echo "[ok] branch: master"
-    # New deployment - clear previous state
+
-    clear_deployment_state
+    if ! git -C "$REPO_ROOT" diff --quiet; then
-    run_deployment "prepare" || EXIT_STATUS=1
+        echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
        exit 1
    fi
    if ! git -C "$REPO_ROOT" diff --cached --quiet; then
        echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
        exit 1
    fi
    echo "[ok] working tree clean"
    git -C "$REPO_ROOT" fetch origin master --quiet
    local unpushed
    unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
    if [[ -n "$unpushed" ]]; then
        echo "ERROR: Unpushed commits on master:" >&2
        echo "$unpushed" >&2
        echo "Push first:  git push origin master" >&2
        exit 1
    fi
    echo "[ok] no unpushed commits"
    echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
    if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
            "${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
        echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
        exit 1
    fi
    echo "[ok] ${SSH_HOST} reachable"
 }
 # ── GATE ─────────────────────────────────────────────────────────────────────
 gate() {
    if [[ "$NO_GATE" == "true" ]]; then
        echo "=== GATE: SKIPPED ==="
        echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
        return 0
    fi
    echo "=== GATE ==="
    local services=()
    if [[ "$TARGET" == "control-plane" ]]; then
        services=("control-plane")
    else
        local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
        if [[ ! -f "$svc_yaml" ]]; then
            echo "ERROR: ${svc_yaml} not found." >&2
            exit 2
        fi
        local svc_list
        svc_list=$(python3 -c "
 import yaml
 with open('${svc_yaml}') as f:
    data = yaml.safe_load(f)
 svcs = data.get('services', {})
 if isinstance(svcs, dict):
    print('\n'.join(svcs.keys()))
 elif isinstance(svcs, list):
    print('\n'.join(svcs))
 ")
        while IFS= read -r svc; do
            [[ -z "$svc" ]] && continue
            if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
                services+=("$svc")
            fi
        done <<< "$svc_list"
    fi
    if [[ ${#services[@]} -eq 0 ]]; then
        echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
        return 0
    fi
    echo "Services under gate: ${services[*]}"
    local gate_failed=false
    for svc in "${services[@]}"; do
        local svc_dir="${REPO_ROOT}/services/${svc}"
        if [[ -d "${svc_dir}/tests" ]]; then
            echo "--- pytest: ${svc} ---"
            if ! python3 -m pytest "${svc_dir}/tests" -q; then
                echo "GATE FAIL: pytest failed for ${svc}" >&2
                gate_failed=true
            fi
        fi
        echo "--- docker build: ${svc} ---"
        if ! docker build --quiet "${svc_dir}" >/dev/null; then
            echo "GATE FAIL: docker build failed for ${svc}" >&2
            gate_failed=true
        fi
    done
    if [[ "$gate_failed" == "true" ]]; then
        exit 2
    fi
    echo "[ok] gate passed"
 }
 # ── EXECUTE ──────────────────────────────────────────────────────────────────
 execute() {
    echo "=== EXECUTE ==="
    local cmd_output
    local cmd_exit=0
    if [[ "$TARGET" == "control-plane" ]]; then
        echo "Running deploy-control-plane.sh --ssh..."
        cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
            || cmd_exit=$?
    else
        echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
        cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
            "${SSH_USER}@${SSH_HOST}" \
            'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
            || cmd_exit=$?
    fi
    echo "$cmd_output"
    if echo "$cmd_output" | grep -qF "[sudo] password"; then
        echo "" >&2
        echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
        echo "Run manually:" >&2
        if [[ "$TARGET" == "control-plane" ]]; then
            echo "  ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
        else
            echo "  ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
        fi
        exit 5
    fi
    if [[ $cmd_exit -ne 0 ]]; then
        echo "ERROR: Deploy command exited ${cmd_exit}." >&2
        exit 3
    fi
    echo "[ok] execute completed"
 }
 # ── VERIFY ───────────────────────────────────────────────────────────────────
 verify() {
    echo "=== VERIFY ==="
    local ps_output
    local ps_exit=0
    ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
        "${SSH_USER}@${SSH_HOST}" \
        'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
        || ps_exit=$?
    if [[ $ps_exit -ne 0 ]]; then
        echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
        echo "$ps_output" >&2
        exit 4
    fi
    echo "$ps_output"
    local failed=false
    local not_up
    not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
    if [[ -n "$not_up" ]]; then
        echo "ERROR: Containers not in Up state:" >&2
        echo "$not_up" >&2
        failed=true
    fi
    local unhealthy
    unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
    if [[ -n "$unhealthy" ]]; then
        echo "ERROR: Unhealthy containers:" >&2
        echo "$unhealthy" >&2
        failed=true
    fi
    if [[ "$TARGET" == "control-plane" ]]; then
        for cp_svc in supervisor observer executor operator-ui; do
            if ! echo "$ps_output" | grep -q "$cp_svc"; then
                echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
                failed=true
            fi
        done
    fi
    if [[ "$failed" == "true" ]]; then
        echo "" >&2
        echo "Full docker ps output above." >&2
        exit 4
    fi
    echo "[ok] all containers healthy"
 }
 # ── REPORT ───────────────────────────────────────────────────────────────────
 report() {
    local mode="${1:-deploy}"
    local end_time
    end_time=$(date +%s)
    local elapsed
    elapsed=$(( end_time - START_TIME ))
    local commit_hash
    commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
    local gate_s verify_s
    if [[ "$NO_GATE" == "true" ]]; then
        gate_s="skip"
    else
        gate_s="ok"
    fi
    if [[ "$mode" == "dry-run" ]]; then
        verify_s="skip(dry-run)"
    else
        verify_s="green"
    fi
    echo ""
    if [[ "$mode" == "dry-run" ]]; then
        echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
    else
        echo "DEPLOY OK  | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
    fi
 }
 # ── MAIN ─────────────────────────────────────────────────────────────────────
 preflight
 gate
 if [[ "$DRY_RUN" == "true" ]]; then
    report dry-run
    exit 0
 fi
-if [[ $EXIT_STATUS -eq 0 ]]; then
+execute
-    print_summary "$TARGET_HOST" "SUCCESS"
+verify
-    log "INFO" "--- Homelab Deployment Finished Successfully ---"
+report
 else
    print_summary "$TARGET_HOST" "FAILED"
    log "ERROR" "--- Homelab Deployment Failed ---"
    exit 1
 fi
--- a/scripts/dev/agent.sh
+++ b/scripts/dev/agent.sh
@ -0,0 +1,361 @@
 #!/usr/bin/env bash
 # Multi-agent worktree manager.
 # EXIT: 0 ok, 1 preflight, 2 operation failed.
 set -euo pipefail
 trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
 RESERVED_NAMES=(master main HEAD list merge clean new)
 MAX_WORKTREES=4
 die()    { echo "ERROR: $*" >&2; exit "${2:-2}"; }
 prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
 # ── helpers ──────────────────────────────────────────────────────────────────
 is_main_checkout() {
  local git_dir common_dir
  git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
  common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
  [ "$git_dir" = "$common_dir" ]
 }
 require_main_checkout() {
  is_main_checkout || prefail "must run from the main checkout, not a worktree"
 }
 require_master_branch() {
  local branch
  branch=$(git rev-parse --abbrev-ref HEAD)
  [ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
 }
 require_clean_tree() {
  local dirty
  dirty=$(git status --porcelain)
  [ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
 }
 worktree_paths() {
  # list worktree paths (excluding main); || true prevents grep exit-1 when empty
  local main_path
  main_path=$(git rev-parse --show-toplevel)
  git worktree list --porcelain \
    | awk '/^worktree /{p=$2} /^$/{print p}' \
    | grep -v "^${main_path}$" \
    || true
 }
 worktree_count() {
  worktree_paths | wc -l
 }
 branch_exists_local()  { git show-ref --verify --quiet "refs/heads/$1"; }
 branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
 utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
 age_str() {
  local created_utc="$1"
  local now_ts created_ts diff_s
  now_ts=$(date -u +%s)
  # strip Z, replace T with space for `date -d`
  created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
  diff_s=$(( now_ts - created_ts ))
  if   (( diff_s < 60 ));   then echo "${diff_s}s"
  elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
  elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
  else echo "$(( diff_s/86400 ))d"
  fi
 }
 validate_name() {
  local name="$1"
  if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
    prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
  fi
  for r in "${RESERVED_NAMES[@]}"; do
    if [ "$name" = "$r" ]; then
      prefail "'$name' is a reserved word"
    fi
  done
 }
 # ── subcommands ───────────────────────────────────────────────────────────────
 cmd_new() {
  local name="${1:-}"
  [ -n "$name" ] || { usage; exit 1; }
  validate_name "$name"
  require_main_checkout
  require_master_branch
  require_clean_tree
  # worktree limit
  local count
  count=$(worktree_count)
  if (( count >= MAX_WORKTREES )); then
    echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
    cmd_list
    exit 1
  fi
  # branch collision
  if branch_exists_local "task/$name"; then
    prefail "branch task/$name already exists locally"
  fi
  git fetch origin master --quiet
  if branch_exists_remote "refs/heads/task/$name"; then
    prefail "branch task/$name already exists on origin"
  fi
  # directory collision
  local main_path wt_path
  main_path=$(git rev-parse --show-toplevel)
  wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
  [ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
  # create worktree
  git worktree add -b "task/$name" "$wt_path" origin/master \
    || die "git worktree add failed"
  # write marker
  local parent_commit
  parent_commit=$(git rev-parse origin/master)
  cat > "$wt_path/.agent-task" <<EOF
 task: $name
 branch: task/$name
 parent_commit: $parent_commit
 created_utc: $(utc_now)
 worktree_path: $wt_path
 EOF
  echo ""
  echo "Worktree created: $wt_path"
  echo "Branch:           task/$name"
  echo ""
  echo "── Start Claude Code in this worktree ──────────────────────────────────────"
  echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
  echo "─────────────────────────────────────────────────────────────────────────────"
 }
 cmd_list() {
  local main_path
  main_path=$(git rev-parse --show-toplevel)
  # fetch to get up-to-date ahead/behind
  git fetch origin master --quiet 2>/dev/null || true
  local paths
  paths=$(worktree_paths)
  if [ -z "$paths" ]; then
    echo "(no active task worktrees)"
    return
  fi
  printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
    "NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
  while IFS= read -r wt_path; do
    [ -z "$wt_path" ] && continue
    local marker="$wt_path/.agent-task"
    local task_name branch parent_commit created_utc
    if [ -f "$marker" ]; then
      task_name=$(  grep '^task:'          "$marker" | awk '{print $2}')
      branch=$(     grep '^branch:'        "$marker" | awk '{print $2}')
      parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
      created_utc=$(grep '^created_utc:'   "$marker" | awk '{print $2}')
    else
      task_name="(no marker)"
      branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
      parent_commit="?"
      created_utc=""
    fi
    local status="clean"
    local dirty
    dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
    [ -n "$dirty" ] && status="dirty"
    local ahead behind ab
    ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
    behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
    ab="+${ahead}/-${behind}"
    local age=""
    [ -n "$created_utc" ] && age=$(age_str "$created_utc")
    local short_parent="${parent_commit:0:7}"
    local short_created="${created_utc:0:10}"
    printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
      "$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
  done <<< "$paths"
 }
 cmd_merge() {
  local name="${1:-}"
  [ -n "$name" ] || { usage; exit 1; }
  require_main_checkout
  require_master_branch
  require_clean_tree
  git fetch origin --quiet
  branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
  local main_path wt_path
  main_path=$(git rev-parse --show-toplevel)
  wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
  # attempt ff-only merge
  local merge_failed=0
  git merge --ff-only "task/$name" || merge_failed=1
  if (( merge_failed )); then
    # abort any partial merge state
    git merge --abort 2>/dev/null || true
    echo ""
    echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
    echo "       The branch has likely diverged from master." >&2
    echo "" >&2
    echo "Diagnose with:" >&2
    echo "  git log master..task/$name        # commits only on task branch" >&2
    echo "  git log task/$name..master        # commits master has that task doesn't" >&2
    echo "" >&2
    echo "Then decide: rebase task/$name onto master, or merge manually." >&2
    echo "Worktree and branch are preserved — no changes made." >&2
    exit 2
  fi
  echo "Merged task/$name into master (fast-forward)."
  git push origin master || die "git push origin master failed"
  echo "Pushed master to origin."
  if [ -d "$wt_path" ]; then
    git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
    echo "Removed worktree: $wt_path"
  else
    echo "(worktree directory $wt_path not found — skipping worktree remove)"
  fi
  git branch -d "task/$name" || die "git branch -d task/$name failed"
  echo "Deleted local branch task/$name."
  git push origin --delete "task/$name" 2>/dev/null \
    && echo "Deleted remote branch task/$name." \
    || echo "(remote branch task/$name not found — nothing to delete)"
  echo ""
  echo "Done. task/$name merged and cleaned up."
 }
 cmd_clean() {
  local main_path
  main_path=$(git rev-parse --show-toplevel)
  git fetch origin --quiet 2>/dev/null || true
  local to_remove=()
  # orphaned registered worktrees: branch deleted or fully merged into master
  local paths
  paths=$(worktree_paths)
  while IFS= read -r wt_path; do
    [ -z "$wt_path" ] && continue
    local branch
    branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
    [ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
    # branch gone locally?
    if ! branch_exists_local "$branch"; then
      to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
      continue
    fi
    # branch fully merged into master?
    local ahead
    ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
    if [ "$ahead" = "0" ]; then
      to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
    fi
  done <<< "$paths"
  # dangling directories: ../homelab-codex-ws-* not registered
  local registered_paths
  registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
  local parent_dir
  parent_dir=$(dirname "$main_path")
  while IFS= read -r candidate; do
    [ -d "$candidate" ] || continue
    if ! echo "$registered_paths" | grep -qF "$candidate"; then
      to_remove+=("dangling:$candidate")
    fi
  done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
  if [ ${#to_remove[@]} -eq 0 ]; then
    echo "Nothing to clean."
    return 0
  fi
  echo "Found ${#to_remove[@]} item(s) to clean:"
  for entry in "${to_remove[@]}"; do
    echo "  $entry"
  done
  echo ""
  local overall_rc=0
  for entry in "${to_remove[@]}"; do
    local kind="${entry%%:*}"
    local path="${entry#*:}"
    # strip trailing annotation in parens
    local raw_path
    raw_path="${path%% (*}"
    local confirm
    read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
    if [[ "$confirm" =~ ^[Yy]$ ]]; then
      if [ "$kind" = "worktree" ]; then
        git worktree remove --force "$raw_path" 2>/dev/null \
          || { echo "  WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
      else
        rm -rf "$raw_path"
      fi
      echo "  Removed."
    else
      echo "  Skipped."
    fi
  done
  return $overall_rc
 }
 usage() {
  cat <<'EOF'
 Usage: agent.sh <subcommand> [args]
  agent.sh new <name>    Create a new task worktree (branch task/<name>)
  agent.sh list          List active task worktrees with status
  agent.sh merge <name>  Fast-forward merge task/<name> into master and clean up
  agent.sh clean         Remove orphaned or dangling worktrees (interactive)
 EXIT: 0 ok, 1 preflight, 2 operation failed.
 EOF
 }
 # ── dispatch ──────────────────────────────────────────────────────────────────
 SUBCOMMAND="${1:-}"
 shift || true
 case "$SUBCOMMAND" in
  new)   cmd_new   "$@" ;;
  list)  cmd_list  "$@" ;;
  merge) cmd_merge "$@" ;;
  clean) cmd_clean "$@" ;;
  *)     usage; exit 1  ;;
 esac
--- a/scripts/observer/observer.py
+++ b/scripts/observer/observer.py
@ -17,6 +17,24 @@ def _atomic_write_json(path: Path, data) -> None:
        os.fsync(f.fileno())
    os.replace(tmp, path)
 def _parse_ts(ts) -> float:
    """Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
    Events from node-agent use int(time.time()); events from stability-agent / events.py
    use ISO format ('2026-06-03T10:30:00Z').  Both appear in incident fields such as
    last_occurrence and resolved_at, so any arithmetic on them must go through here.
    Returns 0.0 on None or unparseable input so callers can use plain comparisons.
    """
    if ts is None:
        return 0.0
    if isinstance(ts, (int, float)):
        return float(ts)
    try:
        return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
    except Exception:
        return 0.0
 # Constants and Paths
 RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
 EVENTS_DIR = Path(RUNTIME_PATH) / "events"
@ -184,32 +202,66 @@ class Observer:
        now = time.time()
-        # Auto-resolve active incidents for services that are currently healthy
+        try:
-        # and whose last_occurrence is older than 30 minutes.  These are phantom
+            # Collect incident_ids currently referenced by any service entry.
-        # incidents created by race-condition reads of truncated state files; they
+            linked_ids: set = {
-        # never receive a service_recovered event because the service was healthy
+                svc.get("incident_id")
-        # all along.
+                for svc in self.world_state["services"].values()
                if svc.get("incident_id")
            }
            # Case 1 — service is healthy but still points at an active incident.
            # process_event already calls _resolve_incident on service_healthy events,
            # but if the observer restarted with on-disk state where the link was
            # intact (inconsistency from a pre-atomic-write crash), it may not get
            # resolved until the next service_healthy event is processed.  Resolve
            # immediately — a healthy service cannot have an ongoing incident.
            for svc_key, svc in self.world_state["services"].items():
-            if svc.get("status") == "healthy":
+                if svc.get("status") != "healthy":
                    continue
                inc_id = svc.get("incident_id")
-                if inc_id and inc_id in self.world_state["incidents"]:
+                if not inc_id:
-                    inc = self.world_state["incidents"][inc_id]
+                    continue
-                    last_occ = inc.get("last_occurrence") or 0
+                inc = self.world_state["incidents"].get(inc_id, {})
-                    if (inc.get("status") == "active"
+                if inc.get("status") == "active":
                            and (now - last_occ) > 1800):
                    logger.info(
-                            f"Auto-resolving stale incident {inc_id} for {svc_key}: "
+                        f"Auto-resolving incident {inc_id} for {svc_key}: "
-                            f"service healthy, last_occurrence >{int((now - last_occ) / 60)}min ago"
+                        f"service is healthy"
                    )
                    inc["status"] = "resolved"
                    inc["resolved_at"] = now
                    svc["incident_id"] = None
                    linked_ids.discard(inc_id)
            # Case 2 — orphaned active incident: no service entry links to it and
            # last_occurrence is older than 5 minutes (guard against creation races).
            # These are the stale records left behind when on-disk state was
            # inconsistent: the service entry had incident_id cleared but incidents.json
            # still had the record as "active".
            for inc_id, inc in self.world_state["incidents"].items():
                if inc.get("status") != "active":
                    continue
                if inc_id in linked_ids:
                    continue
                age = now - _parse_ts(inc.get("last_occurrence"))
                if age > 300:  # 5-minute guard
                    logger.info(
                        f"Auto-resolving orphaned incident {inc_id} "
                        f"(service={inc.get('service')}, node={inc.get('node')}): "
                        f"no service references it, age={int(age)}s"
                    )
                    inc["status"] = "resolved"
                    inc["resolved_at"] = now
        except Exception as exc:
            logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
        # Remove resolved incidents older than 7 days.
        # Use _parse_ts so ISO-string resolved_at values are handled correctly.
        stale_incidents = [
            k for k, v in self.world_state["incidents"].items()
            if v.get("status") == "resolved"
-            and (now - (v.get("resolved_at") or now)) > 7 * 86400
+            and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
        ]
        for k in stale_incidents:
            del self.world_state["incidents"][k]
--- a/services/control-plane/Dockerfile
+++ b/services/control-plane/Dockerfile
@ -20,4 +20,5 @@ ENV RUNTIME_PATH=/opt/homelab
 ENV PYTHONUNBUFFERED=1
 # Default command (will be overridden in docker-compose)
 USER homelab
 CMD ["python", "src/operator_ui.py"]
--- a/services/control-plane/deploy-local.sh
+++ b/services/control-plane/deploy-local.sh
@ -39,10 +39,24 @@ for dir in "${DIRS[@]}"; do
    fi
 done
-# 3. chown/chmod for UID 1000
+# 3. chown/chmod for UID 1000 — self-healing: only calls sudo when actually needed
-echo "Setting permissions for UID 1000 on /opt/homelab..."
+echo "Checking /opt/homelab ownership..."
-sudo chown -R 1000:1000 /opt/homelab
+_chown_needed=$(find /opt/homelab \( ! -uid 1000 -o ! -gid 1000 \) -print -quit 2>/dev/null)
-sudo chmod -R 775 /opt/homelab 2>/dev/null || true
+if [[ -n "$_chown_needed" ]]; then
    echo "Found files not owned by 1000:1000 (e.g. $_chown_needed) — fixing..."
    sudo chown -R 1000:1000 /opt/homelab
 else
    echo "Ownership already correct, skipping chown"
 fi
 echo "Checking /opt/homelab directory permissions..."
 _chmod_needed=$(find /opt/homelab -type d ! -perm -775 -print -quit 2>/dev/null)
 if [[ -n "$_chmod_needed" ]]; then
    echo "Found directories with wrong permissions (e.g. $_chmod_needed) — fixing..."
    sudo chmod -R 775 /opt/homelab 2>/dev/null || true
 else
    echo "Permissions already correct, skipping chmod"
 fi
 # 4. Run docker compose up -d --build --force-recreate
 echo "--- Starting Control Plane Services ---"
--- a/services/control-plane/docker-compose.yml
+++ b/services/control-plane/docker-compose.yml
@ -56,6 +56,9 @@ services:
  executor:
    build: .
    container_name: control-plane-executor
    user: "1000:1000"
    group_add:
      - "999"
    command: python src/executor.py
    volumes:
      - /opt/homelab:/opt/homelab
--- a/services/control-plane/src/operator_ui.py
+++ b/services/control-plane/src/operator_ui.py
@ -147,12 +147,18 @@ def current_deployments():
 def current_incidents():
-    """Return incidents as a list sorted most-recent-first."""
+    """Return active incidents as a list sorted most-recent-first.
    Only incidents with status='active' are returned; resolved and cancelled
    records are excluded so the dashboard reflects the current operational state.
    """
    raw = read_json_file(WORLD_DIR / "incidents.json", default={})
    if isinstance(raw, list):
-        return raw
+        return [i for i in raw if i.get("status") == "active"]
    result = []
    for inc in raw.values():
        if inc.get("status") != "active":
            continue
        # Synthesise a human-readable message if not stored (observer doesn't set one).
        if "message" not in inc:
            inc = dict(inc)
--- a/services/control-plane/tests/test_incident_lifecycle.py
+++ b/services/control-plane/tests/test_incident_lifecycle.py
@ -0,0 +1,333 @@
 """Tests for incident lifecycle: auto-resolve, orphan detection, timestamp parsing."""
 from __future__ import annotations
 import json
 import sys
 import time
 from pathlib import Path
 import pytest
 # Observer lives outside the control-plane package; add scripts/ to path.
 sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent / "scripts"))
 from observer.observer import Observer, _parse_ts, _atomic_write_json
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 def _make_observer(tmp_path: Path) -> Observer:
    """Return an Observer with all runtime paths redirected to tmp_path."""
    import observer.observer as obs_mod
    world = tmp_path / "world"
    state = tmp_path / "state"
    events = tmp_path / "events"
    logs = tmp_path / "logs"
    repo = tmp_path / "repo"
    for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
        d.mkdir(parents=True, exist_ok=True)
    # Minimal topology so inventory isn't empty (avoids prune-guard early-return)
    (repo / "inventory" / "topology.yaml").write_text(
        "nodes:\n  vps:\n    roles: [control-plane]\n    connectivity: {}\n"
    )
    original_world = obs_mod.WORLD_DIR
    original_state = obs_mod.STATE_DIR
    original_events = obs_mod.EVENTS_DIR
    original_logs = obs_mod.LOGS_DIR
    original_inventory = obs_mod.INVENTORY_TOPOLOGY
    original_repo = obs_mod.REPO_ROOT
    obs_mod.WORLD_DIR = world
    obs_mod.STATE_DIR = state
    obs_mod.EVENTS_DIR = events
    obs_mod.LOGS_DIR = logs
    obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
    obs_mod.REPO_ROOT = repo
    obs = Observer()
    # Restore module-level constants (monkeypatching at module level is sufficient
    # for the Observer instance which captures paths at construction time via globals)
    obs_mod.WORLD_DIR = original_world
    obs_mod.STATE_DIR = original_state
    obs_mod.EVENTS_DIR = original_events
    obs_mod.LOGS_DIR = original_logs
    obs_mod.INVENTORY_TOPOLOGY = original_inventory
    obs_mod.REPO_ROOT = original_repo
    return obs
 def _make_observer_simple(tmp_path: Path):
    """Return an Observer instance and patch its world_state in-place."""
    import observer.observer as obs_mod
    world = tmp_path / "world"
    state = tmp_path / "state"
    events = tmp_path / "events"
    logs = tmp_path / "logs"
    repo = tmp_path / "repo"
    for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
        d.mkdir(parents=True, exist_ok=True)
    (repo / "inventory" / "topology.yaml").write_text(
        "nodes:\n  vps:\n    roles: [control-plane]\n    connectivity: {}\n"
    )
    # Patch before construction
    obs_mod.WORLD_DIR = world
    obs_mod.STATE_DIR = state
    obs_mod.EVENTS_DIR = events
    obs_mod.LOGS_DIR = logs
    obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
    obs_mod.REPO_ROOT = repo
    obs = Observer()
    return obs
 # ---------------------------------------------------------------------------
 # 1. _parse_ts — timestamp normalisation
 # ---------------------------------------------------------------------------
 def test_parse_ts_int():
    ts = int(time.time()) - 3600
    assert abs(_parse_ts(ts) - ts) < 1
 def test_parse_ts_float():
    ts = time.time() - 100.5
    assert abs(_parse_ts(ts) - ts) < 0.01
 def test_parse_ts_iso_string():
    # ISO format as emitted by events.py / stability-agent
    from datetime import datetime, timezone
    iso = "2026-06-01T00:03:22Z"
    expected = datetime(2026, 6, 1, 0, 3, 22, tzinfo=timezone.utc).timestamp()
    result = _parse_ts(iso)
    assert result > 0
    assert isinstance(result, float)
    assert abs(result - expected) < 1
 def test_parse_ts_none_returns_zero():
    assert _parse_ts(None) == 0.0
 def test_parse_ts_garbage_returns_zero():
    assert _parse_ts("not-a-date") == 0.0
 def test_parse_ts_zero_int():
    assert _parse_ts(0) == 0.0
 # ---------------------------------------------------------------------------
 # 2. Lifecycle: service_healthy event resolves linked incident
 # ---------------------------------------------------------------------------
 def test_service_healthy_resolves_active_incident(tmp_path):
    obs = _make_observer_simple(tmp_path)
    inc_id = "inc-111-vps-outline"
    obs.world_state["services"]["vps/outline"] = {
        "node": "vps", "service": "outline",
        "status": "unhealthy", "last_check": None,
        "incident_id": inc_id,
    }
    obs.world_state["incidents"][inc_id] = {
        "id": inc_id, "node": "vps", "service": "outline",
        "status": "active", "trigger_type": "service_unhealthy",
        "started_at": int(time.time()) - 600,
        "last_occurrence": int(time.time()) - 600,
        "occurrence_count": 1, "events": [],
    }
    obs.process_event({
        "type": "service_healthy",
        "node": "vps",
        "service": "outline",
        "severity": "info",
        "timestamp": int(time.time()),
        "payload": {},
    })
    assert obs.world_state["services"]["vps/outline"]["status"] == "healthy"
    assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
 def test_service_healthy_does_not_resolve_other_incidents(tmp_path):
    """service_healthy for service A must not touch incident for service B."""
    obs = _make_observer_simple(tmp_path)
    inc_b = "inc-222-vps-supervisor"
    obs.world_state["services"]["vps/supervisor"] = {
        "node": "vps", "service": "supervisor",
        "status": "unhealthy", "last_check": None,
        "incident_id": inc_b,
    }
    obs.world_state["incidents"][inc_b] = {
        "id": inc_b, "status": "active",
        "last_occurrence": int(time.time()) - 300,
    }
    obs.process_event({
        "type": "service_healthy",
        "node": "vps",
        "service": "outline",   # different service
        "severity": "info",
        "timestamp": int(time.time()),
        "payload": {},
    })
    assert obs.world_state["incidents"][inc_b]["status"] == "active"
 # ---------------------------------------------------------------------------
 # 3. _prune_stale_world: healthy-service-linked incident → immediate resolve
 # ---------------------------------------------------------------------------
 def test_prune_resolves_healthy_linked_incident(tmp_path):
    """If a service is healthy but still points at an active incident, resolve it."""
    obs = _make_observer_simple(tmp_path)
    inc_id = "inc-333-vps-outline"
    obs.world_state["services"]["vps/outline"] = {
        "node": "vps", "service": "outline",
        "status": "healthy",          # <-- healthy but incident_id still set
        "last_check": None,
        "incident_id": inc_id,
    }
    obs.world_state["incidents"][inc_id] = {
        "id": inc_id, "status": "active",
        "started_at": int(time.time()) - 7200,
        "last_occurrence": int(time.time()) - 7200,
    }
    obs._prune_stale_world()
    assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
 def test_prune_resolves_healthy_linked_incident_iso_timestamp(tmp_path):
    """Healthy-linked incident with ISO-string last_occurrence must still resolve."""
    obs = _make_observer_simple(tmp_path)
    inc_id = "inc-444-vps-outline"
    obs.world_state["services"]["vps/outline"] = {
        "node": "vps", "service": "outline",
        "status": "healthy", "last_check": None, "incident_id": inc_id,
    }
    obs.world_state["incidents"][inc_id] = {
        "id": inc_id, "status": "active",
        "last_occurrence": "2026-06-01T00:03:22Z",  # ISO string from events.py
    }
    obs._prune_stale_world()   # must not raise TypeError
    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
 # ---------------------------------------------------------------------------
 # 4. _prune_stale_world: orphaned incident (no service link) → resolve after 5 min
 # ---------------------------------------------------------------------------
 def test_prune_resolves_orphaned_incident_old_enough(tmp_path):
    """Orphaned active incident older than 5 min must be auto-resolved."""
    obs = _make_observer_simple(tmp_path)
    inc_id = "inc-555-vps-supervisor"
    # No service entry links to this incident
    obs.world_state["incidents"][inc_id] = {
        "id": inc_id, "status": "active", "node": "vps", "service": "supervisor",
        "last_occurrence": int(time.time()) - 400,   # 6.7 min ago
    }
    obs._prune_stale_world()
    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
 def test_prune_does_not_resolve_orphaned_incident_too_recent(tmp_path):
    """Orphaned incident younger than 5 min must stay active (guard against race)."""
    obs = _make_observer_simple(tmp_path)
    inc_id = "inc-666-vps-supervisor"
    obs.world_state["incidents"][inc_id] = {
        "id": inc_id, "status": "active",
        "last_occurrence": int(time.time()) - 60,   # 1 min ago — within guard
    }
    obs._prune_stale_world()
    assert obs.world_state["incidents"][inc_id]["status"] == "active"
 def test_prune_resolves_orphaned_incident_iso_timestamp(tmp_path):
    """Orphaned incident with ISO-string last_occurrence must resolve correctly."""
    obs = _make_observer_simple(tmp_path)
    inc_id = "inc-777-vps-outline"
    # ISO timestamp well in the past (2026-06-01)
    obs.world_state["incidents"][inc_id] = {
        "id": inc_id, "status": "active",
        "last_occurrence": "2026-06-01T00:03:22Z",
    }
    obs._prune_stale_world()   # must not raise TypeError
    assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
 def test_prune_does_not_touch_linked_incident(tmp_path):
    """An active incident still linked from a non-healthy service must stay active."""
    obs = _make_observer_simple(tmp_path)
    inc_id = "inc-888-vps-outline"
    obs.world_state["services"]["vps/outline"] = {
        "node": "vps", "service": "outline",
        "status": "unhealthy",   # <-- still unhealthy
        "last_check": None,
        "incident_id": inc_id,
    }
    obs.world_state["incidents"][inc_id] = {
        "id": inc_id, "status": "active",
        "last_occurrence": int(time.time()) - 3600,
    }
    obs._prune_stale_world()
    assert obs.world_state["incidents"][inc_id]["status"] == "active"
 # ---------------------------------------------------------------------------
 # 5. 7-day stale incident prune with ISO resolved_at
 # ---------------------------------------------------------------------------
 def test_prune_removes_old_resolved_incident_iso_resolved_at(tmp_path):
    """Resolved incidents with ISO-string resolved_at older than 7 days must be pruned."""
    obs = _make_observer_simple(tmp_path)
    inc_id = "inc-old-resolved"
    obs.world_state["incidents"][inc_id] = {
        "id": inc_id, "status": "resolved",
        "resolved_at": "2026-05-01T00:00:00Z",  # >7 days before 2026-06-03
    }
    obs._prune_stale_world()
    assert inc_id not in obs.world_state["incidents"]
 def test_prune_keeps_recently_resolved_incident(tmp_path):
    """Resolved incidents within 7 days must be kept."""
    obs = _make_observer_simple(tmp_path)
    inc_id = "inc-recent-resolved"
    obs.world_state["incidents"][inc_id] = {
        "id": inc_id, "status": "resolved",
        "resolved_at": time.time() - 86400,  # 1 day ago
    }
    obs._prune_stale_world()
    assert inc_id in obs.world_state["incidents"]
--- a/services/node-agent/Dockerfile
+++ b/services/node-agent/Dockerfile
@ -14,8 +14,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
 # pyyaml      : may be needed for reading host config snippets
 RUN pip install --no-cache-dir "docker>=6.0" psutil pyyaml
 RUN useradd -m -u 1000 homelab
 COPY src/ /app/src/
 ENV PYTHONUNBUFFERED=1
 USER homelab
 CMD ["python", "src/node_agent.py"]
--- a/services/node-agent/docker-compose.yml
+++ b/services/node-agent/docker-compose.yml
@ -2,6 +2,9 @@ services:
  node-agent:
    build: .
    container_name: node-agent
    user: "1000:1000"
    group_add:
      - "999"
    restart: unless-stopped
    environment:
--- a/services/stability-agent/Dockerfile
+++ b/services/stability-agent/Dockerfile
@ -5,6 +5,8 @@ WORKDIR /app
 # No extra dependencies needed beyond standard library for the current script
 # But we might need them if we decide to use libraries later.
 RUN useradd -m -u 1000 homelab
 COPY src/stability_agent.py .
 COPY healthcheck.sh .
 RUN chmod +x healthcheck.sh
@ -12,5 +14,5 @@ RUN chmod +x healthcheck.sh
 # Create the expected directories
 RUN mkdir -p /opt/homelab/state /opt/homelab/events
-# Run the agent
+USER homelab
 CMD ["python", "stability_agent.py"]
--- a/services/stability-agent/docker-compose.yml
+++ b/services/stability-agent/docker-compose.yml
@ -2,6 +2,9 @@ services:
  stability-agent:
    build: .
    container_name: stability-agent
    user: "1000:1000"
    group_add:
      - "999"
    restart: unless-stopped
    volumes:
      - /opt/homelab:/opt/homelab
Author	SHA1	Message	Date
Oskar Kapala	58ac6edd7d	fix(stability-agent): run as uid 1000 with docker group access stability-agent had no USER instruction and no user: in compose, running as root and writing root-owned files to /opt/homelab bind-mount. - Dockerfile: add useradd -m -u 1000 homelab + USER homelab - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) to retain docker.sock:ro access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 18:20:54 +02:00
Oskar Kapala	19fd8799d9	fix(node-agent): run as uid 1000 with docker group access node-agent had no USER instruction and no user: in compose, running as root and writing root-owned files to /opt/homelab bind-mount. - Dockerfile: add useradd -m -u 1000 homelab + USER homelab - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) to retain docker.sock access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 18:20:31 +02:00
Oskar Kapala	7f17b65278	fix(control-plane): run executor as uid 1000 with docker group access Executor was the only control-plane container running as root (uid=0), writing root-owned files to /opt/homelab via bind-mount and triggering false sudo on every deploy. - Dockerfile: add USER homelab after useradd (useradd already present) - docker-compose.yml: add user: "1000:1000" and group_add: ["999"] (GID 999 = docker group on VPS) so executor retains docker.sock access Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 18:19:58 +02:00
Oskar Kapala	e6a2443412	fix(dev): agent.sh worktree_count/paths grep exit-1 on empty set grep -cv (and grep -v) return exit code 1 when there are zero matches. With set -euo pipefail this silently aborted the script before count was returned — causing 'agent.sh new' to fail on a fresh repo with no existing worktrees. Fix: move the grep -v into worktree_paths with '\|\| true' so the function always exits 0, then derive worktree_count via wc -l.	2026-06-03 18:04:38 +02:00
Oskar Kapala	f9b145585f	fix(dev): agent.sh validate_name set -e safety + ERR trap Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail was silently exiting after the loop because the failing-test compound propagated exit code 1 through the function return. Add ERR trap so future silent fails get diagnosed at the source.	2026-06-03 18:02:50 +02:00
Oskar Kapala	3b620ef7e3	docs(claude): multi-agent worktree mode section Main checkout = deploy-only. .agent-task marker triggers mandatory loading of worktree-aware skill. Only the human runs scripts/dev/agent.sh.	2026-06-03 17:41:35 +02:00
Oskar Kapala	745e52723c	feat(skills): worktree-aware skill for Claude Code Encodes branch hygiene for CC running in task worktrees: commit only to assigned branch, no push origin master, no touching main checkout, no git add -A, no worktree management, mandatory final report.	2026-06-03 17:41:35 +02:00
Oskar Kapala	1abe925f65	feat(dev): scripts/dev/agent.sh — multi-agent worktree dispatcher new/list/merge/clean. Decisions: branch task/<name>, sibling worktree ~/homelab-codex-ws-<name>, ff-only auto-merge, cap 4.	2026-06-03 17:41:35 +02:00
Oskar Kapala	1c69a5bc29	feat(skills): save-session skill for Claude Code Records session facts (git log, diff --stat, deploys from transcript) by appending to docs/sessions/YYYY-MM-DD.md with a mandatory narrative placeholder. Never touches backlog.md or CLAUDE.md without explicit instruction. Commits only the session file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 16:06:46 +02:00
Oskar Kapala	02e7c28823	feat(skills): deploy skill for Claude Code Instructs CC to always route deploy/redeploy/ship/wdróż requests through scripts/deploy/deploy.sh, maps exit codes to required actions, and enforces no-bypass rules for gate and branch checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 16:06:40 +02:00
Oskar Kapala	db592fbc28	feat(deploy): Saturn-side dispatcher wrapper Replaces the per-node staged framework with a single entry point that runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest + docker build per service), execute (control-plane.sh --ssh or remote deploy-node.sh), verify (docker ps), and one-line report. Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 16:06:36 +02:00
Oskar Kapala	00fc36df3a	fix(deploy): skip sudo chown/chmod when /opt/homelab ownership is already correct deploy-local.sh previously ran `sudo chown -R 1000:1000` and `sudo chmod -R 775` unconditionally on every deploy, which blocked non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000. Both steps are now conditional using `find ... -print -quit`: - chown: runs only if any file/dir is NOT uid/gid 1000 - chmod: runs only if any directory is missing -775 permission bits When everything is correct (steady state on VPS), both steps log "already correct, skipping" and never invoke sudo. If a new directory was created by root (e.g. a manual mkdir, volume mount, or restart artefact), the remediation path triggers automatically — the self-heal property is preserved. Smoke-tested in Docker (ubuntu:22.04): Case 1 (1000:1000 + 775): chown skipped, chmod skipped ✓ Case 2 (root-owned subdir): chown triggered ✓ Case 3 (700 dir perms): chmod triggered ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 15:44:44 +02:00
Oskar Kapala	f5dcefc752	fix(observer): robust incident lifecycle + orphan auto-resolve Two root causes for stale "active" incidents on the dashboard: 1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at can be an ISO-8601 string (stability-agent via events.py) or a Unix int (node-agent). The previous session's auto-resolve did plain `time.time() - last_occ` which raises TypeError for strings, silently preventing _save_world() from being called and leaving incidents perpetually "active" on disk. Fix: add _parse_ts(ts) -> float that handles int, float, and ISO-8601 strings uniformly. All timestamp arithmetic now goes through it; returns 0.0 on None / garbage to keep comparisons safe. 2. Orphaned active incidents: _resolve_incident clears service["incident_id"] and marks the incident "resolved" in memory, but if incidents.json was truncated mid-write (pre-atomic-write era), the observer loaded it at next startup with status="active" and no service entry pointing to it. No code ever touched these orphans again. Fix: _prune_stale_world now runs two cleanup passes each cycle: - Case 1 (healthy-linked): service.status=="healthy" AND incident_id still set → resolve immediately (service cannot have active incident) - Case 2 (orphaned): active incident with no service link AND last_occurrence > 5 min ago → resolve (5-min guard for creation race) Both cases are wrapped in try/except so a bug here never crashes the observer loop or blocks _save_world. Also fixes the 7-day stale-incident prune to use _parse_ts so ISO-string resolved_at values are handled correctly. 3. Operator UI: current_incidents() now filters to status=="active" only. Resolved incidents were previously included in the /incidents endpoint, making the dashboard show a wall of historical records as if active. Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json (now written atomically) and deletes old event files. No non-atomic writes found. Midnight clustering was likely external (logrotate / OS flush); the supervisor's resilient loader already handles such transient issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-03 14:29:12 +02:00