Commit graph

47 commits

Author SHA1 Message Date
Oskar Kapala c9ee8eb06d fix(observer): quarantine malformed event files to prevent processing wedge
Recovery from bad merge of task/observer-poison-quarantine (c255a02)
which carried false deletes from a stale branch base. Re-applies only
the genuine observer changes on top of correct master state.

When an event file fails to parse (malformed JSON, truncated, corrupted),
the observer previously kept retrying on every cycle while the node's
checkpoint stayed pinned — all subsequent good events for that node lost.

Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures are logged but don't crash the loop.

Complementary to the atomic_write_json fix on state files; this addresses
the same race-pattern on event files instead.

Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
2026-06-12 13:11:15 +02:00
Oskar Kapala 1304c8449f feat(onboard): implement 40-register + 50-verify, remove dead scaffold
- 40-register.sh: idempotent — dopisuje lustro do topology.yaml + tworzy
  hosts/<node>/services.yaml, commituje na bieżącym branchu (bez push)
- 50-verify.sh: 4 checki — node-agent running, eventy, observer restart +
  heartbeat poll, world/nodes.json; tabela pass/fail; exit 1 on failure
- 40-deploy-node-agent.sh: usunięty (martwy scaffold; deploy w 30-node-agent.sh)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 20:36:00 +02:00
Oskar Kapala a99bf9dadc fix(onboard): 30-node-agent — mkdir -p deploy dir before rsync
rsync fails with "No such file or directory" when intermediate dirs
don't exist. /opt/homelab/deploy/ is not created by 20-base.sh.
Add rrun mkdir -p before rsync_dir; pi owns /opt/homelab so no sudo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 14:46:01 +02:00
Oskar Kapala f6342749e6 feat(onboard): add 30-node-agent.sh + lustro node-agent override
Push-based deploy step for LUSTRO (git_control=false): rsync
services/node-agent/ and the host override to /opt/homelab/deploy/node-agent/
on the remote, then docker compose up --build via SSH.

Guard by effect: skip push+build+up if node-agent container already running
(docker ps filter, not command -v). Verify: container running + events appear
in /opt/homelab/events/lustro/ within 90 s (confirms agent write path).

Override (hosts/lustro/runtime/node-agent/docker-compose.override.yml):
- group_add: ["991"]  (docker GID on LUSTRO; 999 from base concatenated — harmless)
- mem_limit: 256m  (MagicMirror ~1.9 GiB; agent must be bounded)
- /home/pi/.ssh:/root/.ssh:ro  (not /home/oskar/.ssh — pi user)
- /opt/homelab/deploy/node-agent:/repo:ro  (no repo checkout on push-based node)
- NODE_NAME=lustro, NODE_TYPE=sd_card, VPS_EVENTS_HOST=100.95.58.48

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 14:24:39 +02:00
Oskar Kapala 415479454a fix(onboard): 20-base.sh — popraw guard idempotencji swap→zram
Stary guard porównywał literał konfigu (SIZE=) zamiast sprawdzać efekt.
Ręcznie postawiony zram był pomijany (dpkg -l vs command -v) i config
był nadpisywany niepotrzebnie.

- Guard by effect: sudo swapon --show | grep /dev/zram + dphys nieaktywny
  → cała sekcja skip bez wchodzenia w substages
- Detekcja pakietu przez dpkg -l zram-tools (nie command -v zramswap — PATH)
- Config: PERCENT=50 (skaluje z RAM) zamiast SIZE=; printf '%s\n' | sudo tee
- Wszystkie weryfikacje zram przez sudo swapon --show (nie zramctl)
- Usuń parsowanie hardware.swap.mb (nieużywane po przejściu na PERCENT)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 13:30:12 +02:00
Oskar Kapala d81ac27ebb feat(onboard): implement 20-base.sh for LUSTRO — swap→zram, /opt/homelab, event dir
Three idempotent stages with guards (probe-before-mutate), rrun() for all
remote mutations, rprobe() for unconditional state queries. Reads
hardware.swap.mb from node.yaml (default 2048 MB). Adds swap.mb: 2048
to hosts/lustro/node.yaml so the value is declarative.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 12:21:53 +02:00
Oskar Kapala d2fb2b3d41 docs: onboard README + CLAUDE.md worktree discipline reminder
scripts/onboard/README.md (new):
- Tool purpose and --node/--step/--from/--dry-run usage
- Full node.yaml field schema with annotations (ssh_user uid-1000
  gotcha, first_contact IP vs .local, deploy_autonomy/git_control gates)
- Step status table (00-access DONE, 00-preflight SCAFFOLD, 10-50 TODO)
- lib/ architecture: run() dry-run convention, yaml_get fallback caveats
- Gotchas/Learnings table from session

CLAUDE.md:
- Node Onboarding section: onboard.sh commands, pointer to README
- Multi-agent worktree mode: add explicit DISCIPLINE RULE — feature
  work must happen in agent.sh worktrees, not the main checkout;
  references the 2026-06-08 session that violated this

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 22:31:12 +02:00
Oskar Kapala 471ba09c4a fix(onboard/00-access): suppress known-hosts warning in Tailscale verify
On first SSH to a new mesh hostname, OpenSSH emits
"Warning: Permanently added 'lustro' to the list of known hosts"
on stderr. The previous code used 2>&1, merging it into the captured
arch variable, which caused the arch assertion to fail with
arch="Warning:Permanentlyadded...".

Fix:
- Add dedicated _TS_SSH opts array with -o LogLevel=ERROR, which
  suppresses INFO-level messages (known-hosts, banner) at source
- Remove 2>&1 — stderr is no longer merged into the captured value
- Run only `uname -m` instead of `echo ok && uname -m`; take the last
  non-empty stdout line to be robust against any remaining preamble
- Change arch mismatch from warn to die in live mode (warn in dry-run)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:28:21 +02:00
Oskar Kapala eed0ad0635 fix(onboard): fix yaml_get fallback — strip inline comments and fix greedy colon match
Two bugs in the grep+sed fallback (triggered when yq is unavailable):

1. Greedy colon match: `s/.*: *//` consumed the *last* `: ` in the line, so
   values containing a colon (e.g. `systemd:magicmirror.service`) were
   silently truncated to the portion after the last colon.
   Fix: `s/^[[:space:]]*[^:]*:[[:space:]]*//' — anchored at line start,
   key chars are `[^:]*` (no colons), so only the first `: ` separator is removed.

2. Inline YAML comment not stripped: `first_contact: pi@pimirror2.local   # ...`
   returned the full tail including `#`, breaking callers like ssh-copy-id.
   Fix: add `s/[[:space:]]\+#.*$//` — requires at least one space before `#`
   to preserve bare `#` characters inside a value.

Also add leading/trailing whitespace trim as a separate pass.
Both bugs affect any node.yaml field that has an inline comment or a colon
in its value; all ten fields in hosts/lustro/node.yaml now parse correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:16:06 +02:00
Oskar Kapala 931fd46e62 fix(onboard): propagate dry-run into steps via run() helper
DRY_RUN now uses 1/0 instead of "true"/"false" across all onboard scripts.

common.sh: add run() — wraps mutations; prints "[dry-run] would: ..." when
  DRY_RUN=1. Exported via `export -f run` so child bash processes inherit it.

onboard.sh: remove the `--dry-run → dryrun "Would execute" → continue` bypass.
  Steps now always execute; DRY_RUN=1 is exported so each step's own run()
  calls handle simulation. The orchestrator no longer needs to know step internals.

remote.sh: update DRY_RUN checks to [ "${DRY_RUN:-0}" = 1 ] for consistency.

00-access.sh: remove all if/else DRY_RUN blocks; replace with:
  - Mutations (ssh-copy-id, curl install, tailscale up) wrapped in run()
  - Probes (SSH BatchMode test, command -v, _ts_state) run unconditionally
    so dry-run reports real current state ("key present → skip" vs "would: ...")
  - Stage 3 verify runs always; SSH failure is die in live mode, warn in
    dry-run (Tailscale not yet joined is expected on a fresh node)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:01:09 +02:00
Oskar Kapala 9012a36827 feat(onboard): add 00-access step + update lustro node.yaml
00-access.sh implements a 3-stage idempotent access bootstrap:
  1. ensure_ssh_key  — ssh-copy-id to first_contact (pi@pimirror2.local),
     skips if BatchMode key-auth already passes
  2. ensure_tailscale — install via install.sh if missing, then tailscale up
     --hostname=lustro; prints interactive auth URL to operator, blocks until
     authenticated; skips if BackendState already Running
  3. verify — SSH over Tailscale to pi@lustro, asserts 'ok' + arch=aarch64

Reads first_contact and tailscale.hostname from node.yaml.
Respects --dry-run. No NOPASSWD or /opt/homelab mutations.

hosts/lustro/node.yaml: fill known hardware facts (arm64, 4096 MB RAM,
zram swap, docker_present, mm_runtime=systemd:magicmirror.service),
add ssh_user=pi, first_contact=pi@pimirror2.local,
services.node-agent.runtime engine=docker mem_limit=256m.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 14:43:16 +02:00
Oskar Kapala adb84079ab feat(onboard): add node onboarding scaffold (bash, idempotent)
- scripts/onboard/onboard.sh: orchestrator with --node/--step/--from/--dry-run flags,
  deploy_autonomy + git_control gates, lexicographic step ordering
- scripts/onboard/lib/common.sh: log/warn/die/step helpers, yaml_get (yq+grep/sed fallback),
  ensure_line, git() wrapper enforcing --no-pager
- scripts/onboard/lib/remote.sh: rrun/rcopy/rsync_dir/rcheck SSH wrappers, dry-run aware
- scripts/onboard/steps/00-preflight.sh: read-only fact collection (arch, RAM, disk, docker,
  tailscale, MagicMirror runtime, swap), human report + machine YAML snippet
- scripts/onboard/steps/10-50: stub files with TODO headers, no mutations
- hosts/lustro/node.yaml: LUSTRO edge node draft (KEN, role=edge, deploy_autonomy=true,
  git_control=false); hardware fields marked TODO for preflight population

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 14:23:21 +02:00
Oskar Kapala e6a2443412 fix(dev): agent.sh worktree_count/paths grep exit-1 on empty set
grep -cv (and grep -v) return exit code 1 when there are zero matches.
With set -euo pipefail this silently aborted the script before count
was returned — causing 'agent.sh new' to fail on a fresh repo with no
existing worktrees.

Fix: move the grep -v into worktree_paths with '|| true' so the
function always exits 0, then derive worktree_count via wc -l.
2026-06-03 18:04:38 +02:00
Oskar Kapala f9b145585f fix(dev): agent.sh validate_name set -e safety + ERR trap
Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail
was silently exiting after the loop because the failing-test compound
propagated exit code 1 through the function return.

Add ERR trap so future silent fails get diagnosed at the source.
2026-06-03 18:02:50 +02:00
Oskar Kapala 1abe925f65 feat(dev): scripts/dev/agent.sh — multi-agent worktree dispatcher
new/list/merge/clean. Decisions: branch task/<name>, sibling worktree
~/homelab-codex-ws-<name>, ff-only auto-merge, cap 4.
2026-06-03 17:41:35 +02:00
Oskar Kapala db592fbc28 feat(deploy): Saturn-side dispatcher wrapper
Replaces the per-node staged framework with a single entry point that
runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest +
docker build per service), execute (control-plane.sh --ssh or remote
deploy-node.sh), verify (docker ps), and one-line report.

Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 16:06:36 +02:00
Oskar Kapala f5dcefc752 fix(observer): robust incident lifecycle + orphan auto-resolve
Two root causes for stale "active" incidents on the dashboard:

1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
   can be an ISO-8601 string (stability-agent via events.py) or a Unix
   int (node-agent).  The previous session's auto-resolve did plain
   `time.time() - last_occ` which raises TypeError for strings,
   silently preventing _save_world() from being called and leaving
   incidents perpetually "active" on disk.

   Fix: add _parse_ts(ts) -> float that handles int, float, and
   ISO-8601 strings uniformly. All timestamp arithmetic now goes through
   it; returns 0.0 on None / garbage to keep comparisons safe.

2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
   and marks the incident "resolved" in memory, but if incidents.json was
   truncated mid-write (pre-atomic-write era), the observer loaded it at
   next startup with status="active" and no service entry pointing to it.
   No code ever touched these orphans again.

   Fix: _prune_stale_world now runs two cleanup passes each cycle:
   - Case 1 (healthy-linked): service.status=="healthy" AND incident_id
     still set → resolve immediately (service cannot have active incident)
   - Case 2 (orphaned): active incident with no service link AND
     last_occurrence > 5 min ago → resolve (5-min guard for creation race)

   Both cases are wrapped in try/except so a bug here never crashes the
   observer loop or blocks _save_world.

   Also fixes the 7-day stale-incident prune to use _parse_ts so
   ISO-string resolved_at values are handled correctly.

3. Operator UI: current_incidents() now filters to status=="active" only.
   Resolved incidents were previously included in the /incidents endpoint,
   making the dashboard show a wall of historical records as if active.

Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 14:29:12 +02:00
Oskar Kapala ffb0608b9a fix(observer): atomic writes for world state files
All JSON state writes (services.json, nodes.json, incidents.json,
deployments.json, runtime-summary.json, observer_checkpoint.json) now use
_atomic_write_json: write to a .tmp sibling, fsync, then os.replace.
This eliminates the truncated-write window that caused supervisors
reading mid-write files to see empty/partial JSON.

Also adds auto-resolution of phantom active incidents: if a service
reports status=healthy and its incident's last_occurrence is >30 min old,
the incident is resolved in _prune_stale_world. This clears false active
incidents accumulated from previous race-condition reads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:26:49 +02:00
Oskar Kapala b40b832159 Fix ghost service keys from hash-prefixed Docker container names
node-agent: use com.docker.compose.service label as canonical name
- Add _canonical_container_name() method: prefers compose label,
  falls back to hash-prefix-stripped c.name
- Replace bare c.name usage in check_containers()
- Skip 'created'-state containers (Docker stale-state artifacts)

observer: prune hash-prefixed ghost keys in _prune_stale_world()
- Each reconcile cycle removes service keys matching <node>/<12hex>_<name>
- Acts as safety net for entries already in services.json + future slippage

control-plane/docker-compose.yml already has explicit container_name on
all four services — no change needed there.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:41:13 +02:00
Oskar Kapala 28e9534765 observer: service_healthy resolves active incidents
service_healthy is a positive health confirmation — if the service had
an active incident (e.g. from earlier service_unhealthy events), that
incident should be resolved when the service is confirmed healthy.

Previously only service_recovered resolved incidents; service_healthy
set status=healthy but left incidents open, keeping status='degraded'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:20:19 +02:00
Oskar Kapala 4e8968f9c7 Fix service health tracking: emit service_healthy, control-plane endpoint check, cleanup checkpoint migration
- node_agent: emit service_healthy for all running managed containers so
  observer populates services.json (previously empty → supervisor flooded
  action queue with missing_service redeploys for healthy services)
- node_agent: VPS-only _check_control_plane_health() probes the HTTP
  endpoint to emit service_healthy/unhealthy for the 'control-plane' logical
  service (multi-container stack, container names don't match service name)
- node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints
  format from observer checkpoint (was reading old last_processed_file key,
  always found nothing, never cleaned up old events)
- observer: handle service_healthy event type → sets service status healthy
  without resolving incidents (unlike service_recovered which also resolves)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:49:56 +02:00
Oskar Kapala f4a8db93e4 fix(observer): per-node-directory checkpoints replace single global checkpoint
The old mechanism tracked a single 'last_processed_file' and used sorted
filename order to find new events.  Remote nodes ship events into
subdirectories (events/piha/, events/chelsty-infra/) that sort
alphabetically BEFORE the VPS directory (events/vps/).  Once the
checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events
were silently skipped forever.

New mechanism:
- node_checkpoints: {node_dir: last_processed_path}
- Each node directory has its own independent cursor
- New events = files whose path > that node's checkpoint
- Backward-compatible: old 'last_processed_file' is migrated by extracting
  the node dir from the path on first load

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:16:58 +02:00
Oskar Kapala 96bf32614f fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
  Inside a Docker container this returns the 12-char container ID (e.g.
  'be17cb6eb0f6'), not the host name.  Observer ingested those events and
  created ghost entries in world/nodes.json that never expired.

observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
  from topology inventory; called on every run_once() cycle (both new-events
  and idle paths).  Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
  so the Dashboard's System Overview cards show real numbers instead of undefined.

operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
  as keyed dicts; the frontend calls .map() which requires an array.  All four
  functions now convert the dict to a properly-shaped list.  Each item has the
  fields the Nodes, Services, Topology, Deployments, and Correlation views expect
  (hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
  service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
  default 24).  Without this, every event file ever written was returned,
  including events from ghost-node deploys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:51:03 +02:00
Oskar Kapala 01b7758fe6 feat(node-agent): implement health monitor and safe cleanup policy
scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
    lte_node  (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
    sd_card   (piha, saturn): dangling images + containers, rate-limited once/24h
    ai_node   (solaria): dangling + containers + build cache, NEVER -a
    standard  (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
  events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
  containers_not_running, healthcheck_failed)

services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
  for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only

observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation

supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
  checks all active states, risk=guarded (operator approval required)

executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
  /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /

hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:15:06 +02:00
Oskar Kapala 7742bda245 feat(control-plane): add container_restart remediation
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 12:50:46 +02:00
oskar beb8b5cbaa fix: remove --pull always flag incompatible with docker-compose v1 2026-05-21 22:07:49 +02:00
oskar 898deda05f fix: deploy-frigate.sh use docker-compose v1 for chelsty-infra 2026-05-21 22:05:43 +02:00
oskar f34399a30d feat: add Frigate NVR deployment for chelsty-infra
VAAPI decode via Intel UHD 630, CPU detection, 2x Reolink RLC-540
placeholders. MQTT to local mosquitto (127.0.0.1), 7-day recording
retention. Secrets in /opt/homelab/config/frigate/frigate.env on node.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 18:19:45 +02:00
oskar b02c8bb50e fix(deploy): inventory-aware orchestration and correct override paths
- orchestrate-deploy.sh: read nodes from inventory/topology.yaml instead of hardcoded list
- orchestrate-deploy.sh: LTE nodes (chelsty-infra, chelsty-ha) use ConnectTimeout=30, non-fatal on failure
- deploy-node.sh: service discovery falls back to services.yaml if no services.txt
- deploy-node.sh: override path corrected to hosts/<node>/runtime/<service>/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:50:01 +02:00
oskar f65698925e Fix control plane SSH deploy TTY 2026-05-18 21:41:47 +02:00
oskar 9f20dcae05 Add control plane deploy script and fix UI healthcheck 2026-05-18 21:34:57 +02:00
oskar c299a2cb85 Fix agent fleet verification via Redis container 2026-05-17 23:00:51 +02:00
oskar b129f03837 Fix stability agent fleet deploy scripts 2026-05-17 21:09:06 +02:00
oskar b7faac00c5 Add executable stability agent fleet deploy scripts 2026-05-17 17:32:10 +02:00
oskar 8f305ba3df Merge VPS control plane deployment and observer runtime 2026-05-17 17:30:04 +02:00
oskar c9ddfa9ac1 Roll out stability agent to homelab nodes 2026-05-17 15:54:19 +02:00
Oskar Kapala 533b8e846d Add heartbeat updates and improve health checks in control-plane components 2026-05-12 20:59:46 +02:00
Oskar Kapala 2029457f57 Implement VPS control-plane deployment profile 2026-05-12 20:19:05 +02:00
Oskar Kapala 8f5b905015 Implement observer runtime world synthesis engine 2026-05-12 14:07:03 +02:00
Oskar Kapala 431d777989 Implement filesystem-first runtime event system 2026-05-12 13:38:25 +02:00
Oskar Kapala 0eeb0ac600 Implement reproducible node onboarding 2026-05-12 13:18:00 +02:00
Oskar Kapala 81bce00bf3 Bootstrap CHELSTY runtime stack 2026-05-11 21:36:10 +02:00
Oskar Kapala b524a3886a Harden deployment runtime framework 2026-05-11 21:20:13 +02:00
Oskar Kapala 5947ddd03d Implement staged deployment runtime 2026-05-11 21:04:24 +02:00
Oskar Kapala bbdbdb8321 Add node capability model 2026-05-11 20:46:50 +02:00
Oskar Kapala d0540f7eb8 Add infrastructure standards and deployment conventions 2026-05-07 21:16:03 +02:00
Oskar Kapala 2b5d59ae27 Initial homelab workspace structure 2026-05-07 20:17:27 +02:00