Commit graph

91 commits

Author SHA1 Message Date
Oskar Kapala 28e9534765 observer: service_healthy resolves active incidents
service_healthy is a positive health confirmation — if the service had
an active incident (e.g. from earlier service_unhealthy events), that
incident should be resolved when the service is confirmed healthy.

Previously only service_recovered resolved incidents; service_healthy
set status=healthy but left incidents open, keeping status='degraded'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:20:19 +02:00
Oskar Kapala 46ae92b5c1 supervisor: also cancel pending actions for services removed from desired state
Previously _cancel_resolved_pending_actions() only cancelled actions where
the service became healthy. This left orphaned actions when a service was
removed from services.yaml or marked monitor:false.

Add Case 1: if the action's svc_key is no longer in desired_state (either
removed entirely or skipped due to monitor:false), cancel with reason
service_removed_from_desired_state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:19:13 +02:00
Oskar Kapala 410bfe7065 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
z2m migrates configuration.yaml on startup and needs write access.
Remove the separate :ro config mount; rely on the base compose's
/opt/homelab/data/zigbee2mqtt/data:/app/data read-write mount instead.
configuration.yaml must exist at that path on the node before first run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:13:33 +02:00
Oskar Kapala b3912fe0ce zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
docker-compose v1 cannot clear the ports list from the base compose with
ports: [] in an override, so network_mode: host caused InvalidArgument.

Use extra_hosts with host-gateway instead: maps 'mosquitto' hostname to the
Docker bridge gateway IP so mqtt://mosquitto:1883 reaches the host-networked
mosquitto process from within the bridge-networked z2m container.
Requires Docker 20.10+ (present on chelsty-infra).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:12:33 +02:00
Oskar Kapala 61e07f4318 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
docker-compose v1 (1.29.2 on chelsty-infra) raises InvalidArgument when
network_mode: host is combined with port_bindings from the base compose file.
Add ports: [] in the override to clear the base ports list.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:11:42 +02:00
Oskar Kapala 51002d4502 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
node_exporter (new service):
- Add services/node_exporter/docker-compose.yml matching solaria deployment
  (network_mode: host, pid: host, /:/host:ro,rslave mount)
- Add services/node_exporter/service.yaml

zigbee2mqtt chelsty-infra override:
- Fix network_mode: host (mosquitto runs on host network, port 1883 on localhost)
- Fix volume mount: ./configuration.yaml → absolute /opt/homelab/config/zigbee2mqtt/
  (secrets stay in runtime config dir, never in Git)
- Remove MQTT_USER/MQTT_PASSWORD (mosquitto uses allow_anonymous true)
- Extend healthcheck start_period to 60s (z2m takes time on first start)

chelsty-ha/services.yaml:
- Remove node-agent entry entirely (never deployed, no plans to bootstrap now)
- Keep homeassistant with monitor: false (no node-agent = no health events)

supervisor: respect monitor: false in services.yaml
- Skip action generation for services where monitor=false
- Cleans up chelsty-ha entries from action queue without removing desired-state docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:10:48 +02:00
Oskar Kapala fb7828b52b supervisor: auto-cancel pending actions when drift is resolved
When a service becomes healthy (node-agent emits service_healthy → observer
updates services.json), any previously queued redeploy/container_restart
action is stale. Without cleanup, the queue accumulates old actions that
require manual rejection.

_cancel_resolved_pending_actions() runs after each reconcile cycle:
- Reads all pending/*.json with type=redeploy or container_restart
- If the service is now healthy in actual_state, moves action to cancelled/
  with reason=drift_resolved_auto
- Only pending actions are touched; approved/running are left to the operator

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:58:55 +02:00
Oskar Kapala 2f1965733f fix(node-agent): unique event IDs per service to prevent same-second overwrites
Multiple service_healthy (or containers_not_running) events emitted in the
same second for different containers shared the same filename pattern
evt-{node}-{ts}-{type}.json — the second write silently overwrote the first,
so the observer only ever saw the last container checked per event type per cycle.

Fix: include a sanitized service name slug in the ID so every event gets a
unique file, e.g. evt-vps-1234-service_healthy-node-agent.json.

Also adds import re (required for re.sub in the slug generation).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:55:22 +02:00
Oskar Kapala 267742c7d7 vps/node-agent: add network_mode: host for control-plane health probe
The _check_control_plane_health() method probes localhost:18180, which
is the control-plane's mapped port. Inside a bridged container, localhost
resolves to the container's own loopback — the probe always fails.

host network mode shares the VPS host's network namespace so that
localhost:18180 correctly reaches the control-plane.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:52:32 +02:00
Oskar Kapala 4e8968f9c7 Fix service health tracking: emit service_healthy, control-plane endpoint check, cleanup checkpoint migration
- node_agent: emit service_healthy for all running managed containers so
  observer populates services.json (previously empty → supervisor flooded
  action queue with missing_service redeploys for healthy services)
- node_agent: VPS-only _check_control_plane_health() probes the HTTP
  endpoint to emit service_healthy/unhealthy for the 'control-plane' logical
  service (multi-container stack, container names don't match service name)
- node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints
  format from observer checkpoint (was reading old last_processed_file key,
  always found nothing, never cleaned up old events)
- observer: handle service_healthy event type → sets service status healthy
  without resolving incidents (unlike service_recovered which also resolves)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:49:56 +02:00
Oskar Kapala f4a8db93e4 fix(observer): per-node-directory checkpoints replace single global checkpoint
The old mechanism tracked a single 'last_processed_file' and used sorted
filename order to find new events.  Remote nodes ship events into
subdirectories (events/piha/, events/chelsty-infra/) that sort
alphabetically BEFORE the VPS directory (events/vps/).  Once the
checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events
were silently skipped forever.

New mechanism:
- node_checkpoints: {node_dir: last_processed_path}
- Each node directory has its own independent cursor
- New events = files whose path > that node's checkpoint
- Backward-compatible: old 'last_processed_file' is migrated by extracting
  the node dir from the path on first load

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:16:58 +02:00
Oskar Kapala a5a3e223dc fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
When ~/.ssh is mounted from the host oskar user into a container that
runs as root, OpenSSH rejects ~/.ssh/config with 'Bad owner or
permissions' because the file UID doesn't match the running process.

Add -F /dev/null to the rsync SSH command to skip the config file
entirely.  Also add UserKnownHostsFile=/dev/null so no known_hosts
write is attempted into a potentially read-only mounted .ssh dir.
The key itself (/root/.ssh/id_rsa) is still read as an implicit
default identity and is not affected by -F.

Reproduces on chelsty-infra (has ~/.ssh/config); safe for all nodes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:12:19 +02:00
Oskar Kapala 2349de518b fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
100.108.208.3 is piha's Tailscale IP (piha hosts Forgejo+Redis).
VPS's actual Tailscale IP is 100.95.58.48.  All three node-agent
overrides were pointing at piha itself, causing containers to SSH
to their own host and fail auth.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:07:27 +02:00
Oskar Kapala 65bac4ebfe fix(node-agent): mount host SSH key into container for event shipping
Nodes ship events to VPS via rsync+SSH. The container runs as root
and uses the default SSH identity, which must be at /root/.ssh/.
Mount /home/oskar/.ssh from the host read-only so the existing
authorized key is available inside the container.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:59:28 +02:00
Oskar Kapala 96bf32614f fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
  Inside a Docker container this returns the 12-char container ID (e.g.
  'be17cb6eb0f6'), not the host name.  Observer ingested those events and
  created ghost entries in world/nodes.json that never expired.

observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
  from topology inventory; called on every run_once() cycle (both new-events
  and idle paths).  Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
  so the Dashboard's System Overview cards show real numbers instead of undefined.

operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
  as keyed dicts; the frontend calls .map() which requires an array.  All four
  functions now convert the dict to a properly-shaped list.  Each item has the
  fields the Nodes, Services, Topology, Deployments, and Correlation views expect
  (hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
  service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
  default 24).  Without this, every event file ever written was returned,
  including events from ghost-node deploys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:51:03 +02:00
Oskar Kapala ae33cce889 feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
- piha: NODE_TYPE=sd_card (rate-limited docker prune, once per day)
- solaria: NODE_TYPE=ai_node (dangling+containers+build cache; never -a to preserve Ollama images)
- chelsty-infra: NODE_TYPE=lte_node (NO cleanup, events-only)
- All three: VPS_EVENTS_HOST set for event shipping via rsync+SSH

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:34:23 +02:00
Oskar Kapala c5c080b3e3 feat(vps): add node-agent runtime override with NODE_NAME=vps
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:18:19 +02:00
Oskar Kapala 01b7758fe6 feat(node-agent): implement health monitor and safe cleanup policy
scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
    lte_node  (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
    sd_card   (piha, saturn): dangling images + containers, rate-limited once/24h
    ai_node   (solaria): dangling + containers + build cache, NEVER -a
    standard  (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
  events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
  containers_not_running, healthcheck_failed)

services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
  for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only

observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation

supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
  checks all active states, risk=guarded (operator approval required)

executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
  /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /

hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:15:06 +02:00
Oskar Kapala 7742bda245 feat(control-plane): add container_restart remediation
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 12:50:46 +02:00
oskar 98fe1f1846 fix: frigate config not read-only, mount from /opt/homelab 2026-05-22 11:31:31 +02:00
oskar beb8b5cbaa fix: remove --pull always flag incompatible with docker-compose v1 2026-05-21 22:07:49 +02:00
oskar 898deda05f fix: deploy-frigate.sh use docker-compose v1 for chelsty-infra 2026-05-21 22:05:43 +02:00
oskar f34399a30d feat: add Frigate NVR deployment for chelsty-infra
VAAPI decode via Intel UHD 630, CPU detection, 2x Reolink RLC-540
placeholders. MQTT to local mosquitto (127.0.0.1), 7-day recording
retention. Secrets in /opt/homelab/config/frigate/frigate.env on node.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 18:19:45 +02:00
oskar 9b39581b53 fix(supervisor): content-based action IDs to prevent 30s backlog accumulation
Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired.
Switch to reconcile-{node}-{service} and check pending/approved/running states.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 17:47:37 +02:00
oskar ae7446a04b feat: add Copy for AI snapshot button to webui
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:05:37 +02:00
oskar f21be4f4d4 ops: align vps desired state with control-plane architecture, remove legacy agent-system references
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 11:40:55 +02:00
oskar 8fb4d3d634 docs: add tech-debt.md, forgejo_runner temp disabled 2026-05-21 10:37:42 +02:00
oskar 35e57cc789 docs(CLAUDE.md): update node model and override path convention
- split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table
- correct docker-compose override path to hosts/<node>/runtime/<service>/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 15:27:46 +02:00
oskar b02c8bb50e fix(deploy): inventory-aware orchestration and correct override paths
- orchestrate-deploy.sh: read nodes from inventory/topology.yaml instead of hardcoded list
- orchestrate-deploy.sh: LTE nodes (chelsty-infra, chelsty-ha) use ConnectTimeout=30, non-fatal on failure
- deploy-node.sh: service discovery falls back to services.yaml if no services.txt
- deploy-node.sh: override path corrected to hosts/<node>/runtime/<service>/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:50:01 +02:00
oskar dc483ae31a docs(chelsty): update docs and topology for site/node split
- chelsty-runtime.md: references chelsty-infra and chelsty-ha nodes
- chelsty-stability-agent.md: scoped to chelsty-infra
- topology.yaml: chelsty monolith replaced with chelsty-infra + chelsty-ha
2026-05-20 14:23:57 +02:00
oskar 9d2f748557 refactor(hosts): split chelsty monolith into chelsty-ha and chelsty-infra
- remove legacy hosts/chelsty/ monolith
- chelsty-infra: add capabilities, networking, paths, runtime (mosquitto, zigbee2mqtt, stability-agent)
- chelsty-ha: add capabilities
- align with site/node model

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:20:49 +02:00
oskar 8a12b7ff17 docs: uzupelnij dokumentacje pod katem agentow AI
Co-authored-by: Junie <junie@jetbrains.com>
2026-05-20 12:06:23 +02:00
oskar f65698925e Fix control plane SSH deploy TTY 2026-05-18 21:41:47 +02:00
oskar 9f20dcae05 Add control plane deploy script and fix UI healthcheck 2026-05-18 21:34:57 +02:00
oskar b7251ac416 Fix control plane UI healthcheck 2026-05-18 21:29:55 +02:00
oskar 807b097eb4 Fix Telegram bot job queue dependency 2026-05-18 20:22:12 +02:00
oskar 5754994f8e Refactor Telegram bot to use control plane API 2026-05-17 23:42:52 +02:00
oskar c299a2cb85 Fix agent fleet verification via Redis container 2026-05-17 23:00:51 +02:00
oskar b129f03837 Fix stability agent fleet deploy scripts 2026-05-17 21:09:06 +02:00
oskar b7faac00c5 Add executable stability agent fleet deploy scripts 2026-05-17 17:32:10 +02:00
oskar 8f305ba3df Merge VPS control plane deployment and observer runtime 2026-05-17 17:30:04 +02:00
oskar c9ddfa9ac1 Roll out stability agent to homelab nodes 2026-05-17 15:54:19 +02:00
oskar 3233cf07cd Add Telegram approval bot for agent actions 2026-05-16 21:53:06 +02:00
oskar ac90acfac8 Merge Agent System UI runtime pipeline 2026-05-16 21:38:48 +02:00
oskar 12a775c834 Finish repo-first implementation of Agent System UI pipeline
Co-authored-by: Junie <junie@jetbrains.com>
2026-05-16 19:36:43 +02:00
oskar 41c05f42b5 Add agent system service with Redis materializer 2026-05-15 23:29:59 +02:00
oskar e8d6d6d473 Publish stability agent state to Redis 2026-05-15 22:52:12 +02:00
oskar 8d0f2379ba Add CHELSTY stability agent 2026-05-15 18:51:45 +02:00
oskar 90b2a5d0e9 Add Zigbee coordinator backup 2026-05-14 18:24:26 +02:00
oskar b726048d41 Adapt zigbee2mqtt for SLZB coordinator 2026-05-14 16:37:18 +02:00