Oskar Kapala
01b7758fe6
feat(node-agent): implement health monitor and safe cleanup policy
...
scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
sd_card (piha, saturn): dangling images + containers, rate-limited once/24h
ai_node (solaria): dangling + containers + build cache, NEVER -a
standard (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
containers_not_running, healthcheck_failed)
services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only
observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation
supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
checks all active states, risk=guarded (operator approval required)
executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
/opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /
hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:15:06 +02:00
Oskar Kapala
7742bda245
feat(control-plane): add container_restart remediation
...
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 12:50:46 +02:00
oskar
9b39581b53
fix(supervisor): content-based action IDs to prevent 30s backlog accumulation
...
Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired.
Switch to reconcile-{node}-{service} and check pending/approved/running states.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 17:47:37 +02:00
oskar
ae7446a04b
feat: add Copy for AI snapshot button to webui
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:05:37 +02:00
oskar
8a12b7ff17
docs: uzupelnij dokumentacje pod katem agentow AI
...
Co-authored-by: Junie <junie@jetbrains.com>
2026-05-20 12:06:23 +02:00
oskar
9f20dcae05
Add control plane deploy script and fix UI healthcheck
2026-05-18 21:34:57 +02:00
oskar
b7251ac416
Fix control plane UI healthcheck
2026-05-18 21:29:55 +02:00
oskar
807b097eb4
Fix Telegram bot job queue dependency
2026-05-18 20:22:12 +02:00
oskar
5754994f8e
Refactor Telegram bot to use control plane API
2026-05-17 23:42:52 +02:00
oskar
b129f03837
Fix stability agent fleet deploy scripts
2026-05-17 21:09:06 +02:00
oskar
b7faac00c5
Add executable stability agent fleet deploy scripts
2026-05-17 17:32:10 +02:00
oskar
8f305ba3df
Merge VPS control plane deployment and observer runtime
2026-05-17 17:30:04 +02:00
oskar
c9ddfa9ac1
Roll out stability agent to homelab nodes
2026-05-17 15:54:19 +02:00
oskar
3233cf07cd
Add Telegram approval bot for agent actions
2026-05-16 21:53:06 +02:00
oskar
12a775c834
Finish repo-first implementation of Agent System UI pipeline
...
Co-authored-by: Junie <junie@jetbrains.com>
2026-05-16 19:36:43 +02:00
oskar
41c05f42b5
Add agent system service with Redis materializer
2026-05-15 23:29:59 +02:00
oskar
e8d6d6d473
Publish stability agent state to Redis
2026-05-15 22:52:12 +02:00
oskar
8d0f2379ba
Add CHELSTY stability agent
2026-05-15 18:51:45 +02:00
oskar
b726048d41
Adapt zigbee2mqtt for SLZB coordinator
2026-05-14 16:37:18 +02:00
Oskar Kapala
533b8e846d
Add heartbeat updates and improve health checks in control-plane components
2026-05-12 20:59:46 +02:00
Oskar Kapala
f4e6871d76
Add health check to control-plane Dockerfile fix syntax
2026-05-12 20:28:13 +02:00
Oskar Kapala
793559a4b5
Add health check to control-plane Dockerfile
2026-05-12 20:25:01 +02:00
Oskar Kapala
0cf1106b34
Update control-plane port mapping to 18180
2026-05-12 20:22:46 +02:00
Oskar Kapala
2029457f57
Implement VPS control-plane deployment profile
2026-05-12 20:19:05 +02:00
Oskar Kapala
431d777989
Implement filesystem-first runtime event system
2026-05-12 13:38:25 +02:00
Oskar Kapala
bbdbdb8321
Add node capability model
2026-05-11 20:46:50 +02:00
Oskar Kapala
d0540f7eb8
Add infrastructure standards and deployment conventions
2026-05-07 21:16:03 +02:00