homelab-codex-ws/docs/stability-agent-rollout.md

2.5 KiB

Stability Agent Multi-Node Rollout

Architecture Summary

The stability-agent is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on PIHA.

  • Source: services/stability-agent
  • State Path: /opt/homelab/state
  • Events Path: /opt/homelab/events
  • Redis Target: 100.108.208.3:6379 (PIHA)

Why UI only showed CHELSTY

Previously, the stability-agent had NODE_NAME defaulted to chelsty and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys homelab:nodes:<NODE_NAME>. Without other agents publishing their specific NODE_NAME, the UI remained limited to the single active node.

Deployment

Use the helper script to deploy or generate commands. The script uses explicit Tailscale IPs for remote targets (piha, chelsty, vps) and runs locally for solaria.

# Print commands
./scripts/deploy/deploy-stability-agent.sh <node-name>

# Deploy via SSH (executes ssh oskar@<ip>)
./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh

Manual Steps per Node

The manual steps are encapsulated in services/stability-agent/deploy-local.sh. On the target node:

cd /home/oskar/homelab-codex-ws
git fetch origin
git checkout master
git pull origin master
cd services/stability-agent
./deploy-local.sh <node-name>

Verification

Fleet Overview

Run the verification script from any node with redis-cli access:

./scripts/deploy/verify-agent-fleet.sh

Redis Inspection (on PIHA)

docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>

Verify Web UI backend:

curl -s http://127.0.0.1:18180/nodes
curl -k https://agents.okit.pl/nodes

Troubleshooting

  • Redis empty after compose down: The agent-system-redis on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every CHECK_INTERVAL).
  • Secrets: .env files and local secrets are not committed to the repo. Ensure MQTT_HOST and other specific secrets are set via overrides if needed.
  • Telegram: Telegram bot notifications can remain disabled if TELEGRAM_BOT_TOKEN is absent.
  • Docker Socket: If the agent reports unavailable for Docker, ensure /var/run/docker.sock is mounted and the user has permissions.