Compare commits
8 commits
c9ddfa9ac1
...
b7faac00c5
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
b7faac00c5 | ||
|
|
8f305ba3df | ||
|
|
533b8e846d | ||
|
|
f4e6871d76 | ||
|
|
793559a4b5 | ||
|
|
0cf1106b34 | ||
|
|
2029457f57 | ||
|
|
8f5b905015 |
63
docs/observer-runtime.md
Normal file
63
docs/observer-runtime.md
Normal file
|
|
@ -0,0 +1,63 @@
|
||||||
|
# Observer Runtime
|
||||||
|
|
||||||
|
The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
|
||||||
|
|
||||||
|
### Inputs
|
||||||
|
- `/opt/homelab/events/`: Normalized JSON events.
|
||||||
|
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint.
|
||||||
|
- `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
|
||||||
|
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
|
||||||
|
|
||||||
|
### World Model Output
|
||||||
|
Generated under `/opt/homelab/world/`:
|
||||||
|
- `nodes.json`: Current node availability, roles, and last seen timestamps.
|
||||||
|
- `services.json`: Service health status and links to active incidents.
|
||||||
|
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
|
||||||
|
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
|
||||||
|
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
|
||||||
|
|
||||||
|
## Incident Lifecycle
|
||||||
|
|
||||||
|
The observer implements lightweight incident correlation:
|
||||||
|
|
||||||
|
1. **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
|
||||||
|
2. **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
|
||||||
|
3. **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
|
||||||
|
4. **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
|
||||||
|
|
||||||
|
### Example Incident JSON
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"inc-1715518800-saturn-mosquitto": {
|
||||||
|
"id": "inc-1715518800-saturn-mosquitto",
|
||||||
|
"node": "saturn",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"status": "resolved",
|
||||||
|
"severity": "error",
|
||||||
|
"started_at": "2026-05-12T12:05:00Z",
|
||||||
|
"last_occurrence": "2026-05-12T12:06:00Z",
|
||||||
|
"occurrence_count": 2,
|
||||||
|
"events": [
|
||||||
|
"2026-05-12T12:05:00Z",
|
||||||
|
"2026-05-12T12:06:00Z"
|
||||||
|
],
|
||||||
|
"correlation_id": "hc-1",
|
||||||
|
"resolved_at": "2026-05-12T12:10:00Z"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Runtime Behavior
|
||||||
|
|
||||||
|
### Idempotency
|
||||||
|
The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state.
|
||||||
|
|
||||||
|
### Resumability
|
||||||
|
The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
|
||||||
|
|
||||||
|
### Deployment Tracking
|
||||||
|
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment.
|
||||||
|
|
@ -11,51 +11,35 @@ The `stability-agent` is a lightweight Python service that monitors node health
|
||||||
## Why UI only showed CHELSTY
|
## Why UI only showed CHELSTY
|
||||||
Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
|
Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
|
||||||
|
|
||||||
## Deployment Commands
|
## Deployment
|
||||||
|
|
||||||
Use the helper script to generate commands:
|
Use the helper script to deploy or generate commands:
|
||||||
```bash
|
```bash
|
||||||
|
# Print commands
|
||||||
./scripts/deploy/deploy-stability-agent.sh <node-name>
|
./scripts/deploy/deploy-stability-agent.sh <node-name>
|
||||||
|
|
||||||
|
# Deploy via SSH (requires SSH access to the node)
|
||||||
|
./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh
|
||||||
```
|
```
|
||||||
|
|
||||||
### PIHA
|
### Manual Steps per Node
|
||||||
|
The manual steps are encapsulated in `services/stability-agent/deploy-local.sh`. On the target node:
|
||||||
```bash
|
```bash
|
||||||
cd ~/homelab-codex-ws
|
cd ~/homelab-codex-ws
|
||||||
git pull
|
git pull
|
||||||
cd services/stability-agent
|
cd services/stability-agent
|
||||||
NODE_NAME=piha REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker compose up -d --build --force-recreate
|
./deploy-local.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
### CHELSTY
|
## Verification
|
||||||
|
|
||||||
|
### Fleet Overview
|
||||||
|
Run the verification script from any node with `redis-cli` access:
|
||||||
```bash
|
```bash
|
||||||
cd ~/homelab-codex-ws
|
./scripts/deploy/verify-agent-fleet.sh
|
||||||
git pull
|
|
||||||
cd services/stability-agent
|
|
||||||
NODE_NAME=chelsty REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker compose up -d --build --force-recreate
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### SOLARIA
|
### Redis Inspection (on PIHA)
|
||||||
```bash
|
|
||||||
cd ~/homelab-codex-ws
|
|
||||||
git pull
|
|
||||||
cd services/stability-agent
|
|
||||||
NODE_NAME=solaria REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker compose up -d --build --force-recreate
|
|
||||||
```
|
|
||||||
|
|
||||||
### VPS
|
|
||||||
```bash
|
|
||||||
cd ~/homelab-codex-ws
|
|
||||||
git pull
|
|
||||||
cd services/stability-agent
|
|
||||||
NODE_NAME=vps REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker compose up -d --build --force-recreate
|
|
||||||
```
|
|
||||||
|
|
||||||
### SATURN (Optional)
|
|
||||||
Saturn is the orchestrator and can optionally run the stability-agent. If deployed, follow the same pattern with `NODE_NAME=saturn`.
|
|
||||||
|
|
||||||
## Verification (on PIHA)
|
|
||||||
|
|
||||||
Verify Redis keys:
|
|
||||||
```bash
|
```bash
|
||||||
docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
|
docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
|
||||||
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
|
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
|
||||||
|
|
|
||||||
78
docs/vps-control-plane.md
Normal file
78
docs/vps-control-plane.md
Normal file
|
|
@ -0,0 +1,78 @@
|
||||||
|
# VPS Control Plane
|
||||||
|
|
||||||
|
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS and provides observability, automated reconciliation, and a web-based operator interface.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
The control plane consists of four core services running as a Docker Compose stack:
|
||||||
|
|
||||||
|
1. **Observer**: Synthesizes world state from events.
|
||||||
|
2. **Supervisor**: Detects drifts between desired and actual state.
|
||||||
|
3. **Executor**: Executes approved actions from the queue.
|
||||||
|
4. **Operator UI**: Web interface for system monitoring and action approval.
|
||||||
|
|
||||||
|
All services adhere to **filesystem-first** semantics, using `/opt/homelab/` as the primary data exchange and persistence layer.
|
||||||
|
|
||||||
|
## Deployment Flow
|
||||||
|
|
||||||
|
### 1. Prerequisites
|
||||||
|
- Target VPS node must be onboarded (Tailscale active, Docker installed).
|
||||||
|
- Repository cloned to `/home/oskar/homelab-codex-ws`.
|
||||||
|
|
||||||
|
### 2. Bootstrap
|
||||||
|
Run the bootstrap script to initialize the runtime filesystem and start the stack:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/bootstrap/vps-control-plane.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Verification
|
||||||
|
Verify the stack is healthy:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd services/control-plane
|
||||||
|
docker compose ps
|
||||||
|
curl http://localhost:8080/summary
|
||||||
|
```
|
||||||
|
|
||||||
|
## Operational Workflows
|
||||||
|
|
||||||
|
### Action Approval
|
||||||
|
1. Access the Operator UI (via Tailscale IP or Nginx Proxy Manager).
|
||||||
|
2. Navigate to **Action Queue**.
|
||||||
|
3. Review **Pending** actions recommended by the Supervisor.
|
||||||
|
4. Click **Approve** to move actions to the execution queue.
|
||||||
|
|
||||||
|
### Recovery Flow
|
||||||
|
In case of control plane failure:
|
||||||
|
1. Check logs: `docker compose logs -f`.
|
||||||
|
2. Restart stack: `docker compose restart`.
|
||||||
|
3. Rebuild world state: Delete `/opt/homelab/state/observer_checkpoint.json` and restart the observer service.
|
||||||
|
|
||||||
|
### Upgrade Flow
|
||||||
|
1. Pull latest changes from git.
|
||||||
|
2. Run bootstrap script again: `./scripts/bootstrap/vps-control-plane.sh`.
|
||||||
|
- This will rebuild images and restart containers with new code.
|
||||||
|
|
||||||
|
### Rollback Semantics
|
||||||
|
Since the runtime is filesystem-first and append-only:
|
||||||
|
1. Roll back the repository state to a previous commit.
|
||||||
|
2. Restart the control plane stack.
|
||||||
|
3. The supervisor will detect drift against the older (rolled-back) desired state and recommend actions to restore it.
|
||||||
|
|
||||||
|
## Runtime Safety
|
||||||
|
|
||||||
|
- **Readonly Mounts**: Most services mount the repository as `:ro` to prevent accidental mutations.
|
||||||
|
- **Least-Privilege**: UI, Observer, and Supervisor run as non-root `homelab` user (UID 1000).
|
||||||
|
- **Filesystem Isolation**: Clear separation between `/repo` (code/inventory) and `/opt/homelab` (runtime state).
|
||||||
|
|
||||||
|
## Integration
|
||||||
|
|
||||||
|
### Nginx Proxy Manager
|
||||||
|
Configure a proxy host in NPM to point to `http://control-plane-ui:8080`. Ensure Websockets are enabled if the UI uses them.
|
||||||
|
|
||||||
|
### Log Locations
|
||||||
|
- Container logs: `docker compose logs`
|
||||||
|
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
|
||||||
|
- World state: `/opt/homelab/world/`
|
||||||
|
- Diagnostics: `/opt/homelab/logs/`
|
||||||
7
hosts/vps/runtime/control-plane/env.example
Normal file
7
hosts/vps/runtime/control-plane/env.example
Normal file
|
|
@ -0,0 +1,7 @@
|
||||||
|
# Control Plane Environment Variables
|
||||||
|
PORT=8080
|
||||||
|
HOMELAB_STATE_ROOT=/opt/homelab/state
|
||||||
|
HOMELAB_EVENTS_ROOT=/opt/homelab/events
|
||||||
|
HOMELAB_WORLD_ROOT=/opt/homelab/world
|
||||||
|
HOMELAB_ACTIONS_ROOT=/opt/homelab/actions
|
||||||
|
HOMELAB_CONFIG_ROOT=/opt/homelab/config
|
||||||
75
scripts/bootstrap/vps-control-plane.sh
Executable file
75
scripts/bootstrap/vps-control-plane.sh
Executable file
|
|
@ -0,0 +1,75 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# vps-control-plane.sh - Bootstrap script for VPS control plane
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||||
|
RUNTIME_DIR="/opt/homelab"
|
||||||
|
VPS_CONFIG="$REPO_ROOT/hosts/vps/runtime"
|
||||||
|
|
||||||
|
# Colors for output
|
||||||
|
RED='\033[0;31m'
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
YELLOW='\033[1;33m'
|
||||||
|
NC='\033[0m' # No Color
|
||||||
|
|
||||||
|
log() { echo -e "${GREEN}[INFO]${NC} $1"; }
|
||||||
|
warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
|
||||||
|
error() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
|
||||||
|
|
||||||
|
log "Starting VPS control plane bootstrap..."
|
||||||
|
|
||||||
|
# 1. Validate Docker availability
|
||||||
|
if ! command -v docker &> /dev/null; then
|
||||||
|
error "Docker is not installed. Please install Docker first."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 2. Validate compose plugin
|
||||||
|
if ! docker compose version &> /dev/null; then
|
||||||
|
error "Docker Compose plugin is not installed."
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "Docker and Compose plugin verified."
|
||||||
|
|
||||||
|
# 3. Create filesystem-first runtime structure
|
||||||
|
log "Creating filesystem-first runtime structure in $RUNTIME_DIR..."
|
||||||
|
sudo mkdir -p "$RUNTIME_DIR/events" \
|
||||||
|
"$RUNTIME_DIR/state" \
|
||||||
|
"$RUNTIME_DIR/world" \
|
||||||
|
"$RUNTIME_DIR/actions/pending" \
|
||||||
|
"$RUNTIME_DIR/actions/approved" \
|
||||||
|
"$RUNTIME_DIR/actions/running" \
|
||||||
|
"$RUNTIME_DIR/actions/completed" \
|
||||||
|
"$RUNTIME_DIR/actions/failed" \
|
||||||
|
"$RUNTIME_DIR/actions/rejected" \
|
||||||
|
"$RUNTIME_DIR/config" \
|
||||||
|
"$RUNTIME_DIR/logs"
|
||||||
|
|
||||||
|
# 4. Set permissions
|
||||||
|
log "Setting permissions..."
|
||||||
|
sudo chown -R $USER:$USER "$RUNTIME_DIR"
|
||||||
|
chmod -R 755 "$RUNTIME_DIR"
|
||||||
|
|
||||||
|
# 5. Install environment file
|
||||||
|
log "Installing environment configuration..."
|
||||||
|
if [ ! -f "$RUNTIME_DIR/config/control-plane.env" ]; then
|
||||||
|
cp "$VPS_CONFIG/control-plane/env.example" "$RUNTIME_DIR/config/control-plane.env"
|
||||||
|
log "Created $RUNTIME_DIR/config/control-plane.env from template."
|
||||||
|
else
|
||||||
|
warn "Environment file already exists, skipping installation."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 6. Build and start the control plane
|
||||||
|
log "Building and starting control plane services..."
|
||||||
|
cd "$REPO_ROOT/services/control-plane"
|
||||||
|
docker compose build
|
||||||
|
docker compose up -d
|
||||||
|
|
||||||
|
log "VPS control plane bootstrap complete!"
|
||||||
|
|
||||||
|
echo -e "\n${YELLOW}Verification commands:${NC}"
|
||||||
|
echo "1. Check container status: docker compose ps"
|
||||||
|
echo "2. Check operator UI: curl http://localhost:8080/summary"
|
||||||
|
echo "3. Validate world state: ls -l $RUNTIME_DIR/world"
|
||||||
|
echo "4. Monitor events: tail -f $RUNTIME_DIR/events/*/*/*.json"
|
||||||
|
|
@ -1,38 +1,43 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# deploy-stability-agent.sh - Helper to print deployment commands for stability-agent
|
# deploy-stability-agent.sh - Helper to deploy stability-agent (print or SSH)
|
||||||
|
|
||||||
NODE=$1
|
TARGET=$1
|
||||||
|
MODE="print"
|
||||||
REPO_PATH="~/homelab-codex-ws"
|
REPO_PATH="~/homelab-codex-ws"
|
||||||
|
|
||||||
if [[ -z "$NODE" ]]; then
|
if [[ "$2" == "--ssh" ]]; then
|
||||||
echo "Usage: $0 <node-name>"
|
MODE="ssh"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -z "$TARGET" ]]; then
|
||||||
|
echo "Usage: $0 <node-name> [--ssh]"
|
||||||
echo "Supported nodes: chelsty, piha, solaria, vps"
|
echo "Supported nodes: chelsty, piha, solaria, vps"
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
case "$NODE" in
|
case "$TARGET" in
|
||||||
chelsty|piha|solaria|vps)
|
chelsty|piha|solaria|vps)
|
||||||
;;
|
;;
|
||||||
*)
|
*)
|
||||||
echo "Error: Unknown node '$NODE'"
|
echo "Error: Unknown node '$TARGET'"
|
||||||
echo "Supported nodes: chelsty, piha, solaria, vps"
|
echo "Supported nodes: chelsty, piha, solaria, vps"
|
||||||
exit 1
|
exit 1
|
||||||
;;
|
;;
|
||||||
esac
|
esac
|
||||||
|
|
||||||
echo "# --- Deployment commands for $NODE ---"
|
if [[ "$MODE" == "ssh" ]]; then
|
||||||
echo "cd $REPO_PATH"
|
echo "--- Deploying to $TARGET via SSH ---"
|
||||||
echo "git fetch origin"
|
ssh "$TARGET" "cd $REPO_PATH && git fetch origin && git checkout master && git pull && cd services/stability-agent && ./deploy-local.sh"
|
||||||
echo "git checkout master"
|
else
|
||||||
echo "git pull"
|
echo "# --- Deployment commands for $TARGET ---"
|
||||||
echo "cd services/stability-agent"
|
echo "cd $REPO_PATH"
|
||||||
echo ""
|
echo "git fetch origin"
|
||||||
echo "# Command (Docker Compose V2):"
|
echo "git checkout master"
|
||||||
echo "NODE_NAME=$NODE REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker compose up -d --build --force-recreate"
|
echo "git pull"
|
||||||
echo ""
|
echo "cd services/stability-agent"
|
||||||
echo "# Command (Docker Compose V1):"
|
echo "./deploy-local.sh"
|
||||||
echo "NODE_NAME=$NODE REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker-compose up -d --build --force-recreate"
|
echo ""
|
||||||
echo ""
|
echo "# Notes:"
|
||||||
echo "# Notes:"
|
echo "# - Run './deploy-local.sh' on the target host."
|
||||||
echo "# - If using host-specific overrides: add '-f ../../hosts/$NODE/runtime/stability-agent/docker-compose.override.yml'"
|
echo "# - Ensure /opt/homelab/state and /opt/homelab/events exist on the host."
|
||||||
echo "# - Ensure /opt/homelab/state and /opt/homelab/events exist on the host."
|
fi
|
||||||
|
|
|
||||||
44
scripts/deploy/verify-agent-fleet.sh
Executable file
44
scripts/deploy/verify-agent-fleet.sh
Executable file
|
|
@ -0,0 +1,44 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# verify-agent-fleet.sh - Check the status of stability agents across the fleet
|
||||||
|
|
||||||
|
REDIS_HOST="100.108.208.3"
|
||||||
|
REDIS_PORT="6379"
|
||||||
|
|
||||||
|
echo "--- Homelab Agent Fleet Status ---"
|
||||||
|
|
||||||
|
# Check if redis-cli is available
|
||||||
|
if ! command -v redis-cli &> /dev/null; then
|
||||||
|
echo "Error: redis-cli not found. Please install it or run this on a node with Redis access."
|
||||||
|
echo "Expected Redis: $REDIS_HOST:$REDIS_PORT"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
NODES=$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" --raw KEYS 'homelab:nodes:*' | sed 's/homelab:nodes://')
|
||||||
|
|
||||||
|
if [[ -z "$NODES" ]]; then
|
||||||
|
echo "No nodes found in Redis."
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf "%-15s | %-10s | %-20s | %-10s\n" "NODE" "STATUS" "LAST HEARTBEAT" "DOCKER"
|
||||||
|
printf "%s\n" "--------------------------------------------------------------------------------"
|
||||||
|
|
||||||
|
for NODE in $NODES; do
|
||||||
|
DATA=$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" HGETALL "homelab:nodes:$NODE")
|
||||||
|
|
||||||
|
# Simple parser for HGETALL output (alternating key/value)
|
||||||
|
STATUS=$(echo "$DATA" | grep -A 1 "status" | tail -n 1)
|
||||||
|
HEARTBEAT=$(echo "$DATA" | grep -A 1 "timestamp" | tail -n 1)
|
||||||
|
CHECKS=$(echo "$DATA" | grep -A 1 "checks" | tail -n 1)
|
||||||
|
|
||||||
|
DOCKER_STATUS="unknown"
|
||||||
|
if [[ "$CHECKS" == *"docker"* ]]; then
|
||||||
|
DOCKER_STATUS=$(echo "$CHECKS" | jq -r '.docker.status' 2>/dev/null || echo "error")
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf "%-15s | %-10s | %-20s | %-10s\n" "$NODE" "$STATUS" "$HEARTBEAT" "$DOCKER_STATUS"
|
||||||
|
done
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Events (last 5):"
|
||||||
|
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" XREVRANGE homelab:events + - COUNT 5
|
||||||
318
scripts/observer/observer.py
Normal file
318
scripts/observer/observer.py
Normal file
|
|
@ -0,0 +1,318 @@
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import glob
|
||||||
|
import logging
|
||||||
|
import yaml
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Constants and Paths
|
||||||
|
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||||
|
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
||||||
|
STATE_DIR = Path(RUNTIME_PATH) / "state"
|
||||||
|
LOGS_DIR = Path(RUNTIME_PATH) / "logs"
|
||||||
|
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
||||||
|
OBSERVER_STATE_FILE = STATE_DIR / "observer_checkpoint.json"
|
||||||
|
|
||||||
|
REPO_ROOT = Path(__file__).parent.parent.parent
|
||||||
|
INVENTORY_TOPOLOGY = REPO_ROOT / "inventory" / "topology.yaml"
|
||||||
|
|
||||||
|
# Logging setup
|
||||||
|
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||||
|
logger = logging.getLogger("observer")
|
||||||
|
|
||||||
|
class Observer:
|
||||||
|
def __init__(self):
|
||||||
|
self.last_processed_file = None
|
||||||
|
self.world_state = {
|
||||||
|
"nodes": {},
|
||||||
|
"services": {},
|
||||||
|
"deployments": {},
|
||||||
|
"incidents": {},
|
||||||
|
"summary": {
|
||||||
|
"last_update": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"status": "initializing",
|
||||||
|
"active_incidents_count": 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
self.inventory = self._load_inventory()
|
||||||
|
self._ensure_dirs()
|
||||||
|
self._load_checkpoint()
|
||||||
|
|
||||||
|
def _ensure_dirs(self):
|
||||||
|
WORLD_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
EVENTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
LOGS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
def _load_inventory(self):
|
||||||
|
inventory = {"nodes": {}, "services": {}}
|
||||||
|
try:
|
||||||
|
if INVENTORY_TOPOLOGY.exists():
|
||||||
|
with open(INVENTORY_TOPOLOGY, "r") as f:
|
||||||
|
topo = yaml.safe_load(f)
|
||||||
|
for node_name, node_info in topo.get("nodes", {}).items():
|
||||||
|
inventory["nodes"][node_name] = {
|
||||||
|
"roles": node_info.get("roles", []),
|
||||||
|
"connectivity": node_info.get("connectivity", {})
|
||||||
|
}
|
||||||
|
|
||||||
|
# Load service assignments from hosts files
|
||||||
|
hosts_dir = REPO_ROOT / "hosts"
|
||||||
|
for host_dir in hosts_dir.iterdir():
|
||||||
|
if host_dir.is_dir():
|
||||||
|
svc_file = host_dir / "services.yaml"
|
||||||
|
if svc_file.exists():
|
||||||
|
with open(svc_file, "r") as f:
|
||||||
|
svc_data = yaml.safe_load(f)
|
||||||
|
host_name = svc_data.get("host")
|
||||||
|
for svc_name, svc_info in svc_data.get("services", {}).items():
|
||||||
|
if host_name not in inventory["services"]:
|
||||||
|
inventory["services"][host_name] = {}
|
||||||
|
inventory["services"][host_name][svc_name] = {
|
||||||
|
"role": svc_info.get("role"),
|
||||||
|
"exposure": svc_info.get("exposure")
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load inventory: {e}")
|
||||||
|
return inventory
|
||||||
|
|
||||||
|
def _load_checkpoint(self):
|
||||||
|
if OBSERVER_STATE_FILE.exists():
|
||||||
|
try:
|
||||||
|
with open(OBSERVER_STATE_FILE, "r") as f:
|
||||||
|
checkpoint = json.load(f)
|
||||||
|
self.last_processed_file = checkpoint.get("last_processed_file")
|
||||||
|
# We might want to persist partial world state,
|
||||||
|
# but for now we rebuild from events (idempotent)
|
||||||
|
# or we can load existing world state files.
|
||||||
|
self._load_world_from_disk()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load checkpoint: {e}")
|
||||||
|
|
||||||
|
def _load_world_from_disk(self):
|
||||||
|
# Optional: Load existing state to resume faster
|
||||||
|
files = {
|
||||||
|
"nodes": WORLD_DIR / "nodes.json",
|
||||||
|
"services": WORLD_DIR / "services.json",
|
||||||
|
"deployments": WORLD_DIR / "deployments.json",
|
||||||
|
"incidents": WORLD_DIR / "incidents.json",
|
||||||
|
"summary": WORLD_DIR / "runtime-summary.json"
|
||||||
|
}
|
||||||
|
for key, path in files.items():
|
||||||
|
if path.exists():
|
||||||
|
try:
|
||||||
|
with open(path, "r") as f:
|
||||||
|
self.world_state[key] = json.load(f)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load {key} state: {e}")
|
||||||
|
|
||||||
|
def _save_checkpoint(self):
|
||||||
|
try:
|
||||||
|
with open(OBSERVER_STATE_FILE, "w") as f:
|
||||||
|
json.dump({"last_processed_file": self.last_processed_file}, f)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to save checkpoint: {e}")
|
||||||
|
|
||||||
|
def _save_world(self):
|
||||||
|
self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat()
|
||||||
|
active_incidents = [
|
||||||
|
k for k, v in self.world_state["incidents"].items() if v.get("status") == "active"
|
||||||
|
]
|
||||||
|
self.world_state["summary"]["active_incidents_count"] = len(active_incidents)
|
||||||
|
|
||||||
|
if active_incidents:
|
||||||
|
self.world_state["summary"]["status"] = "degraded"
|
||||||
|
else:
|
||||||
|
self.world_state["summary"]["status"] = "nominal"
|
||||||
|
|
||||||
|
files = {
|
||||||
|
"nodes.json": self.world_state["nodes"],
|
||||||
|
"services.json": self.world_state["services"],
|
||||||
|
"deployments.json": self.world_state["deployments"],
|
||||||
|
"incidents.json": self.world_state["incidents"],
|
||||||
|
"recommendations.json": [], # Placeholder to satisfy requirements
|
||||||
|
"runtime-summary.json": self.world_state["summary"]
|
||||||
|
}
|
||||||
|
for filename, data in files.items():
|
||||||
|
try:
|
||||||
|
with open(WORLD_DIR / filename, "w") as f:
|
||||||
|
json.dump(data, f, indent=2)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to save {filename}: {e}")
|
||||||
|
|
||||||
|
def process_event(self, event):
|
||||||
|
etype = event.get("type")
|
||||||
|
node = event.get("node")
|
||||||
|
service = event.get("service")
|
||||||
|
severity = event.get("severity")
|
||||||
|
timestamp = event.get("timestamp")
|
||||||
|
cid = event.get("correlation_id")
|
||||||
|
payload = event.get("payload", {})
|
||||||
|
|
||||||
|
# 1. Update Node State
|
||||||
|
if node not in self.world_state["nodes"]:
|
||||||
|
self.world_state["nodes"][node] = {
|
||||||
|
"status": "unknown",
|
||||||
|
"last_seen": None,
|
||||||
|
"roles": self.inventory["nodes"].get(node, {}).get("roles", [])
|
||||||
|
}
|
||||||
|
self.world_state["nodes"][node]["last_seen"] = timestamp
|
||||||
|
|
||||||
|
if etype == "node_online":
|
||||||
|
self.world_state["nodes"][node]["status"] = "online"
|
||||||
|
elif etype == "node_offline":
|
||||||
|
self.world_state["nodes"][node]["status"] = "offline"
|
||||||
|
|
||||||
|
# 2. Update Service State
|
||||||
|
if service and service != "all":
|
||||||
|
svc_key = f"{node}/{service}"
|
||||||
|
if svc_key not in self.world_state["services"]:
|
||||||
|
self.world_state["services"][svc_key] = {
|
||||||
|
"node": node,
|
||||||
|
"service": service,
|
||||||
|
"status": "unknown",
|
||||||
|
"last_check": None,
|
||||||
|
"incident_id": None
|
||||||
|
}
|
||||||
|
self.world_state["services"][svc_key]["last_check"] = timestamp
|
||||||
|
|
||||||
|
if etype == "service_recovered":
|
||||||
|
self.world_state["services"][svc_key]["status"] = "healthy"
|
||||||
|
self._resolve_incident(svc_key, timestamp)
|
||||||
|
elif etype in ["service_unhealthy", "healthcheck_failed"]:
|
||||||
|
self.world_state["services"][svc_key]["status"] = "unhealthy"
|
||||||
|
self._handle_incident(svc_key, event)
|
||||||
|
|
||||||
|
# 3. Update Deployment State
|
||||||
|
if etype.startswith("deployment_") and cid:
|
||||||
|
if cid not in self.world_state["deployments"]:
|
||||||
|
self.world_state["deployments"][cid] = {
|
||||||
|
"node": node,
|
||||||
|
"service": service,
|
||||||
|
"status": "unknown",
|
||||||
|
"started_at": None,
|
||||||
|
"finished_at": None,
|
||||||
|
"events": []
|
||||||
|
}
|
||||||
|
self.world_state["deployments"][cid]["events"].append({
|
||||||
|
"type": etype,
|
||||||
|
"timestamp": timestamp,
|
||||||
|
"payload": payload
|
||||||
|
})
|
||||||
|
if etype == "deployment_started":
|
||||||
|
self.world_state["deployments"][cid]["status"] = "in_progress"
|
||||||
|
self.world_state["deployments"][cid]["started_at"] = timestamp
|
||||||
|
elif etype == "deployment_completed":
|
||||||
|
self.world_state["deployments"][cid]["status"] = "completed"
|
||||||
|
self.world_state["deployments"][cid]["finished_at"] = timestamp
|
||||||
|
elif etype == "deployment_failed":
|
||||||
|
self.world_state["deployments"][cid]["status"] = "failed"
|
||||||
|
self.world_state["deployments"][cid]["finished_at"] = timestamp
|
||||||
|
# Deployment failure often creates an incident
|
||||||
|
self._handle_deployment_failure(event)
|
||||||
|
|
||||||
|
def _handle_incident(self, svc_key, event):
|
||||||
|
# Correlation: collapse repeated failures for the same service on the same node
|
||||||
|
active_incident = self.world_state["services"][svc_key].get("incident_id")
|
||||||
|
|
||||||
|
if active_incident and active_incident in self.world_state["incidents"]:
|
||||||
|
incident = self.world_state["incidents"][active_incident]
|
||||||
|
if incident["status"] == "active":
|
||||||
|
incident["last_occurrence"] = event["timestamp"]
|
||||||
|
incident["occurrence_count"] = incident.get("occurrence_count", 1) + 1
|
||||||
|
incident["events"].append(event["timestamp"])
|
||||||
|
return
|
||||||
|
|
||||||
|
# Create new incident
|
||||||
|
incident_id = f"inc-{int(time.time())}-{event.get('node')}-{event.get('service')}"
|
||||||
|
self.world_state["incidents"][incident_id] = {
|
||||||
|
"id": incident_id,
|
||||||
|
"node": event.get("node"),
|
||||||
|
"service": event.get("service"),
|
||||||
|
"status": "active",
|
||||||
|
"severity": event.get("severity"),
|
||||||
|
"started_at": event.get("timestamp"),
|
||||||
|
"last_occurrence": event.get("timestamp"),
|
||||||
|
"occurrence_count": 1,
|
||||||
|
"events": [event["timestamp"]],
|
||||||
|
"correlation_id": event.get("correlation_id")
|
||||||
|
}
|
||||||
|
self.world_state["services"][svc_key]["incident_id"] = incident_id
|
||||||
|
|
||||||
|
def _resolve_incident(self, svc_key, timestamp):
|
||||||
|
incident_id = self.world_state["services"][svc_key].get("incident_id")
|
||||||
|
if incident_id and incident_id in self.world_state["incidents"]:
|
||||||
|
if self.world_state["incidents"][incident_id]["status"] == "active":
|
||||||
|
self.world_state["incidents"][incident_id]["status"] = "resolved"
|
||||||
|
self.world_state["incidents"][incident_id]["resolved_at"] = timestamp
|
||||||
|
self.world_state["services"][svc_key]["incident_id"] = None
|
||||||
|
|
||||||
|
def _handle_deployment_failure(self, event):
|
||||||
|
# Specific logic for deployment failures
|
||||||
|
svc_key = f"{event.get('node')}/{event.get('service')}"
|
||||||
|
self._handle_incident(svc_key, event)
|
||||||
|
|
||||||
|
# Link diagnostics if available in payload
|
||||||
|
incident_id = self.world_state["services"][svc_key].get("incident_id")
|
||||||
|
if incident_id and incident_id in self.world_state["incidents"]:
|
||||||
|
payload = event.get("payload", {})
|
||||||
|
if "diagnostics_file" in payload:
|
||||||
|
self.world_state["incidents"][incident_id]["diagnostics_ref"] = payload["diagnostics_file"]
|
||||||
|
elif "error" in payload:
|
||||||
|
self.world_state["incidents"][incident_id]["last_error"] = payload["error"]
|
||||||
|
|
||||||
|
def run_once(self):
|
||||||
|
# Update heartbeat
|
||||||
|
heartbeat_file = STATE_DIR / "observer.heartbeat"
|
||||||
|
try:
|
||||||
|
heartbeat_file.touch()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to touch heartbeat file: {e}")
|
||||||
|
|
||||||
|
# Find all event files
|
||||||
|
event_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
|
||||||
|
|
||||||
|
new_files = []
|
||||||
|
if self.last_processed_file:
|
||||||
|
try:
|
||||||
|
idx = event_files.index(self.last_processed_file)
|
||||||
|
new_files = event_files[idx+1:]
|
||||||
|
except ValueError:
|
||||||
|
# If last_processed_file is gone or not in list, process all
|
||||||
|
new_files = event_files
|
||||||
|
else:
|
||||||
|
new_files = event_files
|
||||||
|
|
||||||
|
if not new_files:
|
||||||
|
# Even if no new events, we update freshness of summary
|
||||||
|
self._save_world()
|
||||||
|
return
|
||||||
|
|
||||||
|
logger.info(f"Processing {len(new_files)} new events")
|
||||||
|
for file_path in new_files:
|
||||||
|
try:
|
||||||
|
with open(file_path, "r") as f:
|
||||||
|
event = json.load(f)
|
||||||
|
self.process_event(event)
|
||||||
|
self.last_processed_file = file_path
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error processing {file_path}: {e}")
|
||||||
|
|
||||||
|
self._save_checkpoint()
|
||||||
|
self._save_world()
|
||||||
|
|
||||||
|
def loop(self, interval=5):
|
||||||
|
logger.info("Starting observer loop")
|
||||||
|
while True:
|
||||||
|
self.run_once()
|
||||||
|
time.sleep(interval)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
observer = Observer()
|
||||||
|
if "--run-once" in sys.argv:
|
||||||
|
observer.run_once()
|
||||||
|
else:
|
||||||
|
observer.loop()
|
||||||
83
scripts/observer/test_setup.sh
Normal file
83
scripts/observer/test_setup.sh
Normal file
|
|
@ -0,0 +1,83 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
mkdir -p /tmp/homelab/events/2026-05-12/saturn
|
||||||
|
mkdir -p /tmp/homelab/state
|
||||||
|
mkdir -p /tmp/homelab/logs
|
||||||
|
mkdir -p /tmp/homelab/world
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120000_node_online_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:00:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "node_online",
|
||||||
|
"severity": "info",
|
||||||
|
"source": "system",
|
||||||
|
"service": "all",
|
||||||
|
"correlation_id": "init",
|
||||||
|
"payload": {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120500_service_unhealthy_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:05:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "service_unhealthy",
|
||||||
|
"severity": "error",
|
||||||
|
"source": "healthcheck",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "hc-1",
|
||||||
|
"payload": {"error": "connection refused"}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120600_service_unhealthy_2.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:06:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "service_unhealthy",
|
||||||
|
"severity": "error",
|
||||||
|
"source": "healthcheck",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "hc-2",
|
||||||
|
"payload": {"error": "connection refused"}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121000_service_recovered_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:10:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "service_recovered",
|
||||||
|
"severity": "info",
|
||||||
|
"source": "healthcheck",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "hc-3",
|
||||||
|
"payload": {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121500_deployment_started_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:15:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "deployment_started",
|
||||||
|
"severity": "info",
|
||||||
|
"source": "deploy_agent",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "deploy-1",
|
||||||
|
"payload": {"version": "2.0.18"}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121600_deployment_failed_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:16:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "deployment_failed",
|
||||||
|
"severity": "error",
|
||||||
|
"source": "deploy_agent",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "deploy-1",
|
||||||
|
"payload": {"error": "container crash", "diagnostics_file": "/opt/homelab/logs/diagnostics-deploy-1.log"}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
23
services/control-plane/Dockerfile
Normal file
23
services/control-plane/Dockerfile
Normal file
|
|
@ -0,0 +1,23 @@
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
RUN pip install --no-cache-dir pyyaml
|
||||||
|
|
||||||
|
# Create homelab user
|
||||||
|
RUN useradd -m -u 1000 homelab
|
||||||
|
|
||||||
|
# Copy sources
|
||||||
|
COPY src/ /app/src/
|
||||||
|
# Also need the observer script if we want to run it from here,
|
||||||
|
# but I'll copy it from the repo during build or mount it.
|
||||||
|
# Actually, I'll copy the entire scripts/ directory to /repo/scripts
|
||||||
|
# so the supervisor/executor can find them.
|
||||||
|
|
||||||
|
# For simplicity, we'll assume the repo is mounted at /repo
|
||||||
|
ENV REPO_ROOT=/repo
|
||||||
|
ENV RUNTIME_PATH=/opt/homelab
|
||||||
|
ENV PYTHONUNBUFFERED=1
|
||||||
|
|
||||||
|
# Default command (will be overridden in docker-compose)
|
||||||
|
CMD ["python", "src/operator_ui.py"]
|
||||||
73
services/control-plane/docker-compose.yml
Normal file
73
services/control-plane/docker-compose.yml
Normal file
|
|
@ -0,0 +1,73 @@
|
||||||
|
services:
|
||||||
|
operator-ui:
|
||||||
|
build: .
|
||||||
|
container_name: control-plane-ui
|
||||||
|
user: "1000:1000"
|
||||||
|
command: python src/operator_ui.py
|
||||||
|
ports:
|
||||||
|
- "18180:8080"
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab:/opt/homelab
|
||||||
|
restart: unless-stopped
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "curl", "-f", "http://localhost:8080/summary"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
|
||||||
|
observer:
|
||||||
|
build: .
|
||||||
|
container_name: control-plane-observer
|
||||||
|
user: "1000:1000"
|
||||||
|
command: python /repo/scripts/observer/observer.py
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab:/opt/homelab
|
||||||
|
- ../..:/repo:ro
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
- REPO_ROOT=/repo
|
||||||
|
- RUNTIME_PATH=/opt/homelab
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "test", "-f", "/opt/homelab/state/observer.heartbeat"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
start_period: 5s
|
||||||
|
|
||||||
|
supervisor:
|
||||||
|
build: .
|
||||||
|
container_name: control-plane-supervisor
|
||||||
|
user: "1000:1000"
|
||||||
|
command: python src/supervisor.py
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab:/opt/homelab
|
||||||
|
- ../..:/repo:ro
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
- REPO_ROOT=/repo
|
||||||
|
- RUNTIME_PATH=/opt/homelab
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "test", "-f", "/opt/homelab/state/supervisor.heartbeat"]
|
||||||
|
interval: 60s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
start_period: 10s
|
||||||
|
|
||||||
|
executor:
|
||||||
|
build: .
|
||||||
|
container_name: control-plane-executor
|
||||||
|
command: python src/executor.py
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab:/opt/homelab
|
||||||
|
- ../..:/repo
|
||||||
|
- /var/run/docker.sock:/var/run/docker.sock
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
- REPO_ROOT=/repo
|
||||||
|
- RUNTIME_PATH=/opt/homelab
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "test", "-f", "/opt/homelab/state/executor.heartbeat"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
start_period: 5s
|
||||||
109
services/control-plane/src/executor.py
Normal file
109
services/control-plane/src/executor.py
Normal file
|
|
@ -0,0 +1,109 @@
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import logging
|
||||||
|
import subprocess
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Constants and Paths
|
||||||
|
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||||
|
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
||||||
|
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
|
||||||
|
|
||||||
|
# Logging setup
|
||||||
|
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||||
|
logger = logging.getLogger("executor")
|
||||||
|
|
||||||
|
class Executor:
|
||||||
|
def __init__(self):
|
||||||
|
self._ensure_dirs()
|
||||||
|
|
||||||
|
def _ensure_dirs(self):
|
||||||
|
for s in ["approved", "running", "completed", "failed", "rejected"]:
|
||||||
|
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
def process_actions(self):
|
||||||
|
# Update heartbeat
|
||||||
|
heartbeat_file = ACTIONS_DIR.parent / "state" / "executor.heartbeat"
|
||||||
|
try:
|
||||||
|
heartbeat_file.touch()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to touch heartbeat file: {e}")
|
||||||
|
|
||||||
|
approved_dir = ACTIONS_DIR / "approved"
|
||||||
|
action_files = sorted(approved_dir.glob("*.json"))
|
||||||
|
|
||||||
|
for action_file in action_files:
|
||||||
|
self._execute_action(action_file)
|
||||||
|
|
||||||
|
def _execute_action(self, action_file):
|
||||||
|
action_id = action_file.stem
|
||||||
|
logger.info(f"Executing action: {action_id}")
|
||||||
|
|
||||||
|
# Move to running
|
||||||
|
running_path = ACTIONS_DIR / "running" / f"{action_id}.json"
|
||||||
|
try:
|
||||||
|
with open(action_file, "r") as f:
|
||||||
|
data = json.load(f)
|
||||||
|
data["status"] = "running"
|
||||||
|
data["started_at"] = time.time()
|
||||||
|
with open(running_path, "w") as f:
|
||||||
|
json.dump(data, f, indent=2)
|
||||||
|
action_file.unlink()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to move {action_id} to running: {e}")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Execute
|
||||||
|
success = False
|
||||||
|
error_msg = ""
|
||||||
|
try:
|
||||||
|
action_type = data.get("type")
|
||||||
|
node = data.get("node")
|
||||||
|
service = data.get("service")
|
||||||
|
|
||||||
|
if action_type == "redeploy":
|
||||||
|
# Call deploy-node.sh
|
||||||
|
cmd = [
|
||||||
|
str(REPO_ROOT / "scripts" / "deploy" / "deploy-node.sh"),
|
||||||
|
node,
|
||||||
|
service
|
||||||
|
]
|
||||||
|
logger.info(f"Running command: {' '.join(cmd)}")
|
||||||
|
result = subprocess.run(cmd, capture_output=True, text=True, cwd=str(REPO_ROOT))
|
||||||
|
if result.returncode == 0:
|
||||||
|
success = True
|
||||||
|
else:
|
||||||
|
success = False
|
||||||
|
error_msg = result.stderr or result.stdout
|
||||||
|
else:
|
||||||
|
success = False
|
||||||
|
error_msg = f"Unknown action type: {action_type}"
|
||||||
|
except Exception as e:
|
||||||
|
success = False
|
||||||
|
error_msg = str(e)
|
||||||
|
|
||||||
|
# Move to completed/failed
|
||||||
|
target_status = "completed" if success else "failed"
|
||||||
|
target_path = ACTIONS_DIR / target_status / f"{action_id}.json"
|
||||||
|
try:
|
||||||
|
data["status"] = target_status
|
||||||
|
data["finished_at"] = time.time()
|
||||||
|
if not success:
|
||||||
|
data["error"] = error_msg
|
||||||
|
with open(target_path, "w") as f:
|
||||||
|
json.dump(data, f, indent=2)
|
||||||
|
running_path.unlink()
|
||||||
|
logger.info(f"Action {action_id} {target_status}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to move {action_id} to {target_status}: {e}")
|
||||||
|
|
||||||
|
def loop(self, interval=10):
|
||||||
|
logger.info("Starting executor loop")
|
||||||
|
while True:
|
||||||
|
self.process_actions()
|
||||||
|
time.sleep(interval)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
executor = Executor()
|
||||||
|
executor.loop()
|
||||||
701
services/control-plane/src/index.html
Normal file
701
services/control-plane/src/index.html
Normal file
|
|
@ -0,0 +1,701 @@
|
||||||
|
<!doctype html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="utf-8">
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||||
|
<title>Operator Control Plane</title>
|
||||||
|
<style>
|
||||||
|
:root {
|
||||||
|
--bg-color: #0a0c0e;
|
||||||
|
--sidebar-color: #14171a;
|
||||||
|
--card-color: #1c2024;
|
||||||
|
--border-color: #2a3540;
|
||||||
|
--text-color: #e7edf3;
|
||||||
|
--text-muted: #94a3b8;
|
||||||
|
--accent-color: #3eaf7c;
|
||||||
|
--nominal: #3eaf7c;
|
||||||
|
--degraded: #e7c000;
|
||||||
|
--unstable: #e67e22;
|
||||||
|
--reconciling: #3498db;
|
||||||
|
--error: #c0392b;
|
||||||
|
--safe: #3eaf7c;
|
||||||
|
--guarded: #e67e22;
|
||||||
|
--dangerous: #c0392b;
|
||||||
|
}
|
||||||
|
|
||||||
|
body {
|
||||||
|
margin: 0;
|
||||||
|
font-family: 'Inter', system-ui, -apple-system, sans-serif;
|
||||||
|
background: var(--bg-color);
|
||||||
|
color: var(--text-color);
|
||||||
|
display: flex;
|
||||||
|
height: 100vh;
|
||||||
|
overflow: hidden;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Sidebar */
|
||||||
|
.sidebar {
|
||||||
|
width: 240px;
|
||||||
|
background: var(--sidebar-color);
|
||||||
|
border-right: 1px solid var(--border-color);
|
||||||
|
display: flex;
|
||||||
|
flex-direction: column;
|
||||||
|
flex-shrink: 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
.sidebar-header {
|
||||||
|
padding: 24px;
|
||||||
|
font-weight: 800;
|
||||||
|
font-size: 14px;
|
||||||
|
letter-spacing: 0.1em;
|
||||||
|
color: var(--accent-color);
|
||||||
|
border-bottom: 1px solid var(--border-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.nav-list {
|
||||||
|
list-style: none;
|
||||||
|
padding: 12px 0;
|
||||||
|
margin: 0;
|
||||||
|
flex-grow: 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
.nav-item {
|
||||||
|
padding: 12px 24px;
|
||||||
|
cursor: pointer;
|
||||||
|
font-size: 14px;
|
||||||
|
color: var(--text-muted);
|
||||||
|
transition: all 0.2s;
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
gap: 12px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.nav-item:hover {
|
||||||
|
background: rgba(255, 255, 255, 0.05);
|
||||||
|
color: var(--text-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.nav-item.active {
|
||||||
|
background: rgba(62, 175, 124, 0.1);
|
||||||
|
color: var(--accent-color);
|
||||||
|
border-left: 3px solid var(--accent-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.sidebar-footer {
|
||||||
|
padding: 16px;
|
||||||
|
border-top: 1px solid var(--border-color);
|
||||||
|
font-size: 12px;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Content Area */
|
||||||
|
.main-content {
|
||||||
|
flex-grow: 1;
|
||||||
|
display: flex;
|
||||||
|
flex-direction: column;
|
||||||
|
overflow: hidden;
|
||||||
|
}
|
||||||
|
|
||||||
|
header {
|
||||||
|
height: 64px;
|
||||||
|
border-bottom: 1px solid var(--border-color);
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
padding: 0 24px;
|
||||||
|
justify-content: space-between;
|
||||||
|
background: var(--bg-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.view-title {
|
||||||
|
font-size: 18px;
|
||||||
|
font-weight: 600;
|
||||||
|
}
|
||||||
|
|
||||||
|
.content-scroll {
|
||||||
|
flex-grow: 1;
|
||||||
|
overflow-y: auto;
|
||||||
|
padding: 24px;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Cards & Grids */
|
||||||
|
.grid {
|
||||||
|
display: grid;
|
||||||
|
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
|
||||||
|
gap: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.card {
|
||||||
|
background: var(--card-color);
|
||||||
|
border: 1px solid var(--border-color);
|
||||||
|
padding: 20px;
|
||||||
|
border-radius: 4px;
|
||||||
|
position: relative;
|
||||||
|
}
|
||||||
|
|
||||||
|
.card-header {
|
||||||
|
display: flex;
|
||||||
|
justify-content: space-between;
|
||||||
|
align-items: center;
|
||||||
|
margin-bottom: 16px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.card-title {
|
||||||
|
font-weight: 700;
|
||||||
|
font-size: 16px;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Status Badges */
|
||||||
|
.badge {
|
||||||
|
padding: 4px 8px;
|
||||||
|
border-radius: 4px;
|
||||||
|
font-size: 11px;
|
||||||
|
font-weight: 700;
|
||||||
|
text-transform: uppercase;
|
||||||
|
}
|
||||||
|
|
||||||
|
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
|
||||||
|
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
|
||||||
|
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
|
||||||
|
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
|
||||||
|
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
|
||||||
|
|
||||||
|
/* Timeline */
|
||||||
|
.timeline {
|
||||||
|
display: flex;
|
||||||
|
flex-direction: column;
|
||||||
|
gap: 12px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.event {
|
||||||
|
padding: 12px;
|
||||||
|
border-left: 2px solid var(--border-color);
|
||||||
|
background: rgba(255, 255, 255, 0.02);
|
||||||
|
font-family: ui-monospace, monospace;
|
||||||
|
font-size: 13px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.event.high { border-left-color: var(--error); }
|
||||||
|
.event.medium { border-left-color: var(--unstable); }
|
||||||
|
.event.low { border-left-color: var(--nominal); }
|
||||||
|
|
||||||
|
.event-header {
|
||||||
|
display: flex;
|
||||||
|
justify-content: space-between;
|
||||||
|
margin-bottom: 4px;
|
||||||
|
color: var(--text-muted);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Forms & Inputs */
|
||||||
|
.controls {
|
||||||
|
display: flex;
|
||||||
|
gap: 12px;
|
||||||
|
margin-top: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
input, button {
|
||||||
|
background: var(--card-color);
|
||||||
|
border: 1px solid var(--border-color);
|
||||||
|
color: var(--text-color);
|
||||||
|
padding: 8px 16px;
|
||||||
|
font-size: 14px;
|
||||||
|
border-radius: 4px;
|
||||||
|
}
|
||||||
|
|
||||||
|
button {
|
||||||
|
cursor: pointer;
|
||||||
|
font-weight: 600;
|
||||||
|
}
|
||||||
|
|
||||||
|
button:hover { background: var(--border-color); }
|
||||||
|
|
||||||
|
.btn-primary { background: var(--accent-color); color: white; border: none; }
|
||||||
|
.btn-primary:hover { background: #359b6d; }
|
||||||
|
|
||||||
|
/* Utility */
|
||||||
|
.hidden { display: none !important; }
|
||||||
|
.mono { font-family: ui-monospace, monospace; }
|
||||||
|
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
|
||||||
|
.value { font-weight: 500; margin-bottom: 12px; }
|
||||||
|
|
||||||
|
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
|
||||||
|
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
|
||||||
|
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
|
||||||
|
|
||||||
|
</style>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<aside class="sidebar">
|
||||||
|
<div class="sidebar-header">HOMELAB OPERATOR</div>
|
||||||
|
<ul class="nav-list">
|
||||||
|
<li class="nav-item active" onclick="showView('dashboard', this)">
|
||||||
|
<span>Dashboard</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('actions', this)">
|
||||||
|
<span>Action Queue</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('nodes', this)">
|
||||||
|
<span>Nodes</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('services', this)">
|
||||||
|
<span>Services</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('deployments', this)">
|
||||||
|
<span>Deployments</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('topology', this)">
|
||||||
|
<span>Topology</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('events', this)">
|
||||||
|
<span>Events</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('correlation', this)">
|
||||||
|
<span>Correlation</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('recommendations', this)">
|
||||||
|
<span>Recommendations</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('settings', this)">
|
||||||
|
<span>Settings</span>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
<div class="sidebar-footer">
|
||||||
|
<div id="summary-status">System Status: Loading...</div>
|
||||||
|
</div>
|
||||||
|
</aside>
|
||||||
|
|
||||||
|
<main class="main-content">
|
||||||
|
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
|
||||||
|
RUNTIME STATE IS STALE
|
||||||
|
</div>
|
||||||
|
<header>
|
||||||
|
<div style="display:flex; align-items:center; gap:20px">
|
||||||
|
<div class="view-title" id="current-view-title">Dashboard</div>
|
||||||
|
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
|
||||||
|
<option value="observe">OBSERVE</option>
|
||||||
|
<option value="recommend">RECOMMEND</option>
|
||||||
|
<option value="approval" selected>APPROVAL</option>
|
||||||
|
<option value="autonomous">AUTONOMOUS</option>
|
||||||
|
<option value="maintenance">MAINTENANCE</option>
|
||||||
|
</select>
|
||||||
|
</div>
|
||||||
|
<div class="header-actions">
|
||||||
|
<button onclick="refreshData()">Refresh</button>
|
||||||
|
</div>
|
||||||
|
</header>
|
||||||
|
|
||||||
|
<div class="content-scroll">
|
||||||
|
<!-- Dashboard View -->
|
||||||
|
<div id="view-dashboard" class="view">
|
||||||
|
<div class="grid">
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-title">System Overview</div>
|
||||||
|
<div id="dashboard-summary" style="margin-top:20px"></div>
|
||||||
|
</div>
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-title">Pending Actions</div>
|
||||||
|
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
|
||||||
|
</div>
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-title">Active Incidents</div>
|
||||||
|
<div id="dashboard-incidents" style="margin-top:20px"></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Actions View -->
|
||||||
|
<div id="view-actions" class="view hidden">
|
||||||
|
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
|
||||||
|
<div>
|
||||||
|
<h3>Pending Approval</h3>
|
||||||
|
<div id="actions-pending" class="timeline"></div>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<h3>Active / History</h3>
|
||||||
|
<div id="actions-history" class="timeline"></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Nodes View -->
|
||||||
|
<div id="view-nodes" class="view hidden">
|
||||||
|
<div class="grid" id="nodes-list"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Services View -->
|
||||||
|
<div id="view-services" class="view hidden">
|
||||||
|
<div class="grid" id="services-list"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Deployments View -->
|
||||||
|
<div id="view-deployments" class="view hidden">
|
||||||
|
<div class="grid" id="deployments-list"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Topology View -->
|
||||||
|
<div id="view-topology" class="view hidden">
|
||||||
|
<div class="card" style="min-height:500px">
|
||||||
|
<div class="card-title">Runtime Topology</div>
|
||||||
|
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Events View -->
|
||||||
|
<div id="view-events" class="view hidden">
|
||||||
|
<div class="timeline" id="events-timeline"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Correlation View -->
|
||||||
|
<div id="view-correlation" class="view hidden">
|
||||||
|
<div id="correlation-chains" class="grid"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Recommendations View -->
|
||||||
|
<div id="view-recommendations" class="view hidden">
|
||||||
|
<div class="grid" id="recommendations-list"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Settings View -->
|
||||||
|
<div id="view-settings" class="view hidden">
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-title">Configuration</div>
|
||||||
|
<div id="settings-content" style="margin-top:20px"></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</main>
|
||||||
|
|
||||||
|
<script>
|
||||||
|
let currentView = 'dashboard';
|
||||||
|
const pollInterval = 5000;
|
||||||
|
|
||||||
|
function showView(viewId, el) {
|
||||||
|
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
|
||||||
|
document.getElementById('view-' + viewId).classList.remove('hidden');
|
||||||
|
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
|
||||||
|
if (el) el.classList.add('active');
|
||||||
|
currentView = viewId;
|
||||||
|
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
|
||||||
|
refreshData();
|
||||||
|
}
|
||||||
|
|
||||||
|
async function fetchData(endpoint) {
|
||||||
|
try {
|
||||||
|
const res = await fetch(endpoint, {cache: 'no-store'});
|
||||||
|
return await res.json();
|
||||||
|
} catch (e) {
|
||||||
|
console.error('Fetch error:', endpoint, e);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function postData(endpoint, data) {
|
||||||
|
try {
|
||||||
|
const res = await fetch(endpoint, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: {'Content-Type': 'application/json'},
|
||||||
|
body: JSON.stringify(data)
|
||||||
|
});
|
||||||
|
return await res.json();
|
||||||
|
} catch (e) {
|
||||||
|
console.error('Post error:', endpoint, e);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function mutateAction(id, status) {
|
||||||
|
const res = await postData('/action/mutate', {id, status});
|
||||||
|
if (res && res.status === 'ok') {
|
||||||
|
refreshData();
|
||||||
|
} else {
|
||||||
|
alert('Mutation failed');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function setOperatorMode(mode) {
|
||||||
|
console.log('Operator mode set to:', mode);
|
||||||
|
const res = await postData('/mode', {mode});
|
||||||
|
if (res && res.status === 'ok') {
|
||||||
|
console.log('Mode updated successfully');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function formatTime(ts) {
|
||||||
|
if (!ts) return 'N/A';
|
||||||
|
return new Date(ts * 1000).toLocaleString();
|
||||||
|
}
|
||||||
|
|
||||||
|
function getStatusClass(status) {
|
||||||
|
status = (status || '').toLowerCase();
|
||||||
|
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
|
||||||
|
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
|
||||||
|
if (['unstable'].includes(status)) return 'status-unstable';
|
||||||
|
if (['reconciling'].includes(status)) return 'status-reconciling';
|
||||||
|
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
|
||||||
|
return '';
|
||||||
|
}
|
||||||
|
|
||||||
|
async function refreshData() {
|
||||||
|
// Refresh summary always
|
||||||
|
const summary = await fetchData('/summary');
|
||||||
|
if (summary) {
|
||||||
|
const statusEl = document.getElementById('summary-status');
|
||||||
|
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
|
||||||
|
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
|
||||||
|
|
||||||
|
// Handle stale state
|
||||||
|
const staleBanner = document.getElementById('stale-banner');
|
||||||
|
if (summary.stale) {
|
||||||
|
staleBanner.classList.remove('hidden');
|
||||||
|
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
|
||||||
|
} else {
|
||||||
|
staleBanner.classList.add('hidden');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'dashboard') {
|
||||||
|
const dashSummary = document.getElementById('dashboard-summary');
|
||||||
|
dashSummary.innerHTML = `
|
||||||
|
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
|
||||||
|
<div class="label">Services</div><div class="value">${summary.service_count}</div>
|
||||||
|
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
|
||||||
|
`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'dashboard' || currentView === 'actions') {
|
||||||
|
const actions = await fetchData('/actions');
|
||||||
|
if (actions) {
|
||||||
|
if (currentView === 'dashboard') {
|
||||||
|
const dashActions = document.getElementById('dashboard-actions-summary');
|
||||||
|
const pendingCount = actions.pending.length;
|
||||||
|
dashActions.innerHTML = `
|
||||||
|
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
|
||||||
|
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
|
||||||
|
`;
|
||||||
|
}
|
||||||
|
if (currentView === 'actions') {
|
||||||
|
const pendingEl = document.getElementById('actions-pending');
|
||||||
|
const historyEl = document.getElementById('actions-history');
|
||||||
|
|
||||||
|
pendingEl.innerHTML = actions.pending.map(a => `
|
||||||
|
<div class="card" style="margin-bottom:12px">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
|
||||||
|
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
|
||||||
|
</div>
|
||||||
|
<p>${a.description || a.action_type || 'No description'}</p>
|
||||||
|
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
|
||||||
|
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
|
||||||
|
<div class="controls">
|
||||||
|
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
|
||||||
|
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
`).join('') || 'No pending actions.';
|
||||||
|
|
||||||
|
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
|
||||||
|
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
|
||||||
|
<div class="event">
|
||||||
|
<div class="event-header">
|
||||||
|
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
|
||||||
|
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
|
||||||
|
</div>
|
||||||
|
<div>${a.description || a.action_type || 'No description'}</div>
|
||||||
|
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
|
||||||
|
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
|
||||||
|
${a.transition_history ? `
|
||||||
|
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
|
||||||
|
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
|
||||||
|
</div>
|
||||||
|
` : ''}
|
||||||
|
</div>
|
||||||
|
`).join('') || 'No history.';
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'dashboard' || currentView === 'events') {
|
||||||
|
const incidents = await fetchData('/incidents');
|
||||||
|
if (currentView === 'dashboard') {
|
||||||
|
const dashIncidents = document.getElementById('dashboard-incidents');
|
||||||
|
if (!incidents || incidents.length === 0) {
|
||||||
|
dashIncidents.textContent = 'No active incidents.';
|
||||||
|
} else {
|
||||||
|
dashIncidents.innerHTML = incidents.map(inc => `
|
||||||
|
<div class="event ${inc.severity}">
|
||||||
|
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
|
||||||
|
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'nodes') {
|
||||||
|
const nodes = await fetchData('/nodes');
|
||||||
|
const list = document.getElementById('nodes-list');
|
||||||
|
list.innerHTML = nodes.map(node => `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${node.hostname}</div>
|
||||||
|
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||||
|
</div>
|
||||||
|
<div class="label">ID</div><div class="value mono">${node.id}</div>
|
||||||
|
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
|
||||||
|
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
|
||||||
|
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
|
||||||
|
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
|
||||||
|
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'services') {
|
||||||
|
const services = await fetchData('/services');
|
||||||
|
const list = document.getElementById('services-list');
|
||||||
|
list.innerHTML = services.map(svc => `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${svc.name}</div>
|
||||||
|
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
|
||||||
|
</div>
|
||||||
|
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
|
||||||
|
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
|
||||||
|
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
|
||||||
|
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'deployments') {
|
||||||
|
const deps = await fetchData('/deployments');
|
||||||
|
const list = document.getElementById('deployments-list');
|
||||||
|
list.innerHTML = deps.map(dep => `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${dep.service}</div>
|
||||||
|
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
|
||||||
|
</div>
|
||||||
|
<div class="label">ID</div><div class="value mono">${dep.id}</div>
|
||||||
|
<div class="label">Stage</div><div class="value">${dep.stage}</div>
|
||||||
|
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
|
||||||
|
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
|
||||||
|
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'events') {
|
||||||
|
const events = await fetchData('/events');
|
||||||
|
const timeline = document.getElementById('events-timeline');
|
||||||
|
timeline.innerHTML = events.map(ev => `
|
||||||
|
<div class="event ${ev.severity}">
|
||||||
|
<div class="event-header">
|
||||||
|
<span>${ev.type.toUpperCase()}</span>
|
||||||
|
<span>${formatTime(ev.timestamp)}</span>
|
||||||
|
</div>
|
||||||
|
<div>${ev.message}</div>
|
||||||
|
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'recommendations') {
|
||||||
|
const recs = await fetchData('/recommendations');
|
||||||
|
const list = document.getElementById('recommendations-list');
|
||||||
|
list.innerHTML = recs.map(rec => `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${rec.title}</div>
|
||||||
|
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
|
||||||
|
</div>
|
||||||
|
<p>${rec.description}</p>
|
||||||
|
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
|
||||||
|
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
|
||||||
|
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
|
||||||
|
<div class="controls">
|
||||||
|
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'topology') {
|
||||||
|
const nodes = await fetchData('/nodes');
|
||||||
|
const services = await fetchData('/services');
|
||||||
|
const topMap = document.getElementById('topology-map');
|
||||||
|
if (nodes && services) {
|
||||||
|
topMap.innerHTML = nodes.map(node => {
|
||||||
|
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
|
||||||
|
return `
|
||||||
|
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${node.hostname}</div>
|
||||||
|
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||||
|
</div>
|
||||||
|
<div class="label">Capabilities</div>
|
||||||
|
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
|
||||||
|
<div class="label">Services</div>
|
||||||
|
<div style="font-size:12px; margin-bottom:10px">
|
||||||
|
${nodeServices.length > 0 ? nodeServices.map(s => `
|
||||||
|
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
|
||||||
|
<span>${s.name}</span>
|
||||||
|
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
|
||||||
|
</div>
|
||||||
|
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
|
||||||
|
`).join('') : '<div class="value">None</div>'}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
`;
|
||||||
|
}).join('');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'correlation') {
|
||||||
|
const incidents = await fetchData('/incidents');
|
||||||
|
const actions = await fetchData('/actions');
|
||||||
|
const list = document.getElementById('correlation-chains');
|
||||||
|
if (incidents && actions) {
|
||||||
|
const allActions = Object.values(actions).flat();
|
||||||
|
list.innerHTML = incidents.map(inc => {
|
||||||
|
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
|
||||||
|
return `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
|
||||||
|
<span class="badge status-error">Active</span>
|
||||||
|
</div>
|
||||||
|
<p>${inc.message}</p>
|
||||||
|
<div class="label">Related Actions</div>
|
||||||
|
${related.map(a => `
|
||||||
|
<div class="event" style="margin-top:5px">
|
||||||
|
<strong>${a.type}</strong> (${a.status})<br>
|
||||||
|
<small>${a.description}</small>
|
||||||
|
</div>
|
||||||
|
`).join('') || '<div class="value">No actions yet</div>'}
|
||||||
|
</div>
|
||||||
|
`;
|
||||||
|
}).join('');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (currentView === 'settings') {
|
||||||
|
const config = await fetchData('/config');
|
||||||
|
const content = document.getElementById('settings-content');
|
||||||
|
content.innerHTML = `
|
||||||
|
<div class="label">Auto Mode</div>
|
||||||
|
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
|
||||||
|
<div class="label">Action Thresholds</div>
|
||||||
|
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
|
||||||
|
<div class="label">Telegram Integration</div>
|
||||||
|
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
|
||||||
|
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
|
||||||
|
`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Initial load
|
||||||
|
refreshData();
|
||||||
|
// Poll for updates
|
||||||
|
setInterval(refreshData, pollInterval);
|
||||||
|
|
||||||
|
</script>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
283
services/control-plane/src/operator_ui.py
Normal file
283
services/control-plane/src/operator_ui.py
Normal file
|
|
@ -0,0 +1,283 @@
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
from datetime import datetime
|
||||||
|
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
|
||||||
|
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
|
||||||
|
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
|
||||||
|
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
|
||||||
|
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
|
||||||
|
|
||||||
|
STATIC_DIR = Path(__file__).parent
|
||||||
|
|
||||||
|
DEFAULT_CONFIG = {
|
||||||
|
"operator_mode": "approval",
|
||||||
|
"auto_mode": True,
|
||||||
|
"action_thresholds": {
|
||||||
|
"restart_ha": 0.8,
|
||||||
|
"check_network": 0.9,
|
||||||
|
},
|
||||||
|
"default_threshold": 0.9,
|
||||||
|
"allowed_auto_actions": ["restart_ha"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def read_json_file(path, default=None):
|
||||||
|
if not path.exists():
|
||||||
|
return default if default is not None else []
|
||||||
|
try:
|
||||||
|
return json.loads(path.read_text())
|
||||||
|
except Exception:
|
||||||
|
return default if default is not None else []
|
||||||
|
|
||||||
|
|
||||||
|
def get_config():
|
||||||
|
config_path = STATE_DIR / "operator-config.json"
|
||||||
|
if config_path.exists():
|
||||||
|
return read_json_file(config_path, DEFAULT_CONFIG)
|
||||||
|
return DEFAULT_CONFIG
|
||||||
|
|
||||||
|
|
||||||
|
def save_config(config):
|
||||||
|
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
|
||||||
|
|
||||||
|
|
||||||
|
def current_nodes():
|
||||||
|
return read_json_file(WORLD_DIR / "nodes.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_services():
|
||||||
|
return read_json_file(WORLD_DIR / "services.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_deployments():
|
||||||
|
return read_json_file(WORLD_DIR / "deployments.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_incidents():
|
||||||
|
return read_json_file(WORLD_DIR / "incidents.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_recommendations():
|
||||||
|
return read_json_file(WORLD_DIR / "recommendations.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_summary():
|
||||||
|
summary = read_json_file(WORLD_DIR / "runtime-summary.json", default={})
|
||||||
|
if summary:
|
||||||
|
# Check for staleness from the summary's own timestamp if available
|
||||||
|
# otherwise use file mtime
|
||||||
|
last_update_str = summary.get("last_update")
|
||||||
|
if last_update_str:
|
||||||
|
try:
|
||||||
|
# Assuming ISO format from observer.py
|
||||||
|
last_update = datetime.fromisoformat(last_update_str.replace('Z', '+00:00')).timestamp()
|
||||||
|
except Exception:
|
||||||
|
last_update = os.path.getmtime(WORLD_DIR / "runtime-summary.json")
|
||||||
|
else:
|
||||||
|
last_update = os.path.getmtime(WORLD_DIR / "runtime-summary.json")
|
||||||
|
|
||||||
|
summary["last_update"] = last_update
|
||||||
|
summary["stale"] = (time.time() - last_update) > 60 # Stale if older than 60s
|
||||||
|
return summary
|
||||||
|
|
||||||
|
|
||||||
|
def current_events():
|
||||||
|
events = []
|
||||||
|
if EVENTS_DIR.exists():
|
||||||
|
for f in EVENTS_DIR.glob("**/*.json"):
|
||||||
|
data = read_json_file(f)
|
||||||
|
if data:
|
||||||
|
# Add source file for traceability
|
||||||
|
data["_source"] = f.name
|
||||||
|
events.append(data)
|
||||||
|
return sorted(events, key=lambda x: x.get("timestamp", ""), reverse=True)
|
||||||
|
|
||||||
|
|
||||||
|
def current_actions():
|
||||||
|
actions = {}
|
||||||
|
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||||
|
for status in statuses:
|
||||||
|
actions[status] = []
|
||||||
|
status_dir = ACTIONS_DIR / status
|
||||||
|
if status_dir.exists():
|
||||||
|
for f in status_dir.glob("*.json"):
|
||||||
|
data = read_json_file(f)
|
||||||
|
if data:
|
||||||
|
# Injects some metadata for UI
|
||||||
|
data["id"] = data.get("action_id") or f.stem
|
||||||
|
data["status"] = status
|
||||||
|
actions[status].append(data)
|
||||||
|
return actions
|
||||||
|
|
||||||
|
|
||||||
|
def mutate_action(action_id, target_status):
|
||||||
|
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||||
|
if target_status not in statuses:
|
||||||
|
return False, f"Invalid target status: {target_status}"
|
||||||
|
|
||||||
|
# Find where the action is
|
||||||
|
source_path = None
|
||||||
|
current_status = None
|
||||||
|
for status in statuses:
|
||||||
|
p = ACTIONS_DIR / status / f"{action_id}.json"
|
||||||
|
if p.exists():
|
||||||
|
source_path = p
|
||||||
|
current_status = status
|
||||||
|
break
|
||||||
|
|
||||||
|
if not source_path:
|
||||||
|
return False, f"Action {action_id} not found"
|
||||||
|
|
||||||
|
target_dir = ACTIONS_DIR / target_status
|
||||||
|
target_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
target_path = target_dir / f"{action_id}.json"
|
||||||
|
|
||||||
|
try:
|
||||||
|
data = json.loads(source_path.read_text())
|
||||||
|
data["status"] = target_status
|
||||||
|
data["updated_at"] = time.time()
|
||||||
|
|
||||||
|
# Keep history of transitions
|
||||||
|
history = data.get("transition_history", [])
|
||||||
|
history.append({
|
||||||
|
"from": current_status,
|
||||||
|
"to": target_status,
|
||||||
|
"timestamp": time.time()
|
||||||
|
})
|
||||||
|
data["transition_history"] = history
|
||||||
|
|
||||||
|
target_path.write_text(json.dumps(data, indent=2))
|
||||||
|
if source_path != target_path:
|
||||||
|
source_path.unlink()
|
||||||
|
return True, "Success"
|
||||||
|
except Exception as e:
|
||||||
|
return False, str(e)
|
||||||
|
|
||||||
|
|
||||||
|
def send_json(status, payload, handler):
|
||||||
|
body = (json.dumps(payload) + "\n").encode("utf-8")
|
||||||
|
handler.send_response(status)
|
||||||
|
handler.send_header("Content-Type", "application/json")
|
||||||
|
handler.send_header("Content-Length", str(len(body)))
|
||||||
|
handler.end_headers()
|
||||||
|
handler.wfile.write(body)
|
||||||
|
|
||||||
|
|
||||||
|
class Handler(BaseHTTPRequestHandler):
|
||||||
|
def do_GET(self):
|
||||||
|
if self.path == "/config":
|
||||||
|
send_json(200, get_config(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/nodes":
|
||||||
|
send_json(200, current_nodes(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/services":
|
||||||
|
send_json(200, current_services(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/deployments":
|
||||||
|
send_json(200, current_deployments(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/incidents":
|
||||||
|
send_json(200, current_incidents(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/recommendations":
|
||||||
|
send_json(200, current_recommendations(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/summary":
|
||||||
|
send_json(200, current_summary(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/events":
|
||||||
|
send_json(200, current_events(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/actions":
|
||||||
|
send_json(200, current_actions(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path in ("/", "/index.html"):
|
||||||
|
body = (STATIC_DIR / "index.html").read_bytes()
|
||||||
|
self.send_response(200)
|
||||||
|
self.send_header("Content-Type", "text/html; charset=utf-8")
|
||||||
|
self.send_header("Content-Length", str(len(body)))
|
||||||
|
self.end_headers()
|
||||||
|
self.wfile.write(body)
|
||||||
|
return
|
||||||
|
|
||||||
|
self.send_error(404)
|
||||||
|
|
||||||
|
def do_POST(self):
|
||||||
|
if self.path not in (
|
||||||
|
"/config",
|
||||||
|
"/action/mutate",
|
||||||
|
"/mode",
|
||||||
|
):
|
||||||
|
self.send_error(404)
|
||||||
|
return
|
||||||
|
|
||||||
|
length = int(self.headers.get("Content-Length", "0"))
|
||||||
|
raw_body = self.rfile.read(length).decode("utf-8")
|
||||||
|
try:
|
||||||
|
payload = json.loads(raw_body)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
self.send_error(400, "Invalid JSON")
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/config":
|
||||||
|
config = get_config()
|
||||||
|
config.update(payload)
|
||||||
|
save_config(config)
|
||||||
|
send_json(200, {"status": "ok"}, self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/mode":
|
||||||
|
mode = payload.get("mode")
|
||||||
|
if not mode:
|
||||||
|
self.send_error(400, "mode is required")
|
||||||
|
return
|
||||||
|
config = get_config()
|
||||||
|
config["operator_mode"] = mode
|
||||||
|
save_config(config)
|
||||||
|
send_json(200, {"status": "ok"}, self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/action/mutate":
|
||||||
|
action_id = payload.get("id")
|
||||||
|
target = payload.get("status")
|
||||||
|
if not action_id or not target:
|
||||||
|
self.send_error(400, "id and status are required")
|
||||||
|
return
|
||||||
|
success, msg = mutate_action(action_id, target)
|
||||||
|
if success:
|
||||||
|
send_json(200, {"status": "ok"}, self)
|
||||||
|
else:
|
||||||
|
self.send_error(500, msg)
|
||||||
|
return
|
||||||
|
|
||||||
|
def log_message(self, format, *args):
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Ensure directories exist
|
||||||
|
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
|
||||||
|
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
port = int(os.getenv("PORT", "8080"))
|
||||||
|
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
|
||||||
|
server = ThreadingHTTPServer(("0.0.0.0", port), Handler)
|
||||||
|
server.serve_forever()
|
||||||
143
services/control-plane/src/supervisor.py
Normal file
143
services/control-plane/src/supervisor.py
Normal file
|
|
@ -0,0 +1,143 @@
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import logging
|
||||||
|
import yaml
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Constants and Paths
|
||||||
|
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||||
|
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
||||||
|
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
||||||
|
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
|
||||||
|
|
||||||
|
# Logging setup
|
||||||
|
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||||
|
logger = logging.getLogger("supervisor")
|
||||||
|
|
||||||
|
class Supervisor:
|
||||||
|
def __init__(self):
|
||||||
|
self.desired_state = {"services": {}}
|
||||||
|
self.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
|
||||||
|
self._ensure_dirs()
|
||||||
|
|
||||||
|
def _ensure_dirs(self):
|
||||||
|
ACTIONS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
(ACTIONS_DIR / "pending").mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
def _load_desired_state(self):
|
||||||
|
services = {}
|
||||||
|
hosts_dir = REPO_ROOT / "hosts"
|
||||||
|
if not hosts_dir.exists():
|
||||||
|
logger.warning(f"Hosts directory {hosts_dir} does not exist")
|
||||||
|
return
|
||||||
|
|
||||||
|
for host_dir in hosts_dir.iterdir():
|
||||||
|
if host_dir.is_dir():
|
||||||
|
svc_file = host_dir / "services.yaml"
|
||||||
|
if svc_file.exists():
|
||||||
|
try:
|
||||||
|
with open(svc_file, "r") as f:
|
||||||
|
data = yaml.safe_load(f)
|
||||||
|
host_name = data.get("host")
|
||||||
|
for svc_name, svc_info in data.get("services", {}).items():
|
||||||
|
svc_key = f"{host_name}/{svc_name}"
|
||||||
|
services[svc_key] = {
|
||||||
|
"node": host_name,
|
||||||
|
"service": svc_name,
|
||||||
|
"desired": "running"
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load {svc_file}: {e}")
|
||||||
|
self.desired_state["services"] = services
|
||||||
|
|
||||||
|
def _load_actual_state(self):
|
||||||
|
files = {
|
||||||
|
"services": WORLD_DIR / "services.json",
|
||||||
|
"nodes": WORLD_DIR / "nodes.json",
|
||||||
|
"incidents": WORLD_DIR / "incidents.json"
|
||||||
|
}
|
||||||
|
for key, path in files.items():
|
||||||
|
if path.exists():
|
||||||
|
try:
|
||||||
|
with open(path, "r") as f:
|
||||||
|
self.actual_state[key] = json.load(f)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load {key} actual state: {e}")
|
||||||
|
|
||||||
|
def reconcile(self):
|
||||||
|
# Update heartbeat
|
||||||
|
heartbeat_file = WORLD_DIR.parent / "state" / "supervisor.heartbeat"
|
||||||
|
try:
|
||||||
|
heartbeat_file.touch()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to touch heartbeat file: {e}")
|
||||||
|
|
||||||
|
self._load_desired_state()
|
||||||
|
self._load_actual_state()
|
||||||
|
|
||||||
|
drifts = []
|
||||||
|
|
||||||
|
# 1. Check for missing or unhealthy services
|
||||||
|
for svc_key, desired_info in self.desired_state["services"].items():
|
||||||
|
actual_info = self.actual_state["services"].get(svc_key)
|
||||||
|
|
||||||
|
if not actual_info:
|
||||||
|
drifts.append({
|
||||||
|
"type": "missing_service",
|
||||||
|
"svc_key": svc_key,
|
||||||
|
"node": desired_info["node"],
|
||||||
|
"service": desired_info["service"]
|
||||||
|
})
|
||||||
|
elif actual_info.get("status") != "healthy":
|
||||||
|
drifts.append({
|
||||||
|
"type": "unhealthy_service",
|
||||||
|
"svc_key": svc_key,
|
||||||
|
"node": desired_info["node"],
|
||||||
|
"service": desired_info["service"],
|
||||||
|
"status": actual_info.get("status")
|
||||||
|
})
|
||||||
|
|
||||||
|
# 2. Generate recommendations
|
||||||
|
for drift in drifts:
|
||||||
|
self._generate_recommendation(drift)
|
||||||
|
|
||||||
|
def _generate_recommendation(self, drift):
|
||||||
|
action_id = f"reconcile-{int(time.time())}-{drift['node']}-{drift['service']}"
|
||||||
|
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
||||||
|
|
||||||
|
if action_path.exists():
|
||||||
|
return # Already recommended
|
||||||
|
|
||||||
|
action = {
|
||||||
|
"action_id": action_id,
|
||||||
|
"timestamp": time.time(),
|
||||||
|
"type": "redeploy",
|
||||||
|
"node": drift["node"],
|
||||||
|
"service": drift["service"],
|
||||||
|
"risk_level": "guarded",
|
||||||
|
"confidence": 0.9,
|
||||||
|
"description": f"Redeploy {drift['service']} on {drift['node']} due to {drift['type']}",
|
||||||
|
"status": "pending",
|
||||||
|
"payload": {
|
||||||
|
"reason": drift["type"],
|
||||||
|
"svc_key": drift["svc_key"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
with open(action_path, "w") as f:
|
||||||
|
json.dump(action, f, indent=2)
|
||||||
|
logger.info(f"Generated recommendation: {action_id}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to save recommendation {action_id}: {e}")
|
||||||
|
|
||||||
|
def loop(self, interval=30):
|
||||||
|
logger.info("Starting supervisor loop")
|
||||||
|
while True:
|
||||||
|
self.reconcile()
|
||||||
|
time.sleep(interval)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
supervisor = Supervisor()
|
||||||
|
supervisor.loop()
|
||||||
38
services/stability-agent/deploy-local.sh
Executable file
38
services/stability-agent/deploy-local.sh
Executable file
|
|
@ -0,0 +1,38 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# deploy-local.sh - Local deployment script for stability-agent
|
||||||
|
# This script is intended to be run on the target node.
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# Default values
|
||||||
|
NODE_NAME=${NODE_NAME:-$(hostname)}
|
||||||
|
REDIS_HOST=${REDIS_HOST:-100.108.208.3}
|
||||||
|
REDIS_PORT=${REDIS_PORT:-6379}
|
||||||
|
REDIS_ENABLED=${REDIS_ENABLED:-true}
|
||||||
|
|
||||||
|
echo "--- Deploying stability-agent on $NODE_NAME ---"
|
||||||
|
|
||||||
|
# Check for docker-compose or docker compose
|
||||||
|
if docker compose version >/dev/null 2>&1; then
|
||||||
|
DOCKER_COMPOSE="docker compose"
|
||||||
|
else
|
||||||
|
DOCKER_COMPOSE="docker-compose"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Use host-specific override if it exists
|
||||||
|
OVERRIDE_FILE="../../hosts/$NODE_NAME/runtime/stability-agent/docker-compose.override.yml"
|
||||||
|
COMPOSE_ARGS="-f docker-compose.yml"
|
||||||
|
|
||||||
|
if [ -f "$OVERRIDE_FILE" ]; then
|
||||||
|
echo "Using override file: $OVERRIDE_FILE"
|
||||||
|
COMPOSE_ARGS="$COMPOSE_ARGS -f $OVERRIDE_FILE"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Run deployment
|
||||||
|
NODE_NAME=$NODE_NAME \
|
||||||
|
REDIS_HOST=$REDIS_HOST \
|
||||||
|
REDIS_PORT=$REDIS_PORT \
|
||||||
|
REDIS_ENABLED=$REDIS_ENABLED \
|
||||||
|
$DOCKER_COMPOSE $COMPOSE_ARGS up -d --build --force-recreate
|
||||||
|
|
||||||
|
echo "Deployment finished."
|
||||||
Loading…
Reference in a new issue