Compare commits

...

2 commits

9 changed files with 1454 additions and 0 deletions

78
docs/vps-control-plane.md Normal file
View file

@ -0,0 +1,78 @@
# VPS Control Plane
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS and provides observability, automated reconciliation, and a web-based operator interface.
## Architecture
The control plane consists of four core services running as a Docker Compose stack:
1. **Observer**: Synthesizes world state from events.
2. **Supervisor**: Detects drifts between desired and actual state.
3. **Executor**: Executes approved actions from the queue.
4. **Operator UI**: Web interface for system monitoring and action approval.
All services adhere to **filesystem-first** semantics, using `/opt/homelab/` as the primary data exchange and persistence layer.
## Deployment Flow
### 1. Prerequisites
- Target VPS node must be onboarded (Tailscale active, Docker installed).
- Repository cloned to `/home/oskar/homelab-codex-ws`.
### 2. Bootstrap
Run the bootstrap script to initialize the runtime filesystem and start the stack:
```bash
./scripts/bootstrap/vps-control-plane.sh
```
### 3. Verification
Verify the stack is healthy:
```bash
cd services/control-plane
docker compose ps
curl http://localhost:8080/summary
```
## Operational Workflows
### Action Approval
1. Access the Operator UI (via Tailscale IP or Nginx Proxy Manager).
2. Navigate to **Action Queue**.
3. Review **Pending** actions recommended by the Supervisor.
4. Click **Approve** to move actions to the execution queue.
### Recovery Flow
In case of control plane failure:
1. Check logs: `docker compose logs -f`.
2. Restart stack: `docker compose restart`.
3. Rebuild world state: Delete `/opt/homelab/state/observer_checkpoint.json` and restart the observer service.
### Upgrade Flow
1. Pull latest changes from git.
2. Run bootstrap script again: `./scripts/bootstrap/vps-control-plane.sh`.
- This will rebuild images and restart containers with new code.
### Rollback Semantics
Since the runtime is filesystem-first and append-only:
1. Roll back the repository state to a previous commit.
2. Restart the control plane stack.
3. The supervisor will detect drift against the older (rolled-back) desired state and recommend actions to restore it.
## Runtime Safety
- **Readonly Mounts**: Most services mount the repository as `:ro` to prevent accidental mutations.
- **Least-Privilege**: UI, Observer, and Supervisor run as non-root `homelab` user (UID 1000).
- **Filesystem Isolation**: Clear separation between `/repo` (code/inventory) and `/opt/homelab` (runtime state).
## Integration
### Nginx Proxy Manager
Configure a proxy host in NPM to point to `http://control-plane-ui:8080`. Ensure Websockets are enabled if the UI uses them.
### Log Locations
- Container logs: `docker compose logs`
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
- World state: `/opt/homelab/world/`
- Diagnostics: `/opt/homelab/logs/`

View file

@ -0,0 +1,7 @@
# Control Plane Environment Variables
PORT=8080
HOMELAB_STATE_ROOT=/opt/homelab/state
HOMELAB_EVENTS_ROOT=/opt/homelab/events
HOMELAB_WORLD_ROOT=/opt/homelab/world
HOMELAB_ACTIONS_ROOT=/opt/homelab/actions
HOMELAB_CONFIG_ROOT=/opt/homelab/config

View file

@ -0,0 +1,75 @@
#!/usr/bin/env bash
# vps-control-plane.sh - Bootstrap script for VPS control plane
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
RUNTIME_DIR="/opt/homelab"
VPS_CONFIG="$REPO_ROOT/hosts/vps/runtime"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
log() { echo -e "${GREEN}[INFO]${NC} $1"; }
warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
error() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
log "Starting VPS control plane bootstrap..."
# 1. Validate Docker availability
if ! command -v docker &> /dev/null; then
error "Docker is not installed. Please install Docker first."
fi
# 2. Validate compose plugin
if ! docker compose version &> /dev/null; then
error "Docker Compose plugin is not installed."
fi
log "Docker and Compose plugin verified."
# 3. Create filesystem-first runtime structure
log "Creating filesystem-first runtime structure in $RUNTIME_DIR..."
sudo mkdir -p "$RUNTIME_DIR/events" \
"$RUNTIME_DIR/state" \
"$RUNTIME_DIR/world" \
"$RUNTIME_DIR/actions/pending" \
"$RUNTIME_DIR/actions/approved" \
"$RUNTIME_DIR/actions/running" \
"$RUNTIME_DIR/actions/completed" \
"$RUNTIME_DIR/actions/failed" \
"$RUNTIME_DIR/actions/rejected" \
"$RUNTIME_DIR/config" \
"$RUNTIME_DIR/logs"
# 4. Set permissions
log "Setting permissions..."
sudo chown -R $USER:$USER "$RUNTIME_DIR"
chmod -R 755 "$RUNTIME_DIR"
# 5. Install environment file
log "Installing environment configuration..."
if [ ! -f "$RUNTIME_DIR/config/control-plane.env" ]; then
cp "$VPS_CONFIG/control-plane/env.example" "$RUNTIME_DIR/config/control-plane.env"
log "Created $RUNTIME_DIR/config/control-plane.env from template."
else
warn "Environment file already exists, skipping installation."
fi
# 6. Build and start the control plane
log "Building and starting control plane services..."
cd "$REPO_ROOT/services/control-plane"
docker compose build
docker compose up -d
log "VPS control plane bootstrap complete!"
echo -e "\n${YELLOW}Verification commands:${NC}"
echo "1. Check container status: docker compose ps"
echo "2. Check operator UI: curl http://localhost:8080/summary"
echo "3. Validate world state: ls -l $RUNTIME_DIR/world"
echo "4. Monitor events: tail -f $RUNTIME_DIR/events/*/*/*.json"

View file

@ -0,0 +1,28 @@
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y \
curl \
docker.io \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir pyyaml
# Create homelab user
RUN useradd -m -u 1000 homelab
# Copy sources
COPY src/ /app/src/
# Also need the observer script if we want to run it from here,
# but I'll copy it from the repo during build or mount it.
# Actually, I'll copy the entire scripts/ directory to /repo/scripts
# so the supervisor/executor can find them.
# For simplicity, we'll assume the repo is mounted at /repo
ENV REPO_ROOT=/repo
ENV RUNTIME_PATH=/opt/homelab
ENV PYTHONUNBUFFERED=1
# Default command (will be overridden in docker-compose)
CMD ["python", "src/operator_ui.py"]

View file

@ -0,0 +1,55 @@
services:
operator-ui:
build: .
container_name: control-plane-ui
user: "1000:1000"
command: python src/operator_ui.py
ports:
- "18180:8080"
volumes:
- /opt/homelab:/opt/homelab
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/summary"]
interval: 30s
timeout: 10s
retries: 3
observer:
build: .
container_name: control-plane-observer
user: "1000:1000"
command: python /repo/scripts/observer/observer.py
volumes:
- /opt/homelab:/opt/homelab
- ../..:/repo:ro
restart: unless-stopped
environment:
- REPO_ROOT=/repo
- RUNTIME_PATH=/opt/homelab
supervisor:
build: .
container_name: control-plane-supervisor
user: "1000:1000"
command: python src/supervisor.py
volumes:
- /opt/homelab:/opt/homelab
- ../..:/repo:ro
restart: unless-stopped
environment:
- REPO_ROOT=/repo
- RUNTIME_PATH=/opt/homelab
executor:
build: .
container_name: control-plane-executor
command: python src/executor.py
volumes:
- /opt/homelab:/opt/homelab
- ../..:/repo
- /var/run/docker.sock:/var/run/docker.sock
restart: unless-stopped
environment:
- REPO_ROOT=/repo
- RUNTIME_PATH=/opt/homelab

View file

@ -0,0 +1,102 @@
import os
import json
import time
import logging
import subprocess
from pathlib import Path
# Constants and Paths
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger("executor")
class Executor:
def __init__(self):
self._ensure_dirs()
def _ensure_dirs(self):
for s in ["approved", "running", "completed", "failed", "rejected"]:
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
def process_actions(self):
approved_dir = ACTIONS_DIR / "approved"
action_files = sorted(approved_dir.glob("*.json"))
for action_file in action_files:
self._execute_action(action_file)
def _execute_action(self, action_file):
action_id = action_file.stem
logger.info(f"Executing action: {action_id}")
# Move to running
running_path = ACTIONS_DIR / "running" / f"{action_id}.json"
try:
with open(action_file, "r") as f:
data = json.load(f)
data["status"] = "running"
data["started_at"] = time.time()
with open(running_path, "w") as f:
json.dump(data, f, indent=2)
action_file.unlink()
except Exception as e:
logger.error(f"Failed to move {action_id} to running: {e}")
return
# Execute
success = False
error_msg = ""
try:
action_type = data.get("type")
node = data.get("node")
service = data.get("service")
if action_type == "redeploy":
# Call deploy-node.sh
cmd = [
str(REPO_ROOT / "scripts" / "deploy" / "deploy-node.sh"),
node,
service
]
logger.info(f"Running command: {' '.join(cmd)}")
result = subprocess.run(cmd, capture_output=True, text=True, cwd=str(REPO_ROOT))
if result.returncode == 0:
success = True
else:
success = False
error_msg = result.stderr or result.stdout
else:
success = False
error_msg = f"Unknown action type: {action_type}"
except Exception as e:
success = False
error_msg = str(e)
# Move to completed/failed
target_status = "completed" if success else "failed"
target_path = ACTIONS_DIR / target_status / f"{action_id}.json"
try:
data["status"] = target_status
data["finished_at"] = time.time()
if not success:
data["error"] = error_msg
with open(target_path, "w") as f:
json.dump(data, f, indent=2)
running_path.unlink()
logger.info(f"Action {action_id} {target_status}")
except Exception as e:
logger.error(f"Failed to move {action_id} to {target_status}: {e}")
def loop(self, interval=10):
logger.info("Starting executor loop")
while True:
self.process_actions()
time.sleep(interval)
if __name__ == "__main__":
executor = Executor()
executor.loop()

View file

@ -0,0 +1,701 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Operator Control Plane</title>
<style>
:root {
--bg-color: #0a0c0e;
--sidebar-color: #14171a;
--card-color: #1c2024;
--border-color: #2a3540;
--text-color: #e7edf3;
--text-muted: #94a3b8;
--accent-color: #3eaf7c;
--nominal: #3eaf7c;
--degraded: #e7c000;
--unstable: #e67e22;
--reconciling: #3498db;
--error: #c0392b;
--safe: #3eaf7c;
--guarded: #e67e22;
--dangerous: #c0392b;
}
body {
margin: 0;
font-family: 'Inter', system-ui, -apple-system, sans-serif;
background: var(--bg-color);
color: var(--text-color);
display: flex;
height: 100vh;
overflow: hidden;
}
/* Sidebar */
.sidebar {
width: 240px;
background: var(--sidebar-color);
border-right: 1px solid var(--border-color);
display: flex;
flex-direction: column;
flex-shrink: 0;
}
.sidebar-header {
padding: 24px;
font-weight: 800;
font-size: 14px;
letter-spacing: 0.1em;
color: var(--accent-color);
border-bottom: 1px solid var(--border-color);
}
.nav-list {
list-style: none;
padding: 12px 0;
margin: 0;
flex-grow: 1;
}
.nav-item {
padding: 12px 24px;
cursor: pointer;
font-size: 14px;
color: var(--text-muted);
transition: all 0.2s;
display: flex;
align-items: center;
gap: 12px;
}
.nav-item:hover {
background: rgba(255, 255, 255, 0.05);
color: var(--text-color);
}
.nav-item.active {
background: rgba(62, 175, 124, 0.1);
color: var(--accent-color);
border-left: 3px solid var(--accent-color);
}
.sidebar-footer {
padding: 16px;
border-top: 1px solid var(--border-color);
font-size: 12px;
}
/* Content Area */
.main-content {
flex-grow: 1;
display: flex;
flex-direction: column;
overflow: hidden;
}
header {
height: 64px;
border-bottom: 1px solid var(--border-color);
display: flex;
align-items: center;
padding: 0 24px;
justify-content: space-between;
background: var(--bg-color);
}
.view-title {
font-size: 18px;
font-weight: 600;
}
.content-scroll {
flex-grow: 1;
overflow-y: auto;
padding: 24px;
}
/* Cards & Grids */
.grid {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
gap: 20px;
}
.card {
background: var(--card-color);
border: 1px solid var(--border-color);
padding: 20px;
border-radius: 4px;
position: relative;
}
.card-header {
display: flex;
justify-content: space-between;
align-items: center;
margin-bottom: 16px;
}
.card-title {
font-weight: 700;
font-size: 16px;
}
/* Status Badges */
.badge {
padding: 4px 8px;
border-radius: 4px;
font-size: 11px;
font-weight: 700;
text-transform: uppercase;
}
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
/* Timeline */
.timeline {
display: flex;
flex-direction: column;
gap: 12px;
}
.event {
padding: 12px;
border-left: 2px solid var(--border-color);
background: rgba(255, 255, 255, 0.02);
font-family: ui-monospace, monospace;
font-size: 13px;
}
.event.high { border-left-color: var(--error); }
.event.medium { border-left-color: var(--unstable); }
.event.low { border-left-color: var(--nominal); }
.event-header {
display: flex;
justify-content: space-between;
margin-bottom: 4px;
color: var(--text-muted);
}
/* Forms & Inputs */
.controls {
display: flex;
gap: 12px;
margin-top: 20px;
}
input, button {
background: var(--card-color);
border: 1px solid var(--border-color);
color: var(--text-color);
padding: 8px 16px;
font-size: 14px;
border-radius: 4px;
}
button {
cursor: pointer;
font-weight: 600;
}
button:hover { background: var(--border-color); }
.btn-primary { background: var(--accent-color); color: white; border: none; }
.btn-primary:hover { background: #359b6d; }
/* Utility */
.hidden { display: none !important; }
.mono { font-family: ui-monospace, monospace; }
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
.value { font-weight: 500; margin-bottom: 12px; }
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
</style>
</head>
<body>
<aside class="sidebar">
<div class="sidebar-header">HOMELAB OPERATOR</div>
<ul class="nav-list">
<li class="nav-item active" onclick="showView('dashboard', this)">
<span>Dashboard</span>
</li>
<li class="nav-item" onclick="showView('actions', this)">
<span>Action Queue</span>
</li>
<li class="nav-item" onclick="showView('nodes', this)">
<span>Nodes</span>
</li>
<li class="nav-item" onclick="showView('services', this)">
<span>Services</span>
</li>
<li class="nav-item" onclick="showView('deployments', this)">
<span>Deployments</span>
</li>
<li class="nav-item" onclick="showView('topology', this)">
<span>Topology</span>
</li>
<li class="nav-item" onclick="showView('events', this)">
<span>Events</span>
</li>
<li class="nav-item" onclick="showView('correlation', this)">
<span>Correlation</span>
</li>
<li class="nav-item" onclick="showView('recommendations', this)">
<span>Recommendations</span>
</li>
<li class="nav-item" onclick="showView('settings', this)">
<span>Settings</span>
</li>
</ul>
<div class="sidebar-footer">
<div id="summary-status">System Status: Loading...</div>
</div>
</aside>
<main class="main-content">
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
RUNTIME STATE IS STALE
</div>
<header>
<div style="display:flex; align-items:center; gap:20px">
<div class="view-title" id="current-view-title">Dashboard</div>
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
<option value="observe">OBSERVE</option>
<option value="recommend">RECOMMEND</option>
<option value="approval" selected>APPROVAL</option>
<option value="autonomous">AUTONOMOUS</option>
<option value="maintenance">MAINTENANCE</option>
</select>
</div>
<div class="header-actions">
<button onclick="refreshData()">Refresh</button>
</div>
</header>
<div class="content-scroll">
<!-- Dashboard View -->
<div id="view-dashboard" class="view">
<div class="grid">
<div class="card">
<div class="card-title">System Overview</div>
<div id="dashboard-summary" style="margin-top:20px"></div>
</div>
<div class="card">
<div class="card-title">Pending Actions</div>
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
</div>
<div class="card">
<div class="card-title">Active Incidents</div>
<div id="dashboard-incidents" style="margin-top:20px"></div>
</div>
</div>
</div>
<!-- Actions View -->
<div id="view-actions" class="view hidden">
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
<div>
<h3>Pending Approval</h3>
<div id="actions-pending" class="timeline"></div>
</div>
<div>
<h3>Active / History</h3>
<div id="actions-history" class="timeline"></div>
</div>
</div>
</div>
<!-- Nodes View -->
<div id="view-nodes" class="view hidden">
<div class="grid" id="nodes-list"></div>
</div>
<!-- Services View -->
<div id="view-services" class="view hidden">
<div class="grid" id="services-list"></div>
</div>
<!-- Deployments View -->
<div id="view-deployments" class="view hidden">
<div class="grid" id="deployments-list"></div>
</div>
<!-- Topology View -->
<div id="view-topology" class="view hidden">
<div class="card" style="min-height:500px">
<div class="card-title">Runtime Topology</div>
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
</div>
</div>
<!-- Events View -->
<div id="view-events" class="view hidden">
<div class="timeline" id="events-timeline"></div>
</div>
<!-- Correlation View -->
<div id="view-correlation" class="view hidden">
<div id="correlation-chains" class="grid"></div>
</div>
<!-- Recommendations View -->
<div id="view-recommendations" class="view hidden">
<div class="grid" id="recommendations-list"></div>
</div>
<!-- Settings View -->
<div id="view-settings" class="view hidden">
<div class="card">
<div class="card-title">Configuration</div>
<div id="settings-content" style="margin-top:20px"></div>
</div>
</div>
</div>
</main>
<script>
let currentView = 'dashboard';
const pollInterval = 5000;
function showView(viewId, el) {
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
document.getElementById('view-' + viewId).classList.remove('hidden');
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
if (el) el.classList.add('active');
currentView = viewId;
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
refreshData();
}
async function fetchData(endpoint) {
try {
const res = await fetch(endpoint, {cache: 'no-store'});
return await res.json();
} catch (e) {
console.error('Fetch error:', endpoint, e);
return null;
}
}
async function postData(endpoint, data) {
try {
const res = await fetch(endpoint, {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify(data)
});
return await res.json();
} catch (e) {
console.error('Post error:', endpoint, e);
return null;
}
}
async function mutateAction(id, status) {
const res = await postData('/action/mutate', {id, status});
if (res && res.status === 'ok') {
refreshData();
} else {
alert('Mutation failed');
}
}
async function setOperatorMode(mode) {
console.log('Operator mode set to:', mode);
const res = await postData('/mode', {mode});
if (res && res.status === 'ok') {
console.log('Mode updated successfully');
}
}
function formatTime(ts) {
if (!ts) return 'N/A';
return new Date(ts * 1000).toLocaleString();
}
function getStatusClass(status) {
status = (status || '').toLowerCase();
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
if (['unstable'].includes(status)) return 'status-unstable';
if (['reconciling'].includes(status)) return 'status-reconciling';
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
return '';
}
async function refreshData() {
// Refresh summary always
const summary = await fetchData('/summary');
if (summary) {
const statusEl = document.getElementById('summary-status');
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
// Handle stale state
const staleBanner = document.getElementById('stale-banner');
if (summary.stale) {
staleBanner.classList.remove('hidden');
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
} else {
staleBanner.classList.add('hidden');
}
if (currentView === 'dashboard') {
const dashSummary = document.getElementById('dashboard-summary');
dashSummary.innerHTML = `
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
<div class="label">Services</div><div class="value">${summary.service_count}</div>
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
`;
}
}
if (currentView === 'dashboard' || currentView === 'actions') {
const actions = await fetchData('/actions');
if (actions) {
if (currentView === 'dashboard') {
const dashActions = document.getElementById('dashboard-actions-summary');
const pendingCount = actions.pending.length;
dashActions.innerHTML = `
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
`;
}
if (currentView === 'actions') {
const pendingEl = document.getElementById('actions-pending');
const historyEl = document.getElementById('actions-history');
pendingEl.innerHTML = actions.pending.map(a => `
<div class="card" style="margin-bottom:12px">
<div class="card-header">
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
</div>
<p>${a.description || a.action_type || 'No description'}</p>
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
<div class="controls">
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
</div>
</div>
`).join('') || 'No pending actions.';
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
<div class="event">
<div class="event-header">
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
</div>
<div>${a.description || a.action_type || 'No description'}</div>
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
${a.transition_history ? `
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
</div>
` : ''}
</div>
`).join('') || 'No history.';
}
}
}
if (currentView === 'dashboard' || currentView === 'events') {
const incidents = await fetchData('/incidents');
if (currentView === 'dashboard') {
const dashIncidents = document.getElementById('dashboard-incidents');
if (!incidents || incidents.length === 0) {
dashIncidents.textContent = 'No active incidents.';
} else {
dashIncidents.innerHTML = incidents.map(inc => `
<div class="event ${inc.severity}">
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
</div>
`).join('');
}
}
}
if (currentView === 'nodes') {
const nodes = await fetchData('/nodes');
const list = document.getElementById('nodes-list');
list.innerHTML = nodes.map(node => `
<div class="card">
<div class="card-header">
<div class="card-title">${node.hostname}</div>
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
</div>
<div class="label">ID</div><div class="value mono">${node.id}</div>
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
</div>
`).join('');
}
if (currentView === 'services') {
const services = await fetchData('/services');
const list = document.getElementById('services-list');
list.innerHTML = services.map(svc => `
<div class="card">
<div class="card-header">
<div class="card-title">${svc.name}</div>
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
</div>
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
</div>
`).join('');
}
if (currentView === 'deployments') {
const deps = await fetchData('/deployments');
const list = document.getElementById('deployments-list');
list.innerHTML = deps.map(dep => `
<div class="card">
<div class="card-header">
<div class="card-title">${dep.service}</div>
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
</div>
<div class="label">ID</div><div class="value mono">${dep.id}</div>
<div class="label">Stage</div><div class="value">${dep.stage}</div>
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
</div>
`).join('');
}
if (currentView === 'events') {
const events = await fetchData('/events');
const timeline = document.getElementById('events-timeline');
timeline.innerHTML = events.map(ev => `
<div class="event ${ev.severity}">
<div class="event-header">
<span>${ev.type.toUpperCase()}</span>
<span>${formatTime(ev.timestamp)}</span>
</div>
<div>${ev.message}</div>
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
</div>
`).join('');
}
if (currentView === 'recommendations') {
const recs = await fetchData('/recommendations');
const list = document.getElementById('recommendations-list');
list.innerHTML = recs.map(rec => `
<div class="card">
<div class="card-header">
<div class="card-title">${rec.title}</div>
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
</div>
<p>${rec.description}</p>
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
<div class="controls">
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
</div>
</div>
`).join('');
}
if (currentView === 'topology') {
const nodes = await fetchData('/nodes');
const services = await fetchData('/services');
const topMap = document.getElementById('topology-map');
if (nodes && services) {
topMap.innerHTML = nodes.map(node => {
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
return `
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
<div class="card-header">
<div class="card-title">${node.hostname}</div>
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
</div>
<div class="label">Capabilities</div>
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
<div class="label">Services</div>
<div style="font-size:12px; margin-bottom:10px">
${nodeServices.length > 0 ? nodeServices.map(s => `
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
<span>${s.name}</span>
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
</div>
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
`).join('') : '<div class="value">None</div>'}
</div>
</div>
`;
}).join('');
}
}
if (currentView === 'correlation') {
const incidents = await fetchData('/incidents');
const actions = await fetchData('/actions');
const list = document.getElementById('correlation-chains');
if (incidents && actions) {
const allActions = Object.values(actions).flat();
list.innerHTML = incidents.map(inc => {
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
return `
<div class="card">
<div class="card-header">
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
<span class="badge status-error">Active</span>
</div>
<p>${inc.message}</p>
<div class="label">Related Actions</div>
${related.map(a => `
<div class="event" style="margin-top:5px">
<strong>${a.type}</strong> (${a.status})<br>
<small>${a.description}</small>
</div>
`).join('') || '<div class="value">No actions yet</div>'}
</div>
`;
}).join('');
}
}
if (currentView === 'settings') {
const config = await fetchData('/config');
const content = document.getElementById('settings-content');
content.innerHTML = `
<div class="label">Auto Mode</div>
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
<div class="label">Action Thresholds</div>
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
<div class="label">Telegram Integration</div>
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
`;
}
}
// Initial load
refreshData();
// Poll for updates
setInterval(refreshData, pollInterval);
</script>
</body>
</html>

View file

@ -0,0 +1,272 @@
import json
import os
import time
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from pathlib import Path
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
STATIC_DIR = Path(__file__).parent
DEFAULT_CONFIG = {
"operator_mode": "approval",
"auto_mode": True,
"action_thresholds": {
"restart_ha": 0.8,
"check_network": 0.9,
},
"default_threshold": 0.9,
"allowed_auto_actions": ["restart_ha"],
}
def read_json_file(path, default=None):
if not path.exists():
return default if default is not None else []
try:
return json.loads(path.read_text())
except Exception:
return default if default is not None else []
def get_config():
config_path = STATE_DIR / "operator-config.json"
if config_path.exists():
return read_json_file(config_path, DEFAULT_CONFIG)
return DEFAULT_CONFIG
def save_config(config):
STATE_DIR.mkdir(parents=True, exist_ok=True)
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
def current_nodes():
return read_json_file(WORLD_DIR / "nodes.json")
def current_services():
return read_json_file(WORLD_DIR / "services.json")
def current_deployments():
return read_json_file(WORLD_DIR / "deployments.json")
def current_incidents():
return read_json_file(WORLD_DIR / "incidents.json")
def current_recommendations():
return read_json_file(WORLD_DIR / "recommendations.json")
def current_summary():
summary = read_json_file(WORLD_DIR / "runtime-summary.json", default={})
if summary:
# Check for staleness
mtime = os.path.getmtime(WORLD_DIR / "runtime-summary.json")
summary["last_update"] = mtime
summary["stale"] = (time.time() - mtime) > 60 # Stale if older than 60s
return summary
def current_events():
events = []
if EVENTS_DIR.exists():
for f in EVENTS_DIR.glob("*.json"):
data = read_json_file(f)
if data:
# Add source file for traceability
data["_source"] = f.name
events.append(data)
return sorted(events, key=lambda x: x.get("timestamp", 0), reverse=True)
def current_actions():
actions = {}
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
for status in statuses:
actions[status] = []
status_dir = ACTIONS_DIR / status
if status_dir.exists():
for f in status_dir.glob("*.json"):
data = read_json_file(f)
if data:
# Injects some metadata for UI
data["id"] = data.get("action_id") or f.stem
data["status"] = status
actions[status].append(data)
return actions
def mutate_action(action_id, target_status):
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
if target_status not in statuses:
return False, f"Invalid target status: {target_status}"
# Find where the action is
source_path = None
current_status = None
for status in statuses:
p = ACTIONS_DIR / status / f"{action_id}.json"
if p.exists():
source_path = p
current_status = status
break
if not source_path:
return False, f"Action {action_id} not found"
target_dir = ACTIONS_DIR / target_status
target_dir.mkdir(parents=True, exist_ok=True)
target_path = target_dir / f"{action_id}.json"
try:
data = json.loads(source_path.read_text())
data["status"] = target_status
data["updated_at"] = time.time()
# Keep history of transitions
history = data.get("transition_history", [])
history.append({
"from": current_status,
"to": target_status,
"timestamp": time.time()
})
data["transition_history"] = history
target_path.write_text(json.dumps(data, indent=2))
if source_path != target_path:
source_path.unlink()
return True, "Success"
except Exception as e:
return False, str(e)
def send_json(status, payload, handler):
body = (json.dumps(payload) + "\n").encode("utf-8")
handler.send_response(status)
handler.send_header("Content-Type", "application/json")
handler.send_header("Content-Length", str(len(body)))
handler.end_headers()
handler.wfile.write(body)
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/config":
send_json(200, get_config(), self)
return
if self.path == "/nodes":
send_json(200, current_nodes(), self)
return
if self.path == "/services":
send_json(200, current_services(), self)
return
if self.path == "/deployments":
send_json(200, current_deployments(), self)
return
if self.path == "/incidents":
send_json(200, current_incidents(), self)
return
if self.path == "/recommendations":
send_json(200, current_recommendations(), self)
return
if self.path == "/summary":
send_json(200, current_summary(), self)
return
if self.path == "/events":
send_json(200, current_events(), self)
return
if self.path == "/actions":
send_json(200, current_actions(), self)
return
if self.path in ("/", "/index.html"):
body = (STATIC_DIR / "index.html").read_bytes()
self.send_response(200)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
return
self.send_error(404)
def do_POST(self):
if self.path not in (
"/config",
"/action/mutate",
"/mode",
):
self.send_error(404)
return
length = int(self.headers.get("Content-Length", "0"))
raw_body = self.rfile.read(length).decode("utf-8")
try:
payload = json.loads(raw_body)
except json.JSONDecodeError:
self.send_error(400, "Invalid JSON")
return
if self.path == "/config":
config = get_config()
config.update(payload)
save_config(config)
send_json(200, {"status": "ok"}, self)
return
if self.path == "/mode":
mode = payload.get("mode")
if not mode:
self.send_error(400, "mode is required")
return
config = get_config()
config["operator_mode"] = mode
save_config(config)
send_json(200, {"status": "ok"}, self)
return
if self.path == "/action/mutate":
action_id = payload.get("id")
target = payload.get("status")
if not action_id or not target:
self.send_error(400, "id and status are required")
return
success, msg = mutate_action(action_id, target)
if success:
send_json(200, {"status": "ok"}, self)
else:
self.send_error(500, msg)
return
def log_message(self, format, *args):
return
if __name__ == "__main__":
# Ensure directories exist
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
d.mkdir(parents=True, exist_ok=True)
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
port = int(os.getenv("PORT", "8080"))
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
server = ThreadingHTTPServer(("0.0.0.0", port), Handler)
server.serve_forever()

View file

@ -0,0 +1,136 @@
import os
import json
import time
import logging
import yaml
from pathlib import Path
# Constants and Paths
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
WORLD_DIR = Path(RUNTIME_PATH) / "world"
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger("supervisor")
class Supervisor:
def __init__(self):
self.desired_state = {"services": {}}
self.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
self._ensure_dirs()
def _ensure_dirs(self):
ACTIONS_DIR.mkdir(parents=True, exist_ok=True)
(ACTIONS_DIR / "pending").mkdir(parents=True, exist_ok=True)
def _load_desired_state(self):
services = {}
hosts_dir = REPO_ROOT / "hosts"
if not hosts_dir.exists():
logger.warning(f"Hosts directory {hosts_dir} does not exist")
return
for host_dir in hosts_dir.iterdir():
if host_dir.is_dir():
svc_file = host_dir / "services.yaml"
if svc_file.exists():
try:
with open(svc_file, "r") as f:
data = yaml.safe_load(f)
host_name = data.get("host")
for svc_name, svc_info in data.get("services", {}).items():
svc_key = f"{host_name}/{svc_name}"
services[svc_key] = {
"node": host_name,
"service": svc_name,
"desired": "running"
}
except Exception as e:
logger.error(f"Failed to load {svc_file}: {e}")
self.desired_state["services"] = services
def _load_actual_state(self):
files = {
"services": WORLD_DIR / "services.json",
"nodes": WORLD_DIR / "nodes.json",
"incidents": WORLD_DIR / "incidents.json"
}
for key, path in files.items():
if path.exists():
try:
with open(path, "r") as f:
self.actual_state[key] = json.load(f)
except Exception as e:
logger.error(f"Failed to load {key} actual state: {e}")
def reconcile(self):
self._load_desired_state()
self._load_actual_state()
drifts = []
# 1. Check for missing or unhealthy services
for svc_key, desired_info in self.desired_state["services"].items():
actual_info = self.actual_state["services"].get(svc_key)
if not actual_info:
drifts.append({
"type": "missing_service",
"svc_key": svc_key,
"node": desired_info["node"],
"service": desired_info["service"]
})
elif actual_info.get("status") != "healthy":
drifts.append({
"type": "unhealthy_service",
"svc_key": svc_key,
"node": desired_info["node"],
"service": desired_info["service"],
"status": actual_info.get("status")
})
# 2. Generate recommendations
for drift in drifts:
self._generate_recommendation(drift)
def _generate_recommendation(self, drift):
action_id = f"reconcile-{int(time.time())}-{drift['node']}-{drift['service']}"
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
if action_path.exists():
return # Already recommended
action = {
"action_id": action_id,
"timestamp": time.time(),
"type": "redeploy",
"node": drift["node"],
"service": drift["service"],
"risk_level": "guarded",
"confidence": 0.9,
"description": f"Redeploy {drift['service']} on {drift['node']} due to {drift['type']}",
"status": "pending",
"payload": {
"reason": drift["type"],
"svc_key": drift["svc_key"]
}
}
try:
with open(action_path, "w") as f:
json.dump(action, f, indent=2)
logger.info(f"Generated recommendation: {action_id}")
except Exception as e:
logger.error(f"Failed to save recommendation {action_id}: {e}")
def loop(self, interval=30):
logger.info("Starting supervisor loop")
while True:
self.reconcile()
time.sleep(interval)
if __name__ == "__main__":
supervisor = Supervisor()
supervisor.loop()