Add node capability model #3

Merged
oskar merged 2 commits from capability-model into master 2026-05-11 20:56:47 +02:00
47 changed files with 1373 additions and 20 deletions

85
docs/capabilities.md Normal file
View file

@ -0,0 +1,85 @@
# Node Capability Model
This document defines the capability model for the homelab infrastructure. The goal is to provide a declarative way to describe what each node can do, its constraints, and its suitability for various workloads.
## Overview
Capabilities are defined per host in `hosts/<hostname>/capabilities.yaml`. This metadata allows infrastructure tooling and future AI agents to reason about workload placement, recovery, and compatibility without hardcoding logic into the orchestration system.
## Schema Definition
The `capabilities.yaml` file follows this structure:
```yaml
capabilities:
hardware:
cpu:
arch: <string> # e.g., x86_64, arm64
cores: <int>
threads: <int>
memory:
total_gb: <int>
acceleration:
type: <string> # e.g., none, cuda, tpu, vaapi
model: <string> # e.g., "NVIDIA RTX 3060", "Coral Edge TPU"
virtualization:
supported: <boolean>
type: <string> # e.g., kvm, docker-only
storage:
persistence: <string> # ephemeral, persistent, redundant
type: <string> # ssd, hdd, nvme, sd-card
capacity_gb: <int>
networking:
reachability: <string> # public, tailscale-only, lan-only
ingress_suitability: <boolean>
bandwidth: <string> # e.g., "1Gbps", "100Mbps", "LTE"
runtime:
container_engine: <string> # docker, podman, containerd
os: <string> # debian, ubuntu, alpine, nixos
operational:
power_constraint: <string> # low-power, mains, battery-backed
connectivity: <string> # stable, intermittent
availability_target: <string> # high, medium, best-effort
deployment:
suitability: [<string>] # list of workload types (e.g., ai, database, edge, web)
restricted: <boolean> # if true, only specific workloads are allowed
```
## Placement Reasoning Examples
### AI Workloads
A service requiring `cuda` acceleration will be matched against nodes where `capabilities.hardware.acceleration.type == "cuda"`.
* **Target:** `solaria`
### Public Ingress
A service requiring public exposure will look for `capabilities.networking.ingress_suitability == true`.
* **Target:** `vps`
### Low-Power Staging
Staging workloads that should not consume significant power or are tolerant of intermittent connectivity.
* **Target:** `chelsty`
## Recovery Reasoning Examples
### Failover Strategy
If `saturn` (the primary orchestrator) fails:
1. Identify nodes with `roles: [control]` or `roles: [infra]`.
2. Check `capabilities.operational.availability_target == "high"`.
3. Propose migration of critical infra services to `piha`.
### Storage-Bound Services
If a node with `persistence: persistent` fails, the agent must check if there are other nodes with `persistence: persistent` and compatible `storage.type` before attempting recovery, or warn about potential data loss if moved to an `ephemeral` node.
## Future Usage by AI Agents
Future autonomous agents will use this metadata to:
1. **Evaluate Suitability:** Match service requirements (from `service.yaml`) against node capabilities.
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.

View file

@ -8,23 +8,46 @@ This document describes the GitOps-lite deployment process for the homelab.
2. **Unidirectional Flow**: Changes flow from **SATURN** (commit node) to execution nodes. 2. **Unidirectional Flow**: Changes flow from **SATURN** (commit node) to execution nodes.
3. **Lightweight**: No complex orchestrators (no Kubernetes). Use `docker compose` and simple shell scripts. 3. **Lightweight**: No complex orchestrators (no Kubernetes). Use `docker compose` and simple shell scripts.
4. **Tailscale Mesh**: All hosts are connected via Tailscale, allowing secure communication without public port exposure. 4. **Tailscale Mesh**: All hosts are connected via Tailscale, allowing secure communication without public port exposure.
5. **Host Autonomy**: Services that must operate during WAN or Git outages keep their runtime dependencies on the execution node or local LAN.
## Deployment Process ## Staged Deployment Framework
### 1. Preparation (on SATURN) The homelab uses a staged deployment framework located at `scripts/deploy/deploy.sh`. This script is designed to be resumable, stage-aware, and observable.
- Modify or create service definitions in `services/`. ### Deployment Stages
- Assign services to hosts by creating/updating `hosts/<hostname>/services.txt` (or similar mapping).
- Commit and push changes to the Forgejo instance.
### 2. Deployment (on Execution Node) 1. **prepare**: Pulls the latest changes from Git, validates inventory, and prepares the local environment.
2. **deploy**: Executes `docker compose` commands for all assigned services.
3. **verify**: Checks the health and connectivity of deployed services.
4. **diagnose**: Performs deep checks and resource analysis if something goes wrong.
5. **rollback**: Reverts to a previous known-good state.
6. **resume**: Automatically continues from the last successful stage.
Execution nodes run a deployment script (e.g., via cron or manual trigger) that: ### State Tracking and Logging
1. Performs a `git pull` from the source of truth. - **State**: Local node state is tracked in `/opt/homelab/state/deploy/current_stage`.
2. Identifies services assigned to this host. - **Logs**: Detailed execution logs are stored in `/opt/homelab/logs/deploy/deploy_<timestamp>.log`.
3. Symlinks or copies `services/<service>/docker-compose.yml` to `/opt/homelab/services/`.
4. Runs `docker compose up -d --remove-orphans`. ### Operational Semantics
Deployment is **hybrid**:
- **SATURN** acts as the orchestrator and source of truth.
- **Nodes** execute the deployment locally using the `deploy.sh` script.
- Human-in-the-loop is required for triggering and confirming deployments.
### Recovery Workflow
If a deployment fails:
1. Run `deploy.sh diagnose` to identify the issue.
2. Use the `recover-node` AI prompt to analyze logs and get recommendations.
3. Either fix the issue and run `deploy.sh resume`, or use `deploy.sh rollback`.
## Onboarding New Nodes
Refer to `inventory/templates/how_to_add_new_node.yaml` for a detailed guide on adding new hardware to the mesh. The general flow is:
1. Define node in `hosts/` and `inventory/topology.yaml` on SATURN.
2. Bootstrap the node (Docker, Tailscale, Git).
3. Run the staged deployment framework starting with `prepare`.
## Host-Local Overrides ## Host-Local Overrides
@ -33,6 +56,57 @@ If a service requires host-specific configuration (e.g., unique device paths for
1. Create a `docker-compose.override.yml` in `/opt/homelab/config/<service>/`. 1. Create a `docker-compose.override.yml` in `/opt/homelab/config/<service>/`.
2. The deployment script should include this override if it exists. 2. The deployment script should include this override if it exists.
For CHELSTY Home Assistant infrastructure, host-local configuration is the
authority for runtime identity, secrets, and local device endpoints:
- Home Assistant config: `/opt/homelab/config/homeassistant`
- Zigbee2MQTT config: `/opt/homelab/config/zigbee2mqtt`
- Mosquitto config: `/opt/homelab/config/mosquitto`
CHELSTY services must not require SATURN, VPS, or Forgejo to be reachable after
deployment has completed. Docker Compose definitions can still come from Git,
but Home Assistant automation, Zigbee control, and MQTT messaging must continue
locally while LTE or Tailscale connectivity is unavailable.
## Exposure Classes
Service inventory may declare one of these exposure classes:
- `local-only`: bind only to host, LAN, or container networks. This is the default for Zigbee2MQTT and Mosquitto.
- `tailscale-internal`: reachable over Tailscale only. This is appropriate for Home Assistant remote administration.
- `public`: reachable from the public internet through a deliberate ingress path, normally the VPS edge role.
Public exposure is not implied by a service existing in Git. It must be explicit
in host inventory and ingress configuration.
## CHELSTY Home Automation Deployment Notes
CHELSTY remains a Docker Compose execution node. No Kubernetes, Helm, Ansible,
or additional orchestration layer is required for Home Assistant infrastructure.
The SLZB-06U coordinator is network-connected over Ethernet or WiFi. Compose
files and host overrides should configure Zigbee2MQTT for a TCP/network
coordinator endpoint, not a USB serial device. Avoid `/dev/ttyUSB0` mappings.
Runtime paths follow the standard layout:
- `/opt/homelab/data/homeassistant`
- `/opt/homelab/config/homeassistant`
- `/opt/homelab/logs/homeassistant`
- `/opt/homelab/data/zigbee2mqtt`
- `/opt/homelab/config/zigbee2mqtt`
- `/opt/homelab/logs/zigbee2mqtt`
- `/opt/homelab/data/mosquitto`
- `/opt/homelab/config/mosquitto`
- `/opt/homelab/logs/mosquitto`
Recommended backup coverage:
- Home Assistant config and persistent data before upgrades or major integration changes.
- Zigbee2MQTT config, database, coordinator backup files, and Zigbee network key material.
- SLZB-06U firmware version, exported configuration, network address reservation, and coordinator state.
- Mosquitto config, ACL/password files, persistence data, and bridge configuration if enabled.
## Secrets Management ## Secrets Management
- **Do NOT commit secrets to Git.** - **Do NOT commit secrets to Git.**

51
docs/lifecycle.md Normal file
View file

@ -0,0 +1,51 @@
# Service Lifecycle and Recovery
This document defines the lifecycle of a service in the homelab and the procedures for operational recovery.
## Service Lifecycle
1. **Onboarding**:
- Create `services/<service>/` directory.
- Define `docker-compose.yml`, `service.yaml`, `README.md`, `env.example`, and `healthcheck.sh`.
- Register service in `inventory/topology.yaml` or relevant host configs.
2. **Provisioning**:
- Ensure `/opt/homelab/data/<service>` exists.
- Ensure `/opt/homelab/config/<service>` exists and contains required secrets/configs.
- Setup environment variables from `env.example` into `/opt/homelab/config/<service>/.env`.
3. **Deployment**:
- `docker compose pull`
- `docker compose up -d`
4. **Verification**:
- Run `healthcheck.sh`.
- Verify ports are reachable according to `service.yaml`.
5. **Maintenance**:
- Periodic updates via `docker compose pull`.
- Log monitoring via `docker compose logs -f`.
6. **Decommissioning**:
- `docker compose down`.
- Archive `/opt/homelab/data/<service>` if necessary.
## Operational Recovery
### 1. Container Failure
If a service is unhealthy:
- Check `docker compose logs`.
- Restart: `docker compose restart`.
- Recreate: `docker compose up -d --force-recreate`.
### 2. Node Failure
If a host node fails:
- Services with `owner_node` matching the failed node must be recovered on a backup node or the node must be restored.
- Persistence data must be restored from backups to `/opt/homelab/data/<service>`.
### 3. Dependency Recovery
If a dependency fails:
- Services depending on it might report unhealthy status.
- Recover the dependency first.
- Re-verify dependent services.
## Persistent Data Conventions
- **Data**: `/opt/homelab/data/<service>` - Primary persistent state.
- **Config**: `/opt/homelab/config/<service>` - Local overrides and secrets.
- **Backups**: Standard backup routines should target `/opt/homelab/data/`.

75
docs/service-model.md Normal file
View file

@ -0,0 +1,75 @@
# Service Model and Healthchecks
This document defines the normalized service model for the homelab.
## Service Layout
Each service must reside in its own directory under `services/`:
```text
services/<service>/
├── docker-compose.yml # Docker Compose definition
├── service.yaml # Service metadata and orchestration contract
├── README.md # Service documentation
├── env.example # Template for required environment variables
└── healthcheck.sh # Standardized healthcheck script
```
## Service Metadata (`service.yaml`)
The `service.yaml` file provides a machine-readable contract for deployment and orchestration.
### Schema
```yaml
service:
name: <string> # Canonical service name (kebab-case)
owner_node: <string> # Preferred host node
exposure: <class> # public, private, or local-only
dependencies: [<service>] # List of required services
ports:
- container: <int>
host: <int>
protocol: <tcp|udp>
healthcheck:
type: <string> # local-only, container, http, mqtt
endpoint: <string> # URL or topic if applicable
interval: <duration>
timeout: <duration>
retries: <int>
restart_policy: <string> # unless-stopped, always, etc.
persistence:
paths:
- /opt/homelab/data/<service>/...
runtime:
directories: [<string>] # Required host directories to be created
env_vars: [<string>] # List of required environment variables (keys only)
```
## Healthcheck Semantics
The `healthcheck.sh` script should return `0` for healthy and `1` for unhealthy. It should support different modes based on `service.yaml` definitions.
### 1. Local-only
Checks if the container is running and the process is alive within the host.
### 2. Container-level
Uses `docker inspect` or `docker exec` to check internal container health.
### 3. HTTP
Performs a `curl` against a specific endpoint (e.g., `/health` or `/`).
### 4. MQTT
Verifies that a specific topic is being updated or responds to a ping.
### 5. Dependency-aware
The healthcheck script may optionally check if its dependencies are healthy before reporting its own status.
## Runtime Authority
`/opt/homelab/config/<service>` is the source of truth for:
- Secrets (not in Git)
- Host-local overrides
- Mutable configuration
Services should mount files from this directory as needed.

View file

@ -19,11 +19,14 @@ This document defines the standards and conventions for the homelab GitOps-lite
/ /
├── docs/ # Infrastructure documentation ├── docs/ # Infrastructure documentation
├── hosts/ # Host-specific configurations ├── hosts/ # Host-specific configurations
│ ├── saturn/ ├── inventory/ # Topology and templates
│ ├── solaria/ ├── services/ # Normalized service definitions
│ ├── piha/ │ └── <service>/
│ └── vps/ │ ├── docker-compose.yml
├── services/ # Reusable service definitions (Docker Compose) │ ├── service.yaml
│ ├── README.md
│ ├── env.example
│ └── healthcheck.sh
├── scripts/ # Management and deployment scripts ├── scripts/ # Management and deployment scripts
└── README.md └── README.md
``` ```
@ -37,18 +40,28 @@ Runtime state must live outside the repository to keep it immutable and clean.
├── services/ # Active docker-compose files (deployed from git) ├── services/ # Active docker-compose files (deployed from git)
├── data/ # Persistent volume data (backed up) ├── data/ # Persistent volume data (backed up)
├── config/ # Host-local overrides and secrets (not in git) ├── config/ # Host-local overrides and secrets (not in git)
│ └── <service>/
│ ├── .env # Merged environment variables
│ └── overrides/ # Local configuration overrides
└── logs/ # Service logs └── logs/ # Service logs
``` ```
## Service Standards
1. **Normalization**: Every service MUST follow the `services/<service>/` layout.
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract.
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification.
4. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host.
## Docker Compose Standards ## Docker Compose Standards
1. **File Naming**: Use `docker-compose.yml`. 1. **File Naming**: Use `docker-compose.yml`.
2. **Container Naming**: `service-name`. 2. **Container Naming**: Match the service name.
3. **Restarts**: Always use `restart: unless-stopped`. 3. **Restarts**: Always use `restart: unless-stopped` unless specified otherwise in `service.yaml`.
4. **Networking**: 4. **Networking**:
- Use `tailscale` internal mesh for inter-host communication. - Use `tailscale` internal mesh for inter-host communication.
- Expose ports only when necessary. - Expose ports only when necessary.
5. **Volumes**: Use named volumes or absolute paths to `/opt/homelab/data/service-name`. 5. **Volumes**: Use absolute paths to `/opt/homelab/data/<service>`.
## Environment Variables ## Environment Variables

View file

@ -8,7 +8,7 @@
| PIHA | Infrastructure and monitoring node | | PIHA | Infrastructure and monitoring node |
| SOLARIA | AI and compute node | | SOLARIA | AI and compute node |
| VPS | Public ingress and edge node | | VPS | Public ingress and edge node |
| CHELSTY | Virtualization and Home Assistant node | | CHELSTY | LTE-connected edge hypervisor and Home Assistant node |
## Architecture Principles ## Architecture Principles
@ -21,6 +21,36 @@
- Deployment uses lightweight shell scripts. - Deployment uses lightweight shell scripts.
- Avoid Kubernetes and heavy orchestration frameworks. - Avoid Kubernetes and heavy orchestration frameworks.
## CHELSTY Home Automation
CHELSTY hosts the local home automation control plane. Because it uses an LTE
uplink and may be intermittently connected, Home Assistant, Zigbee2MQTT, and
Mosquitto must continue operating without SATURN, VPS, or Forgejo.
The CHELSTY Home Assistant inventory is split across:
- `hosts/chelsty/services.yaml`
- `hosts/chelsty/networking.yaml`
- `hosts/chelsty/paths.yaml`
Service exposure is classified as:
- `local-only`: available only on local host, LAN, or container networks.
- `tailscale-internal`: available to approved Tailscale clients only.
- `public`: available from the public internet through explicit ingress.
Initial CHELSTY service intent:
| Service | Role | Exposure | Offline required |
|---|---|---|---|
| homeassistant | Home automation controller | tailscale-internal | yes |
| zigbee2mqtt | Zigbee to MQTT bridge | local-only | yes |
| mosquitto | Local MQTT broker | local-only | yes |
The Zigbee coordinator is an SLZB-06U network coordinator. It should be modeled
as an Ethernet/WiFi network device consumed by Zigbee2MQTT, not as a USB dongle.
Do not use `/dev/ttyUSB0` or other USB device mappings for this coordinator.
## Runtime Layout ## Runtime Layout
Runtime data should live under: Runtime data should live under:
@ -32,3 +62,12 @@ with separated:
- data - data
- config - config
- logs - logs
CHELSTY follows the same layout:
- `/opt/homelab/data/<service>` for persistent service data.
- `/opt/homelab/config/<service>` for host-local configuration and secrets.
- `/opt/homelab/logs/<service>` for logs that should stay outside Git.
Critical backup sets on CHELSTY include Home Assistant config, Zigbee2MQTT
config and network state, Mosquitto config/data, and SLZB-06U coordinator state.

View file

@ -0,0 +1,40 @@
capabilities:
hardware:
cpu:
arch: x86_64
cores: 4
threads: 4
memory:
total_gb: 16
acceleration:
type: none
virtualization:
supported: true
type: kvm
storage:
persistence: persistent
type: ssd
capacity_gb: 250
networking:
reachability: tailscale-only
ingress_suitability: false
bandwidth: LTE
runtime:
container_engine: docker
os: debian
operational:
power_constraint: low-power
connectivity: intermittent
availability_target: best-effort
deployment:
suitability:
- staging
- homeassistant
- edge
restricted: false

View file

@ -0,0 +1,57 @@
host: chelsty
uplink:
type: lte
connectivity: intermittent
public_reachability: not-assumed
tailscale:
enabled: true
host_ip: 100.122.201.22
role: internal-management
exposure_classes:
local-only:
description: LAN, host, or container-network access only.
tailscale-internal:
description: Tailnet access only; no public ingress dependency.
public:
description: Public internet exposure through an explicitly defined ingress host.
networks:
home_automation_lan:
purpose: Home Assistant, MQTT, Zigbee coordinator, and local device control.
offline_required: true
internet_required_for_core_operation: false
devices:
slzb-06u:
role: zigbee-coordinator
vendor_model: SLZB-06U
connection_type: network
transport:
primary: ethernet
secondary: wifi
usb: false
address:
hostname: slzb-06u.local
ipv4: null
port: 6638
protocol: tcp
consumers:
- zigbee2mqtt
placement: chelsty-home-automation-lan
operational_notes:
- Treat the coordinator as a network appliance, not a USB dongle.
- Do not configure /dev/ttyUSB0 or other host USB device mappings for this coordinator.
- Prefer static DHCP or a reserved IP once the LAN addressing plan is known.
backup:
recommended: true
include:
- coordinator firmware version
- coordinator configuration export
- Zigbee network backup from Zigbee2MQTT
- device IEEE address and network parameters
notes:
- Keep a copy of coordinator state with the Zigbee2MQTT backup set.
- Record the reserved IP or DNS name used by Zigbee2MQTT.

48
hosts/chelsty/paths.yaml Normal file
View file

@ -0,0 +1,48 @@
host: chelsty
runtime_root: /opt/homelab
conventions:
services: /opt/homelab/services
data: /opt/homelab/data
config: /opt/homelab/config
logs: /opt/homelab/logs
services:
homeassistant:
data: /opt/homelab/data/homeassistant
config: /opt/homelab/config/homeassistant
logs: /opt/homelab/logs/homeassistant
backup_priority: critical
zigbee2mqtt:
data: /opt/homelab/data/zigbee2mqtt
config: /opt/homelab/config/zigbee2mqtt
logs: /opt/homelab/logs/zigbee2mqtt
backup_priority: critical
mosquitto:
data: /opt/homelab/data/mosquitto
config: /opt/homelab/config/mosquitto
logs: /opt/homelab/logs/mosquitto
backup_priority: high
backup_sets:
homeassistant:
include:
- /opt/homelab/config/homeassistant
- /opt/homelab/data/homeassistant
restore_note: Restore before starting the Home Assistant container.
zigbee2mqtt:
include:
- /opt/homelab/config/zigbee2mqtt
- /opt/homelab/data/zigbee2mqtt
restore_note: Restore before starting Zigbee2MQTT so coordinator and network state remain aligned.
slzb-06u:
include:
- SLZB-06U firmware version
- SLZB-06U exported configuration
- Zigbee network backup generated by Zigbee2MQTT
restore_note: Restore or reconfigure coordinator state before permitting Zigbee2MQTT to reform the network.

108
hosts/chelsty/services.yaml Normal file
View file

@ -0,0 +1,108 @@
host: chelsty
exposure_classes:
local-only:
description: Reachable only from CHELSTY-local networks or container networks.
public_ingress: false
tailscale_required: false
tailscale-internal:
description: Reachable through the Tailscale mesh by approved tailnet clients.
public_ingress: false
tailscale_required: true
public:
description: Reachable from the public internet through an explicit ingress path.
public_ingress: true
tailscale_required: false
operational_constraints:
uplink: lte
connectivity: intermittent
offline_operation_required: true
must_not_depend_on:
- saturn
- vps
- forgejo
services:
homeassistant:
role: home-automation-controller
deployment_model: docker-compose
exposure: tailscale-internal
offline_required: true
depends_on:
local:
- mosquitto
- zigbee2mqtt
external: []
ports:
- name: http
container_port: 8123
protocol: tcp
runtime:
config_path: /opt/homelab/config/homeassistant
data_path: /opt/homelab/data/homeassistant
logs_path: /opt/homelab/logs/homeassistant
backup:
recommended: true
include:
- /opt/homelab/config/homeassistant
- /opt/homelab/data/homeassistant
notes:
- Back up before Home Assistant core, supervisor-equivalent, or integration upgrades.
- Keep local restore copies on CHELSTY because LTE connectivity may be unavailable during recovery.
zigbee2mqtt:
role: zigbee-mqtt-bridge
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local:
- mosquitto
external:
- slzb-06u
coordinator:
name: slzb-06u
connection: network
usb_device: null
ports:
- name: frontend
container_port: 8080
protocol: tcp
exposure: tailscale-internal
runtime:
config_path: /opt/homelab/config/zigbee2mqtt
data_path: /opt/homelab/data/zigbee2mqtt
logs_path: /opt/homelab/logs/zigbee2mqtt
backup:
recommended: true
include:
- /opt/homelab/config/zigbee2mqtt
- /opt/homelab/data/zigbee2mqtt
notes:
- Include configuration.yaml, database.db, coordinator backup files, and network key material.
- Restore Zigbee2MQTT state together with the SLZB-06U coordinator state when replacing hardware.
mosquitto:
role: local-mqtt-broker
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
ports:
- name: mqtt
container_port: 1883
protocol: tcp
runtime:
config_path: /opt/homelab/config/mosquitto
data_path: /opt/homelab/data/mosquitto
logs_path: /opt/homelab/logs/mosquitto
backup:
recommended: true
include:
- /opt/homelab/config/mosquitto
- /opt/homelab/data/mosquitto
notes:
- Retain ACL, password, persistence, and bridge configuration if enabled.

View file

@ -0,0 +1,39 @@
capabilities:
hardware:
cpu:
arch: arm64
cores: 4
threads: 4
memory:
total_gb: 4
acceleration:
type: none
virtualization:
supported: false
type: docker-only
storage:
persistence: persistent
type: sd-card
capacity_gb: 32
networking:
reachability: tailscale-only
ingress_suitability: false
bandwidth: 1Gbps
runtime:
container_engine: docker
os: debian
operational:
power_constraint: mains
connectivity: stable
availability_target: medium
deployment:
suitability:
- infra
- monitoring
restricted: false

View file

@ -0,0 +1,40 @@
capabilities:
hardware:
cpu:
arch: arm64
cores: 8
threads: 8
memory:
total_gb: 8
acceleration:
type: none
virtualization:
supported: false
type: docker-only
storage:
persistence: persistent
type: sd-card
capacity_gb: 64
networking:
reachability: tailscale-only
ingress_suitability: false
bandwidth: 1Gbps
runtime:
container_engine: docker
os: debian
operational:
power_constraint: mains
connectivity: stable
availability_target: high
deployment:
suitability:
- control
- development
- infra
restricted: false

View file

@ -0,0 +1,41 @@
capabilities:
hardware:
cpu:
arch: x86_64
cores: 12
threads: 24
memory:
total_gb: 64
acceleration:
type: cuda
model: "NVIDIA RTX 4070"
virtualization:
supported: true
type: kvm
storage:
persistence: redundant
type: nvme
capacity_gb: 2000
networking:
reachability: tailscale-only
ingress_suitability: false
bandwidth: 1Gbps
runtime:
container_engine: docker
os: ubuntu
operational:
power_constraint: mains
connectivity: stable
availability_target: medium
deployment:
suitability:
- ai
- compute
- database
restricted: false

View file

@ -0,0 +1,40 @@
capabilities:
hardware:
cpu:
arch: x86_64
cores: 2
threads: 2
memory:
total_gb: 4
acceleration:
type: none
virtualization:
supported: false
type: docker-only
storage:
persistence: persistent
type: ssd
capacity_gb: 80
networking:
reachability: public
ingress_suitability: true
bandwidth: 1Gbps
runtime:
container_engine: docker
os: debian
operational:
power_constraint: mains
connectivity: stable
availability_target: high
deployment:
suitability:
- edge
- ingress
- web
restricted: true

View file

@ -0,0 +1,29 @@
---
title: How to Add a New Node to the Homelab
description: This guide outlines the process for onboarding a new execution node into the GitOps-lite environment.
phases:
- phase: 1. Preparation (on SATURN)
steps:
- "Define Node Inventory: Create hosts/<hostname>/ directory"
- "Add host.yaml with hardware metadata"
- "Add networking.yaml with IP and Tailscale info"
- "Add capabilities.yaml with node capability description"
- "Add services.txt listing assigned services"
- "Update inventory/topology.yaml"
- "Commit and push changes to Forgejo"
- phase: 2. Bootstrapping (on the New Node)
steps:
- "Install OS (Debian/Ubuntu recommended)"
- "Configure SSH and user access"
- "Install Docker, Docker Compose, Tailscale, Git"
- "Join the tailnet"
- "Clone repository: git clone <forgejo-url>/homelab-codex.git ~/homelab-codex-ws"
- "Setup runtime: sudo mkdir -p /opt/homelab/{services,config,state,logs} && sudo chown -R $USER:$USER /opt/homelab"
- phase: 3. Initial Deployment
steps:
- "Run prepare: ~/homelab-codex-ws/scripts/deploy/deploy.sh prepare"
- "Run deploy: ~/homelab-codex-ws/scripts/deploy/deploy.sh deploy"
- "Run verify: ~/homelab-codex-ws/scripts/deploy/deploy.sh verify"

View file

@ -0,0 +1,29 @@
---
bootstrap_checklist:
pre_flight:
- task: "Hardware connected and powered"
done: false
- task: "Base OS installed (Debian/Ubuntu)"
done: false
- task: "Network connectivity established"
done: false
- task: "SSH access configured"
done: false
onboarding:
- task: "Tailscale installed and authenticated"
done: false
- task: "Docker and Compose V2 installed"
done: false
- task: "Git installed"
done: false
- task: "Repository cloned to ~/homelab-codex-ws"
done: false
- task: "Opt homelab structure created"
done: false
initial_run:
- task: "deploy.sh prepare successful"
done: false
- task: "deploy.sh deploy successful"
done: false
- task: "deploy.sh verify successful"
done: false

View file

@ -0,0 +1,18 @@
---
discovery_commands:
cpu:
- "lscpu"
- "cat /proc/cpuinfo"
memory:
- "free -h"
storage:
- "lsblk"
- "df -h"
network:
- "ip addr"
- "tailscale status"
gpu:
- "nvidia-smi"
- "lspci | grep -i vga"
usb:
- "lsusb"

View file

@ -0,0 +1,13 @@
---
node_preparation:
actions:
- name: update_system
command: "sudo apt update && sudo apt upgrade -y"
- name: install_dependencies
command: "sudo apt install -y curl git docker.io docker-compose-v2 tailscale"
- name: configure_docker_permissions
command: "sudo usermod -aG docker $USER"
- name: create_runtime_directories
command: "sudo mkdir -p /opt/homelab/{services,config,state,logs} && sudo chown -R $USER:$USER /opt/homelab"
- name: initialize_repo
command: "git clone <repo_url> ~/homelab-codex-ws"

View file

@ -0,0 +1,13 @@
### System Prompt Addendum: Create Node
**Context**: You are assisting in adding a new node to the homelab.
**Task**: Generate the necessary inventory files for a new node.
**Requirements**:
1. Ask for: hostname, IP address, Tailscale IP, hardware specs (CPU/RAM/Storage), and intended role/services.
2. Generate `hosts/<hostname>/host.yaml` and `hosts/<hostname>/networking.yaml`.
3. Provide a snippet for `inventory/topology.yaml`.
4. Recommend services based on hardware (e.g., if GPU is present, suggest inference services).
**Output Format**: YAML blocks for each file.
**Restriction**: Do NOT execute any shell commands. Only provide the configuration.

View file

@ -0,0 +1,16 @@
### System Prompt Addendum: Deploy Node
**Context**: Orchestrating a deployment across one or more nodes.
**Task**: Generate the deployment plan and verification checklist.
**Requirements**:
1. Identify which nodes need updates based on git changes.
2. Recommend the sequence of stages (e.g., `prepare` on all, then `deploy` on edge nodes first).
3. Generate a human-readable checklist for the operator.
4. Define verification criteria for the `verify` stage.
**Output Format**:
- Deployment Plan (sequence of commands).
- Verification Checklist.
**Restriction**: Do NOT mutate infrastructure autonomously.

View file

@ -0,0 +1,17 @@
### System Prompt Addendum: Recover Node
**Context**: A homelab node is unresponsive or has suffered data loss.
**Task**: Analyze logs and state to recommend recovery steps.
**Requirements**:
1. Request the content of `/opt/homelab/logs/deploy/` (latest log) and `/opt/homelab/state/deploy/current_stage`.
2. Analyze the last failed stage.
3. Recommend specific `deploy.sh` commands (e.g., `rollback` or `resume`).
4. Provide manual recovery steps if automated stages fail.
**Output Format**:
- Analysis of the failure.
- Recommended action.
- Documentation of the recovery process.
**Restriction**: Do NOT auto-execute deployment.

View file

@ -30,6 +30,20 @@ nodes:
chelsty: chelsty:
roles: roles:
- remote
- hypervisor - hypervisor
- homeassistant - homeassistant
- staging - staging
connectivity:
uplink: lte
intermittent: true
home_automation:
offline_operation_required: true
services:
- homeassistant
- zigbee2mqtt
- mosquitto
coordinator:
model: SLZB-06U
connection: network
usb: false

110
scripts/deploy/deploy.sh Executable file
View file

@ -0,0 +1,110 @@
#!/usr/bin/env bash
# deploy.sh - Staged deployment framework for homelab nodes.
# Usage: ./deploy.sh [stage]
set -e
# --- Configuration ---
RUNTIME_PATH="/opt/homelab"
STATE_DIR="${RUNTIME_PATH}/state/deploy"
LOG_DIR="${RUNTIME_PATH}/logs/deploy"
REPO_PATH="${HOME}/homelab-codex-ws"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
# --- Initialization ---
mkdir -p "$STATE_DIR" "$LOG_DIR"
# Redirection for logging
exec > >(tee -a "$LOG_FILE") 2>&1
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}
set_state() {
echo "$1" > "${STATE_DIR}/current_stage"
log "State set to: $1"
}
get_state() {
if [ -f "${STATE_DIR}/current_stage" ]; then
cat "${STATE_DIR}/current_stage"
else
echo "none"
fi
}
# --- Stages ---
stage_prepare() {
log "Stage: PREPARE"
set_state "prepare"
# Skeleton: Pull latest changes, check dependencies, validate inventory
log "Checking repository at $REPO_PATH..."
cd "$REPO_PATH" && git pull
log "Preparation complete."
}
stage_deploy() {
log "Stage: DEPLOY"
set_state "deploy"
# Skeleton: Iterate through services and run docker compose
log "Deploying services defined for $(hostname)..."
# Implementation detail: loop through services/ and run compose
log "Deployment complete."
}
stage_verify() {
log "Stage: VERIFY"
set_state "verify"
# Skeleton: Check container status, healthchecks, connectivity
log "Verifying service health..."
docker ps
log "Verification complete."
}
stage_diagnose() {
log "Stage: DIAGNOSE"
# Skeleton: Check logs, resource usage, networking
log "Running diagnostics..."
docker stats --no-stream
log "Diagnostics complete."
}
stage_rollback() {
log "Stage: ROLLBACK"
# Skeleton: Revert to previous git commit or previous state
log "Rolling back changes..."
log "Rollback complete."
}
stage_resume() {
log "Stage: RESUME"
CURRENT=$(get_state)
log "Resuming from state: $CURRENT"
case "$CURRENT" in
"prepare") stage_deploy ;;
"deploy") stage_verify ;;
"verify") log "Last deployment was verified. Nothing to resume." ;;
*) log "Unknown state or nothing to resume. Starting from prepare..."; stage_prepare ;;
esac
}
# --- Main ---
COMMAND=${1:-resume}
log "--- Homelab Deployment Started (Command: $COMMAND) ---"
case "$COMMAND" in
prepare) stage_prepare ;;
deploy) stage_deploy ;;
verify) stage_verify ;;
diagnose) stage_diagnose ;;
rollback) stage_rollback ;;
resume) stage_resume ;;
*) echo "Usage: $0 {prepare|deploy|verify|diagnose|rollback|resume}"; exit 1 ;;
esac
log "--- Homelab Deployment Finished ---"

View file

@ -0,0 +1,9 @@
# Forgejo
Forgejo is a self-hosted lightweight software forge. Easy to install and low maintenance.
## Usage
Deployed on the `saturn` node as the git source of truth.
Web UI is available on port 3000.
SSH for git is available on port 222.

View file

@ -0,0 +1,15 @@
services:
forgejo:
image: codeberg.org/forgejo/forgejo:latest
container_name: forgejo
restart: unless-stopped
environment:
- USER_UID=1000
- USER_GID=1000
volumes:
- /opt/homelab/data/forgejo/data:/data
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
ports:
- '3000:3000'
- '222:22'

View file

@ -0,0 +1,3 @@
USER_UID=1000
USER_GID=1000
# FORGEJO__database__DB_TYPE=sqlite3

View file

@ -0,0 +1,17 @@
#!/bin/bash
# Healthcheck for Forgejo
# Check if the container is running
if ! docker ps --filter "name=forgejo" --filter "status=running" | grep -q "forgejo"; then
echo "[FAIL] Forgejo container is not running"
exit 1
fi
# Check API health endpoint
if ! curl -sf http://localhost:3000/api/healthz > /dev/null; then
echo "[FAIL] Forgejo API is not responding"
exit 1
fi
echo "[OK] Forgejo is healthy"
exit 0

View file

@ -0,0 +1,28 @@
service:
name: forgejo
owner_node: saturn
exposure: private
dependencies: []
ports:
- container: 3000
host: 3000
protocol: tcp
- container: 22
host: 222
protocol: tcp
healthcheck:
type: http
endpoint: http://localhost:3000/api/healthz
interval: 1m
timeout: 10s
retries: 5
restart_policy: unless-stopped
persistence:
paths:
- /opt/homelab/data/forgejo/data
runtime:
directories:
- /opt/homelab/data/forgejo/data
env_vars:
- USER_UID
- USER_GID

View file

@ -0,0 +1,9 @@
# Mosquitto MQTT Broker
Eclipse Mosquitto is an open source (EPL/EDL licensed) message broker that implements the MQTT protocol versions 5.0, 3.1.1 and 3.1.
## Usage
Deployed on the `piha` node.
Port 1883 for standard MQTT.
Port 9001 for WebSockets.

View file

@ -0,0 +1,12 @@
services:
mosquitto:
image: eclipse-mosquitto:latest
container_name: mosquitto
restart: unless-stopped
ports:
- '1883:1883'
- '9001:9001'
volumes:
- /opt/homelab/data/mosquitto/config:/mosquitto/config
- /opt/homelab/data/mosquitto/data:/mosquitto/data
- /opt/homelab/data/mosquitto/log:/mosquitto/log

View file

@ -0,0 +1,2 @@
# No specific environment variables required by default.
# Mosquitto is mainly configured via /opt/homelab/data/mosquitto/config/mosquitto.conf

View file

@ -0,0 +1,17 @@
#!/bin/bash
# Healthcheck for Mosquitto
# Check if the container is running
if ! docker ps --filter "name=mosquitto" --filter "status=running" | grep -q "mosquitto"; then
echo "[FAIL] Mosquitto container is not running"
exit 1
fi
# Basic port check for 1883
if ! (echo > /dev/tcp/localhost/1883) >/dev/null 2>&1; then
echo "[FAIL] Mosquitto port 1883 is not reachable"
exit 1
fi
echo "[OK] Mosquitto is healthy"
exit 0

View file

@ -0,0 +1,29 @@
service:
name: mosquitto
owner_node: piha
exposure: private
dependencies: []
ports:
- container: 1883
host: 1883
protocol: tcp
- container: 9001
host: 9001
protocol: tcp
healthcheck:
type: container
interval: 30s
timeout: 10s
retries: 3
restart_policy: unless-stopped
persistence:
paths:
- /opt/homelab/data/mosquitto/config
- /opt/homelab/data/mosquitto/data
- /opt/homelab/data/mosquitto/log
runtime:
directories:
- /opt/homelab/data/mosquitto/config
- /opt/homelab/data/mosquitto/data
- /opt/homelab/data/mosquitto/log
env_vars: []

13
services/npm/README.md Normal file
View file

@ -0,0 +1,13 @@
# Nginx Proxy Manager (NPM)
Expose your services easily and securely with Nginx Proxy Manager.
## Features
- Secure HTTPS via Let's Encrypt
- Easy to use Web UI
- Advanced configuration for power users
## Usage
Deployed on the `vps` node for public ingress.
Web UI is available on port 81.

2
services/npm/env.example Normal file
View file

@ -0,0 +1,2 @@
# No environment variables required for standard NPM deployment.
# Local overrides can be placed in /opt/homelab/config/npm/.env

View file

@ -0,0 +1,17 @@
#!/bin/bash
# Healthcheck for Nginx Proxy Manager
# Check if the container is running
if ! docker ps --filter "name=npm" --filter "status=running" | grep -q "npm"; then
echo "[FAIL] NPM container is not running"
exit 1
fi
# Check Web UI responsiveness (port 81)
if ! curl -sf http://localhost:81 > /dev/null; then
echo "[FAIL] NPM Web UI is not responding"
exit 1
fi
echo "[OK] NPM is healthy"
exit 0

31
services/npm/service.yaml Normal file
View file

@ -0,0 +1,31 @@
service:
name: npm
owner_node: vps
exposure: public
dependencies: []
ports:
- container: 80
host: 80
protocol: tcp
- container: 81
host: 81
protocol: tcp
- container: 443
host: 443
protocol: tcp
healthcheck:
type: http
endpoint: http://localhost:81
interval: 30s
timeout: 10s
retries: 3
restart_policy: unless-stopped
persistence:
paths:
- /opt/homelab/data/npm/data
- /opt/homelab/data/npm/letsencrypt
runtime:
directories:
- /opt/homelab/data/npm/data
- /opt/homelab/data/npm/letsencrypt
env_vars: []

13
services/ollama/README.md Normal file
View file

@ -0,0 +1,13 @@
# Ollama
Get up and running with large language models locally.
## Usage
Deployed on the `solaria` node for GPU acceleration.
API is available on port 11434.
Example check:
```bash
curl http://localhost:11434/api/tags
```

View file

@ -0,0 +1,16 @@
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- '11434:11434'
volumes:
- /opt/homelab/data/ollama:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]

View file

@ -0,0 +1,2 @@
# No specific environment variables required by default.
# CUDA_VISIBLE_DEVICES=0

View file

@ -0,0 +1,17 @@
#!/bin/bash
# Healthcheck for Ollama
# Check if the container is running
if ! docker ps --filter "name=ollama" --filter "status=running" | grep -q "ollama"; then
echo "[FAIL] Ollama container is not running"
exit 1
fi
# Check API responsiveness
if ! curl -sf http://localhost:11434/api/tags > /dev/null; then
echo "[FAIL] Ollama API is not responding"
exit 1
fi
echo "[OK] Ollama is healthy"
exit 0

View file

@ -0,0 +1,23 @@
service:
name: ollama
owner_node: solaria
exposure: private
dependencies: []
ports:
- container: 11434
host: 11434
protocol: tcp
healthcheck:
type: http
endpoint: http://localhost:11434/api/tags
interval: 1m
timeout: 10s
retries: 3
restart_policy: unless-stopped
persistence:
paths:
- /opt/homelab/data/ollama
runtime:
directories:
- /opt/homelab/data/ollama
env_vars: []

View file

@ -0,0 +1,10 @@
# Zigbee2MQTT
Zigbee to MQTT bridge, get rid of your proprietary Zigbee bridges.
## Usage
Deployed on the `piha` node.
Requires a Zigbee adapter (e.g., Sonoff ZBDongle-E) mapped to `/dev/ttyACM0`.
Frontend is available on port 8080.

View file

@ -0,0 +1,14 @@
services:
zigbee2mqtt:
container_name: zigbee2mqtt
image: koenkk/zigbee2mqtt:latest
restart: unless-stopped
volumes:
- /opt/homelab/data/zigbee2mqtt/data:/app/data
- /run/udev:/run/udev:ro
ports:
- 8080:8080
devices:
- /dev/ttyACM0:/dev/ttyACM0
environment:
- TZ=Europe/Stockholm

View file

@ -0,0 +1,3 @@
TZ=Europe/Stockholm
# MQTT credentials if applicable
# Z2M_MQTT_SERVER=mqtt://mosquitto:1883

View file

@ -0,0 +1,17 @@
#!/bin/bash
# Healthcheck for Zigbee2MQTT
# Check if the container is running
if ! docker ps --filter "name=zigbee2mqtt" --filter "status=running" | grep -q "zigbee2mqtt"; then
echo "[FAIL] Zigbee2MQTT container is not running"
exit 1
fi
# Check frontend responsiveness
if ! curl -sf http://localhost:8080 > /dev/null; then
echo "[FAIL] Zigbee2MQTT frontend is not responding"
exit 1
fi
echo "[OK] Zigbee2MQTT is healthy"
exit 0

View file

@ -0,0 +1,25 @@
service:
name: zigbee2mqtt
owner_node: piha
exposure: private
dependencies:
- mosquitto
ports:
- container: 8080
host: 8080
protocol: tcp
healthcheck:
type: http
endpoint: http://localhost:8080
interval: 30s
timeout: 10s
retries: 3
restart_policy: unless-stopped
persistence:
paths:
- /opt/homelab/data/zigbee2mqtt/data
runtime:
directories:
- /opt/homelab/data/zigbee2mqtt/data
env_vars:
- TZ