agent-system/docs/operator/incident-remediation.md

25 lines
928 B
Markdown
Raw Permalink Normal View History

2026-05-12 18:01:37 +02:00
# Incident Remediation Guide
Guide for operators responding to system incidents using the Control Plane.
## Remediation Flow
### 1. Detection
Incidents appear in the **Active Incidents** card on the Dashboard and in the **Events** timeline.
### 2. Correlation
Use the **Correlation** view to see:
- The event chain leading to the incident.
- Automated recommendations generated in response.
- Any manual actions already taken.
### 3. Intervention
1. Review the recommended actions in the **Action Queue**.
2. If the automated recommendation is not sufficient, use the **Nodes** or **Services** view to manually trigger commands.
3. Observe the **Runtime Topology** to ensure no cascading failures occur during remediation.
### 4. Verification
Once actions are completed, verify the system state:
- Health badges should transition back to **Nominal**.
- The **System Status** in the sidebar should reflect a healthy state.