agent-system/docs/operator/incident-remediation.md

928 B

Incident Remediation Guide

Guide for operators responding to system incidents using the Control Plane.

Remediation Flow

1. Detection

Incidents appear in the Active Incidents card on the Dashboard and in the Events timeline.

2. Correlation

Use the Correlation view to see:

  • The event chain leading to the incident.
  • Automated recommendations generated in response.
  • Any manual actions already taken.

3. Intervention

  1. Review the recommended actions in the Action Queue.
  2. If the automated recommendation is not sufficient, use the Nodes or Services view to manually trigger commands.
  3. Observe the Runtime Topology to ensure no cascading failures occur during remediation.

4. Verification

Once actions are completed, verify the system state:

  • Health badges should transition back to Nominal.
  • The System Status in the sidebar should reflect a healthy state.