Self Healing

Overview

Klaudia Self-Healing is the autonomous layer of the Komodor AI SRE platform.

While standard Root Cause Analysis identifies why an issue occurred, Self-Healing moves beyond diagnosis to active resolution.

By bridging the gap between detection and remediation, Klaudia can automatically execute precise fixes for recurring Kubernetes incidents - such as restarting stuck pods or rolling back failed deployments, without requiring manual human intervention.

How It Helps You

Near-Zero MTTR: Resolve recurring issues instantly as they are detected, even during off-hours.
Reduced Cognitive Load: Automate repetitive "toil" so SREs and DevOps engineers can focus on strategic initiatives rather than manual fixes.
Safe Empowerment: Enable developers to have issues resolved safely within pre-defined guardrails and RBAC policies, reducing the risk of manual errors.
Consistency: Ensure that identical problems are met with standardized, policy-driven resolutions across your entire infrastructure.

How It Works

Klaudia follows a sophisticated four-tier remediation model. Self-Healing is the highest tier of this system:

General Command: Klaudia provides manual copy-paste commands for external tools (e.g., AWS CLI).
Guided Suggestion: Klaudia suggests a path but requests human input for specific data (e.g., choosing a label).
One-Click Remediation: A "Run" button allows you to execute a fix directly from the Komodor UI.
Autonomous Self-Healing: Once you authorize a policy, Klaudia detects the issue and executes the "One-Click" action automatically.

The Safety Engine

Policy-Driven: Actions only trigger if they match strict account-level settings (specific clusters, namespaces, or services).
RBAC Enforcement: Every automated action is executed using platform RBAC credentials. If a user doesn't have the permission to perform the action manually, Klaudia cannot perform it automatically.
Whitelisted Actions: Only safe, predefined operations (like scaling, restarting, or rolling back) are eligible for automation.

Where to Find It

The Self-Healing feature is integrated into several areas of the Komodor platform:

Self-Healing Policies

Located in Account Settings → Klaudia AI → Self-Healing Policies, this is where admins define the "guardrails" for autonomous actions.

Access Control: This page is visible only to users with the manage:klaudia permission.
Policy Configuration: Admins can create policies that specify covered clusters, namespaces, and specific allowed actions, such as helm rollback or Restart Deployment.
Cluster Limits: Each cluster can have only one policy configured at a time.
Guardrails: Admins define which specific resources and actions are eligible for automation to ensure safety.

Self-Healing Events Page

Found under the Workload Health page, this dedicated tab provides a central dashboard for all autonomous actions. It allows you to track:

Event Status: Whether a remediation is in progress, resolved, or unsuccessful.
Contextual Links: Direct links to the original RCA and the investigation path that triggered the fix.
Full Application Audit: All autonomous actions are also logged in the General Audit page within the settings area for full transparency and compliance.

Self-Healing Event Drawer

Expanding an event provides a deep dive into the resolution:

Summary: A one-line explanation of the issue and remediation.
Timeline: Step-by-step history from detection to stabilization.
Applied Fix: Details on the action taken, status, and affected resources.
Policy Context: A link to the specific policy that authorized the action.

Overview Page

A dedicated Self-Healing card appears on the Overview page to provide immediate visibility.

Activity Summary: The card displays the number of self-healing events from the last 7 days.
Quick Access: Clicking the card opens a drawer with a filtered list of recent autonomous events.

Resource Timeline

Every autonomous fix is injected directly into the Resource Timeline (Deployment, StatefulSet, DaemonSet, Rollout, Job, or CronJob) as a "Klaudia Self Healing Action" event.

Visibility: These events can be found in a dedicated swimlane titled "Manual Actions and Self-Healing" (if the swimlane exists) and are also listed in the standard events list.

Use Cases

Below are common scenarios demonstrating how Klaudia Self-Healing can be utilized to automate incident resolution.

1. Auto-Rollback Services

Scenario: A new deployment in a testing environment triggers a CrashLoopBackOff or CreateContainerConfigError due to configuration drift or a missing resource.
Policy Context: An active Self-Healing policy is configured for the testing cluster with Helm Rollback enabled as an allowed action.
Klaudia’s Action: Klaudia detects the failure, correlates it with the latest deploy, and automatically initiates a Helm Rollback to the last stable version.
Benefit: This keeps the testing pipeline moving without requiring manual intervention from a developer to revert the change.

2. Auto-Rerun Failed Jobs

Scenario: A critical batch Job fails at 3 AM due to a transient infrastructure issue, such as a node termination.
Policy Context: A "Production Job Policy" is active on Production cluster, covering all jobs.
Klaudia’s Action: After confirming the root cause is transient and not an application bug, Klaudia triggers a "Rerun" action.
Benefit: The SRE team wakes up to a "Resolved" notification rather than an active incident during off-hours.

Getting Started

Define Your First Policy: Navigate to Settings → Klaudia AI → Self-Healing Policies and click "New Policy".
Set the Scope: Select a cluster, and select the allowed actions.
Monitor: Watch the Self-Healing Events page to see Klaudia in action and let Klaudia do its magic.

Important Limitations & Notes

Multi-Command Restriction: Currently, One-Click Remediation supports single commands only. If a fix requires multiple steps (e.g., "Scale down" followed by "Delete PVC"), it will be presented as a Suggestion rather than a One-Click action .
RBAC Enforcement: All remediation actions—whether One-Click or Self-Healing—are executed using the user's or the platform's RBAC credentials. If you do not have permission to perform an action (e.g., delete pod) via kubectl, you cannot execute it through Klaudia .
Guardrails: Self-healing actions only run if they match the strict policies defined in your account settings (Scope, Allowed Actions, etc.)
Auditing: all actions (manual or automatic) are audited by default, allowing org admins to review them.