How Klaudia Investigates

The Art of AI SRE Investigation

The following investigation process applies to all resource types listed above — workloads, add-ons, ML workflows, and native resources.

Detection & Analysis

  • Identifies failures: CrashLoopBackOff, OOMKill, scheduling failures, readiness probe errors, and more.
  • Conducts an independent investigation without requiring manual setup or steering.

AI-Driven Root Cause Analysis

  • Extracts logs, configurations, and errors automatically — no manual copy-paste into the investigation.
  • Analyzes new information and refines investigation. 
  • Mimics an expert SRE's workflow: hypothesizes, queries data, analyzes, and refines.
  • Repeats process, narrowing down to root cause.
  • References the exact resource where the problem originates, with a full evidence chain.

Suggested Remediation

  • Synthesizes findings from multiple sources into a single, coherent conclusion.
  • Provides precise, actionable fixes tied to the identified root cause — not generic suggestions.
  • Highlights specific misconfigurations — YAML syntax errors, incorrect settings, resource limit issues, missing dependencies and cross-resource context.
  • Links directly to the affected resource for quick access.

Interactive Troubleshooting

  • Supports real-time follow-up questions within any investigation.
  • Restart the analysis at any time or branch into adjacent issues without losing context
    Screenshot 2025-03-17 at 13.49.14.png

Once Click Remediation

  • Execute remediation actions with a single click, directly from Klaudia's RCA output. 
  • Verify the state of the remediation, trigger other flows if needed

Optimization and efficiency

  • Indexes context to memory. Improve  resolution for future incidents.

The Investigated Resources

For each resource, Klaudia runs the same investigation loop: detection, root cause analysis, evidence collection, remediation suggestions, one-click remediation, and interactive follow-up.

  • For workloads, Klaudia investigates Pods, ReplicaSets, Deployments, Jobs, CronJobs, StatefulSets, and DaemonSets.

    • Deployments, StatefulSets, DaemonSets — full RCA on pod failures, scheduling issues, resource limits, configuration errors, and readiness failures across all replica-based workloads.
    • Jobs & CronJobs — investigates failed runs, timeout conditions, and retry exhaustion, correlating failures back to config or upstream dependency changes.  

  • For storage - Klaudia investigates PVCs, PVs and Storage Classes.

  • For configuration - Klaudia investigates ConfigMaps, Secrets, Resource Quotas, Limit Ranges, HPAs, and PDBs that may impact workload stability, scaling, or availability.

  • For networking and access control, Klaudia investigates Kubernetes Services, Endpoints, Endpoint Slices, Ingresses, Network Policies, Service Accounts, Roles, Cluster Roles, Role Bindings, and Cluster Role Bindings.

  • For Kubernetes add-ons - Klaudia investigates Kubernetes add-ons as first-class resources — not just the workloads they manage. 
    Klaudia investigates operational managers such as Argo CD, Helm, Cert Manager, External DNS and Autoscalers. 
    For example, with ArgoCD: when a sync completes and a deployment immediately fails, Klaudia links the ArgoCD application state to the failing workload and investigates both together.

  • CRDs - Klaudia can investigate custom resource definition issues.

  • For ML & Workflow Systems, Klaudia runs RCA directly on ML and data workflow resources — not just the individual pods they spawn. 
    Supported: Argo Workflows, Argo Rollouts, Airflow, Spark, and custom workflow CRDs.
    Workflow-specific investigation capabilities:

    • Runs RCA directly on the workflow resource itself (Argo, Airflow, Spark, custom)
    • Automatically fetches all failing pods tied to the workflow run
    • Analyzes failures together to correlate root causes across steps and tasks — not as isolated pod failures
    • Includes the full workflow resource YAML, including CRDs for Spark and Argo, to produce a more complete RCA
    • This means a Spark job failing across multiple executors, or an Argo Workflow where three tasks fail for the same underlying reason, produces a single correlated root cause — not three separate alerts.

Investigating Failed Deploys and Workflows

When a deployment or rollout fails, Klaudia goes beyond object-level RCA to trace the failure back to the specific change that introduced it.

  • Detects Deployment Failures

    • Identifies failed deployments and flags them in the Komodor UI.
    • Tracks deployment events and failure points.
  • Provides Timeline & Event Breakdown

  • Displays a timeline of deployment events, including pod states like:
    • Pending
    • Running (not ready)
    • Running (ready)
  • Shows key failure points (e.g., image update failure).
  • Correlates Events with Changes

    • Links failures to configuration changes (e.g., image updates, annotations).
    • Helps identify what changed before the failure occurred.
  • Suggests Remediation Steps
  • Highlights potential causes
    • Incorrect image tags
    • Missing dependencies
    • Misconfigured environment variables
  • Provides recommendations to fix issues
  • Enables Quick Investigation & Fixes
    Offers direct actions like:
    • Restarting the deployment
    • Editing YAML configurations
    • Detecting configuration drift

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.