The Art of AI SRE Investigation
The following investigation process applies to all resource types listed above — workloads, add-ons, ML workflows, and native resources.
Detection & Analysis
- Identifies failures: CrashLoopBackOff, OOMKill, scheduling failures, readiness probe errors, and more.
- Conducts an independent investigation without requiring manual setup or steering.
AI-Driven Root Cause Analysis
- Extracts logs, configurations, and errors automatically — no manual copy-paste into the investigation.
- Analyzes new information and refines investigation.
- Mimics an expert SRE's workflow: hypothesizes, queries data, analyzes, and refines.
- Repeats process, narrowing down to root cause.
- References the exact resource where the problem originates, with a full evidence chain.
Suggested Remediation
- Synthesizes findings from multiple sources into a single, coherent conclusion.
- Provides precise, actionable fixes tied to the identified root cause — not generic suggestions.
- Highlights specific misconfigurations — YAML syntax errors, incorrect settings, resource limit issues, missing dependencies and cross-resource context.
- Links directly to the affected resource for quick access.
Interactive Troubleshooting
- Supports real-time follow-up questions within any investigation.
- Restart the analysis at any time or branch into adjacent issues without losing context
Once Click Remediation
- Execute remediation actions with a single click, directly from Klaudia's RCA output.
- Verify the state of the remediation, trigger other flows if needed
Optimization and efficiency
- Indexes context to memory. Improve resolution for future incidents.
The Investigated Resources
For each resource, Klaudia runs the same investigation loop: detection, root cause analysis, evidence collection, remediation suggestions, one-click remediation, and interactive follow-up.
For workloads, Klaudia investigates Pods, ReplicaSets, Deployments, Jobs, CronJobs, StatefulSets, and DaemonSets.
- Deployments, StatefulSets, DaemonSets — full RCA on pod failures, scheduling issues, resource limits, configuration errors, and readiness failures across all replica-based workloads.
- Jobs & CronJobs — investigates failed runs, timeout conditions, and retry exhaustion, correlating failures back to config or upstream dependency changes.
For storage - Klaudia investigates PVCs, PVs and Storage Classes.
For configuration - Klaudia investigates ConfigMaps, Secrets, Resource Quotas, Limit Ranges, HPAs, and PDBs that may impact workload stability, scaling, or availability.
For networking and access control, Klaudia investigates Kubernetes Services, Endpoints, Endpoint Slices, Ingresses, Network Policies, Service Accounts, Roles, Cluster Roles, Role Bindings, and Cluster Role Bindings.
For Kubernetes add-ons - Klaudia investigates Kubernetes add-ons as first-class resources — not just the workloads they manage.
Klaudia investigates operational managers such as Argo CD, Helm, Cert Manager, External DNS and Autoscalers.
For example, with ArgoCD: when a sync completes and a deployment immediately fails, Klaudia links the ArgoCD application state to the failing workload and investigates both together.CRDs - Klaudia can investigate custom resource definition issues.
For ML & Workflow Systems, Klaudia runs RCA directly on ML and data workflow resources — not just the individual pods they spawn.
Supported: Argo Workflows, Argo Rollouts, Airflow, Spark, and custom workflow CRDs.
Workflow-specific investigation capabilities:- Runs RCA directly on the workflow resource itself (Argo, Airflow, Spark, custom)
- Automatically fetches all failing pods tied to the workflow run
- Analyzes failures together to correlate root causes across steps and tasks — not as isolated pod failures
- Includes the full workflow resource YAML, including CRDs for Spark and Argo, to produce a more complete RCA
- This means a Spark job failing across multiple executors, or an Argo Workflow where three tasks fail for the same underlying reason, produces a single correlated root cause — not three separate alerts.
Investigating Failed Deploys and Workflows
When a deployment or rollout fails, Klaudia goes beyond object-level RCA to trace the failure back to the specific change that introduced it.
Detects Deployment Failures
- Identifies failed deployments and flags them in the Komodor UI.
- Tracks deployment events and failure points.
Provides Timeline & Event Breakdown
- Displays a timeline of deployment events, including pod states like:
- Pending
- Running (not ready)
- Running (ready)
- Shows key failure points (e.g., image update failure).
Correlates Events with Changes
- Links failures to configuration changes (e.g., image updates, annotations).
- Helps identify what changed before the failure occurred.
- Suggests Remediation Steps
- Highlights potential causes
- Incorrect image tags
- Missing dependencies
- Misconfigured environment variables
- Provides recommendations to fix issues
- Enables Quick Investigation & Fixes
Offers direct actions like:- Restarting the deployment
- Editing YAML configurations
- Detecting configuration drift
Comments
0 comments
Please sign in to leave a comment.