Overview
Klaudia is Komodor's AI SRE — available across all investigation surfaces in the platform. This page covers how each capability works in practice: workload and deployment investigation, full-stack RCA across Kubernetes core, compute, networking, GitOps, data services, and ML workloads, chat-based troubleshooting, autonomous remediation, organizational context, and external integrations.
Klaudia can be easily disabled or enabled per each cluster:
For more complex cases, Klaudia may provide further details around the investigation being run as well as showing a delay banner.
What Klaudia Investigates
For each resource, Klaudia runs the same investigation loop: detection, root cause analysis, evidence collection, remediation suggestions, one-click remediation, and interactive follow-up.
-
For workloads, Klaudia investigates Pods, ReplicaSets, Deployments, Jobs, CronJobs, StatefulSets, and DaemonSets.
- Deployments, StatefulSets, DaemonSets — full RCA on pod failures, scheduling issues, resource limits, configuration errors, and readiness failures across all replica-based workloads.
- Jobs & CronJobs — investigates failed runs, timeout conditions, and retry exhaustion, correlating failures back to config or upstream dependency changes.
For storage - Klaudia investigates PVCs, PVs and Storage Classes.
For configuration - Klaudia investigates ConfigMaps, Secrets, Resource Quotas, Limit Ranges, HPAs, and PDBs that may impact workload stability, scaling, or availability.
For networking and access control, Klaudia investigates Kubernetes Services, Endpoints, Endpoint Slices, Ingresses, Network Policies, Service Accounts, Roles, Cluster Roles, Role Bindings, and Cluster Role Bindings.
For Kubernetes add-ons - Klaudia investigates Kubernetes add-ons as first-class resources — not just the workloads they manage.
Klaudia investigates operational managers such as Argo CD, Helm, Cert Manager, External DNS and Autoscalers.
For example, with ArgoCD: when a sync completes and a deployment immediately fails, Klaudia links the ArgoCD application state to the failing workload and investigates both together.CRDs - Klaudia can investigate custom resource definition issues.
-
For ML & Workflow Systems, Klaudia runs RCA directly on ML and data workflow resources — not just the individual pods they spawn.
Supported: Argo Workflows, Argo Rollouts, Airflow, Spark, and custom workflow CRDs.
Workflow-specific investigation capabilities:- Runs RCA directly on the workflow resource itself (Argo, Airflow, Spark, custom)
- Automatically fetches all failing pods tied to the workflow run
- Analyzes failures together to correlate root causes across steps and tasks — not as isolated pod failures
- Includes the full workflow resource YAML, including CRDs for Spark and Argo, to produce a more complete RCA
- This means a Spark job failing across multiple executors, or an Argo Workflow where three tasks fail for the same underlying reason, produces a single correlated root cause — not three separate alerts.
How Klaudia Investigates
The following investigation process applies to all resource types listed above — workloads, add-ons, ML workflows, and native resources.
Detection & Analysis
- Identifies failures: CrashLoopBackOff, OOMKill, scheduling failures, readiness probe errors, and more.
- Conducts an independent investigation without requiring manual setup or steering.
AI-Driven Root Cause Analysis
- Extracts logs, configurations, and errors automatically — no manual copy-paste into the investigation.
- Analyzes new information and refines investigation.
- Mimics an expert SRE's workflow: hypothesizes, queries data, analyzes, and refines.
- Repeats process, narrowing down to root cause.
- References the exact resource where the problem originates, with a full evidence chain.
Suggested Remediation
- Synthesizes findings from multiple sources into a single, coherent conclusion.
- Provides precise, actionable fixes tied to the identified root cause — not generic suggestions.
- Highlights specific misconfigurations — YAML syntax errors, incorrect settings, resource limit issues, missing dependencies and cross-resource context.
- Links directly to the affected resource for quick access.
Interactive Troubleshooting
- Supports real-time follow-up questions within any investigation.
- Restart the analysis at any time or branch into adjacent issues without losing context
Once Click Remediation
- Execute remediation actions with a single click, directly from Klaudia's RCA output.
- Verify the state of the remediation, trigger other flows if needed
Optimization and efficiency
- Indexes context to memory. Improve resolution for future incidents.
Investigating Failed Deploys and Workflows
When a deployment or rollout fails, Klaudia goes beyond object-level RCA to trace the failure back to the specific change that introduced it.
Detects Deployment Failures
- Identifies failed deployments and flags them in the Komodor UI.
- Tracks deployment events and failure points.
Provides Timeline & Event Breakdown
- Displays a timeline of deployment events, including pod states like:
- Pending
- Running (not ready)
- Running (ready)
- Shows key failure points (e.g., image update failure).
Correlates Events with Changes
- Links failures to configuration changes (e.g., image updates, annotations).
- Helps identify what changed before the failure occurred.
Suggests Remediation Steps
- Highlights potential causes like:
- Incorrect image tags
- Missing dependencies
- Misconfigured environment variables
- Provides recommendations to fix issues.
Enables Quick Investigation & Fixes
- Offers direct actions like:
- Restarting the deployment.
- Editing YAML configurations.
- Detecting configuration drift.
Klaudia Capabilities
Full-Stack Investigation: Following Root Cause Beyond Kubernetes
Kubernetes is where failures surface. The root cause can live anywhere — in a delivery pipeline, a network layer, a dependent data service, or a compute capacity limit. When a workload investigation points outside the cluster, Klaudia follows the evidence automatically.
| Investigation Domain | When Klaudia routes here | Tools & Integrations |
|---|---|---|
| GitOps & Delivery | Issue starts in a delivery pipeline, source control, or multi-cluster control plane | ArgoCD · FluxCD · Helm · GitHub · Cluster API |
| Networking & Security | Traffic routing, connectivity, DNS, certificates, or secrets injection failing | Cilium · Istio · NGINX · Cert-Manager · Vault · External Secrets |
| Compute & Capacity | Node provisioning, autoscaling, storage volumes, GPU, or cloud infra resources failing | Karpenter · KEDA · Crossplane · NVIDIA · Storage |
| Data & Messaging | Stateful services the application depends on at runtime are slow or unavailable | Kafka · Postgres · Redis · RabbitMQ · Elasticsearch |
| Workflows & ML | Orchestration jobs, batch pipelines, ML training runs, or inference endpoints failing | Airflow · Argo Workflows · Kubeflow · Spark · Flink · vLLM |
| Kubernetes Core | K8s admission, policy enforcement, or event-driven scaling configuration | K8s Admission · Kyverno |
Examples of cross-domain routing:
- Pods stuck in Pending because the node autoscaler has hit a hard capacity ceiling → Compute & Capacity
- An ingress returning 503s because a certificate silently expired → Networking & Security
- A CrashLoop introduced by a config change 12 minutes ago → GitOps & Delivery
- A service failing due to connection pool exhaustion in the database layer → Data & Messaging
Behind every cross-domain investigation, purpose-built domain agents join based on where the root cause leads. Klaudia routes to the right agent at the right step — no manual steering required.
Connecting to any MCP/API (Beta)
Connect to any tool or service that exposes an MCP endpoint or OpenAPI spec - just point it at the URL and it becomes available to the AI during investigation.
AI Powered Investigation
Klaudia Chat
Troubleshooting doesn’t stop at the first RCA. Often, you need to:
- Clarify what a finding means
- Understand resolution steps and prevention
- Investigate related resources
- Assess the broader impact of a change or failure
And sometimes, you just want to understand how a healthy resource behaves or fits into your environment. With Klaudia Chat, you now have a virtual SRE available on demand—whether you’re working on a live issue or exploring infrastructure.
Klaudia Chat enables ongoing, context-aware conversation at any point in an investigation — and from any resource in the platform, even healthy ones.
Ask natural-language questions to investigate any resource - get context, understand issues, assess impact, and explore solutions effortlessly. Restore or share historical chats.
What Klaudia Chat enables:
- Reduced Mean Time to Resolution (MTTR): Faster insights lead to quicker resolutions.
- Interactive Problem Exploration: A back-and-forth conversation simplifies troubleshooting.
- Deeper Insights: Leverage Klaudia’s expansive knowledge base for comprehensive answers.
- Adaptable Investigation Paths: Branch into related issues that the initial RCA doesn’t immediately cover.
Automated Correlation
Leverage AI with multi-source data, history & deep K8s knowledge in Komodor to correlate issues to recent changes, related resources, cross-cluster info, infra issues and more for a profound investigation experience.
Cross cluster chat support
Klaudia reasons across your entire fleet in a single conversation. Ask questions that span multiple clusters without pre-selecting one — she resolves scope dynamically, enabling fleet-wide investigations that would otherwise require multiple fragmented sessions.
Granular chat permissions
Users with scoped cluster access (e.g. specific namespace/s) can engage with single and cross-cluster chat without needing elevated permissions, allowing investigations without compromising access control.
AI Driven RCA
From day-one, and without requiring extensive setup, Komodor’s internal AI models leverage this existing knowledge base and conduct independent investigations to automatically highlight the root cause (RCA) and suggest actionable next steps.
RCA Chat
Troubleshooting doesn't stop when the RCA ends. Chat keeps the investigation moving — from diagnosis to full resolution — without switching tools.
When viewing any Klaudia Root Cause Analysis, continue the investigation through conversation:
- Ask follow-up questions: "What does this finding mean?", "How do I prevent this from happening again?"
- Get step-by-step resolution guidance
- Investigate related logs, events, or recent changes
- Branch into adjacent issues the initial RCA didn't cover
Share your chat outputs
Want to share the entire conversation with a teammate? Click the “Share link” button next to the RCA results.
Chat from Any Resource
Open Ask Klaudia from any resource in the platform — Pods, Deployments, ConfigMaps, Secrets, Nodes, and more — regardless of whether an issue is active.
Klaudia uses Komodor’s event intelligence and investigation engine to:
- Identify issues and correlations across the cluster
- Connect logs, events, and configurations to uncover root causes
- Leverage user prompts to refine answers with context
Klaudia scopes the conversation to the selected resource and provides contextual answers, suggested questions, and explanations of metrics, recent changes, and dependencies.
Questions you can ask from any resource:
- "Why did this happen?"
- "How do I fix it?"
- "What changed recently?"
- "Which services are related to this?"
- "Has this occurred before?"
- "What does this config mean?"
Autonomous Remediation
One-Click Remediation
Execute fixes directly from Klaudia's RCA output, without leaving the investigation:
- Action preview — see the exact command before it runs
- Real-time status feedback — monitor execution as it happens
- RBAC enforcement — actions are permission-gated; Klaudia only acts within your defined access scope
- Full audit trail — every action executed through Klaudia is logged and traceable
Git PR Generation (Beta)
For configuration-level root causes — manifest errors, resource limit misconfigurations, infrastructure settings — Klaudia opens a Pull Request with the correct fix directly in your repository. The PR goes through your team's standard review and approval flow. No changes are applied without human sign-off unless a self-healing policy is in place.
Fully Automated Self-Healing (Beta)
Automatically resolve issues (e.g., pod crashes, misconfigurations, failing rollouts) before they escalate, using tailor-made playbooks from day one.
Policy-Driven Guardrails
Control scope, autonomous level and limitations to ensure alignment with security and organizational policies.
Clear reasoning, causality and auditing
Easily understand the issue in hand, investigation flow and remediation steps. All actions taken are fully audited.
Organizational Context
Klaudia uses three context layers to ground every investigation in your environment — not generic best practices.
Klaudia.md
A blueprint file written once by your team. It captures service dependencies, hard constraints, and rules that must never be violated. Klaudia.md is automatically loaded into every investigation session, ensuring remediations never violate your environment's specific constraints.
Knowledge Base
Connect your existing runbooks, postmortems, and troubleshooting guides. Klaudia semantically searches the knowledge base on-demand per incident, retrieving the most relevant runbook for the active failure pattern and applying your team's specific procedures rather than generic responses.
Klaudia Memory
Klaudia retains investigation history across sessions — what happened, what was tried, and what resolved each incident. Similar incidents are recognized instantly, and resolution playbooks are auto-indexed over time. Resolution speed improves with every incident.
Work Surfaces
Slack
A Klaudia bot natively embedded in Slack lets engineers run full investigations without leaving their operational channels:
- Trigger RCA investigations from a Slack message or war room
- Ask follow-up questions and get answers in-thread
- Review and approve remediation actions directly in Slack
- Receive investigation summaries and root cause verdicts as messages
REST API
Klaudia’s Root Cause Analysis (RCA) is now available through the Komodor API—enabling teams to trigger and retrieve AI-powered investigations directly from their existing toolchains and workflows.
Available Endpoints
-
Trigger RCA —
POST /api/v1/klaudia/rca
Initiates an RCA for a specific Kubernetes workload (Pod, Deployment, Job). -
Retrieve Results —
GET /api/v1/klaudia/rca/{session_id}
Returns the root cause, confidence score, supporting evidence, and suggested remediation steps.
Usage Pattern
Investigations typically complete in 20–30 seconds. Use a polling pattern:
- Trigger investigation via
POST /api/v1/klaudia/rca - Receive a
session_idin the response - Poll
GET /api/v1/klaudia/rca/{session_id}until complete - Process the returned root cause analysis and recommendations
Common use cases:
- Automate post-deployment troubleshooting by integrating directly into CI/CD tools like Jenkins, GitLab, or CircleCI
- Enhance incident response workflows in alerting platforms like PagerDuty and OpsGenie
- Bring RCA insights into ChatOps tools such as Slack or Microsoft Teams
- Centralize visibility by embedding RCA outputs into dashboards and internal ticketing systems
Getting Started
Explore the new endpoints in our public Swagger documentation. The RCA API uses the same Komodor authentication and permission model you’re already familiar with.
MCP Server (Beta)
Expose Klaudia as a standardized tool that external platforms and AI agents can discover and invoke — without custom API integrations. IDEs, CI/CD pipelines, autonomous agents, and internal tools can trigger investigations through a common protocol, embedding Klaudia across your ecosystem.
Supported clients: Claude, VS Code, Cursor, LLM agents, and any MCP-compatible platform.
- The server runs locally on your machine. It is distributed as a Python package on PyPI and launched via uvx or uv run.
-
Two transport modes are supported:
- HTTP: the server starts on http://localhost:8002 and the AI client connects to it over HTTP.
- stdio (development): the AI client spawns the server as a subprocess and communicates via stdin/stdout. Supported by most MCP-compatible clients.
- It talks to Komodor's API over HTTPS using your API key.
-
It is read-only. The tools expose observability and analysis capabilities only.
One Click Remediation is not currently exposed.
Installation & Setup
For installation instructions, prerequisites, and AI client configuration, see the official package page: komodor-mcp on PyPI →
Comments
0 comments
Please sign in to leave a comment.