Overview

Klaudia is Komodor's AI SRE — available across all investigation surfaces in the platform. This page covers how each capability works in practice: workload and deployment investigation, full-stack RCA across Kubernetes core, compute, networking, GitOps, data services, and ML workloads, chat-based troubleshooting, autonomous remediation, organizational context, and external integrations.

Klaudia can be easily disabled or enabled per each cluster:

For more complex cases, Klaudia may provide further details around the investigation being run as well as showing a delay banner.

What Klaudia Investigates

For each resource, Klaudia runs the same investigation loop: detection, root cause analysis, evidence collection, remediation suggestions, one-click remediation, and interactive follow-up.

For workloads, Klaudia investigates Pods, ReplicaSets, Deployments, Jobs, CronJobs, StatefulSets, and DaemonSets.
- Deployments, StatefulSets, DaemonSets — full RCA on pod failures, scheduling issues, resource limits, configuration errors, and readiness failures across all replica-based workloads.
- Jobs & CronJobs — investigates failed runs, timeout conditions, and retry exhaustion, correlating failures back to config or upstream dependency changes.
For storage - Klaudia investigates PVCs, PVs and Storage Classes.
For configuration - Klaudia investigates ConfigMaps, Secrets, Resource Quotas, Limit Ranges, HPAs, and PDBs that may impact workload stability, scaling, or availability.
For networking and access control, Klaudia investigates Kubernetes Services, Endpoints, Endpoint Slices, Ingresses, Network Policies, Service Accounts, Roles, Cluster Roles, Role Bindings, and Cluster Role Bindings.
For Kubernetes add-ons - Klaudia investigates Kubernetes add-ons as first-class resources — not just the workloads they manage.
Klaudia investigates operational managers such as Argo CD, Helm, Cert Manager, External DNS and Autoscalers.
For example, with ArgoCD: when a sync completes and a deployment immediately fails, Klaudia links the ArgoCD application state to the failing workload and investigates both together.
CRDs - Klaudia can investigate custom resource definition issues.
For ML & Workflow Systems, Klaudia runs RCA directly on ML and data workflow resources — not just the individual pods they spawn.
Supported: Argo Workflows, Argo Rollouts, Airflow, Spark, and custom workflow CRDs.
Workflow-specific investigation capabilities:
- Runs RCA directly on the workflow resource itself (Argo, Airflow, Spark, custom)
- Automatically fetches all failing pods tied to the workflow run
- Analyzes failures together to correlate root causes across steps and tasks — not as isolated pod failures
- Includes the full workflow resource YAML, including CRDs for Spark and Argo, to produce a more complete RCA
- This means a Spark job failing across multiple executors, or an Argo Workflow where three tasks fail for the same underlying reason, produces a single correlated root cause — not three separate alerts.

How Klaudia Investigates

The following investigation process applies to all resource types listed above — workloads, add-ons, ML workflows, and native resources.

Detection & Analysis

Identifies failures: CrashLoopBackOff, OOMKill, scheduling failures, readiness probe errors, and more.
Conducts an independent investigation without requiring manual setup or steering.

AI-Driven Root Cause Analysis

Extracts logs, configurations, and errors automatically — no manual copy-paste into the investigation.
Analyzes new information and refines investigation.
Mimics an expert SRE's workflow: hypothesizes, queries data, analyzes, and refines.
Repeats process, narrowing down to root cause.
References the exact resource where the problem originates, with a full evidence chain.

Suggested Remediation

Synthesizes findings from multiple sources into a single, coherent conclusion.
Provides precise, actionable fixes tied to the identified root cause — not generic suggestions.
Highlights specific misconfigurations — YAML syntax errors, incorrect settings, resource limit issues, missing dependencies and cross-resource context.
Links directly to the affected resource for quick access.

Interactive Troubleshooting

Supports real-time follow-up questions within any investigation.
Restart the analysis at any time or branch into adjacent issues without losing context

Once Click Remediation

Execute remediation actions with a single click, directly from Klaudia's RCA output.
Verify the state of the remediation, trigger other flows if needed

Optimization and efficiency

Indexes context to memory. Improve resolution for future incidents.

Investigating Failed Deploys and Workflows

When a deployment or rollout fails, Klaudia goes beyond object-level RCA to trace the failure back to the specific change that introduced it.

Detects Deployment Failures

Identifies failed deployments and flags them in the Komodor UI.
Tracks deployment events and failure points.

Provides Timeline & Event Breakdown

Displays a timeline of deployment events, including pod states like:
- Pending
- Running (not ready)
- Running (ready)
Shows key failure points (e.g., image update failure).

Correlates Events with Changes

Links failures to configuration changes (e.g., image updates, annotations).
Helps identify what changed before the failure occurred.

Suggests Remediation Steps

Highlights potential causes like:
- Incorrect image tags
- Missing dependencies
- Misconfigured environment variables
Provides recommendations to fix issues.

Enables Quick Investigation & Fixes

Offers direct actions like:
- Restarting the deployment.
- Editing YAML configurations.
- Detecting configuration drift.

Klaudia Capabilities

Full-Stack Investigation: Following Root Cause Beyond Kubernetes

Kubernetes is where failures surface. The root cause can live anywhere — in a delivery pipeline, a network layer, a dependent data service, or a compute capacity limit. When a workload investigation points outside the cluster, Klaudia follows the evidence automatically.

Investigation Domain	When Klaudia routes here	Tools & Integrations
GitOps & Delivery	Issue starts in a delivery pipeline, source control, or multi-cluster control plane	ArgoCD · FluxCD · Helm · GitHub · Cluster API
Networking & Security	Traffic routing, connectivity, DNS, certificates, or secrets injection failing	Cilium · Istio · NGINX · Cert-Manager · Vault · External Secrets
Compute & Capacity	Node provisioning, autoscaling, storage volumes, GPU, or cloud infra resources failing	Karpenter · KEDA · Crossplane · NVIDIA · Storage
Data & Messaging	Stateful services the application depends on at runtime are slow or unavailable	Kafka · Postgres · Redis · RabbitMQ · Elasticsearch
Workflows & ML	Orchestration jobs, batch pipelines, ML training runs, or inference endpoints failing	Airflow · Argo Workflows · Kubeflow · Spark · Flink · vLLM
Kubernetes Core	K8s admission, policy enforcement, or event-driven scaling configuration	K8s Admission · Kyverno

Examples of cross-domain routing:

Pods stuck in Pending because the node autoscaler has hit a hard capacity ceiling → Compute & Capacity
An ingress returning 503s because a certificate silently expired → Networking & Security
A CrashLoop introduced by a config change 12 minutes ago → GitOps & Delivery
A service failing due to connection pool exhaustion in the database layer → Data & Messaging

Behind every cross-domain investigation, purpose-built domain agents join based on where the root cause leads. Klaudia routes to the right agent at the right step — no manual steering required.

Connecting to any MCP/API (Beta)

Connect to any tool or service that exposes an MCP endpoint or OpenAPI spec - just point it at the URL and it becomes available to the AI during investigation.

AI Powered Investigation

Klaudia Chat

Troubleshooting doesn’t stop at the first RCA. Often, you need to:

Clarify what a finding means
Understand resolution steps and prevention
Investigate related resources
Assess the broader impact of a change or failure

And sometimes, you just want to understand how a healthy resource behaves or fits into your environment. With Klaudia Chat, you now have a virtual SRE available on demand—whether you’re working on a live issue or exploring infrastructure.

Klaudia Chat enables ongoing, context-aware conversation at any point in an investigation — and from any resource in the platform, even healthy ones.

Ask natural-language questions to investigate any resource - get context, understand issues, assess impact, and explore solutions effortlessly. Restore or share historical chats.

What Klaudia Chat enables:

Reduced Mean Time to Resolution (MTTR): Faster insights lead to quicker resolutions.
Interactive Problem Exploration: A back-and-forth conversation simplifies troubleshooting.
Deeper Insights: Leverage Klaudia’s expansive knowledge base for comprehensive answers.
Adaptable Investigation Paths: Branch into related issues that the initial RCA doesn’t immediately cover.

Automated Correlation

Leverage AI with multi-source data, history & deep K8s knowledge in Komodor to correlate issues to recent changes, related resources, cross-cluster info, infra issues and more for a profound investigation experience.

Cross cluster chat support

Klaudia reasons across your entire fleet in a single conversation. Ask questions that span multiple clusters without pre-selecting one — she resolves scope dynamically, enabling fleet-wide investigations that would otherwise require multiple fragmented sessions.

Granular chat permissions

Users with scoped cluster access (e.g. specific namespace/s) can engage with single and cross-cluster chat without needing elevated permissions, allowing investigations without compromising access control.

AI Driven RCA

From day-one, and without requiring extensive setup, Komodor’s internal AI models leverage this existing knowledge base and conduct independent investigations to automatically highlight the root cause (RCA) and suggest actionable next steps.

RCA Chat

Troubleshooting doesn't stop when the RCA ends. Chat keeps the investigation moving — from diagnosis to full resolution — without switching tools.

When viewing any Klaudia Root Cause Analysis, continue the investigation through conversation:

Ask follow-up questions: "What does this finding mean?", "How do I prevent this from happening again?"
Get step-by-step resolution guidance
Investigate related logs, events, or recent changes
Branch into adjacent issues the initial RCA didn't cover

Share your chat outputs

Want to share the entire conversation with a teammate? Click the “Share link” button next to the RCA results.

Chat from Any Resource

Open Ask Klaudia from any resource in the platform — Pods, Deployments, ConfigMaps, Secrets, Nodes, and more — regardless of whether an issue is active.

Klaudia uses Komodor’s event intelligence and investigation engine to:

Identify issues and correlations across the cluster
Connect logs, events, and configurations to uncover root causes
Leverage user prompts to refine answers with context

Klaudia scopes the conversation to the selected resource and provides contextual answers, suggested questions, and explanations of metrics, recent changes, and dependencies.

Questions you can ask from any resource:

"Why did this happen?"
"How do I fix it?"
"What changed recently?"
"Which services are related to this?"
"Has this occurred before?"
"What does this config mean?"

Autonomous Remediation

One-Click Remediation

Execute fixes directly from Klaudia's RCA output, without leaving the investigation:

Action preview — see the exact command before it runs
Real-time status feedback — monitor execution as it happens
RBAC enforcement — actions are permission-gated; Klaudia only acts within your defined access scope
Full audit trail — every action executed through Klaudia is logged and traceable

Git PR Generation (Beta)

For configuration-level root causes — manifest errors, resource limit misconfigurations, infrastructure settings — Klaudia opens a Pull Request with the correct fix directly in your repository. The PR goes through your team's standard review and approval flow. No changes are applied without human sign-off unless a self-healing policy is in place.

Fully Automated Self-Healing (Beta)

Automatically resolve issues (e.g., pod crashes, misconfigurations, failing rollouts) before they escalate, using tailor-made playbooks from day one.

Policy-Driven Guardrails

Control scope, autonomous level and limitations to ensure alignment with security and organizational policies.

Clear reasoning, causality and auditing

Easily understand the issue in hand, investigation flow and remediation steps. All actions taken are fully audited.

Organizational Context

Klaudia uses three context layers to ground every investigation in your environment — not generic best practices.

Klaudia.md

A blueprint file written once by your team. It captures service dependencies, hard constraints, and rules that must never be violated. Klaudia.md is automatically loaded into every investigation session, ensuring remediations never violate your environment's specific constraints.

Knowledge Base

Connect your existing runbooks, postmortems, and troubleshooting guides. Klaudia semantically searches the knowledge base on-demand per incident, retrieving the most relevant runbook for the active failure pattern and applying your team's specific procedures rather than generic responses.

Klaudia Memory

Klaudia retains investigation history across sessions — what happened, what was tried, and what resolved each incident. Similar incidents are recognized instantly, and resolution playbooks are auto-indexed over time. Resolution speed improves with every incident.

Work Surfaces

Slack

A Klaudia bot natively embedded in Slack lets engineers run full investigations without leaving their operational channels:

Trigger RCA investigations from a Slack message or war room
Ask follow-up questions and get answers in-thread
Review and approve remediation actions directly in Slack
Receive investigation summaries and root cause verdicts as messages

REST API

Klaudia’s Root Cause Analysis (RCA) is now available through the Komodor API—enabling teams to trigger and retrieve AI-powered investigations directly from their existing toolchains and workflows.

Available Endpoints

Trigger RCA — POST /api/v1/klaudia/rca
Initiates an RCA for a specific Kubernetes workload (Pod, Deployment, Job).
Retrieve Results — GET /api/v1/klaudia/rca/{session_id}
Returns the root cause, confidence score, supporting evidence, and suggested remediation steps.

Usage Pattern

Investigations typically complete in 20–30 seconds. Use a polling pattern:

Trigger investigation via POST /api/v1/klaudia/rca
Receive a session_id in the response
Poll GET /api/v1/klaudia/rca/{session_id} until complete
Process the returned root cause analysis and recommendations

Common use cases:

Automate post-deployment troubleshooting by integrating directly into CI/CD tools like Jenkins, GitLab, or CircleCI
Enhance incident response workflows in alerting platforms like PagerDuty and OpsGenie
Bring RCA insights into ChatOps tools such as Slack or Microsoft Teams
Centralize visibility by embedding RCA outputs into dashboards and internal ticketing systems

Example CI/CD script

Getting Started

Explore the new endpoints in our public Swagger documentation. The RCA API uses the same Komodor authentication and permission model you’re already familiar with.

MCP Server (Beta)

Expose Klaudia as a standardized tool that external platforms and AI agents can discover and invoke — without custom API integrations. IDEs, CI/CD pipelines, autonomous agents, and internal tools can trigger investigations through a common protocol, embedding Klaudia across your ecosystem.

Supported clients: Claude, VS Code, Cursor, LLM agents, and any MCP-compatible platform.

The server runs locally on your machine. It is distributed as a Python package on PyPI and launched via uvx or uv run.
Two transport modes are supported:
- HTTP: the server starts on http://localhost:8002 and the AI client connects to it over HTTP.
- stdio (development): the AI client spawns the server as a subprocess and communicates via stdin/stdout. Supported by most MCP-compatible clients.
It talks to Komodor's API over HTTPS using your API key.
It is read-only. The tools expose observability and analysis capabilities only.
One Click Remediation is not currently exposed.

Installation & Setup

For installation instructions, prerequisites, and AI client configuration, see the official package page: komodor-mcp on PyPI →

Klaudia: Features & Capabilities