Guided Troubleshooting - Simplifying Kubernetes Issues Resolution

Introduction

Komodor’s Guided Troubleshooting feature is designed to make Kubernetes troubleshooting easier and more accessible. This feature streamlines identifying and resolving issues within your Kubernetes environment by leveraging automated checks and scenario-based flows to provide insightful guidance. By running automated checks behind the scenes, Komodor ensures that users receive the most relevant information and correlations to pinpoint root causes and resolve issues efficiently.

Key Features & Capabilities

Automated Checks and Scenarios

Komodor's Guided Troubleshooting feature runs automated checks and scenarios behind the scenes, providing you with insights that lead to finding the root cause and solving issues more effectively. These automated processes help gather and correlate relevant data, making the troubleshooting process more efficient and accurate.

Komodor's Guided Troubleshooting covers a range of common issues:

Out of memory and Eviction

Komodor's Out of Memory (OOM) Issue Detection feature runs automated checks to identify and address memory-related problems within Kubernetes clusters.
Offering insights and correlations assists users in promptly resolving OOM issues.
When a pod is having an issue due to eviction or an out-of-memory issue, Komodor runs a series of checks to provide actionable insights:

Check a Container's Memory Limit: Verifies if the failed container reached its memory limit.
Memory Limit Adjustment Detection: Detects if the memory limit was decreased before the first occurrence of the issue, indicating a potential misconfiguration.
Noisy Neighbors Investigation: Identifies and reports on pods that are evicting others or causing out-of-memory issues, helping to isolate and manage disruptive elements in your cluster.
Memory Allocation Review: Proactively prompting a review of the memory allocation needs of applications to ensure they are adequately provisioned.
Memory Leak Analysis: Suggests reviewing the memory usage graph and provides tips to identify potential memory leak issues.

Through these checks and insights, Komodor facilitates efficient troubleshooting and resolution of OOM issues, helping users maintain the stability and performance of their Kubernetes environments.

Noisy neighbors detection

By actively monitoring pod behavior, the workflow identifies instances where specific pods may be disrupting the performance of other services, such as causing evictions or out-of-memory issues.
This capability allows users to quickly isolate and manage these disruptive pods, saving precious time and effort that would otherwise be spent investigating multiple resources.
When suspected noisy neighbors are detected, Komodor provides a detailed report containing information about the suspect pods, including their configurations, metrics, and behavior next to the relevant node information where the suspected neighbors are running.

Smart analysis of error messages

Automatically recognize specific error templates and provide straightforward explanations. By assisting in understanding why a pod failed to come up and conveniently highlighting the issue for easy access, this feature empowers engineers and DevOps professionals to resolve issues faster and with more confidence.

Another example:

Organized Data Steps

The feature breaks down the troubleshooting process into manageable steps, each step simplified to demystify complicated Kubernetes concepts:

Introduction to the Issue: A clear summary of the problem at hand.
Logs Analysis: Comprehensive logs analysis to identify potential issues.
Correlated Deployments: Insights into recent deployments that may have affected the service.
Correlated Node Issues and Terminations: Information about node issues and terminations that could be related to the problem.
Unhealthy Pods Information: Details about any unhealthy pods within the service.
Insight-Driven Steps: Additional steps based on insights related to the issue's root cause.

Getting Started

The Guided Troubleshooting feature is currently available for addressing availability issues.
It can be found in the service view health indicator

And by clicking on the availability issue event.
First, you’ll see a summary of the issue
To begin troubleshooting, simply click the "Investigate" button.