Reliability

Background

Kubernetes, when configured correctly, provides a self-healing infrastructure.
It is intended to mitigate local issues by restarting containers, replacing pods, limiting/throttling workloads, and other steps to make sure your applications are running properly.

This behavior sometimes hides a bigger problem to be discovered down-the-line or impacts application services and the end-user experience.

Komodor utilizes data collected by the platform (events, metrics, resource specifications) to identify such faulty behaviors that could indicate a larger issue but may go unnoticed, and provide clear guidance on how to solve them.
Komodor also identifies misconfigurations within clusters and raises a flag to help in proactively preventing potential issues and improve the overall application reliability.

In this article, we will cover:

  • Reliability in Komodor
  • "Ignoring" reliability violations
  • Configuring reliability violations using policies

Reliability in Komodor

Prerequisites 

  • Komodor agent version 2.2.1
  • Metrics should be enabled, as some of the checks use this information. 

Overview

A reliability violation represents an outstanding risk tied to a specific workload/cluster.

Komodor runs daily scans on your clusters and creates violations for failed checks, for you to be able to resolve, based on elaborated guidance. Those violations can be found under Reliability
(left-side menu) → Violations.

Violations are automatically grouped by “Impact group” - a logical division of how the violation affects the reliability of the cluster.

The page also supports grouping by cluster, none (have no grouping at all), scoped to different timeframes, and can be filtered by user preferences, using the left-hand-side filters. 

Example Use Cases

Over-provisioning/consumption leading to Node pressure

Poorly configured resource configurations (requests/limits) can generate unexpected load on the infrastructure, which can affect the overall performance and availability of other workloads running on those affected nodes. Komodor raises violations to indicate such cases, clearly shows their impact on the running application, and helps in addressing the risk and adjusting the resource allocation accordingly.

Screenshot 2024-06-23 at 14.21.04.png

Cluster-upgrades constraints

  • Upgrading Kubernetes clusters can be challenging, it involves upgrading the control plane and the workloads running on the cluster.
  • As the Kubernetes API evolves, APIs are periodically reorganized or upgraded. When APIs evolve, the old API is deprecated and eventually removed.

For both of the above cases, Komodor creates “Cluster Upgrades” violations, indicating approaching/already EOL Kubernetes versions or API deprecations. It clearly displays the gaps, and suggests the right version of cluster/API to upgrade to.

unnamed.png

 

Supported violations

Impact group - Degraded Service:

  • HPA Reached Max - Detects workloads that frequently experience high demand, causing its HPA to reach its maximum replicas, potentially impacting the workload performance and stability.
  • Workload CPU throttling - Surfaces throttled containers to tackle issues such as resource shortages, slow applications, or crashes.
  • Container restarts - Detects containers that suffer from frequent restarts.

Impact group - Cluster Upgrades:

  • Cluster end of life (EOL) - Detects clusters running with an outdated Kubernetes version, posing security risks and possible compatibility issues with newer tools and applications, and provides upgrade recommendations.
    In addition to that, managed Kubernetes providers support only the latest versions, this means you might not get support for the Kubernetes cluster in case you are running an older version, this can also cause additional costs in some providers (extended support)
  • Deprecated APIs - Raises a violation for deprecated APIs that are no longer recommended for use or are removed from future Kubernetes versions.

Impact group - Node Pressure:

    • Noisy neighbor - identifies memory-heavy workloads that impact co-located workloads.
    • Under provisioned workload - detects services where the usage surpasses the allocated resources (requests).

NOTE: If resource requests are not configured or configured to 0, Komodor assumes the requests are set to 1 to calculate the violation impact.

Violation Lifecycle

Violations can have one of the following statuses:

    • Open - When a violation is detected, it is opened with this status.
    • Acknowledged - You can acknowledge a violation to show awareness of a violation, and to allow communication with other team members.
    • Closed - A violation will automatically close when the issue does not appear in the next scan.
    • Manually Resolved - Violations can be marked as resolved. This will apply for the next 7 days and will re-open or close based on whether the issue persists or not.

NOTE: If you’d like to remove violations from the dashboard, you can add an ignore rule, please refer to Ignore Reliability violations

Handling Violations

Different violations pose different challenges, which in-tow will require different strategies for tackling the issue.
Under each impact group, the top violations appear, sorted by severity.
Within each violation, Komodor shows what happened and what you can do:

  • Explanation of the issue.
  • What exactly was detected at runtime (what’s the impact on the cluster/application, including links to problems in Komodor, if relevant).
  • A brief explanation of why it’s important.
  • Guidance on possible ways to resolve it.

After collecting the required data to address the violation, changes can be made to resolve it.

Ignore reliability violations

As not all workloads, namespaces, or clusters are made equal, some organizations might not want to get prompted on specific violations on certain scopes. 
Komodor allows you to configure ignore rules to accommodate just that.

Under Reliability → Ignored Checks, you can create ignore rules, used to exclude checks that are not of interest to you or your organization.

If exclusions are defined, violations of the specific type will not appear on the Reliability dashboard, nor in an API call output. 

To re-include violations as part of Komodor’s reliability checks, you can delete the defined ignore rules, and the relevant violations will reappear on the platform.

unnamed (2).png unnamed (3).png

 

NOTE: a user has to have manage:reliability permissions to manage ignore rules.

Configure reliability violations using policies

Komodor offers default configurations for violations, based on market benchmarks and best practices.

However, some issues might be more (or less) crucial than others, depending on application types and organization protocols.

For that, Komodor allows configuring custom thresholds based on specific preferences:

  • A default policy is provided
  • Option to add new policies:
    • To be applied on a specific scope (one or more cluster)
    • Include a policy priority (numeric, a higher number means higher priority)
    • Can include one or more checks, with configurable thresholds
    • Can exclude severities in case they are irrelevant to your organization

unnamed (4).png

NOTE: 

  • The policy priority has to be a unique number.
  • In case of multiple policies with different check configurations for the same scope, the higher priority policy will apply

What’s next?

We are continuing to evolve this feature, new checks and violations will be added regularly.

If any best practices are enforced in your organization and are not yet a part of our Reliability offering, feel free to reach out to us at product@komodor.io and share your needs.

We’re always happy to get feedback and expand our features and capabilities.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.