Background
Kubernetes, when configured correctly, provides a self-healing infrastructure.
It is intended to mitigate local issues by restarting containers, replacing pods, limiting/throttling workloads, and other steps to make sure your applications are running properly.
This behavior sometimes hides a bigger problem to be discovered down-the-line or impacts application services and the end-user experience.
Komodor utilizes data collected by the platform (events, metrics, resource specifications) to identify such faulty behaviors that could indicate a larger issue but may go unnoticed, and provide clear guidance on how to solve them.
Komodor also identifies misconfigurations within clusters and raises a flag to help in proactively preventing potential issues and improve the overall application reliability.
In this article, we will cover what Reliability risks in Komodor are, and what you can do with them.
Reliability in Komodor
Prerequisites
- Komodor agent version 2.2.1
- Metrics should be enabled, as some of the checks use this information.
Overview
A reliability violation represents an outstanding risk tied to a specific workload/cluster.
Komodor runs daily scans on your clusters and creates violations for failed checks, for you to be able to resolve, based on elaborated guidance. Those violations can be found under Reliability
(left-side menu) → Violations.
Violations are automatically grouped by “Impact group” - a logical division of how the violation affects the reliability of the cluster.
The page also supports grouping by cluster, none (have no grouping at all), scoped to different timeframes, and can be filtered by user preferences, using the left-hand-side filters.
Example Use Cases
Over-provisioning/consumption leading to Node pressure
Poorly configured resource configurations (requests/limits) can generate unexpected load on the infrastructure, which can affect the overall performance and availability of other workloads running on those affected nodes. Komodor raises violations to indicate such cases, clearly shows their impact on the running application, and helps in addressing the risk and adjusting the resource allocation accordingly.
Cluster-upgrades constraints
- Upgrading Kubernetes clusters can be challenging, it involves upgrading the control plane and the workloads running on the cluster.
- As the Kubernetes API evolves, APIs are periodically reorganized or upgraded. When APIs evolve, the old API is deprecated and eventually removed.
For both of the above cases, Komodor creates “Cluster Upgrades” violations, indicating approaching/already EOL Kubernetes versions or API deprecations. It clearly displays the gaps, and suggests the right version of cluster/API to upgrade to.
Mitigating Single Points of Failure using configuration best practice implementation
Reliable applications should have a reasonable failure-gap, to provide resource redundancy and avoid availability or performance issues as much as possible.
In cases where such redundancy does not exist, as well as Komodor detected a runtime impact, a SPoF (single point of failure) violation will be raised to clearly visualize what happened to your replicas during that time, correlate it to any node termination event / node issues / availability issues, and recommend relevant best practices that can be implemented to prevent such cases in the future.
Supported violations
Impact group - Degraded Service:
- HPA Reached Max - Detects workloads that frequently experience high demand, causing its HPA to reach its maximum replicas, potentially impacting the workload performance and stability.
- Workload CPU throttling - Surfaces throttled containers to tackle issues such as resource shortages, slow applications, or crashes.
- Container restarts - Detects containers that suffer from frequent restarts.
- Single point of failure - surfaces cases in which you have best practice misconfiguration (one of: missing PDB, low # of replicas, missing topology constraints, HPA min is 1) together with a runtime impact (node issue/termination that led to availability issue) and your application reached a state of having a single point of failure.
Impact group - Cluster Upgrades:
-
Cluster end of life (EOL) - Detects clusters running with an outdated Kubernetes version, posing security risks and possible compatibility issues with newer tools and applications, and provides upgrade recommendations.
In addition to that, managed Kubernetes providers support only the latest versions, this means you might not get support for the Kubernetes cluster in case you are running an older version, this can also cause additional costs in some providers (extended support) - Deprecated APIs - Raises a violation for deprecated APIs that are no longer recommended for use or are removed from future Kubernetes versions.
Impact group - Node Pressure:
- Noisy neighbor - identifies memory-heavy workloads that impact co-located workloads.
- Under provisioned workload - detects services where the usage surpasses the allocated resources (requests).
NOTE: If resource requests are not configured or configured to 0, Komodor assumes the requests are set to 1 to calculate the violation impact.
Violation Lifecycle
Violations can have one of the following statuses:
- Open - When a violation is detected, it is opened with this status.
- Acknowledged - You can acknowledge a violation to show awareness of a violation, and to allow communication with other team members.
- Closed - A violation will automatically close when the issue does not appear in the next scan.
- Manually Resolved - Violations can be marked as resolved. This will apply for the next 7 days and will re-open or close based on whether the issue persists or not.
NOTE: If you’d like to remove violations from the dashboard, you can add an ignore rule, please refer to Ignore Reliability violations
Handling Violations
Different violations pose different challenges, which in-tow will require different strategies for tackling the issue.
Under each impact group, the top violations appear, sorted by severity.
Within each violation, Komodor shows what happened and what you can do:
- Explanation of the issue.
- What exactly was detected at runtime (what’s the impact on the cluster/application, including links to problems in Komodor, if relevant).
- A brief explanation of why it’s important.
- Guidance on possible ways to resolve it.
After collecting the required data to address the violation, changes can be made to resolve it.
Standards violations
Misconfiguration of Kubernetes workloads can lead to applications downtime. Fortunately enough, Kubernetes offers different configurations and best practices recommendations, to make sure your applications are fault tolerant, and to help you stay up and running.
Komodor, as part of its Reliability set-of-tools, surfaces best practice violations to help you achieve exactly that.
How it works:
- Komodor checks your workloads against common reliability configuration checks
- In case the check criteria is not met (best practice is not configured), a best practice violations is created
- In it, you can find information about what it is, why it’s important, and how you can fix it
- You also get an impact indication - whether Komodor detected a runtime impact related to the misconfiguration. In such cases, the best practice violation will be linked to the corresponding runtime violation.
Supported best practice violations
- Missing CPU / memory requests
- Missing CPU / memory limits
- Missing pod priority
- Missing HPA
- Missing readiness probe
- Missing liveness probe
- Missing pod disruption budget
- Low number of replicas (replicas = 1)
- Missing topology spread constraints
- HPA min is 1
What’s next?
We are continuing to evolve this feature, new checks and violations will be added regularly.
If any best practices are enforced in your organization and are not yet a part of our Reliability offering, feel free to reach out to us at product@komodor.io and share your needs.
We’re always happy to get feedback and expand our features and capabilities.
Comments
0 comments
Please sign in to leave a comment.