Overview

Komodor provides rich out-of-the-box monitoring capabilities, capturing a wide variety of issues from day one. Komodor allows you, from the moment you onboard your cluster, to monitor significant events in your cluster that will allow you to review occurrences of certain events and issues.

The Realtime Health Policy Monitors in Komodor will detect, realtime, based on the settings configured for each monitor, issues that arise in your specified scope (Cluster/ NS/ Service etc.)

See the configuration found below Monitor Logic

Monitors for docs.png

Monitor Logic

When a cluster is onboarded to Komodor all the above monitors are created out of the box for the cluster, this is to allow for issues to be recognized from the first moment and begin providing value.

When a service is showing as UNHEALTHY in the service view, as below, this is directly linked to the monitor set up for the cluster.

Now see that the monitor configuration below for that service is showing an Availability Issue.

Realtime Health Monitors can be used for detecting infrastructure failures (such as node/disk), as well as application level issues generated due to a failed deployment or any other reason.

Automated investigation playbooks - when any failure occurs, Komodor initiates automated playbooks to enrich the failure event with additional information to help capture the cause for the issue and provide suggested remediation items to resolve the issue.

The monitors are highly configurable, allowing users to configure different thresholds for different reasons as well as adjusting the scope of the monitor (workload, namespace, cluster).

NOTE: Understanding the Notification Selector for Monitors

The Notification Selector in Komodor monitors is used to define how and where notifications should be sent when certain conditions are met. However, it’s important to note:

If no notification sink (e.g., Slack, email, PagerDuty) is set up, the triggered notification will not be delivered anywhere. This is because a sink acts as the destination for these notifications.

Removing a Monitor’s Logic

If you do not want the monitor’s logic to apply to the services within its scope:

Navigate to the monitor in the Komodor UI.
Delete the relevant rule from the monitor’s configuration.

This ensures that the monitor no longer evaluates or triggers notifications for those services.

If you have any further questions or need clarification, feel free to reach out to our support team.

It also offers the ability to trigger notifications based on the cause of the issue

Integration with industry-leading tools

Komodor monitors are also configurable programatically for integration into automated workflows:

Monitor Types

Availability Monitor

Monitor your workload’s health (available replicas < desired replicas), and creates an Availability issue on the Events and Service timelines that provides relevant information to resolve the issue. The Availability monitor will not be triggered during an active rollout.

Please note Modifying the scope of an Availability monitor might affect (remove) events from the timeline.

The monitor is triggered by - Service (Deployment/DaemonSet/Rollout/StatefulSet) number of available replicas < desired replicas by the specified conditions for the defined duration
The following checks are performed -
Pods health Foreach Pod we'll provide the following: - Phase, Reason, Pod events - Containers list with their state, reason, logs and metrics (CPU/Memory)
Correlated latest deployments
Correlated node issues & node terminations For Out of memory and evicted issues, additional checks are performed -
Check if the container limit has been reached
Check if the memory limit decreased before the issue
Noisy neighbors report

Please note Data provided in the Availability issue checks is a snapshot in time for when the issue occurred.

Please note It is possible to customize your monitor's conditions and alerts for specific error categories. You can do it easily by going to the "Monitors" screen, choose the relevant cluster, and clicking "Add rule" under "Availability monitor", you can choose the desired categories in the "Trigger conditions" section.

Each category includes the following reasons :

Category	Reasons
NonZeroExitCode	NonZeroExitCode
Unhealthy	Unhealthy
OOMKilled	OOMKilled, NonZeroExitCode - Exit code: 137
Creating/Initializing	ContainerCreating, PodInitializing, PodNotReady, ContainersNotReady
BackOff	BackOff, CrashLoopBackOff, ImagePullBackOff
Infrastructure	NodeNotReady, NetworkNotReady, Evicted,NodeShutdown, Terminated, Preempted
Scheduling	FailedScheduling, NotTriggerScaleUp, PodPending, NodeAffinity
Image	ErrImagePull, InvalidImageName
Volume/Secret/ConfigMap	FailedMount, FailedAttachVolume, CreateContainerConfigError
Container Creation	CreateContainerError, RunContainerError, ContainerCannotRun, ContainerStatusUnknown, ReadinessGatesNotReady
Pod Termination	FailedPreStopHook, FailedKillPod
Completed	Completed
Other	Any reason that was not mapped in other categories

Deploy Monitor

A Deploy monitor will be triggered whenever a resource is being deployed/rolled out.
Using the Deploy Monitor configuration you can define on what resources (scope) and on what occasion (failed deploy/successful deploy/both) when you would like to get a notification in one of your notification channels (Slack/Teams)

Node Monitor

Monitors Nodes with faulty Conditions.

The monitor is triggered by - Node Conditions change to a faulty Condition, the faulty condition/s last through the configured Duration
We perform the following checks as part of our investigation
Is the node ready?
Is the node overcommitted?
Is the node under pressure?
Are system Pods healthy?
Is the network available?
Are Pods being evicted? -Are user pods healthy?
Is the node schedulable?
Node overall resource consumption including top 5 pod consumers (requires metric-server installed)
Notes
The Node detector currently does not deal with nodes in an Unknown state (this means Spot interruptions or scale-down events will not be handled by the WF and could affect other scenarios as well)
Will only run on Nodes that are created for more than 3 minutes (there is a 3-minute delay from Node create time before running the WF)

PVC Monitor

Monitors PVCs in a pending state.

Triggered by - PVC in a pending state for the defined duration
We perform the following checks as part of our investigation
PVC creation, utilization, and readiness issues
Volume provisioner-related issues
PVC spec change
Identify the impact on your services

Job Monitor

The Job Monitor will be triggered when a job fails execution.
It allows you to get notified of Job failures on the defined scope.

CronJob Monitor

The CronJob Monitor will be triggered when a job (managed by CronJob) fails execution.
It allows you to get notified for the first (first failure after a success) or any CronJob failures on the defined scope.

Workflow Pod Monitor

The Workflows Pod Monitor is specifically designed to improve the reliability and accuracy of monitoring for workflow engine pods, such as those used by Airflow and Argo Workflows.
This feature automatically detects faults and issues that arise in these pods, ensuring quicker identification and resolution of problems that may affect your workflow processes.

The monitor triggers investigations whenever there are faults detected in pod exit codes or conditions.
This ensures that problems are caught instantly, allowing you to take action promptly.

The Workflows Pod Monitor is enabled by default, targeting workflow-related pods.
If you’re using Airflow or Argo Workflows in your environment, this monitor will be automatically applied.

Notifications can be customized based on your preference, ensuring you receive alerts in the most effective format for your workflow.

Real-Time Health Policy Monitors