Overview
The Workflows feature in Komodor provides full visibility and monitoring of AI/ML workflows on Kubernetes. This feature allows you to track workflow runs and troubleshoot issues across various workflow engines. Workflows is designed to simplify the management of workflows from popular engines like Argo Workflows, Spark and Apache Airflow, with support for custom workflows by using specific labels.
With Workflows, you can:
- View and monitor your workflows
- Quickly identify issues with pods in past runs
- Insight into related infrastructure events, such as node terminations or issues, that may impact the workflow.
Accessing Workflows
-
Navigate to Workflows:
-
Access Workflows under the Kubernetes AddOns section in the left sidebar. This section provides access to monitoring and configuration tools specifically for Kubernetes-based workloads.
-
View the Workflows Dashboard:
- The Workflows Dashboard displays all monitored workflows, with separate tabs for Argo Workflows, Airflow, Spark and Custom Engines.
- Each tab includes a table listing workflows, organized by DAG/Template, with details like the latest run status, duration, and any identified issues.
-
Out-of-the-Box Monitoring for Argo Workflows, Airflow and Spark
Argo Workflows, Airflow and Spark Monitoring:
- For workflows in Argo, Airflow and Spark, Komodor automatically monitors and displays workflows without additional setup. Simply navigate to the relevant tab to view active and historical runs.
- The dashboard displays workflows organized by DAG/Template and shows the status (Running/Completed) and any issues for the latest run.
-
Understanding Status Indicators:
- Workflow statuses are calculated based on pod statuses within the workflow. Therfore, status and duration reflect pod-based data and may have a 10-minute delay in updates. Workflows will display Running or Completed status, with an indicator if issues were detected.
Custom Workflow Support
Custom workflows can also be monitored by Komodor by adding specific labels to your workflow’s pods.
-
Adding Custom Workflow Labels:
- To enable Komodor to identify and monitor custom workflows, label your pods with the following keys:
-
Workflow DAG ID:
app.komodor.com/WorkflowDagId
(e.g.,
data-processing-prod
)
-
Workflow Engine:
app.komodor.com/WorkflowEngine
(e.g.,
MLFlow
)
-
Workflow Run ID:
app.komodor.com/WorkflowRunId
(e.g.,
run-3235
)
-
Workflow Task ID:
app.komodor.com/WorkflowTaskId
(e.g.,
validation-1121
)
-
- To enable Komodor to identify and monitor custom workflows, label your pods with the following keys:
-
Viewing Custom Workflows:
- Once labeled, custom workflows will appear under the Custom Engines tab in the Workflows Dashboard.
- Similar to Argo and Airflow, workflows are organized by DAG/Template, showing the latest run status and any issues identified.
Workflow Pod Monitoring
The Workflow Pod Monitor is automatically enabled for each cluster, operating on a fault-based monitoring approach that tracks Argo, Airflow, and labeled pods. This monitor provides real-time insights into workflow performance and highlights any issues in pod execution.
For more detailed information check out our Monitors guide
Here’s an example of workflow pod issue:
Workflow View
The workflow view allows you to observe and troubleshoot your workflows easily.
The screen contains:
Runs Dropdown:
For each workflow template, you can switch between different runs using the Runs Dropdown. This allows you to review historical runs and compare performance across instances.
Timeline and events views:
Each workflow run contains detailed information about pod phases, issues, and infrastructure events. This data allows you to identify potential bottlenecks, troubleshoot failures, and understand the health of each workflow.
-
Tracking Issues with Pod Phases:
For each workflow’s pods, Komodor tracks all pod phases and pod issues within the workflow’s timeline, showing each task as a 'swimlane' in the workflow view. -
Tracking correlated Infrastructure Events
For each workflow, Komodor will identify correlated infrastructure events (e.g., node terminations, node issues) that are also displayed to give context to potential workflow disruptions.
Timeline capabilities:
- Toggle between timeline and event list views.
- Show only pods with issues toggle
Please note - Workflow data is retained for 3 days
Comments
0 comments
Please sign in to leave a comment.