Workflows 🔀

Overview

The Workflows feature in Komodor provides full visibility and monitoring of AI/ML workflows on Kubernetes. This feature allows you to track workflow runs and troubleshoot issues across various workflow engines. Workflows is designed to simplify the management of workflows from popular engines like Argo Workflows, Spark and Apache Airflow, with support for custom workflows by using specific labels.

With Workflows, you can:

  • View and monitor your workflows
  • Quickly identify issues with pods in past runs
  • Insight into related infrastructure events, such as node terminations or issues, that may impact the workflow.
wf with correlated node termination for docs.png

Accessing Workflows

  1. Navigate to Workflows:
    1. Access Workflows under the Kubernetes AddOns section in the left sidebar. This section provides access to monitoring and configuration tools specifically for Kubernetes-based workloads.

      Screenshot 2024-10-27 at 15.08.30.png

    2. View the Workflows Dashboard:
      • The Workflows Dashboard displays all monitored workflows, with separate tabs for Argo Workflows, Airflow, Spark and Custom Engines.
      • Each tab includes a table listing workflows, organized by DAG/Template, with details like the latest run status, duration, and any identified issues.
workflows docs update.png

Out-of-the-Box Monitoring for Argo Workflows, Airflow and Spark

Argo Workflows, Airflow and Spark Monitoring:

  1. For workflows in Argo, Airflow and Spark, Komodor automatically monitors and displays workflows without additional setup. Simply navigate to the relevant tab to view active and historical runs.
  2. The dashboard displays workflows organized by DAG/Template and shows the status (Running/Completed) and any issues for the latest run.Workflows List Airflow for docs.png
  3. Understanding Status Indicators:
    - Workflow statuses are calculated based on pod statuses within the workflow. Therfore, status and duration reflect pod-based data and may have a 10-minute delay in updates. Workflows will display Running or Completed status, with an indicator if issues were detected.latest run for docs.png

Custom Workflow Support

Custom workflows can also be monitored by Komodor by adding specific labels to your workflow’s pods.

  1. Adding Custom Workflow Labels:
    • To enable Komodor to identify and monitor custom workflows, label your pods with the following keys:
      • Workflow DAG ID: app.komodor.com/WorkflowDagId (e.g., data-processing-prod)
      • Workflow Engine: app.komodor.com/WorkflowEngine (e.g., MLFlow)
      • Workflow Run ID: app.komodor.com/WorkflowRunId (e.g., run-3235)
      • Workflow Task ID: app.komodor.com/WorkflowTaskId (e.g., validation-1121)
labels custom wf for docs.png
  1. Viewing Custom Workflows:
    1. Once labeled, custom workflows will appear under the Custom Engines tab in the Workflows Dashboard.
    2. Similar to Argo and Airflow, workflows are organized by DAG/Template, showing the latest run status and any issues identified.custom workflows for docs.png

Workflow Pod Monitoring

The Workflow Pod Monitor is automatically enabled for each cluster, operating on a fault-based monitoring approach that tracks Argo, Airflow, and labeled pods. This monitor provides real-time insights into workflow performance and highlights any issues in pod execution.

For more detailed information check out our Monitors guide

Here’s an example of workflow pod issue:
wf pod issue example for docs.png

Workflow View

The workflow view allows you to observe and troubleshoot your workflows easily.

The screen contains:

Runs Dropdown:

For each workflow template, you can switch between different runs using the Runs Dropdown. This allows you to review historical runs and compare performance across instances.wf dropdown for docs.png

Timeline and events views:

Each workflow run contains detailed information about pod phases, issues, and infrastructure events. This data allows you to identify potential bottlenecks, troubleshoot failures, and understand the health of each workflow.

  1. Tracking Issues with Pod Phases:
    For each workflow’s pods, Komodor tracks all pod phases and pod issues within the workflow’s timeline, showing each task as a 'swimlane' in the workflow view.
  2. Tracking correlated Infrastructure Events
    For each workflow, Komodor will identify correlated infrastructure events (e.g., node terminations, node issues) that are also displayed to give context to potential workflow disruptions.
wf with correlated node termination for docs.png

Timeline capabilities:

  1. Toggle between timeline and event list views.wf timeline vs list toggle for docs.png
  2. Show only pods with issues toggleshow only pod with issues for docs.png

Please note - Workflow data is retained for 3 days

Klaudia RCA for Workflows

Klaudia RCA for Workflows extends Klaudia's root cause analysis capabilities beyond individual pods to entire workflow executions. This feature provides comprehensive failure analysis for workflow-based systems including Argo Workflows, Apache Airflow, Apache Spark, and custom workflow implementations.

Workflows can fail across multiple pods and steps, and often require critical context from the workflow resource itself to understand the full picture. Klaudia RCA for Workflows analyzes workflow failures holistically, correlating issues across all related pods while incorporating the complete workflow Custom Resource Definition (CRD).

This feature is generally available and automatically activates when workflow issues are detected.

What Klaudia RCA for Workflows Can Do

Klaudia RCA for Workflows provides end-to-end root cause analysis specifically designed for complex, multi-step workflow executions:

Workflow-Scoped Analysis
Rather than examining a single failing pod in isolation, Klaudia analyzes the entire workflow execution as a cohesive unit. This enables detection of cascading failures, dependency issues, and cross-step problems that would be invisible in pod-level analysis.

Automatic Pod Correlation
When analyzing a workflow, Klaudia automatically identifies and retrieves all failing pods associated with that workflow run, eliminating the need to manually investigate each pod separately and ensuring no relevant failure data is overlooked.

Multi-Pod Root Cause Correlation
Klaudia examines failures across all workflow pods together, identifying patterns and relationships between failures in different steps or tasks. This reveals whether failures are independent issues or symptoms of a common root cause.

CRD-Aware Investigation
The analysis incorporates the workflow's Custom Resource Definition YAML, including configurations specific to Spark, Argo, and other workflow engines. This ensures that workflow-level settings, dependencies, and constraints are considered in the root cause determination.

Comprehensive Context
By combining pod logs, pod events, workflow resource configurations, and CRD specifications, Klaudia delivers a complete picture of what went wrong and why, leading to more accurate diagnoses and faster resolution.

When to Use It

When It Appears
The workflow RCA investigation option displays only for workflows experiencing failures or issues. Successfully completed workflow executions will not present the RCA option.

Ideal Use Cases
This feature is particularly valuable when troubleshooting:

  • Multi-step workflow failures where the root cause isn't immediately apparent
  • Situations where multiple pods in a workflow execution have failed
  • Complex data pipelines where failures may cascade across dependent tasks
  • Scenarios where pod-level RCA has been insufficient or inconclusive

By analyzing workflow failures at the appropriate scope and incorporating all relevant context, Klaudia RCA for Workflows helps teams quickly understand and resolve issues in their orchestrated workloads, reducing mean time to resolution.

 

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.