GPU Tool Investigation

Overview

Klaudia’s GPU Tool focuses on GPU issues that occur below the Kubernetes layer - problems that Kubernetes itself cannot see or explain.
Instead of analyzing scheduling or pod configuration, Klaudia investigates node-level GPU, driver, and hardware failures that surface as application crashes, CUDA errors, or unstable workloads.

By running a DaemonSet on GPU nodes, Klaudia gains direct access to kernel logs, NVIDIA driver output, and real-time GPU diagnostics, enabling deep investigation of GPU failures that standard Kubernetes monitoring and observability tools miss.

This is especially critical for AI/ML training, inference, and HPC workloads, where GPU hardware, drivers, and interconnects are often the real source of failure.

How It Works

  • Komodor deploys a DaemonSet on GPU nodes
  • Klaudia accesses:
    • Node kernel logs (dmesg)
    • NVIDIA driver and firmware output
    • GPU diagnostic files on the host
  • When a GPU issue is detected, Klaudia:
    • Executes DCGM commands via PodExec
    • Collects real-time GPU health and hardware telemetry
  • This data is correlated with workload failures to identify true root causes below Kubernetes

What Klaudia Analyzes

Klaudia analyzes low-level GPU and system signals that live outside Kubernetes:

  • Linux kernel GPU error logs
  • NVIDIA driver crashes and faults
  • GPU hardware diagnostics and health counters
  • GPU interconnect and PCIe/NVLink communication data

These signals are invisible to standard K8s events, metrics, and logs.

What Klaudia Can Do

Hardware & Driver Failures

  • XID Errors
    NVIDIA error codes indicating GPU faults such as:
    • Memory access violations
    • “GPU has fallen off the bus” errors
  • ECC Memory Failures
    • Correctable errors (early warning signs of degradation)
    • Uncorrectable errors causing crashes and data corruption
  • Driver Errors
    • Driver crashes and initialization failures
    • Version or firmware mismatches
    • Kernel-level GPU faults

Thermal & Power Issues

  • Thermal throttling
  • Power limit enforcement
  • GPU overheating events affecting performance or stability

Connectivity & Multi-GPU Issues

  • PCIe failures
  • NVLink communication problems
  • GPU ↔ CPU or GPU ↔ GPU communication errors

Low-Level GPU Processing Errors

  • MMU faults
  • Graphics exceptions
  • Other kernel-reported GPU execution failures

When Klaudia Uses GPU Tool - Usage Examples

Klaudia automatically engages GPU investigation when failures suggest issues outside Kubernetes control, such as:

Root Cause Analysis (RCA)

  • Training or inference pods failing with unexplained CUDA errors
  • Repeated GPU-related crashes without clear pod-level causes
  • Performance degradation tied to GPU behavior

Unhealthy GPU-Backed Workloads

  • AI/ML training jobs crashing intermittently
  • Inference services failing under load with GPU errors
  • Batch jobs failing with CUDA or driver-level exceptions 

Chat-Driven Investigations

When you ask Klaudia questions like:

  • “Why does my GPU workload keep crashing?”
  • “Are there hardware issues on this GPU node?”
  • “Is this a driver or GPU failure?”

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.