GPU Tool Investigation

Overview

Klaudia’s GPU Tool focuses on GPU issues that occur below the Kubernetes layer - problems that Kubernetes itself cannot see or explain.
Instead of analyzing scheduling or pod configuration, Klaudia investigates node-level GPU, driver, and hardware failures that surface as application crashes, CUDA errors, or unstable workloads.

By running a DaemonSet on GPU nodes, Klaudia gains direct access to kernel logs, NVIDIA driver output, and real-time GPU diagnostics, enabling deep investigation of GPU failures that standard Kubernetes monitoring and observability tools miss.

This is especially critical for AI/ML training, inference, and HPC workloads, where GPU hardware, drivers, and interconnects are often the real source of failure.

How It Works

Komodor deploys a DaemonSet on GPU nodes
Klaudia accesses:
- Node kernel logs (dmesg)
- NVIDIA driver and firmware output
- GPU diagnostic files on the host
When a GPU issue is detected, Klaudia:
- Executes DCGM commands via PodExec
- Collects real-time GPU health and hardware telemetry
This data is correlated with workload failures to identify true root causes below Kubernetes

What Klaudia Analyzes

Klaudia analyzes low-level GPU and system signals that live outside Kubernetes:

Linux kernel GPU error logs
NVIDIA driver crashes and faults
GPU hardware diagnostics and health counters
GPU interconnect and PCIe/NVLink communication data

These signals are invisible to standard K8s events, metrics, and logs.

What Klaudia Can Do

Hardware & Driver Failures

XID Errors
NVIDIA error codes indicating GPU faults such as:
- Memory access violations
- “GPU has fallen off the bus” errors
ECC Memory Failures
- Correctable errors (early warning signs of degradation)
- Uncorrectable errors causing crashes and data corruption
Driver Errors
- Driver crashes and initialization failures
- Version or firmware mismatches
- Kernel-level GPU faults

Thermal & Power Issues

Thermal throttling
Power limit enforcement
GPU overheating events affecting performance or stability

Connectivity & Multi-GPU Issues

PCIe failures
NVLink communication problems
GPU ↔ CPU or GPU ↔ GPU communication errors

Low-Level GPU Processing Errors

MMU faults
Graphics exceptions
Other kernel-reported GPU execution failures

When Klaudia Uses GPU Tool - Usage Examples

Klaudia automatically engages GPU investigation when failures suggest issues outside Kubernetes control, such as:

Root Cause Analysis (RCA)

Training or inference pods failing with unexplained CUDA errors
Repeated GPU-related crashes without clear pod-level causes
Performance degradation tied to GPU behavior

Unhealthy GPU-Backed Workloads

AI/ML training jobs crashing intermittently
Inference services failing under load with GPU errors
Batch jobs failing with CUDA or driver-level exceptions

Chat-Driven Investigations

When you ask Klaudia questions like:

“Why does my GPU workload keep crashing?”
“Are there hardware issues on this GPU node?”
“Is this a driver or GPU failure?”