Overview
Klaudia’s GPU Tool focuses on GPU issues that occur below the Kubernetes layer - problems that Kubernetes itself cannot see or explain.
Instead of analyzing scheduling or pod configuration, Klaudia investigates node-level GPU, driver, and hardware failures that surface as application crashes, CUDA errors, or unstable workloads.
By running a DaemonSet on GPU nodes, Klaudia gains direct access to kernel logs, NVIDIA driver output, and real-time GPU diagnostics, enabling deep investigation of GPU failures that standard Kubernetes monitoring and observability tools miss.
This is especially critical for AI/ML training, inference, and HPC workloads, where GPU hardware, drivers, and interconnects are often the real source of failure.
How It Works
- Komodor deploys a DaemonSet on GPU nodes
- Klaudia accesses:
- Node kernel logs (dmesg)
- NVIDIA driver and firmware output
- GPU diagnostic files on the host
- When a GPU issue is detected, Klaudia:
- Executes DCGM commands via PodExec
- Collects real-time GPU health and hardware telemetry
- This data is correlated with workload failures to identify true root causes below Kubernetes
What Klaudia Analyzes
Klaudia analyzes low-level GPU and system signals that live outside Kubernetes:
- Linux kernel GPU error logs
- NVIDIA driver crashes and faults
- GPU hardware diagnostics and health counters
- GPU interconnect and PCIe/NVLink communication data
These signals are invisible to standard K8s events, metrics, and logs.
What Klaudia Can Do
Hardware & Driver Failures
- XID Errors
NVIDIA error codes indicating GPU faults such as:- Memory access violations
- “GPU has fallen off the bus” errors
- ECC Memory Failures
- Correctable errors (early warning signs of degradation)
- Uncorrectable errors causing crashes and data corruption
- Driver Errors
- Driver crashes and initialization failures
- Version or firmware mismatches
- Kernel-level GPU faults
Thermal & Power Issues
- Thermal throttling
- Power limit enforcement
- GPU overheating events affecting performance or stability
Connectivity & Multi-GPU Issues
- PCIe failures
- NVLink communication problems
- GPU ↔ CPU or GPU ↔ GPU communication errors
Low-Level GPU Processing Errors
- MMU faults
- Graphics exceptions
- Other kernel-reported GPU execution failures
When Klaudia Uses GPU Tool - Usage Examples
Klaudia automatically engages GPU investigation when failures suggest issues outside Kubernetes control, such as:
Root Cause Analysis (RCA)
- Training or inference pods failing with unexplained CUDA errors
- Repeated GPU-related crashes without clear pod-level causes
- Performance degradation tied to GPU behavior
Unhealthy GPU-Backed Workloads
- AI/ML training jobs crashing intermittently
- Inference services failing under load with GPU errors
- Batch jobs failing with CUDA or driver-level exceptions
Chat-Driven Investigations
When you ask Klaudia questions like:
- “Why does my GPU workload keep crashing?”
- “Are there hardware issues on this GPU node?”
- “Is this a driver or GPU failure?”
Comments
0 comments
Please sign in to leave a comment.