Build: (2239771) Merge pull request #280 from machadovilaca/add-observability-controller-alerts-runbooks

sradco · sradco · commit 8f22cba85477 · 2025-02-12T17:32:26.000Z
Add runbooks for observability controller alerts
diff --git a/runbooks/HAControlPlaneDown.md b/runbooks/HAControlPlaneDown.md
@@ -0,0 +1,88 @@
+# HAControlPlaneDown
+
+## Meaning
+
+A control plane node has been detected as not ready for more than 5 minutes.
+
+## Impact
+
+When a control plane node is down, it affects the high availability and
+redundancy of the Kubernetes control plane. This can negatively impact:
+- API server availability
+- Controller manager operations
+- Scheduler functionality
+- etcd cluster health (if etcd is co-located)
+
+## Diagnosis
+
+1. Check the status of all control plane nodes:
+   ```bash
+   kubectl get nodes -l node-role.kubernetes.io/control-plane=''
+   ```
+
+2. Get detailed information about the affected node:
+   ```bash
+   kubectl describe node <node-name>
+   ```
+
+3. Review system logs on the affected node:
+   ```bash
+   ssh <node-address>
+   ```
+
+   ```bash
+   journalctl -xeu kubelet
+   ```
+
+## Mitigation
+
+1. Check node resources:
+   - Verify CPU, memory, and disk usage
+      ```bash
+      # Check the node's CPU and memory resource usage
+      kubectl top node <node-name>
+      ```
+
+      ```bash
+      # Check node status conditions for DiskPressure status
+      kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")'
+      ```
+   - Clear disk space if necessary
+   - Restart the kubelet if resource issues are resolved
+
+2. If the node is unreachable:
+   - Verify network connectivity
+   - Check physical/virtual machine status
+   - Ensure the node has power and is running
+
+3. If the kubelet is generating errors:
+   ```bash
+   systemctl status kubelet
+   ```
+
+   ```bash
+   systemctl restart kubelet
+   ```
+
+4. If the node cannot be recovered:
+   - If possible, safely drain the node
+      ```bash
+      kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
+      ```
+   - Investigate hardware/infrastructure issues
+   - Consider replacing the node if necessary
+
+## Additional notes
+- Maintain at least three control plane nodes for high availability
+- Monitor etcd cluster health if the affected node runs etcd
+- Document any infrastructure-specific recovery procedures
+
+<!--DS: If you cannot resolve the issue, log in to the
+link:https://access.redhat.com[Customer Portal] and open a support case,
+attaching the artifacts gathered during the diagnosis procedure.-->
+<!--USstart-->
+If you cannot resolve the issue, see the following resources:
+
+- [OKD Help](https://www.okd.io/help/)
+- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
+<!--USend-->
diff --git a/runbooks/HighCPUWorkload.md b/runbooks/HighCPUWorkload.md
@@ -0,0 +1,66 @@
+# HighCPUWorkload
+
+## Meaning
+
+This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes.
+
+## Impact
+
+High CPU utilization can lead to:
+- Degraded performance of applications running on the node
+- Increased latency in request processing
+- Potential service disruptions if CPU usage continues to climb
+
+## Diagnosis
+
+1. Identify the affected node:
+   ```bash
+   kubectl get nodes
+   ```
+
+2. Check the node's resource usage:
+   ```bash
+   kubectl describe node <node-name>
+   ```
+
+3. List pods that consume high amounts of CPU:
+   ```bash
+   kubectl top pods --all-namespaces --sort-by=cpu
+   ```
+
+4. Investigate specific pod details if needed:
+   ```bash
+   kubectl describe pod <pod-name> -n <namespace>
+   ```
+
+## Mitigation
+
+1. If the issue was caused by a malfunctioning pod:
+   - Consider restarting the pod
+   - Check pod logs for anomalies
+   - Review pod resource limits and requests
+
+2. If the issue is system-wide:
+   - Check for system processes that consume high amounts of CPU
+   - Consider cordoning the node and migrating workloads
+   - Evaluate if node scaling is needed
+
+3. Long-term solutions to avoid the issue:
+   - Implement or adjust pod resource limits
+   - Consider horizontal pod autoscaling
+   - Evaluate cluster capacity and scaling needs
+
+## Additional notes
+- Monitor the node after mitigation to ensure CPU usage returns to normal
+- Review application logs for potential root causes
+- Consider updating resource requests/limits if this is a recurring issue
+
+<!--DS: If you cannot resolve the issue, log in to the
+link:https://access.redhat.com[Customer Portal] and open a support case,
+attaching the artifacts gathered during the diagnosis procedure.-->
+<!--USstart-->
+If you cannot resolve the issue, see the following resources:
+
+- [OKD Help](https://www.okd.io/help/)
+- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
+<!--USend-->
diff --git a/runbooks/NodeNetworkInterfaceDown.md b/runbooks/NodeNetworkInterfaceDown.md
@@ -0,0 +1,85 @@
+# NodeNetworkInterfaceDown
+
+## Meaning
+
+This alert fires when one or more network interfaces on a node have been down
+for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and
+bridge tunnels.
+
+## Impact
+
+Network interface failures can lead to:
+- Reduced network connectivity for pods on the affected node
+- Potential service disruptions if critical network paths are affected
+- Degraded cluster communication if management interfaces are impacted
+
+## Diagnosis
+
+1. Identify the affected node and interfaces:
+   ```bash
+   kubectl get nodes
+   ```
+
+   ```bash
+   ssh <node-address>
+   ```
+
+   ```bash
+   ip link show | grep -i down
+   ```
+
+2. Check network interface details:
+   ```bash
+   ip addr show
+   ```
+
+   ```bash
+   ethtool <interface-name>
+   ```
+
+3. Review system logs for network-related issues:
+   ```bash
+   journalctl -u NetworkManager
+   ```
+
+   ```bash
+   dmesg | grep -i eth
+   ```
+
+## Mitigation
+
+1. For physical interface issues:
+   - Check physical cable connections
+   - Verify switch port configuration
+   - Test the interface with a different cable/port
+
+2. For software or configuration issues:
+   ```bash
+   # Restart NetworkManager
+   systemctl restart NetworkManager
+   ```
+
+   ```bash
+   # Bring interface up manually
+   ip link set <interface-name> up
+   ```
+
+3. If the issue persists:
+   - Check network interface configuration files
+   - Verify driver compatibility
+   - If the failure is on a physical interface, consider hardware replacement
+
+## Additional notes
+- Monitor interface status after mitigation
+- Document any hardware replacements or configuration changes
+- Consider implementing network redundancy for critical interfaces
+
+<!--DS: If you cannot resolve the issue, log in to the
+link:https://access.redhat.com[Customer Portal] and open a support case,
+attaching the artifacts gathered during the diagnosis procedure.-->
+<!--USstart-->
+If you cannot resolve the issue, see the following resources:
+
+- [OKD Help](https://www.okd.io/help/)
+- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
+<!--USend-->
diff --git a/runbooks_index.md b/runbooks_index.md
@@ -2,6 +2,7 @@
 
 - [HCOMisconfiguredDescheduler.md](runbooks/HCOMisconfiguredDescheduler.md)
 - [VirtApiRESTErrorsHigh.md](runbooks/VirtApiRESTErrorsHigh.md)
+- [HighCPUWorkload.md](runbooks/HighCPUWorkload.md)
 - [KubeVirtNoAvailableNodesToRunVMs.md](runbooks/KubeVirtNoAvailableNodesToRunVMs.md)
 - [KubeMacPoolDuplicateMacsFound.md](runbooks/KubeMacPoolDuplicateMacsFound.md)
 - [CDIStorageProfilesIncomplete.md](runbooks/CDIStorageProfilesIncomplete.md)
@@ -19,6 +20,7 @@
 - [VirtOperatorRESTErrorsHigh.md](runbooks/VirtOperatorRESTErrorsHigh.md)
 - [KubevirtVmHighMemoryUsage.md](runbooks/KubevirtVmHighMemoryUsage.md)
 - [HCOInstallationIncomplete.md](runbooks/HCOInstallationIncomplete.md)
+- [NodeNetworkInterfaceDown.md](runbooks/NodeNetworkInterfaceDown.md)
 - [CDIOperatorDown.md](runbooks/CDIOperatorDown.md)
 - [LowReadyVirtOperatorsCount.md](runbooks/LowReadyVirtOperatorsCount.md)
 - [SSPCommonTemplatesModificationReverted.md](runbooks/SSPCommonTemplatesModificationReverted.md)
@@ -39,6 +41,7 @@
 - [CDIDataVolumeUnusualRestartCount.md](runbooks/CDIDataVolumeUnusualRestartCount.md)
 - [CnaoDown.md](runbooks/CnaoDown.md)
 - [SSPFailingToReconcile.md](runbooks/SSPFailingToReconcile.md)
+- [HAControlPlaneDown.md](runbooks/HAControlPlaneDown.md)
 - [VirtOperatorDown.md](runbooks/VirtOperatorDown.md)
 - [VMStorageClassWarning.md](runbooks/VMStorageClassWarning.md)
 - [LowVirtAPICount.md](runbooks/LowVirtAPICount.md)