Skip to content

Commit 8f22cba

Browse files
committed
Build: (2239771) Merge pull request #280 from machadovilaca/add-observability-controller-alerts-runbooks
Add runbooks for observability controller alerts
1 parent 3068da7 commit 8f22cba

File tree

4 files changed

+242
-0
lines changed

4 files changed

+242
-0
lines changed

runbooks/HAControlPlaneDown.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# HAControlPlaneDown
2+
3+
## Meaning
4+
5+
A control plane node has been detected as not ready for more than 5 minutes.
6+
7+
## Impact
8+
9+
When a control plane node is down, it affects the high availability and
10+
redundancy of the Kubernetes control plane. This can negatively impact:
11+
- API server availability
12+
- Controller manager operations
13+
- Scheduler functionality
14+
- etcd cluster health (if etcd is co-located)
15+
16+
## Diagnosis
17+
18+
1. Check the status of all control plane nodes:
19+
```bash
20+
kubectl get nodes -l node-role.kubernetes.io/control-plane=''
21+
```
22+
23+
2. Get detailed information about the affected node:
24+
```bash
25+
kubectl describe node <node-name>
26+
```
27+
28+
3. Review system logs on the affected node:
29+
```bash
30+
ssh <node-address>
31+
```
32+
33+
```bash
34+
journalctl -xeu kubelet
35+
```
36+
37+
## Mitigation
38+
39+
1. Check node resources:
40+
- Verify CPU, memory, and disk usage
41+
```bash
42+
# Check the node's CPU and memory resource usage
43+
kubectl top node <node-name>
44+
```
45+
46+
```bash
47+
# Check node status conditions for DiskPressure status
48+
kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")'
49+
```
50+
- Clear disk space if necessary
51+
- Restart the kubelet if resource issues are resolved
52+
53+
2. If the node is unreachable:
54+
- Verify network connectivity
55+
- Check physical/virtual machine status
56+
- Ensure the node has power and is running
57+
58+
3. If the kubelet is generating errors:
59+
```bash
60+
systemctl status kubelet
61+
```
62+
63+
```bash
64+
systemctl restart kubelet
65+
```
66+
67+
4. If the node cannot be recovered:
68+
- If possible, safely drain the node
69+
```bash
70+
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
71+
```
72+
- Investigate hardware/infrastructure issues
73+
- Consider replacing the node if necessary
74+
75+
## Additional notes
76+
- Maintain at least three control plane nodes for high availability
77+
- Monitor etcd cluster health if the affected node runs etcd
78+
- Document any infrastructure-specific recovery procedures
79+
80+
<!--DS: If you cannot resolve the issue, log in to the
81+
link:https://access.redhat.com[Customer Portal] and open a support case,
82+
attaching the artifacts gathered during the diagnosis procedure.-->
83+
<!--USstart-->
84+
If you cannot resolve the issue, see the following resources:
85+
86+
- [OKD Help](https://www.okd.io/help/)
87+
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
88+
<!--USend-->

runbooks/HighCPUWorkload.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# HighCPUWorkload
2+
3+
## Meaning
4+
5+
This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes.
6+
7+
## Impact
8+
9+
High CPU utilization can lead to:
10+
- Degraded performance of applications running on the node
11+
- Increased latency in request processing
12+
- Potential service disruptions if CPU usage continues to climb
13+
14+
## Diagnosis
15+
16+
1. Identify the affected node:
17+
```bash
18+
kubectl get nodes
19+
```
20+
21+
2. Check the node's resource usage:
22+
```bash
23+
kubectl describe node <node-name>
24+
```
25+
26+
3. List pods that consume high amounts of CPU:
27+
```bash
28+
kubectl top pods --all-namespaces --sort-by=cpu
29+
```
30+
31+
4. Investigate specific pod details if needed:
32+
```bash
33+
kubectl describe pod <pod-name> -n <namespace>
34+
```
35+
36+
## Mitigation
37+
38+
1. If the issue was caused by a malfunctioning pod:
39+
- Consider restarting the pod
40+
- Check pod logs for anomalies
41+
- Review pod resource limits and requests
42+
43+
2. If the issue is system-wide:
44+
- Check for system processes that consume high amounts of CPU
45+
- Consider cordoning the node and migrating workloads
46+
- Evaluate if node scaling is needed
47+
48+
3. Long-term solutions to avoid the issue:
49+
- Implement or adjust pod resource limits
50+
- Consider horizontal pod autoscaling
51+
- Evaluate cluster capacity and scaling needs
52+
53+
## Additional notes
54+
- Monitor the node after mitigation to ensure CPU usage returns to normal
55+
- Review application logs for potential root causes
56+
- Consider updating resource requests/limits if this is a recurring issue
57+
58+
<!--DS: If you cannot resolve the issue, log in to the
59+
link:https://access.redhat.com[Customer Portal] and open a support case,
60+
attaching the artifacts gathered during the diagnosis procedure.-->
61+
<!--USstart-->
62+
If you cannot resolve the issue, see the following resources:
63+
64+
- [OKD Help](https://www.okd.io/help/)
65+
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
66+
<!--USend-->

runbooks/NodeNetworkInterfaceDown.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# NodeNetworkInterfaceDown
2+
3+
## Meaning
4+
5+
This alert fires when one or more network interfaces on a node have been down
6+
for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and
7+
bridge tunnels.
8+
9+
## Impact
10+
11+
Network interface failures can lead to:
12+
- Reduced network connectivity for pods on the affected node
13+
- Potential service disruptions if critical network paths are affected
14+
- Degraded cluster communication if management interfaces are impacted
15+
16+
## Diagnosis
17+
18+
1. Identify the affected node and interfaces:
19+
```bash
20+
kubectl get nodes
21+
```
22+
23+
```bash
24+
ssh <node-address>
25+
```
26+
27+
```bash
28+
ip link show | grep -i down
29+
```
30+
31+
2. Check network interface details:
32+
```bash
33+
ip addr show
34+
```
35+
36+
```bash
37+
ethtool <interface-name>
38+
```
39+
40+
3. Review system logs for network-related issues:
41+
```bash
42+
journalctl -u NetworkManager
43+
```
44+
45+
```bash
46+
dmesg | grep -i eth
47+
```
48+
49+
## Mitigation
50+
51+
1. For physical interface issues:
52+
- Check physical cable connections
53+
- Verify switch port configuration
54+
- Test the interface with a different cable/port
55+
56+
2. For software or configuration issues:
57+
```bash
58+
# Restart NetworkManager
59+
systemctl restart NetworkManager
60+
```
61+
62+
```bash
63+
# Bring interface up manually
64+
ip link set <interface-name> up
65+
```
66+
67+
3. If the issue persists:
68+
- Check network interface configuration files
69+
- Verify driver compatibility
70+
- If the failure is on a physical interface, consider hardware replacement
71+
72+
## Additional notes
73+
- Monitor interface status after mitigation
74+
- Document any hardware replacements or configuration changes
75+
- Consider implementing network redundancy for critical interfaces
76+
77+
<!--DS: If you cannot resolve the issue, log in to the
78+
link:https://access.redhat.com[Customer Portal] and open a support case,
79+
attaching the artifacts gathered during the diagnosis procedure.-->
80+
<!--USstart-->
81+
If you cannot resolve the issue, see the following resources:
82+
83+
- [OKD Help](https://www.okd.io/help/)
84+
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
85+
<!--USend-->

runbooks_index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
- [HCOMisconfiguredDescheduler.md](runbooks/HCOMisconfiguredDescheduler.md)
44
- [VirtApiRESTErrorsHigh.md](runbooks/VirtApiRESTErrorsHigh.md)
5+
- [HighCPUWorkload.md](runbooks/HighCPUWorkload.md)
56
- [KubeVirtNoAvailableNodesToRunVMs.md](runbooks/KubeVirtNoAvailableNodesToRunVMs.md)
67
- [KubeMacPoolDuplicateMacsFound.md](runbooks/KubeMacPoolDuplicateMacsFound.md)
78
- [CDIStorageProfilesIncomplete.md](runbooks/CDIStorageProfilesIncomplete.md)
@@ -19,6 +20,7 @@
1920
- [VirtOperatorRESTErrorsHigh.md](runbooks/VirtOperatorRESTErrorsHigh.md)
2021
- [KubevirtVmHighMemoryUsage.md](runbooks/KubevirtVmHighMemoryUsage.md)
2122
- [HCOInstallationIncomplete.md](runbooks/HCOInstallationIncomplete.md)
23+
- [NodeNetworkInterfaceDown.md](runbooks/NodeNetworkInterfaceDown.md)
2224
- [CDIOperatorDown.md](runbooks/CDIOperatorDown.md)
2325
- [LowReadyVirtOperatorsCount.md](runbooks/LowReadyVirtOperatorsCount.md)
2426
- [SSPCommonTemplatesModificationReverted.md](runbooks/SSPCommonTemplatesModificationReverted.md)
@@ -39,6 +41,7 @@
3941
- [CDIDataVolumeUnusualRestartCount.md](runbooks/CDIDataVolumeUnusualRestartCount.md)
4042
- [CnaoDown.md](runbooks/CnaoDown.md)
4143
- [SSPFailingToReconcile.md](runbooks/SSPFailingToReconcile.md)
44+
- [HAControlPlaneDown.md](runbooks/HAControlPlaneDown.md)
4245
- [VirtOperatorDown.md](runbooks/VirtOperatorDown.md)
4346
- [VMStorageClassWarning.md](runbooks/VMStorageClassWarning.md)
4447
- [LowVirtAPICount.md](runbooks/LowVirtAPICount.md)

0 commit comments

Comments
 (0)