Skip to content

Commit d822c7d

Browse files
committed
Add runbooks for observability controller alerts
- HAControlPlaneDown - NodeNetworkInterfaceDown - HighCPUWorkload Signed-off-by: João Vilaça <[email protected]>
1 parent 8c69bc4 commit d822c7d

File tree

3 files changed

+217
-0
lines changed

3 files changed

+217
-0
lines changed

docs/runbooks/HAControlPlaneDown.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# HAControlPlaneDown
2+
3+
## Meaning
4+
5+
A control plane node has been detected as not ready for more than 5 minutes.
6+
7+
## Impact
8+
9+
When a control plane node is down, it affects the high availability and
10+
redundancy of the Kubernetes control plane. This can negatively impact:
11+
- API server availability
12+
- Controller manager operations
13+
- Scheduler functionality
14+
- etcd cluster health (if etcd is co-located)
15+
16+
## Diagnosis
17+
18+
1. Check the status of all control plane nodes:
19+
```bash
20+
kubectl get nodes -l node-role.kubernetes.io/control-plane=''
21+
```
22+
23+
2. Get detailed information about the affected node:
24+
```bash
25+
kubectl describe node <node-name>
26+
```
27+
28+
3. Review system logs on the affected node:
29+
```bash
30+
ssh <node-address>
31+
journalctl -xeu kubelet
32+
```
33+
34+
## Mitigation
35+
36+
1. Check node resources:
37+
- Verify CPU, memory, and disk usage
38+
```bash
39+
# Check the node's CPU and memory resource usage
40+
kubectl top node <node-name>
41+
42+
# Check node status conditions for DiskPressure status
43+
kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")'
44+
```
45+
- Clear disk space if necessary
46+
- Restart kubelet if resource issues are resolved
47+
48+
2. If the node is unreachable:
49+
- Verify network connectivity
50+
- Check physical/virtual machine status
51+
- Ensure the node has power and is running
52+
53+
3. If kubelet is generating errors:
54+
```bash
55+
systemctl status kubelet
56+
systemctl restart kubelet
57+
```
58+
59+
4. If the node cannot be recovered:
60+
- If possible, safely drain the node
61+
```bash
62+
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
63+
```
64+
- Investigate hardware/infrastructure issues
65+
- Consider replacing the node if necessary
66+
67+
## Additional Notes
68+
- Maintain at least three control plane nodes for high availability
69+
- Monitor etcd cluster health if affected node runs etcd
70+
- Document any infrastructure-specific recovery procedures
71+
72+
<!--DS: If you cannot resolve the issue, log in to the
73+
link:https://access.redhat.com[Customer Portal] and open a support case,
74+
attaching the artifacts gathered during the diagnosis procedure.-->
75+
<!--USstart-->
76+
If you cannot resolve the issue, see the following resources:
77+
78+
- [OKD Help](https://www.okd.io/help/)
79+
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
80+
<!--USend-->

docs/runbooks/HighCPUWorkload.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# HighCPUWorkload
2+
3+
## Meaning
4+
5+
This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes.
6+
7+
## Impact
8+
9+
High CPU utilization can lead to:
10+
- Degraded performance of applications running on the node
11+
- Increased latency in request processing
12+
- Potential service disruptions if CPU usage continues to climb
13+
14+
## Diagnosis
15+
16+
1. Identify the affected node:
17+
```bash
18+
kubectl get nodes
19+
```
20+
21+
2. Check node resource usage:
22+
```bash
23+
kubectl describe node <node-name>
24+
```
25+
26+
3. List pods consuming high CPU:
27+
```bash
28+
kubectl top pods --all-namespaces --sort-by=cpu
29+
```
30+
31+
4. Investigate specific pod details if needed:
32+
```bash
33+
kubectl describe pod <pod-name> -n <namespace>
34+
```
35+
36+
## Mitigation
37+
38+
1. If the issue was caused by a malfunctioning pod:
39+
- Consider restarting the pod
40+
- Check pod logs for anomalies
41+
- Review pod resource limits and requests
42+
43+
2. If the issue is system-wide:
44+
- Check for system processes consuming high CPU
45+
- Consider cordoning the node and migrating workloads
46+
- Evaluate if node scaling is needed
47+
48+
3. Long-term solutions to avoid the issue:
49+
- Implement or adjust pod resource limits
50+
- Consider horizontal pod autoscaling
51+
- Evaluate cluster capacity and scaling needs
52+
53+
## Additional Notes
54+
- Monitor the node after mitigation to ensure CPU usage returns to normal
55+
- Review application logs for potential root causes
56+
- Consider updating resource requests/limits if this is a recurring issue
57+
58+
<!--DS: If you cannot resolve the issue, log in to the
59+
link:https://access.redhat.com[Customer Portal] and open a support case,
60+
attaching the artifacts gathered during the diagnosis procedure.-->
61+
<!--USstart-->
62+
If you cannot resolve the issue, see the following resources:
63+
64+
- [OKD Help](https://www.okd.io/help/)
65+
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
66+
<!--USend-->
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# NodeNetworkInterfaceDown
2+
3+
## Meaning
4+
5+
This alert fires when one or more network interfaces on a node have been down
6+
for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and
7+
bridge tunnels.
8+
9+
## Impact
10+
11+
Network interface failures can lead to:
12+
- Reduced network connectivity for pods on the affected node
13+
- Potential service disruptions if critical network paths are affected
14+
- Degraded cluster communication if management interfaces are impacted
15+
16+
## Diagnosis
17+
18+
1. Identify the affected node and interfaces:
19+
```bash
20+
kubectl get nodes
21+
ssh <node-address>
22+
ip link show | grep -i down
23+
```
24+
25+
2. Check network interface details:
26+
```bash
27+
ip addr show
28+
ethtool <interface-name>
29+
```
30+
31+
3. Review system logs for network-related issues:
32+
```bash
33+
journalctl -u NetworkManager
34+
dmesg | grep -i eth
35+
```
36+
37+
## Mitigation
38+
39+
1. For physical interface issues:
40+
- Check physical cable connections
41+
- Verify switch port configuration
42+
- Test the interface with a different cable/port
43+
44+
2. For software or configuration issues:
45+
```bash
46+
# Restart NetworkManager
47+
systemctl restart NetworkManager
48+
49+
# Bring interface up manually
50+
ip link set <interface-name> up
51+
```
52+
53+
3. If the issue persists:
54+
- Check network interface configuration files
55+
- Verify driver compatibility
56+
- Consider hardware replacement if physical failure
57+
58+
## Additional Notes
59+
- Monitor interface status after mitigation
60+
- Document any hardware replacements or configuration changes
61+
- Consider implementing network redundancy for critical interfaces
62+
63+
<!--DS: If you cannot resolve the issue, log in to the
64+
link:https://access.redhat.com[Customer Portal] and open a support case,
65+
attaching the artifacts gathered during the diagnosis procedure.-->
66+
<!--USstart-->
67+
If you cannot resolve the issue, see the following resources:
68+
69+
- [OKD Help](https://www.okd.io/help/)
70+
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
71+
<!--USend-->

0 commit comments

Comments
 (0)