-
Notifications
You must be signed in to change notification settings - Fork 41
Add runbooks for observability controller alerts #280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,88 @@ | ||||||
# HAControlPlaneDown | ||||||
|
||||||
## Meaning | ||||||
|
||||||
A control plane node has been detected as not ready for more than 5 minutes. | ||||||
|
||||||
## Impact | ||||||
|
||||||
When a control plane node is down, it affects the high availability and | ||||||
redundancy of the Kubernetes control plane. This can negatively impact: | ||||||
- API server availability | ||||||
- Controller manager operations | ||||||
- Scheduler functionality | ||||||
- etcd cluster health (if etcd is co-located) | ||||||
|
||||||
## Diagnosis | ||||||
|
||||||
1. Check the status of all control plane nodes: | ||||||
```bash | ||||||
kubectl get nodes -l node-role.kubernetes.io/control-plane='' | ||||||
``` | ||||||
|
||||||
2. Get detailed information about the affected node: | ||||||
```bash | ||||||
kubectl describe node <node-name> | ||||||
``` | ||||||
|
||||||
3. Review system logs on the affected node: | ||||||
```bash | ||||||
ssh <node-address> | ||||||
``` | ||||||
|
||||||
```bash | ||||||
journalctl -xeu kubelet | ||||||
``` | ||||||
|
||||||
## Mitigation | ||||||
|
||||||
1. Check node resources: | ||||||
- Verify CPU, memory, and disk usage | ||||||
```bash | ||||||
# Check the node's CPU and memory resource usage | ||||||
kubectl top node <node-name> | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we know how much usage will count as problematic? |
||||||
``` | ||||||
|
||||||
```bash | ||||||
# Check node status conditions for DiskPressure status | ||||||
kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")' | ||||||
``` | ||||||
- Clear disk space if necessary | ||||||
- Restart the kubelet if resource issues are resolved | ||||||
|
||||||
2. If the node is unreachable: | ||||||
- Verify network connectivity | ||||||
- Check physical/virtual machine status | ||||||
- Ensure the node has power and is running | ||||||
|
||||||
3. If the kubelet is generating errors: | ||||||
```bash | ||||||
systemctl status kubelet | ||||||
``` | ||||||
|
||||||
```bash | ||||||
systemctl restart kubelet | ||||||
``` | ||||||
|
||||||
4. If the node cannot be recovered: | ||||||
- If possible, safely drain the node | ||||||
machadovilaca marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
```bash | ||||||
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data | ||||||
``` | ||||||
- Investigate hardware/infrastructure issues | ||||||
- Consider replacing the node if necessary | ||||||
|
||||||
## Additional notes | ||||||
- Maintain at least three control plane nodes for high availability | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Monitor etcd cluster health if the affected node runs etcd | ||||||
- Document any infrastructure-specific recovery procedures | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I understand this sentence, where to document? how it can help? |
||||||
|
||||||
<!--DS: If you cannot resolve the issue, log in to the | ||||||
link:https://access.redhat.com[Customer Portal] and open a support case, | ||||||
attaching the artifacts gathered during the diagnosis procedure.--> | ||||||
<!--USstart--> | ||||||
If you cannot resolve the issue, see the following resources: | ||||||
|
||||||
- [OKD Help](https://www.okd.io/help/) | ||||||
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||||||
<!--USend--> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# HighCPUWorkload | ||
|
||
## Meaning | ||
|
||
This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes. | ||
|
||
## Impact | ||
|
||
High CPU utilization can lead to: | ||
- Degraded performance of applications running on the node | ||
- Increased latency in request processing | ||
- Potential service disruptions if CPU usage continues to climb | ||
|
||
## Diagnosis | ||
|
||
1. Identify the affected node: | ||
```bash | ||
kubectl get nodes | ||
``` | ||
|
||
2. Check the node's resource usage: | ||
```bash | ||
kubectl describe node <node-name> | ||
``` | ||
|
||
3. List pods that consume high amounts of CPU: | ||
```bash | ||
kubectl top pods --all-namespaces --sort-by=cpu | ||
``` | ||
|
||
4. Investigate specific pod details if needed: | ||
```bash | ||
kubectl describe pod <pod-name> -n <namespace> | ||
``` | ||
|
||
## Mitigation | ||
|
||
1. If the issue was caused by a malfunctioning pod: | ||
- Consider restarting the pod | ||
- Check pod logs for anomalies | ||
- Review pod resource limits and requests | ||
|
||
2. If the issue is system-wide: | ||
- Check for system processes that consume high amounts of CPU | ||
- Consider cordoning the node and migrating workloads | ||
- Evaluate if node scaling is needed | ||
|
||
3. Long-term solutions to avoid the issue: | ||
- Implement or adjust pod resource limits | ||
- Consider horizontal pod autoscaling | ||
- Evaluate cluster capacity and scaling needs | ||
|
||
## Additional notes | ||
- Monitor the node after mitigation to ensure CPU usage returns to normal | ||
- Review application logs for potential root causes | ||
- Consider updating resource requests/limits if this is a recurring issue | ||
|
||
<!--DS: If you cannot resolve the issue, log in to the | ||
link:https://access.redhat.com[Customer Portal] and open a support case, | ||
attaching the artifacts gathered during the diagnosis procedure.--> | ||
<!--USstart--> | ||
If you cannot resolve the issue, see the following resources: | ||
|
||
- [OKD Help](https://www.okd.io/help/) | ||
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||
<!--USend--> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# NodeNetworkInterfaceDown | ||
|
||
## Meaning | ||
|
||
This alert fires when one or more network interfaces on a node have been down | ||
for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and | ||
bridge tunnels. | ||
|
||
## Impact | ||
|
||
Network interface failures can lead to: | ||
- Reduced network connectivity for pods on the affected node | ||
- Potential service disruptions if critical network paths are affected | ||
- Degraded cluster communication if management interfaces are impacted | ||
|
||
## Diagnosis | ||
|
||
1. Identify the affected node and interfaces: | ||
```bash | ||
kubectl get nodes | ||
``` | ||
|
||
```bash | ||
ssh <node-address> | ||
``` | ||
|
||
```bash | ||
ip link show | grep -i down | ||
``` | ||
|
||
2. Check network interface details: | ||
```bash | ||
ip addr show | ||
``` | ||
|
||
```bash | ||
ethtool <interface-name> | ||
``` | ||
|
||
3. Review system logs for network-related issues: | ||
```bash | ||
journalctl -u NetworkManager | ||
``` | ||
|
||
```bash | ||
dmesg | grep -i eth | ||
``` | ||
|
||
## Mitigation | ||
|
||
1. For physical interface issues: | ||
- Check physical cable connections | ||
- Verify switch port configuration | ||
- Test the interface with a different cable/port | ||
|
||
2. For software or configuration issues: | ||
```bash | ||
# Restart NetworkManager | ||
systemctl restart NetworkManager | ||
``` | ||
|
||
```bash | ||
# Bring interface up manually | ||
ip link set <interface-name> up | ||
``` | ||
|
||
3. If the issue persists: | ||
- Check network interface configuration files | ||
- Verify driver compatibility | ||
- If the failure is on a physical interface, consider hardware replacement | ||
|
||
## Additional notes | ||
- Monitor interface status after mitigation | ||
- Document any hardware replacements or configuration changes | ||
- Consider implementing network redundancy for critical interfaces | ||
|
||
<!--DS: If you cannot resolve the issue, log in to the | ||
link:https://access.redhat.com[Customer Portal] and open a support case, | ||
attaching the artifacts gathered during the diagnosis procedure.--> | ||
<!--USstart--> | ||
If you cannot resolve the issue, see the following resources: | ||
|
||
- [OKD Help](https://www.okd.io/help/) | ||
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||
<!--USend--> |
Uh oh!
There was an error while loading. Please reload this page.