|
14 | 14 | - [Verify Pod has the resource limit](#verify-pod-has-the-resource-limit)
|
15 | 15 | - [Verify Pod has the pod-eni annotation](#verify-pod-has-the-pod-eni-annotation)
|
16 | 16 | - [Check Issues with VPC CNI](#check-issues-with-vpc-cni)
|
| 17 | + - [Connection timeouts](#connection-timeouts) |
| 18 | + - [IP starvation issue](#ip-starvation-issue) |
17 | 19 | - [Troubleshooting Prefix Delegation for Windows](#troubleshooting-prefix-delegation-for-windows)
|
18 | 20 | - [Verify Windows prefix delegation is enabled in the ConfigMap](#verify-windows-prefix-delegation-is-enabled-in-the-configmap)
|
19 | 21 | - [Check both pod events and node events for any specific error](#check-both-pod-events-and-node-events-for-any-specific-error)
|
@@ -272,6 +274,31 @@ If the Pod is still stuck in `ContainerCreating` you can,
|
272 | 274 | - Check the CNI Logs from the collected logs.
|
273 | 275 | - Open an [Issue](https://github.com/aws/amazon-vpc-resource-controller-k8s/issues/new/choose) in this repository if the problem still persists.
|
274 | 276 |
|
| 277 | +### Connection Timeouts |
| 278 | + |
| 279 | +If you observe connection failures like intermittent DNS timeouts on pods using security groups, you might need to update the branch ENI cooldown period or kernel ARP cache timeout so the **values are equal**. Else this could result in re-use of IP address of a recently terminated pod by a new pod before the kernel's ARP cache is updated, which causes DNS failures or general packet drops. |
| 280 | + |
| 281 | +The branch ENI cooldown period is the period of time to wait before deleting the branch ENI for propagation of iptables rules for the deleted pod. This can be set on the `amazon-vpc-cni` configmap. See more details [here](../docs/sgp/sgp_config_options.md). |
| 282 | + |
| 283 | +To update the kernel ARP cache timeout, set the following parameters for each existing interface on the node. If the branch ENI cooldown period is 30s, set: |
| 284 | +``` |
| 285 | +sudo sysctl -w net.ipv4.neigh.eth0.gc_stale_time=30 |
| 286 | +sudo sysctl -w net.ipv4.neigh.eth0.base_reachable_time_ms=15000 |
| 287 | +``` |
| 288 | + |
| 289 | +Also set the default so all new interfaces created are configured with these values: |
| 290 | +``` |
| 291 | +sudo sysctl -w net.ipv4.neigh.default.gc_stale_time=30 |
| 292 | +sudo sysctl -w net.ipv4.neigh.default.base_reachable_time_ms=15000 |
| 293 | +``` |
| 294 | + |
| 295 | +### IP starvation issue |
| 296 | + |
| 297 | +If the pods are not `Running` due to IP addresses being unavailable, but you have few pods running and expect to have IP address available, tune the branch ENI cooldown period accordingly. |
| 298 | +The branch ENI cooldown period is the period of time to wait before deleting the branch ENI for propagation of iptables rules for the deleted pod. The default value is 60s, so IP addresses are not released for atleast 60s. This can be configured via the `amazon-vpc-cni` configmap as described [here](../docs/sgp/sgp_config_options.md). Note that the minimum cooldown period is 30s. |
| 299 | + |
| 300 | +Be sure to also update the kernel ARP cache timeouts if you notice DNS issues as outlined in the [above section](#intermittent-dns-failures). |
| 301 | + |
275 | 302 | ## Troubleshooting Prefix Delegation for Windows
|
276 | 303 | Please follow the troubleshooting steps here for issues with Windows Node and Pods when using `prefix delegation` mode.
|
277 | 304 |
|
|
0 commit comments