-
Notifications
You must be signed in to change notification settings - Fork 114
Graceful recovery tests fails with NGINX Plus #2022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Seeing this error in the pipeline as well
|
I have handled this error case in my current PR . We should update the bug description to avoid confusion
|
that could happen if NGF updated NGINX with upstream servers that are not ready yet. we do have a check here to counter it here More over, I don't think it is the right thing to counter it -- I think we either have a bug in the NGF code or the test code. Note that what we do in the test -- https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/graceful_recovery_test.go#L142-L153 We redeploy the app and then send requests to NGINX The fact that we call So (1) (2) or some other thing. To test (1), I suggest deploying the cafe example https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/cafe-example, and configure the tea pod spec with a readiness probe that sends a request to non-existing port (like 1234), so that the pod doesn't become ready:
and then check what config NGF generates and if it includes any tea endpoints. Since the test failed with Plus, let's check it with the Plus image. Regarding (2), I suspect we have a race condition in https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L311 as it calls https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L312 , specifically here https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L335 when we get the pod list, the pods might not be created, so we get 0 and the function quickly returns success. To test this theory, we create separate functions like below that check the count: func (rm *ResourceManager) WaitForAppsToBeReadyWithPodCount(namespace string, podCount int) error {
ctx, cancel := context.WithTimeout(context.Background(), rm.TimeoutConfig.CreateTimeout)
defer cancel()
return rm.WaitForAppsToBeReadyWithCtxWithPodCount(ctx, namespace, podCount)
}
func (rm *ResourceManager) WaitForAppsToBeReadyWithCtxWithPodCount(ctx context.Context, namespace string, podCount int) error {
if err := rm.WaitForPodsToBeReadyWithCount(ctx, namespace, podCount); err != nil {
return err
}
if err := rm.waitForHTTPRoutesToBeReady(ctx, namespace); err != nil {
return err
}
if err := rm.waitForGRPCRoutesToBeReady(ctx, namespace); err != nil {
return err
}
return rm.waitForGatewaysToBeReady(ctx, namespace)
}
// WaitForPodsToBeReady waits for all Pods in the specified namespace to be ready or
// until the provided context is cancelled.
func (rm *ResourceManager) WaitForPodsToBeReadyWithCount(ctx context.Context, namespace string, count int) error {
return wait.PollUntilContextCancel(
ctx,
500*time.Millisecond,
true, /* poll immediately */
func(ctx context.Context) (bool, error) {
var podList core.PodList
if err := rm.K8sClient.List(ctx, &podList, client.InNamespace(namespace)); err != nil {
return false, err
}
var podsReady int
for _, pod := range podList.Items {
for _, cond := range pod.Status.Conditions {
if cond.Type == core.PodReady && cond.Status == core.ConditionTrue {
podsReady++
}
}
}
return podsReady == count, nil
},
)
} and update the graceful recovery test to call those functions. so instead of
do
|
Could not reproduce the error above, but found a race condition in graceful recovery tests so closing this. Opened another PR |
When graceful recovery test is run with NGF with NGINX Plus, it fails
See https://github.com/nginxinc/nginx-gateway-fabric/actions/runs/9198281036/job/25300657770
NGF Logs excerpts:
Acceptance criteria:
The text was updated successfully, but these errors were encountered: