Skip to content

Graceful recovery tests fails with NGINX Plus #2022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pleshakov opened this issue May 23, 2024 · 4 comments
Closed

Graceful recovery tests fails with NGINX Plus #2022

pleshakov opened this issue May 23, 2024 · 4 comments
Labels
bug Something isn't working
Milestone

Comments

@pleshakov
Copy link
Contributor

pleshakov commented May 23, 2024

When graceful recovery test is run with NGF with NGINX Plus, it fails

See https://github.com/nginxinc/nginx-gateway-fabric/actions/runs/9198281036/job/25300657770

NGF Logs excerpts:

. . .
 {"level":"error","ts":"2024-05-22T21:15:36Z","logger":"eventLoop.eventHandler","msg":"failed to get upstreams from API, reloading configuration instead","batchID":16,"error":"failed to get upstreams: failed to get http/upstreams: Get \"http://nginx-plus-api/api/9/http/upstreams\": dial unix /var/run/nginx/nginx-plus-api.sock: connect: no such file or directory","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).updateUpstreamServers\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:357\ngithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:204\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"}

Acceptance criteria:

  • Further identify any bugs
@mpstefan mpstefan added the bug Something isn't working label Jun 3, 2024
@mpstefan mpstefan added this to the v1.3.0 milestone Jun 3, 2024
@salonichf5
Copy link
Contributor

salonichf5 commented Jun 3, 2024

Seeing this error in the pipeline as well

Error: 5;9m[FAILED] 2024/06/03 19:56:02 [error] 56#56: *8 no live upstreams while connecting to upstream, client: 127.0.0.1, server: cafe.example.com, request: "GET /tea HTTP/2.0", upstream: "http://graceful-recovery_tea_80/tea", host: "cafe.example.com:10443"
  Expected
Error: string>: 2024/06/03 19:56:02 [error] 56#56: *8 no live upstreams while connecting to upstream, client: 127.0.0.1, server: cafe.example.com, request: "GET /tea HTTP/2.0", upstream: "http://graceful-recovery_tea_80/tea", host: "cafe.example.com:10443"
  to contain substring
      <string>: connect() failed (111: Connection refused)

@salonichf5
Copy link
Contributor

I have handled this error case in my current PR . We should update the bug description to avoid confusion

{"level":"error","ts":"2024-05-22T21:15:18Z","logger":"usageReporter","msg":"Usage reporting must be enabled when using NGINX Plus; redeploy with usage reporting enabled","error":"usage reporting not enabled","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.createUsageWarningJob.func1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/manager.go:601\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:259\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:259\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/runnables.(*CronJob).Start\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/runnables/cronjob.go:53\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}

@pleshakov
Copy link
Contributor Author

Seeing this error in the pipeline as well

@salonichf5

that could happen if NGF updated NGINX with upstream servers that are not ready yet.

we do have a check here to counter it here
https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/graceful_recovery_test.go#L267
but it doesn't fully counter it, as another error can appear.

More over, I don't think it is the right thing to counter it -- I think we either have a bug in the NGF code or the test code.

Note that what we do in the test -- https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/graceful_recovery_test.go#L142-L153

We redeploy the app and then send requests to NGINX

The fact that we call Expect(resourceManager.WaitForAppsToBeReady(ns.Name)).To(Succeed()) combined with readiness probe enabled in the backend pods -- https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/manifests/graceful-recovery/cafe.yaml#L57-L61 -- should have prevented unready endpoints to be propagated to NGINX upstreams in the first place.

So

(1)
we could have a bug in the code that propagates not ready endpoints to NGINX upstreams.

(2)
Or resourceManager.WaitForAppsToBeReady(ns.Name) doesn't work as expected

or some other thing.

To test (1), I suggest deploying the cafe example https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/cafe-example, and configure the tea pod spec with a readiness probe that sends a request to non-existing port (like 1234), so that the pod doesn't become ready:

      containers:
      - name: tea
        image: nginxdemos/nginx-hello:plain-text
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /
            port: 1234

and then check what config NGF generates and if it includes any tea endpoints. Since the test failed with Plus, let's check it with the Plus image.

Regarding (2), I suspect we have a race condition in https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L311 as it calls https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L312 , specifically here https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L335

when we get the pod list, the pods might not be created, so we get 0 and the function quickly returns success.

To test this theory, we create separate functions like below that check the count:

func (rm *ResourceManager) WaitForAppsToBeReadyWithPodCount(namespace string, podCount int) error {
	ctx, cancel := context.WithTimeout(context.Background(), rm.TimeoutConfig.CreateTimeout)
	defer cancel()

	return rm.WaitForAppsToBeReadyWithCtxWithPodCount(ctx, namespace, podCount)
}

func (rm *ResourceManager) WaitForAppsToBeReadyWithCtxWithPodCount(ctx context.Context, namespace string, podCount int) error {
	if err := rm.WaitForPodsToBeReadyWithCount(ctx, namespace, podCount); err != nil {
		return err
	}

	if err := rm.waitForHTTPRoutesToBeReady(ctx, namespace); err != nil {
		return err
	}

	if err := rm.waitForGRPCRoutesToBeReady(ctx, namespace); err != nil {
		return err
	}

	return rm.waitForGatewaysToBeReady(ctx, namespace)
}

// WaitForPodsToBeReady waits for all Pods in the specified namespace to be ready or
// until the provided context is cancelled.
func (rm *ResourceManager) WaitForPodsToBeReadyWithCount(ctx context.Context, namespace string, count int) error {
	return wait.PollUntilContextCancel(
		ctx,
		500*time.Millisecond,
		true, /* poll immediately */
		func(ctx context.Context) (bool, error) {
			var podList core.PodList
			if err := rm.K8sClient.List(ctx, &podList, client.InNamespace(namespace)); err != nil {
				return false, err
			}

			var podsReady int
			for _, pod := range podList.Items {
				for _, cond := range pod.Status.Conditions {
					if cond.Type == core.PodReady && cond.Status == core.ConditionTrue {
						podsReady++
					}
				}
			}

			return podsReady ==  count, nil
		},
	)
}

and update the graceful recovery test to call those functions.

so instead of

	Expect(resourceManager.WaitForAppsToBeReady(ns.Name)).To(Succeed())

do

	Expect(resourceManager.WaitForAppsToBeReadyWithPodCount(ns.Name, 2)).To(Succeed())

@salonichf5
Copy link
Contributor

Could not reproduce the error above, but found a race condition in graceful recovery tests so closing this. Opened another PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants