cmd/openshift-install/create: Make waitForInstallComplete() more robust #1413

wking · 2019-03-14T17:41:46Z

If the cluster failed to completely initialize, we may still have a router CA or console URL. With this commit, errors in finalization components are no longer immediately fatal; instead we hold any errors but continue to try the later components. We retain the first component component error, and log any later errors.

To avoid waiting on a console URL when the cluster failed to initialize, I've added a oneShot argument to waitForConsole. When waitForInitializedCluster failed, we set oneShot true and only make a single attempt to resolve the console URL. I've also shuffled around the Until function to flatten some of its previous if/else blocks so I only have to add single oneShot-cancel block.

Tangentially related, addRouterCAToClusterCA is idempotent, because AppendCertsFromPEM wraps AddCert and AddCert checks to avoid duplicate certificates.

CC @abhinavdahiya, I think this is as discussed on the group-G arch call just now.

cmd/openshift-install/create.go

wking · 2019-03-21T19:25:56Z

Heh, this is not quite what I was aiming for:

level=info msg="Waiting up to context.Background.WithDeadline(2019-03-21 18:52:12.039581075 +0000 UTC m=+2153.930641135 [9m59.999985838s]) for openshift-console routes to be created..."
level=info msg="Install complete!"
level=info msg="Run 'export KUBECONFIG=/tmp/artifacts/installer/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI."
level=info msg="The cluster is ready when 'oc login -u kubeadmin -p BdURA-FwDhx-B5Jix-JHHYK' succeeds (wait a few minutes)."
level=info msg="Login to the console with user: kubeadmin, password: BdURA-FwDhx-B5Jix-JHHYK"
level=fatal msg="failed to wait for openshift-console routes: timed out waiting for the condition"

wking · 2019-03-22T08:49:57Z

Ok, updated a few things with 2f625248e -> 8aba2fac1. Hopefully that covers it, although I guess CI will tell us ;).

wking · 2019-03-22T13:17:41Z

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1413/pull-ci-openshift-installer-master-e2e-aws/4580/artifacts/e2e-aws/installer/.openshift_install.log | grep -B5 fatal
time="2019-03-22T09:23:19Z" level=debug msg="Cluster is initialized"
time="2019-03-22T09:23:19Z" level=info msg="Waiting up to 10m0s for openshift-console routes to be created..."
time="2019-03-22T09:33:19Z" level=info msg="Run 'export KUBECONFIG=/tmp/artifacts/installer/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI."
time="2019-03-22T09:33:19Z" level=info msg="Depending on the severity of the error, you may be able to log in with 'oc login -u kubeadmin -p Gqqf8-f5sSv-5KqZS-saiZF https://api.ci-op-9pdd09zr-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443' succeeds."
time="2019-03-22T09:33:19Z" level=info msg="Login to the console with user: kubeadmin, password: Gqqf8-f5sSv-5KqZS-saiZF"
time="2019-03-22T09:33:19Z" level=fatal msg="failed to wait for openshift-console routes: timed out waiting for the condition"
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1413/pull-ci-openshift-installer-master-e2e-aws/4580/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-d6958f9b6-2gr9r_cluster-version-operator.log.gz | gunzip | grep route | grep downloads | head -n2
I0322 09:16:56.022824       1 sync_worker.go:462] Running sync for route "openshift-console/downloads" (270 of 310)
I0322 09:16:56.095793       1 sync_worker.go:475] Done syncing for route "openshift-console/downloads" (270 of 310)

I seem to have broken the route watcher. Robust finish is looking pretty good though.

cmd/openshift-install/create.go

wking · 2019-03-22T13:40:20Z

So my UntilWithSync reroll (requested here) was broken, and even if it wasn't, the watcher approach was not compatible with oneShot (at least as I had it implemented so far). I've pushed 8aba2fac1 -> 2995aef49, rebasing onto master and dropping the UntilWithSync commit. Can we land the finish commit, and I'll come back around and take another pass at UntilWithSync for routes in a follow-up PR?

wking · 2019-03-22T15:39:13Z

e2e-aws:

level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"

/retest

wking · 2019-03-22T16:07:32Z

e2e-aws:

level=error msg="\t* aws_route_table_association.route_net.1: timeout while waiting for state to become 'success' (timeout: 5m0s)"

/retest

wking · 2019-03-22T17:03:20Z

e2e-aws:

level=error msg="\t* aws_instance.master.1: Error waiting for instance (i-010d8bf6e3b4e5f7e) to become ready: timeout while waiting for state to become 'running' (last state: 'pending', timeout: 10m0s)"

Amazon short of instances :/

/retest

wking · 2019-03-26T13:35:45Z

/retest

wking · 2019-03-26T15:58:46Z

All green. @abhinavdahiya, can you take another look at some point today?

abhinavdahiya · 2019-03-26T23:51:48Z

cmd/openshift-install/create.go

 	consoleRouteContext, cancel := context.WithTimeout(ctx, consoleRouteTimeout)
 	defer cancel()
+	if oneShot {
+		logrus.Infof("Checking for the %s route...", consoleNamespace)


/s/consoleNamespace/consoleRouteName

/s/consoleNamespace/consoleRouteName

My follow up is waiting for the downloads route too; this choice will be a better fit then. Even without follow-up work, I think "the main console-namespace route" isn't that bad vs. naming the route without the namespace. If we end up sticking with a single route, I'll probably write both the namespace and name.

abhinavdahiya · 2019-03-26T23:51:59Z

cmd/openshift-install/create.go

+	if oneShot {
+		logrus.Infof("Checking for the %s route...", consoleNamespace)
+	} else {
+		logrus.Infof("Waiting up to %v for the %s route to be created...", consoleRouteTimeout, consoleNamespace)


/s/consoleNamespace/consoleRouteName

abhinavdahiya · 2019-03-27T00:00:52Z

cmd/openshift-install/create.go

-	if err := waitForInitializedCluster(ctx, config); err != nil {
-		return err
+	errs := []error{}
+	err := waitForInitializedCluster(ctx, config)


this is

err := <> if err != nil { .. }

while all other calls are

if err = <>; err != nil { .... }

??

... while all other calls are...

They're all using this err ;). But whatever, I wish gofmt would just make this sort of thing consistent with whatever a coin-flip says is the easiest to read :p.

wking · 2019-03-27T07:44:08Z

e2e-aws:

level=fatal msg="failed to initialize the cluster: Working towards 0.0.1-2019-03-27-010000: 98% complete: timed out waiting for the condition"

/retest

wking · 2019-03-27T07:49:08Z

Depending on the severity of the error, you may be able to log in with: oc login -u kubeadmin -p
I don't think we should be telling people maybe you will be able to log in. Please drop this

I thought that was helpful ;). You'd rather me not say anything about oc login when we have errors? Do I need to ever say anything about oc login? Folks using oc will likely be using auth/kubeconfig or setting up a real identity provider anyway.

abhinavdahiya · 2019-03-27T11:53:05Z

Depending on the severity of the error, you may be able to log in with: oc login -u kubeadmin -p
I don't think we should be telling people maybe you will be able to log in. Please drop this
I thought that was helpful ;). You'd rather me not say anything about oc login when we have errors? Do I need to ever say anything about oc login? Folks using oc will likely be using auth/kubeconfig or setting up a real identity provider anyway.

I'm not sure, if oc login needs to be called out at all. But it is now... So making sure we are not calling that out when we don't know if the authentication operator succeeded...

oc with auth/kubeconfig will work as we have successfully bootstrapped. We can keep that part on partial failure.

wking · 2019-04-05T21:28:05Z

I'm not sure, if oc login needs to be called out at all.

Rebased onto master with 2995aef49 -> c3723847d, and dropping the oc login reference as part of that (thanks to #1513).

wking · 2019-04-05T22:21:20Z

oc delete project ci-op-84i4xlit

/retest

wking · 2019-04-11T03:00:38Z

Rebased onto master with c3723847d -> 19866aa38.

wking · 2019-04-11T03:14:03Z

And I've pushed 19866aa38 -> cddd3f0ec to distingish between Failing and Progressing.

wking · 2019-04-11T04:28:33Z

All green. Here's the end of the e2e-aws run:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1413/pull-ci-openshift-installer-master-e2e-aws/5140/artifacts/e2e-aws/installer/.openshift_install.log | grep -A8 'Cluster is initialized'
time="2019-04-11T03:53:34Z" level=debug msg="Cluster is initialized"
time="2019-04-11T03:53:34Z" level=info msg="Waiting up to 10m0s for the openshift-console route to be created..."
time="2019-04-11T03:53:34Z" level=debug msg="Route found in openshift-console namespace: console"
time="2019-04-11T03:53:34Z" level=debug msg="Route found in openshift-console namespace: downloads"
time="2019-04-11T03:53:34Z" level=debug msg="OpenShift console route is created"
time="2019-04-11T03:53:34Z" level=info msg="Install complete!"
time="2019-04-11T03:53:34Z" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/tmp/artifacts/installer/auth/kubeconfig'"
time="2019-04-11T03:53:34Z" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ci-op-b89kp5jd-1d3f3.origin-ci-int-aws.dev.rhcloud.com"
time="2019-04-11T03:53:34Z" level=info msg="Login to the console with user: kubeadmin, password: ShUna-zfIzJ-HPSYX-WhzpE"

trown · 2019-04-16T13:16:59Z

/test e2e-openstack

trown · 2019-04-18T17:47:37Z

/test e2e-openstack

If the cluster failed to completely initialize, we may still have a router CA or console URL. With this commit, errors in finalization components are no longer immediately fatal; instead we hold any errors but continue to try the later components. We retain the first component component error, and log any later errors. To avoid waiting on a console URL when the cluster failed to initialize, I've added a oneShot argument to waitForConsole. When waitForInitializedCluster failed, we set oneShot true and only make a single attempt to resolve the console URL. I've also shuffled around the Until function to flatten some of its previous if/else blocks so I only have to add single oneShot-cancel block. Tangentially related, addRouterCAToClusterCA is idempotent, because AppendCertsFromPEM wraps AddCert [1] and AddCert checks to avoid duplicate certificates [2]. [1]: https://github.com/golang/go/blob/go1.12/src/crypto/x509/cert_pool.go#L144 [2]: https://github.com/golang/go/blob/go1.12/src/crypto/x509/cert_pool.go#L106-L109

…sing And point at 'wait-for install-complete'. This gives a better experience for users that we time out because they're installing over a slow network or something. We might want to fail fast if we see a Failing, instead of waiting for the full timeout. But I'm leaving that alone for now; we can always come back to it in follow-up work.

wking · 2019-05-14T21:22:52Z

Rebased around #1718 with f79f53c0d -> 773a2b0.

wking · 2019-05-29T17:13:29Z

e2e-aws:

level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition"

/retest

openshift-ci-robot · 2019-05-29T18:07:51Z

@wking: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-scaleup-rhel7	`773a2b0`	link	`/test e2e-aws-scaleup-rhel7`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2019-06-23T21:28:36Z

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavdahiya · 2019-07-24T22:16:13Z

closing due to inactivity. Please reopen if needed.

/close

openshift-ci-robot · 2019-07-24T22:16:14Z

@abhinavdahiya: Closed this PR.

In response to this:

closing due to inactivity. Please reopen if needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 14, 2019

wking force-pushed the finishing-with-partial-failures branch from c8c4786 to 445e27e Compare March 14, 2019 17:42

openshift-ci-robot requested review from aaronlevy and tomassedovic March 14, 2019 17:42

abhinavdahiya reviewed Mar 21, 2019

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

abhinavdahiya reviewed Mar 21, 2019

View reviewed changes

cmd/openshift-install/create.go Show resolved Hide resolved

wking force-pushed the finishing-with-partial-failures branch 4 times, most recently from d3214ac to 2f62524 Compare March 21, 2019 18:05

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 21, 2019

wking force-pushed the finishing-with-partial-failures branch from 2f62524 to 8aba2fa Compare March 22, 2019 08:49

wking commented Mar 22, 2019

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

wking force-pushed the finishing-with-partial-failures branch from 8aba2fa to 2995aef Compare March 22, 2019 13:36

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 22, 2019

abhinavdahiya reviewed Mar 26, 2019

View reviewed changes

abhinavdahiya reviewed Mar 27, 2019

View reviewed changes

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2019

wking force-pushed the finishing-with-partial-failures branch from 2995aef to c372384 Compare April 5, 2019 21:27

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 5, 2019

wking mentioned this pull request Apr 11, 2019

status: Report the operators that have not yet deployed openshift/cluster-version-operator#158

Merged

wking force-pushed the finishing-with-partial-failures branch from c372384 to 19866aa Compare April 11, 2019 02:59

wking force-pushed the finishing-with-partial-failures branch from cddd3f0 to f79f53c Compare April 12, 2019 20:30

wking changed the title ~~cmd/openshift-install/create: Make finish() more robust~~ cmd/openshift-install/create: Make waitForInstallComplete() more robust Apr 12, 2019

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 12, 2019

wking added 2 commits May 14, 2019 14:19

wking force-pushed the finishing-with-partial-failures branch from f79f53c to 773a2b0 Compare May 14, 2019 21:22

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2019

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2019

openshift-ci-robot closed this Jul 24, 2019

cmd/openshift-install/create: Make waitForInstallComplete() more robust #1413

cmd/openshift-install/create: Make waitForInstallComplete() more robust #1413

Uh oh!

Conversation

wking commented Mar 14, 2019

Uh oh!

Uh oh!

Uh oh!

wking commented Mar 21, 2019

Uh oh!

wking commented Mar 22, 2019

Uh oh!

wking commented Mar 22, 2019

Uh oh!

Uh oh!

wking commented Mar 22, 2019

Uh oh!

wking commented Mar 22, 2019

Uh oh!

wking commented Mar 22, 2019

Uh oh!

wking commented Mar 22, 2019

Uh oh!

wking commented Mar 26, 2019

Uh oh!

wking commented Mar 26, 2019

Uh oh!

abhinavdahiya Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

wking Mar 27, 2019

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Mar 27, 2019

Choose a reason for hiding this comment

Uh oh!

wking Mar 27, 2019

Choose a reason for hiding this comment

Uh oh!

wking commented Mar 27, 2019

Uh oh!

wking commented Mar 27, 2019

Uh oh!

abhinavdahiya commented Mar 27, 2019

Uh oh!

wking commented Apr 5, 2019

Uh oh!

wking commented Apr 5, 2019

Uh oh!

wking commented Apr 11, 2019

Uh oh!

wking commented Apr 11, 2019

Uh oh!

wking commented Apr 11, 2019

Uh oh!

trown commented Apr 16, 2019

Uh oh!

trown commented Apr 18, 2019

Uh oh!

wking commented May 14, 2019

Uh oh!

wking commented May 29, 2019

Uh oh!

openshift-ci-robot commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 23, 2019

Uh oh!

abhinavdahiya commented Jul 24, 2019

Uh oh!

openshift-ci-robot commented Jul 24, 2019

Uh oh!

Uh oh!

openshift-ci-robot commented May 29, 2019 •

edited

Loading