Skip to content

cmd/openshift-install/create: Make waitForInstallComplete() more robust #1413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

wking
Copy link
Member

@wking wking commented Mar 14, 2019

If the cluster failed to completely initialize, we may still have a router CA or console URL. With this commit, errors in finalization components are no longer immediately fatal; instead we hold any errors but continue to try the later components. We retain the first component component error, and log any later errors.

To avoid waiting on a console URL when the cluster failed to initialize, I've added a oneShot argument to waitForConsole. When waitForInitializedCluster failed, we set oneShot true and only make a single attempt to resolve the console URL. I've also shuffled around the Until function to flatten some of its previous if/else blocks so I only have to add single oneShot-cancel block.

Tangentially related, addRouterCAToClusterCA is idempotent, because AppendCertsFromPEM wraps AddCert and AddCert checks to avoid duplicate certificates.

CC @abhinavdahiya, I think this is as discussed on the group-G arch call just now.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 14, 2019
@wking wking force-pushed the finishing-with-partial-failures branch from c8c4786 to 445e27e Compare March 14, 2019 17:42
@wking wking force-pushed the finishing-with-partial-failures branch 4 times, most recently from d3214ac to 2f62524 Compare March 21, 2019 18:05
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 21, 2019
@wking
Copy link
Member Author

wking commented Mar 21, 2019

Heh, this is not quite what I was aiming for:

level=info msg="Waiting up to context.Background.WithDeadline(2019-03-21 18:52:12.039581075 +0000 UTC m=+2153.930641135 [9m59.999985838s]) for openshift-console routes to be created..."
level=info msg="Install complete!"
level=info msg="Run 'export KUBECONFIG=/tmp/artifacts/installer/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI."
level=info msg="The cluster is ready when 'oc login -u kubeadmin -p BdURA-FwDhx-B5Jix-JHHYK' succeeds (wait a few minutes)."
level=info msg="Login to the console with user: kubeadmin, password: BdURA-FwDhx-B5Jix-JHHYK"
level=fatal msg="failed to wait for openshift-console routes: timed out waiting for the condition"

@wking wking force-pushed the finishing-with-partial-failures branch from 2f62524 to 8aba2fa Compare March 22, 2019 08:49
@wking
Copy link
Member Author

wking commented Mar 22, 2019

Ok, updated a few things with 2f625248e -> 8aba2fac1. Hopefully that covers it, although I guess CI will tell us ;).

@wking
Copy link
Member Author

wking commented Mar 22, 2019

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1413/pull-ci-openshift-installer-master-e2e-aws/4580/artifacts/e2e-aws/installer/.openshift_install.log | grep -B5 fatal
time="2019-03-22T09:23:19Z" level=debug msg="Cluster is initialized"
time="2019-03-22T09:23:19Z" level=info msg="Waiting up to 10m0s for openshift-console routes to be created..."
time="2019-03-22T09:33:19Z" level=info msg="Run 'export KUBECONFIG=/tmp/artifacts/installer/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI."
time="2019-03-22T09:33:19Z" level=info msg="Depending on the severity of the error, you may be able to log in with 'oc login -u kubeadmin -p Gqqf8-f5sSv-5KqZS-saiZF https://api.ci-op-9pdd09zr-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443' succeeds."
time="2019-03-22T09:33:19Z" level=info msg="Login to the console with user: kubeadmin, password: Gqqf8-f5sSv-5KqZS-saiZF"
time="2019-03-22T09:33:19Z" level=fatal msg="failed to wait for openshift-console routes: timed out waiting for the condition"
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1413/pull-ci-openshift-installer-master-e2e-aws/4580/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-d6958f9b6-2gr9r_cluster-version-operator.log.gz | gunzip | grep route | grep downloads | head -n2
I0322 09:16:56.022824       1 sync_worker.go:462] Running sync for route "openshift-console/downloads" (270 of 310)
I0322 09:16:56.095793       1 sync_worker.go:475] Done syncing for route "openshift-console/downloads" (270 of 310)

I seem to have broken the route watcher. Robust finish is looking pretty good though.

@wking wking force-pushed the finishing-with-partial-failures branch from 8aba2fa to 2995aef Compare March 22, 2019 13:36
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 22, 2019
@wking
Copy link
Member Author

wking commented Mar 22, 2019

So my UntilWithSync reroll (requested here) was broken, and even if it wasn't, the watcher approach was not compatible with oneShot (at least as I had it implemented so far). I've pushed 8aba2fac1 -> 2995aef49, rebasing onto master and dropping the UntilWithSync commit. Can we land the finish commit, and I'll come back around and take another pass at UntilWithSync for routes in a follow-up PR?

@wking
Copy link
Member Author

wking commented Mar 22, 2019

e2e-aws:

level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"

/retest

@wking
Copy link
Member Author

wking commented Mar 22, 2019

e2e-aws:

level=error msg="\t* aws_route_table_association.route_net.1: timeout while waiting for state to become 'success' (timeout: 5m0s)"

/retest

@wking
Copy link
Member Author

wking commented Mar 22, 2019

e2e-aws:

level=error msg="\t* aws_instance.master.1: Error waiting for instance (i-010d8bf6e3b4e5f7e) to become ready: timeout while waiting for state to become 'running' (last state: 'pending', timeout: 10m0s)"

Amazon short of instances :/

/retest

@wking
Copy link
Member Author

wking commented Mar 26, 2019

/retest

@wking
Copy link
Member Author

wking commented Mar 26, 2019

All green. @abhinavdahiya, can you take another look at some point today?

consoleRouteContext, cancel := context.WithTimeout(ctx, consoleRouteTimeout)
defer cancel()
if oneShot {
logrus.Infof("Checking for the %s route...", consoleNamespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/consoleNamespace/consoleRouteName

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/consoleNamespace/consoleRouteName

My follow up is waiting for the downloads route too; this choice will be a better fit then. Even without follow-up work, I think "the main console-namespace route" isn't that bad vs. naming the route without the namespace. If we end up sticking with a single route, I'll probably write both the namespace and name.

if oneShot {
logrus.Infof("Checking for the %s route...", consoleNamespace)
} else {
logrus.Infof("Waiting up to %v for the %s route to be created...", consoleRouteTimeout, consoleNamespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/consoleNamespace/consoleRouteName

if err := waitForInitializedCluster(ctx, config); err != nil {
return err
errs := []error{}
err := waitForInitializedCluster(ctx, config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is

err := <>
if err != nil {
..
}

while all other calls are

if err = <>; err != nil {
....
}

??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... while all other calls are...

They're all using this err ;). But whatever, I wish gofmt would just make this sort of thing consistent with whatever a coin-flip says is the easiest to read :p.

@wking
Copy link
Member Author

wking commented Mar 27, 2019

e2e-aws:

level=fatal msg="failed to initialize the cluster: Working towards 0.0.1-2019-03-27-010000: 98% complete: timed out waiting for the condition"

/retest

@wking
Copy link
Member Author

wking commented Mar 27, 2019

Depending on the severity of the error, you may be able to log in with: oc login -u kubeadmin -p

I don't think we should be telling people maybe you will be able to log in. Please drop this

I thought that was helpful ;). You'd rather me not say anything about oc login when we have errors? Do I need to ever say anything about oc login? Folks using oc will likely be using auth/kubeconfig or setting up a real identity provider anyway.

@abhinavdahiya
Copy link
Contributor

Depending on the severity of the error, you may be able to log in with: oc login -u kubeadmin -p

I don't think we should be telling people maybe you will be able to log in. Please drop this

I thought that was helpful ;). You'd rather me not say anything about oc login when we have errors? Do I need to ever say anything about oc login? Folks using oc will likely be using auth/kubeconfig or setting up a real identity provider anyway.

I'm not sure, if oc login needs to be called out at all. But it is now... So making sure we are not calling that out when we don't know if the authentication operator succeeded...

oc with auth/kubeconfig will work as we have successfully bootstrapped. We can keep that part on partial failure.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2019
@wking wking force-pushed the finishing-with-partial-failures branch from 2995aef to c372384 Compare April 5, 2019 21:27
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 5, 2019
@wking
Copy link
Member Author

wking commented Apr 5, 2019

I'm not sure, if oc login needs to be called out at all.

Rebased onto master with 2995aef49 -> c3723847d, and dropping the oc login reference as part of that (thanks to #1513).

@wking
Copy link
Member Author

wking commented Apr 5, 2019

oc delete project ci-op-84i4xlit

/retest

@wking
Copy link
Member Author

wking commented Apr 11, 2019

Rebased onto master with c3723847d -> 19866aa38.

@wking
Copy link
Member Author

wking commented Apr 11, 2019

And I've pushed 19866aa38 -> cddd3f0ec to distingish between Failing and Progressing.

@wking
Copy link
Member Author

wking commented Apr 11, 2019

All green. Here's the end of the e2e-aws run:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1413/pull-ci-openshift-installer-master-e2e-aws/5140/artifacts/e2e-aws/installer/.openshift_install.log | grep -A8 'Cluster is initialized'
time="2019-04-11T03:53:34Z" level=debug msg="Cluster is initialized"
time="2019-04-11T03:53:34Z" level=info msg="Waiting up to 10m0s for the openshift-console route to be created..."
time="2019-04-11T03:53:34Z" level=debug msg="Route found in openshift-console namespace: console"
time="2019-04-11T03:53:34Z" level=debug msg="Route found in openshift-console namespace: downloads"
time="2019-04-11T03:53:34Z" level=debug msg="OpenShift console route is created"
time="2019-04-11T03:53:34Z" level=info msg="Install complete!"
time="2019-04-11T03:53:34Z" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/tmp/artifacts/installer/auth/kubeconfig'"
time="2019-04-11T03:53:34Z" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ci-op-b89kp5jd-1d3f3.origin-ci-int-aws.dev.rhcloud.com"
time="2019-04-11T03:53:34Z" level=info msg="Login to the console with user: kubeadmin, password: ShUna-zfIzJ-HPSYX-WhzpE"

@wking wking force-pushed the finishing-with-partial-failures branch from cddd3f0 to f79f53c Compare April 12, 2019 20:30
@wking wking changed the title cmd/openshift-install/create: Make finish() more robust cmd/openshift-install/create: Make waitForInstallComplete() more robust Apr 12, 2019
@trown
Copy link

trown commented Apr 16, 2019

/test e2e-openstack

1 similar comment
@trown
Copy link

trown commented Apr 18, 2019

/test e2e-openstack

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 12, 2019
wking added 2 commits May 14, 2019 14:19
If the cluster failed to completely initialize, we may still have a
router CA or console URL.  With this commit, errors in finalization
components are no longer immediately fatal; instead we hold any errors
but continue to try the later components.  We retain the first
component component error, and log any later errors.

To avoid waiting on a console URL when the cluster failed to
initialize, I've added a oneShot argument to waitForConsole.  When
waitForInitializedCluster failed, we set oneShot true and only make a
single attempt to resolve the console URL.  I've also shuffled around
the Until function to flatten some of its previous if/else blocks so I
only have to add single oneShot-cancel block.

Tangentially related, addRouterCAToClusterCA is idempotent, because
AppendCertsFromPEM wraps AddCert [1] and AddCert checks to avoid
duplicate certificates [2].

[1]: https://github.com/golang/go/blob/go1.12/src/crypto/x509/cert_pool.go#L144
[2]: https://github.com/golang/go/blob/go1.12/src/crypto/x509/cert_pool.go#L106-L109
…sing

And point at 'wait-for install-complete'.  This gives a better
experience for users that we time out because they're installing over
a slow network or something.

We might want to fail fast if we see a Failing, instead of waiting for
the full timeout.  But I'm leaving that alone for now; we can always
come back to it in follow-up work.
@wking wking force-pushed the finishing-with-partial-failures branch from f79f53c to 773a2b0 Compare May 14, 2019 21:22
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2019
@wking
Copy link
Member Author

wking commented May 14, 2019

Rebased around #1718 with f79f53c0d -> 773a2b0.

@wking
Copy link
Member Author

wking commented May 29, 2019

e2e-aws:

level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition"

/retest

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 29, 2019

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 773a2b0 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya
Copy link
Contributor

closing due to inactivity. Please reopen if needed.

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closed this PR.

In response to this:

closing due to inactivity. Please reopen if needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants