-
Notifications
You must be signed in to change notification settings - Fork 1.4k
cmd/openshift-install/create: Make waitForInstallComplete() more robust #1413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c8c4786
to
445e27e
Compare
d3214ac
to
2f62524
Compare
Heh, this is not quite what I was aiming for:
|
2f62524
to
8aba2fa
Compare
Ok, updated a few things with 2f625248e -> 8aba2fac1. Hopefully that covers it, although I guess CI will tell us ;). |
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1413/pull-ci-openshift-installer-master-e2e-aws/4580/artifacts/e2e-aws/installer/.openshift_install.log | grep -B5 fatal
time="2019-03-22T09:23:19Z" level=debug msg="Cluster is initialized"
time="2019-03-22T09:23:19Z" level=info msg="Waiting up to 10m0s for openshift-console routes to be created..."
time="2019-03-22T09:33:19Z" level=info msg="Run 'export KUBECONFIG=/tmp/artifacts/installer/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI."
time="2019-03-22T09:33:19Z" level=info msg="Depending on the severity of the error, you may be able to log in with 'oc login -u kubeadmin -p Gqqf8-f5sSv-5KqZS-saiZF https://api.ci-op-9pdd09zr-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443' succeeds."
time="2019-03-22T09:33:19Z" level=info msg="Login to the console with user: kubeadmin, password: Gqqf8-f5sSv-5KqZS-saiZF"
time="2019-03-22T09:33:19Z" level=fatal msg="failed to wait for openshift-console routes: timed out waiting for the condition"
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1413/pull-ci-openshift-installer-master-e2e-aws/4580/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-d6958f9b6-2gr9r_cluster-version-operator.log.gz | gunzip | grep route | grep downloads | head -n2
I0322 09:16:56.022824 1 sync_worker.go:462] Running sync for route "openshift-console/downloads" (270 of 310)
I0322 09:16:56.095793 1 sync_worker.go:475] Done syncing for route "openshift-console/downloads" (270 of 310) I seem to have broken the route watcher. Robust |
8aba2fa
to
2995aef
Compare
So my |
/retest |
/retest |
Amazon short of instances :/ /retest |
/retest |
All green. @abhinavdahiya, can you take another look at some point today? |
consoleRouteContext, cancel := context.WithTimeout(ctx, consoleRouteTimeout) | ||
defer cancel() | ||
if oneShot { | ||
logrus.Infof("Checking for the %s route...", consoleNamespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/s/consoleNamespace/consoleRouteName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/s/consoleNamespace/consoleRouteName
My follow up is waiting for the downloads
route too; this choice will be a better fit then. Even without follow-up work, I think "the main console-namespace route" isn't that bad vs. naming the route without the namespace. If we end up sticking with a single route, I'll probably write both the namespace and name.
if oneShot { | ||
logrus.Infof("Checking for the %s route...", consoleNamespace) | ||
} else { | ||
logrus.Infof("Waiting up to %v for the %s route to be created...", consoleRouteTimeout, consoleNamespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/s/consoleNamespace/consoleRouteName
cmd/openshift-install/create.go
Outdated
if err := waitForInitializedCluster(ctx, config); err != nil { | ||
return err | ||
errs := []error{} | ||
err := waitForInitializedCluster(ctx, config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is
err := <>
if err != nil {
..
}
while all other calls are
if err = <>; err != nil {
....
}
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... while all other calls are...
They're all using this err
;). But whatever, I wish gofmt
would just make this sort of thing consistent with whatever a coin-flip says is the easiest to read :p.
/retest |
I thought that was helpful ;). You'd rather me not say anything about |
I'm not sure, if
|
2995aef
to
c372384
Compare
Rebased onto master with 2995aef49 -> c3723847d, and dropping the |
/retest |
c372384
to
19866aa
Compare
Rebased onto master with c3723847d -> 19866aa38. |
And I've pushed 19866aa38 -> cddd3f0ec to distingish between Failing and Progressing. |
All green. Here's the end of the e2e-aws run: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1413/pull-ci-openshift-installer-master-e2e-aws/5140/artifacts/e2e-aws/installer/.openshift_install.log | grep -A8 'Cluster is initialized'
time="2019-04-11T03:53:34Z" level=debug msg="Cluster is initialized"
time="2019-04-11T03:53:34Z" level=info msg="Waiting up to 10m0s for the openshift-console route to be created..."
time="2019-04-11T03:53:34Z" level=debug msg="Route found in openshift-console namespace: console"
time="2019-04-11T03:53:34Z" level=debug msg="Route found in openshift-console namespace: downloads"
time="2019-04-11T03:53:34Z" level=debug msg="OpenShift console route is created"
time="2019-04-11T03:53:34Z" level=info msg="Install complete!"
time="2019-04-11T03:53:34Z" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/tmp/artifacts/installer/auth/kubeconfig'"
time="2019-04-11T03:53:34Z" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ci-op-b89kp5jd-1d3f3.origin-ci-int-aws.dev.rhcloud.com"
time="2019-04-11T03:53:34Z" level=info msg="Login to the console with user: kubeadmin, password: ShUna-zfIzJ-HPSYX-WhzpE" |
cddd3f0
to
f79f53c
Compare
/test e2e-openstack |
1 similar comment
/test e2e-openstack |
If the cluster failed to completely initialize, we may still have a router CA or console URL. With this commit, errors in finalization components are no longer immediately fatal; instead we hold any errors but continue to try the later components. We retain the first component component error, and log any later errors. To avoid waiting on a console URL when the cluster failed to initialize, I've added a oneShot argument to waitForConsole. When waitForInitializedCluster failed, we set oneShot true and only make a single attempt to resolve the console URL. I've also shuffled around the Until function to flatten some of its previous if/else blocks so I only have to add single oneShot-cancel block. Tangentially related, addRouterCAToClusterCA is idempotent, because AppendCertsFromPEM wraps AddCert [1] and AddCert checks to avoid duplicate certificates [2]. [1]: https://github.com/golang/go/blob/go1.12/src/crypto/x509/cert_pool.go#L144 [2]: https://github.com/golang/go/blob/go1.12/src/crypto/x509/cert_pool.go#L106-L109
…sing And point at 'wait-for install-complete'. This gives a better experience for users that we time out because they're installing over a slow network or something. We might want to fail fast if we see a Failing, instead of waiting for the full timeout. But I'm leaving that alone for now; we can always come back to it in follow-up work.
f79f53c
to
773a2b0
Compare
/retest |
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@wking: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
closing due to inactivity. Please reopen if needed. /close |
@abhinavdahiya: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
If the cluster failed to completely initialize, we may still have a router CA or console URL. With this commit, errors in finalization components are no longer immediately fatal; instead we hold any errors but continue to try the later components. We retain the first component component error, and log any later errors.
To avoid waiting on a console URL when the cluster failed to initialize, I've added a
oneShot
argument towaitForConsole
. WhenwaitForInitializedCluster
failed, we setoneShot
true and only make a single attempt to resolve the console URL. I've also shuffled around theUntil
function to flatten some of its previousif
/else
blocks so I only have to add singleoneShot
-cancel
block.Tangentially related,
addRouterCAToClusterCA
is idempotent, becauseAppendCertsFromPEM
wrapsAddCert
andAddCert
checks to avoid duplicate certificates.CC @abhinavdahiya, I think this is as discussed on the group-G arch call just now.